19
Tue, Oct
0 New Articles

Does taking some threads out of the equation add up to 99.999?

News

Adding a small amount of capacity to Amazon's Kineses real-time data processing service broke Amazon Web Services in the US last week.

These are the findings of an investigation into what brought AWS to its knees in some parts of the US last week. The timing was embarrassing: right before its annual AWS bash.

According to AWS’ findings, adding the small amount of capacity “caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration,” which created  a cascade of issues, shutting down some sites and services in the US for a shortish  time.

The incident highlighted the intra-dependence of cloud services, as the Kenesis failure took out Amazon’s Cognito authentication service, its CloudWatch monitoring technology and its Lambda serverless computing infrastructure among others.

Some big names

Companies affected included Adobe, Twilio, Flickr, Autodesk, plus New York’s Metropolitan Transit Authority and the Washington Post, which, of course, is owned by Jeff Bezos.

The company made a big effort to move swiftly on from the relatively small outage, explaining it would very shortly be moving to larger central processing units and servers with bigger memories, meaning that there will be fewer of them and hence fewer threads will be needed by each to communicate with the others.

Fewer threads or threats?

The explanatory post stated, “This will provide significant headroom in thread count used as the total threads each server must maintain is directly proportional to the number of servers in the fleet.”

Amazon said it was sorry and fixed the relatively contained problem pretty quickly. Not so fast: as network operators’ infrastructure, operations, and services become ever more intertwined with public cloud, the innate fluffiness of cloud – acceptance that occasional outages are part of the deal – sits very badly with the 99.999% ethos of telcos as we moved towards cloud-native core networks.

And as 2020 has taught us, in spectacular fashion, without resilient connectivity and services, as individuals, communities, societies and countries, we are, well, screwed.