AWS Outage Post-Mortem

Published by Bentley Cook on Mar 10, 2017 •

Tuesday, February 28th, 2017 was an exciting day for a lot of engineering teams. A large portion of the internet was affected by Amazon’s outage, and as our status page update details, we were too. From approximately 12:45pm–4:49pm ET Amazon Web Services suffered a major service interruption to their S3 service in the US-East-1 region. Trello went down at 12:55pm–3:57pm because our static assets (web client bundle, attachments, file previews, fonts, icons) are stored on Cloudfront, which is backed by Amazon S3.

We’ll use this post to cover the technical details of where our architecture faltered because of its reliance on the S3 service in the US-East-1 region, our immediate response, and our longer-term plans to fix it twice.

The Day Of

At 12:40 pm ET, an engineer noticed anomalies in our grafana dashboards. We were seeing spikes in active requests, response times, and sessions. But there wasn’t a corresponding spike in web processes CPU nor database times. Then our nagios alerts started showing a large amount of AWS API errors. Initially we tried resetting the entire site to clear everything out. Unfortunately, turning it off and back on again didn’t fix the problem and by 1:00 pm ET Trello was completely down. A quick search on Twitter for AWS showed that we weren’t alone and it was clear that AWS was having an outage.

We pulled out the crisis plan, dusted it off, and got to work. Because most of Trello’s engineering team is remote, we have to be sure that our crisis plan accounts for a distributed team. This means that there has to be a centralized location where conversation takes place. We cover this by starting a group video chat. The second major step of the plan is to make sure support is included and ready to handle an increased outbound communication. The support team works with the engineering team to put together clear, concise copy regarding what is going on and when we expect to be back online. They’ll use this copy to keep our status page up-to-date and to communicate with customers.

The Trello web-app is a single-page app, written in Coffeescript, and served out of S3. However, the Trello API is a stateless Express app running in a Node cluster in EC2. This meant that while AWS was having an outage for S3, our backend API service should have been running without problem. However, when bringing the backend service online, we check to see if the web client is available, and if not, we halt the process. Furthermore, attachment requests from the backend service increased the request load of the processes until they timed out or OOMed. Because of the web client check halting server start-up, the crashed backend services were unable to restart.

Despite the challenges we were facing, we were excited that we’d just launched offline-mode for mobile apps. Users of the Trello mobile apps were able to continue editing boards and cards while the web-app was down.

We put together an outline of a plan to serve Trello’s core Javascript and CSS assets from another location, and begin working on the necessary changes to get this working and tested. The proposed solution allows access to Trello but user-uploaded assets such as card attachments, board backgrounds, and avatars will still fail to load, and users will be unable to upload any new assets. By 3:19 pm, the fix is in our staging environment and undergoing user testing.

There were promising signs that S3 was recovering, but without a concrete evidence that it was going to be back online soon, we rolled out the fix at 3:56 pm and Trello was restored. By 5:11 pm, the AWS status page indicated that S3 was functioning normally and our monitoring indicated that user-uploaded assets were working again.

Fix It Twice

Since Trello’s early days at Fog Creek, we’ve relied on a number of Fog Creek’s principles and philosophies to guide our journey. One that you hear often at Trello is: Fix It Twice. Fixing something twice means that you resolve the immediate problem and also take the time to fix the thing that caused the problem. You should never have to perform the exact same fix to a problem twice. Check out Fog Creek’s blog post on how they scaled customer service by fixing things twice.

Trello’s engineering team asks, “How can we fix this twice?” during every post-mortem. The S3 outage presented a number of opportunities for us to take the time to fix it twice. Below are a few of the important ones we wanted to share.

Problem: Web-app static assets are only hosted on S3 - creating a single point of failure for the entire Trello web experience.

Fix It Twice: We plan to deploy a copy of all static assets somewhere else every time we have a new deployment. Additionally, we will add a switch to easily cut over to it. This is low cost, and relatively easy to set up. It’ll be tested continuously by serving 50% of our internal traffic from the alternative static asset deployment location. This ensures that we have a good backup ready at all times.

Problem: We tightly coupled our back-end server with the web client such that when a server starts up it checks for the existence of the web client on S3 and won’t start without it.

Fix It Twice: This is a bit extreme, so instead of not starting the server without the web client, we’ll just log the error and continue. This will allow us to get server (including the API) up and running so that mobile and other clients can still access it regardless of whether the web client is up or not.

Problem: User-uploaded assets on S3 may be unavailable sometimes. However, they aren’t critical to the core functionality of Trello and we should default to degraded service (instead of complete failure) if they are unavailable.

Fix It Twice: We plan to create a switch that will stop our services from talking to S3 if it’s down. Things like attachments and avatars will fail more gracefully and quickly.

Problem: The static page that we use when we are down indicates that we are down for scheduled maintenance.

Fix It Twice: Changing the language on our maintenance page to let users know that we’re working on a problem rather than being down for scheduled “maintenance”.