Problem with queuing servers

Incident Report for Transifex

Postmortem

From Thursday afternoon to Friday morning, Transifex's Queuing Systems experienced unusually heavy loads. As a result, whenever our customers tried to load a page in Transifex, they either received a 503 error or experienced slow loading times.

I would like to apologize for the inconvenience this has caused. We know there are thousands of people and companies who depend on Transifex to provide their services. We understand that this incident has seriously affected your operations in ways we can't even predict.

Following is an analysis of the background which has led to the incident, the actions we took to address it and our plans to prevent similar incidents in the future.

Background

We expected higher-than-usual levels of traffic to Transifex this week, and in preparation, had doubled our server capacity. However, this proved to be inadequate.

While Transifex is used to handling projects with thousands of users -- the Joomla community has over 2,500 translators, for instance -- these users typically log in at varying times. The traffic and registered user increase this week brought a record number of concurrent logged-in users. At certain points, we had an order of magnitude more people calling pages with background jobs than usual. To make matters worse, some of these pages were a lot more complex than the average page.

In addition to the above, we had tens of thousands users simultaneously translating the same content. On one day, we saw a 173% increase in active sessions. This unusual size of concurrent traffic on pages with unusually complex operations caught us by surprise.

Issues

Friday morning, EU time, our website started throwing 503 errors. Our engineers quickly identified our Queuing system as the culprit. Our RabbitMQ had a big queue which was not emptying properly. This backfired on our application servers that were trying to reach the queueing servers for major or minor jobs with no luck.

To mitigate the risk of affecting important events and notifications such as billing and registrations, we use multiple queues. The queue with the problems was the one sending minor notifications, such as when a user joins a team.

We first researched whether the queue was filling in faster than it should. Then we researched whether the issue was related to the size of the queue. But that wasn't the case: our queuing system can empty thousands of jobs within seconds.

To distribute the load and have the queue empty faster, we added 4 more servers to our infrastructure at 11:52 UTC and restarted our queue. We closely monitored our queues for the next few hours and everything looked normal. The issue was marked as resolved at 13:44 UTC.

A couple of minutes later another issue came up, which which we cover in a separate post-mortem.

Plans for the future

We plan to review a number of items related to our queuing and notification systems over the next few weeks. Here are some of the actions we plan to take:

We're investigating how we can make our queuing system more robust. We're adding more Nagios alerts that track trends in the different queues and are load-testing it for similar future session spikes.
We are rewriting our notifications system to have more sane defaults and send fewer emails in general. Fewer emails is good, especially for our users' inboxes.
We're reviewing the way we queue user notifications on team operations. Large teams with thousands of translators may have dozens of events happening per minute, which causes tens thousands of emails to be sent very fast.
We are reviewing and testing how we're handling batch operations when creating users over the API. User registrations are especially tricky, since they're an asynchronous operation which requires a confirmation action from the user, so we want to get them right.
We're growing our Infrastructure team. We're hiring an additional Dev Ops & Database Engineer to be part of our engineering team in Athens, Greece.

Summary

I would like to once again apologize for the impact that this incident had to your organization. We're working hard to provide the best service and we're committed to improving our consistency in providing the quality of service you expect from your localization platform. Thank you for your continual support of Transifex!

— Dimitris Glezos

Posted May 02, 2014 - 22:15 UTC

Resolved

All good so far. We're continuing to optimize small things here and there.

Posted May 02, 2014 - 13:44 UTC

Monitoring

All systems look great now. We're going to monitor them for the next few hours.

Posted May 02, 2014 - 12:34 UTC

Update

Upgrades underway. 4 new servers are live.

Posted May 02, 2014 - 11:52 UTC

Identified

We're increasing capacity on our servers and optimizing parts of the product to address the high loads.

Posted May 02, 2014 - 10:36 UTC

Update

This is related to notifications being sent for minor events on the site, such as when a person joins a team. We've been having an unusually high number of them.

Posted May 02, 2014 - 09:54 UTC

Investigating

We're facing high loads on our queuing servers, causing 503 errors on the site and slow pages. We apologize for any inconvenience.

Posted May 02, 2014 - 09:40 UTC