High level of server errors

Incident Report for Transifex

Postmortem

Summary

Between 7 AM and 8 AM GMT there was a high error rate for requests to our service, and the service was intermittently unavailable. Engineers tracked the problem down to requests coming from a particular range of IPs which were rate limited in order to restore the service level.

In total the service was unavailable for about 15 minutes, which is unacceptable to our users and to us. We would like to apologize for this and say we are determined to minimize the occurrence and impact of similar events in the future.

Timeline

all times UTC/GMT

07:09 - engineers received alerts that certain HTTP health check requests were failing.
07:10 - an investigation was started and it was found that at the level of the load balancer certain hosts were deemed to be unhealthy and had been removed from the pool.
07:15 - an incident was opened on our status page
07:19 - after some servers were identified that were operating under resource exhaustion our engineers remedied this situation and service was restored.
07:33 - while the initial cause was being investigated, the problem reappeared in other servers.
07:34 - Engineers detected unusual request patterns from certain IP ranges which belonged to a particular customer. At the same time the errors had peaked again and the service had become unavailable.
07:39 - Service was restored after the servers with resource exhaustion were fixed and restrictive rate limiting was applied to those IPs that had been determined to be the source of the problem.
07:46 - The problematic request pattern resumed from different IPs, causing another spike in errors. The new IPs were added to the rate-limiting policy.
07:50 - Our monitoring systems reported that the situation was fully resolved.

Root cause

The problem essentially was caused by a great number of automated requests to our API by third-party systems. The systems in question belonged to a particular customer who had particularly elevated requirements, for which custom policies were in effect.

On the 30th of November it appears changes had been made in the third-party systems that essentially led to an abuse of our API due to run-away requests. Unfortunately we were not notified of these changes and we did not detect the change in the request pattern in time to apply a proactive fix.

Remediation and future steps

By applying more restrictive rate limiting to requests from the offending IP ranges we were able to prevent the problem from reappearing. At the time these are statically defined, but we have plans to improve the flexibility of the abuse prevention system and better integrate with our monitoring tools. Also splitting some services into independent systems will do more to restrict the amount of damage that similar situations might create in the future.

Posted Dec 01, 2016 - 13:23 UTC

Resolved

This incident has been resolved.

Posted Dec 01, 2016 - 09:53 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Dec 01, 2016 - 07:43 UTC

Investigating

We are currently investigating this issue.

Posted Dec 01, 2016 - 07:16 UTC