Postmortem for 503 errors in Oregon on 2020-12-06

Between 13:34 and 17:33 PST on Sunday December 6, 2020, many Render services returned 503 errors at an elevated rate. We are very sorry for the impact this had on our customers and their businesses.

We have since made improvements to our platform that would have prevented this incident and will make similar incidents less likely going forward. We are actively working on additional improvements: reliability is and always will be a top priority for Render.

What Happened

At 13:34 PST, a customer API hosted in our Oregon region started holding onto incoming requests until they timed out with an error. The load balancing algorithm that manages HTTP requests across our platform is designed to handle this failure mode by rejecting traffic to the failing service. This prevents TCP connection exhaustion which can affect other services on a given load balancer. The service in question had one of the highest rates of incoming requests in the Oregon region, and its failure exposed a bug in the load balancing algorithm that led to intermittent rejection of requests made to other services as well. These requests were rejected with a 503 (Service Unavailable) response.

Our monitoring system immediately notified us of the issue and we took steps to mitigate the impact until the root cause was identified and fixed.

Why It Won’t Happen Again

A fix has been rolled out across all our load balancers to prevent incidents like this from happening again. We are working on additional mechanisms to increase isolation between services in Render’s routing and load balancing layers. We are also increasing test coverage for our load balancing code, and introducing additional synthetic failure states in our continuous integration test suites to increase platform resilience during unexpected events.

Our team holds itself to the highest standards when it comes to reliability. We know we failed to meet that standard in this case, and let you down. Still, we firmly believe that we will meet it going forward with these changes in place and renewed focus on chaos engineering. We are grateful for your continued trust in us.

If you have questions, concerns, or comments related to this incident, please reach out to us here or email us at support@render.com.