Why does status.render.com report the Postgres outages incorrectly?

Render reports the Postgres outage for Oregon as one hour 26 minutes, when it was an outage of over 2 hours. Even if you look at the timestamps in the incident report (which are not entirely accurate in your guys favor), it is well over 1:26. This matters because people will decide on whether or not to use your service based on your uptime, and you guys are clearly not accurately reporting what happened yesterday.

Would love some kind of explanation, and ideally you guys update the incident to accurately reflect the actual outage times.

Screenshot 2024-03-27 at 10.18.22 AM

These are all valid concerns, but the underlying details are complicated. In no particular order;

  • The uptime history is based off components being set as impacted in status updates. It isn’t live monitoring.
  • There have been incidents in the past where an Engineer following along but not primarily involved at the time has specifically updated the component history independent of textual updates (components are otherwise set during a given update item), there was no such Engineer in this situation. During this incident everyone was playing a role, entirely focused on the task at hand in some meaningful capacity, and not worrying about granular details of the status page itself. We in Support were similarly overwhelmed in communication on all of our mediums (tickets, chat, nominally reviewing forum posts here as well), as well as updating the status page when there was a broadly applicable update.
  • We in Support occasionally see customers reference when opening issues that “the Status Page is wrong”, because they are experiencing an outage not listed on the Status Page. The problem is that things can happen to a specific host that which is not relevant for the Status Page; Status Pages fill the role of a broadly applicable platform event, something that will impact a significant chunk of a specific service type at a minimum, or much of a region, or much more. But not only a single or small handful of hosts.
  • Gauging a company’s reliability solely off of their Status Page is not accurate, because it doesn’t represent the totality of platform-related events that could affect a single customer.

Nevertheless, I will bring up these concerns as part of our retrospective processes. I agree that now that we have room to breathe, as the timeline of events is cemented, we should keep that data available publicly as well.

“Gauging a company’s reliability solely off of their Status Page is not accurate, because it doesn’t represent the totality of platform-related events that could affect a single customer.”

Curious about this quote. We have our own SLA’s with our clients we are trying to meet, and I def used your status page to determine whether or not we could meet those requirements. If I am trying to determine the stability of render versus another provider, what should I be using to determine how stable the platform is?

I understand total platform does not mean an individual service, but you guys do have individual service statuses as well. In this post, I am referring strictly to Oregon’s Postgres outage, which was over 2 hours during the event. Is it asking too much to accurately report your outages now that the event is over and all hands are not on deck working on the issue. It just feels slimy overall like you guys are trying to downplay the seriousness of what happened.

This is a really complicated topic that doesn’t have any single answer. I’ve worked in support nearly my entire professional life, for developers specifically for the last 12 years. Is a provider unstable because an application crashes because it has Windows or Mac specific assumptions, and when deployed in a Linux environment things go a little sideways? Is a provider unstable because they provide a shared execution computing platform and an application takes longer and times out when running on a host that has many active services running? Is a provider unstable when a customer’s application is the highest resource using service on the host and causes host health issues due to resource demand?

Is a provider unstable when an application encounters an unexpected restart when hosts are being cycled out for software upgrades?

These are all examples of cases I’ve handled in the past where the customer pinned the blame on the provider (not necessarily Render) despite either their application being the catalyst, or despite hosts needing to be patched for security purposes, and their application misbehaving during shutdowns.

Is it asking too much to accurately report your outages now that the event is over and all hands are not on deck working on the issue.

Engineers are;

  • Continuing on investigation not only of the root issue, but also some ancillary failures that happened during the incident. We didn’t post about it in text but there was a Log Stream outage whose component we marked as impacted during various updates.
  • Working on mitigation tasks related to items that could be immediately addressed as a result of details that came up during the incident.
  • Working on new reports and timely day-to-day issues.
  • Working on things we were working on before Tuesday.
  • Working on the RCA.
  • Being a human being that takes breaks and communicates with other people and works a limited number of hours per day.

We are still in all-hands mode, with less immediate urgency but we are still engaged in long-tail tasks related to the incident on top of day-to-day work.

I haven’t updated the component outage durations because I haven’t looked for the timeline yet. This isn’t slimy, this isn’t being downplayed, we’re doing the work for our customers. The status incident will be revisited because we have to post the RCA there as well, we haven’t gotten back to it yet because work on it is still being done.