What's next after multiple outages in 2024?

We are hosting in the Oregon region and today’s outage is the third major outage we’ve observed in 2024. It appears that this one is the second involving upstream issues with Cloudflare.

I have been a big fan and advocate of Render for the past year but we are losing patience quickly. What, if anything, will be done to prevent these moving forward?


Start looking at a new hosting provider. We are at least.

We have enjoyed renders pricing, but the outages have cost us more than if we just went with a more stable provider (Heroku comes to mind). At the very least we are looking into having some kind of redundancy in place so we can switch over to a different provider if this happens again.


Same for us, we are moving to Heroku, seems to be a better option

Does heroku offer zero downtime deployment and automatic deployments?
PS: Aaah… Heroku seems extremely pricy

Folks, we’re working as fast as we can to get everyone back up - we can’t apologize enough.

This incident is around services that make use of disks. Typically, this impacts our Postgres and Redis the most.

Given the size and impact of this a thorough post mortem will be done and the findings shared publically,

John, I completely understand. I love Render by heart and we (at least me) don’t blame you for this. These things happen. I joined this discussion not to leave Render but to have some kind of backup. I have not find a better DX focused service than Render :slight_smile: There’s certainly a difference between talking to each other as a people and as someone who needs to protect its app and business

I look forward to the post-mortem and appreciate the transparency and communication. I love Render for its features, DX experience, and how it makes dev ops so easy, but the reliability has also been getting me concerned more lately, especially as I am a few months away from another product launch and ideally I would like to stay with Render. Thank you for the update and looking forward to hearing more.

You will absolutely get that.

Personally, I’ve been here for nearly 3 years, and this is the first incident of this magnitude I’ve seen - we have regional incidents but nothing like this, which was across all our regions and was on us to resolve. The timing of the Cloudflare incident was coincidental and not the cause here. The initial recovery of stateless services (ie services without disks) was fast, however, it was services that made use of disks, via mounted disks, managed Postgres, or managed Redis that were most impacted and were the longest to recover.

Was a post mortem ever created for the outage? Looking around but have not seen it anywhere.

Yes, posted on the original status incident