Deploy gets stuck and never finishes

We’re investigating the issue and will post status updates here: Render Status - New deploys for dynamic apps failing in Frankfurt. Very sorry for the trouble.

It looks like - if I’m reading my tools correctly - that since the deploys gut stuck without timing out, the log stream doesn’t work anymore (not seeing anything in Papertrail) and the health check is not running anymore (throughput on the URL significantly down in New Relic). This is pretty concerning… :confused:

Site is fully down now. Frankfurt region.

Everything seems to be down, websites, databases & deploys. As much as I like the product, these reliability issues you’ve been having are very concerning and frustrating and we can’t have this when trying to run a business.

3 Likes

Happening to me too, all databases down. They look like they may be coming back, but definitely not reliable at the moment.

It’s happening to us as well, region Frankfurt. The deploy starts and never shows any deploy logs, just a blank box. Now when I try to manually deploy, it shows me a “Deployment Error” modal with message “Error: Internal Server Error”.

This was due to an outage earlier today: Render Status - New deploys for dynamic apps failing in Frankfurt. We have emailed all affected users. Incredibly sorry for the disruption.

Why did it take about 5 hours for the incident to be finally declared on the status page? We’ve been having issues since the morning, and saw that the status page was showing “operational” until late afternoon

1 Like

I would love to learn more about this too. My first deploy get stuck at 11am CET. I’ve reported it here at 1pm CET. During that time it was impossible to deploy, and it looks like both logs and health checks weren’t running either. The incident wasn’t acknowledged before 3.30pm CET (4.5 hours after it started), and even then just partially (“new deploys failing”). It took until 5pm CET (6 hours later) until broader issues (“timeouts”) got acknowledged.

Does Render have staff monitoring all relevant services 24/7? Is there a commitment to supporting European services? If the answer is yes, I’d love to learn more about what processes failed yesterday, in addition to the technical root causes.

1 Like

Another question I had is why the status updates were continuously getting modified after they were posted. For example at one point a update was sent that stated the following:

Engineers are working through a disaster recovery plan and making progress.

What happened to that disaster recovery plan and why was it suddenly removed from the status updates?

Hey @simon and @arunesh90,

It’s understandable that there are still a number of questions about what happened yesterday. We’ll be providing more transparency into the situation with the coming RCA. We will address the delayed status update and what we are going to do about it going forward.

The disaster recovery plan status update was incorrect and we later updated it to reflect the current state of the situation. Look out for more details on the events that occurred and the steps we took in the RCA.

1 Like

Hey @jake

Just wondering if you happen to have a ETA or progress update on the RCA?

1 Like

Isnt there suppose to be an automated test to handle such situation in Render?
Timeouts should be handled and reflected back to the user, getting stuck and solving it is one issue. Handling it so it gives the proper error messages should be another.
I would imagine the latter to be easier to handle than fixing the actual problem itself?

Hey @jake and @anurag

Just shooting another message, assuming the previous one got overlooked :slightly_smiling_face:
Hoping to get a ETA or a progress update on the RCA, as it’s important for us to get it to explain to our clients why their projects went down and so we can ensure them that it won’t so easily happen again and especially not for so long.

Hi @arunesh90 ,

My apologies for missing your previous message. I’ve asked the team that is writing the RCA, and they’ve let me know the RCA will be published by the end of the week.

What’s the status here? I can’t get my app to deploy. It fails every time, even on manual deploys + no build cache.

Hi @ajswell ,

The downtime that was reported earlier in the thread was resolved on May 7. It’s likely that your application is experiencing a different problem. I can take a look at your service to see what’s happening, though. Can you share a link to your service on the Render dashboard?

@dan Link here: https://ratemyjudge.onrender.com

It’s a Strapi app. It builds fine. Just get’s stuck on “Deploying…” then fails.

@ajswell I’ve taken a look at your service, and I’m not seeing the same thing you’re seeing. Looking at the events for your service on the Render dashboard, I see that it is deploying successfully, but that your service process is exiting 1:

I don’t see any errors in your service logs, so it might be useful to add additional logging to see why your process might be exiting 1.

Well not sure why, but the second I reached out about the issue, the deploy worked. Will keep an eye on it. Thanks.