Regular health check outages lasting only a few seconds

We have a Rails app running as a Web Service. The last few weeks we’ve been getting short outage notifications from Freshping, as well as a couple of users reporting seeing the Cloudflare page.

This is how the last 3 months looks in the Freshping summary. We only deployed in July, but July is as clear as August.

This isn’t causing any major issues yet, but it’s concerning. Any ideas or pointers would be great.

1 Like

Hi Colin,

Can you share which services had these failures so we can take a closer look?

Hi Tyler,

Thanks for the response. The service is called pba-rails-production-web. Around the time I posted the original message I also set up monitoring on a staging environment. Neither are showing outages since then so far.

Just had another short outage, only a few seconds

I’m also experiencing a similar pattern that I have been investigating. During these incidents from what I can tell having added a lot of monitoring to our stack, networking on render.com becomes unstable. This manifests as Cloudflare 502 errors to users and networking connection errors in our application.

In the incidents I’ve got telemetry for, my processes on render.com are unable to connect to other network services both inside my application’s private network and outside of it like rollbar.com.

I have a theory that it’s a TCP issue as UDP log events seem to not be impacted – since logtail.com events detailing the networking failures are still delivered. I’m not sure about that last bit – it’s sort of dependent on how the logsink works internally to render.com.

It’s particularly frustrating that I opened a ticket about this issue on October 15th but it still shows up as “unread” in intercom.

My service is daisychain-production and we’re running in the Frankfurt region.

We reported this same issue (in the Frankfurt region) to Render on July 13th via this forum: Daily web services outages in Frankfurt region.
After mentioning DDOS attacks, they told us they had implemented a “fix” and performed infrastructure upgrades in late July/early August. It did not fix the outages.

On August 23rd we emailed support@render.com (ticket #18792) as requested by @John_B and @al_ps. After providing them with details about every incident we had, and arguing to convince them that those outages were not due to load spikes or issues on our end, they finally admitted on September 30th that

a critical system component was unavailable: when failover occurred, some services were unreachable

On October 7th, they told us that it was fixed:

Yes, the root case is fixed now. With increased redundancy, any underlying machine failure will not make the system service unavailable. It still can happen if all machines are failed (for example, data center outage), but the chance is very low.

Sadly, the outages are still happening: we saw no improvements on our web services, quite the opposite to be honest. On October 12th, during 20 minutes almost all of our web services responded with 502 errors. Despite emails about every incident, we have had no news from support@render.com since then.

We migrated services from Heroku to Render in early May, and we have been experiencing those outages on our web services ever since. We have been reporting those for more than 3 months now, but they still have not found the root cause.

@Nathan_Woodhull @colinagile good luck! There are more and more reports about similar outages (e.g. by @jnns here, by @jerin_ceo here, by @Merovex here), so maybe Render will become aware of the seriousness of this issue.

The lack of communication is of course concerning. The fact that the Render status page never reports any of those 502 outages although they have been acknowledged by the support is disappointing.

4 Likes

It’s now been four days since I opened the support ticket. It’s still shows up as “unread” in intercom.

There’s a half dozen customers on this forum with similar issues and no acknowledgment from the Render team.

I tried emailing sales to ask about signing up for premium support but did not receive a response on that front either.

I’m feeling a bit lost and confused by the silence.

Thanks for the other replies, and it’s concerning that it seems to be widespread and not getting enough attention. We also migrated from Heroku (in July), hoping for a better experience, and it really has been, so hopefully the issues will be sorted before we start to lose faith

Hi all, we appreciate all your patience as we sort out the various issues that have been affecting your services. The difficulty here is that there aren’t any specific trends as of yet that can address every unique case, but we will work with each of you to investigate and resolve the outages you are experiencing as best we can.

@colinagile I took a look at your service and was able to confirm a few 502 responses from your healthcheck path in the past month. Would you be able to provide us with a Cloudflare Ray Id so we can investigate further? If you can send that to support@render.com and mention this thread that would be very helpful.

Thanks Tyler, the ID has been sent. FWIW, there haven’t been any outages in the last 6 days.

Just to let anyone watching the thread know, the issue looks to have been resolved. After the last message shown here, I’ve had a few email exchanges, so Render have been following up. The issue was a load balancer config issue which was fixed on the 19th, which lines up with the fact we haven’t had any outages since then.

Thanks for the help.

3 Likes

We are also seeing a lot of small outages in the Oregon region with no apparent cause.
We tried bumping the number of instances but it did not help at all.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.