I need help ASAP - one of my prod services stopped deploying

I need help ASAP - one of my services stopped deploying (no code changes since last successful deploy). I think the physical server hosting my app is borked and there’s no apparent way to preempt my service to a different physical server!

Service is Render · The Easiest Cloud For All Your Apps

We’re looking into it. Sorry about the delay in responding: since it’s the weekend the Render team hasn’t been as attentive on this forum (we have a separate escalation process for customers with an SLA).

We don’t see an issue with your Clickhouse instance, except for high load. It looks like your service was unavailable for the last several hours but a plan upgrade fixed it. Are you still seeing issues?

There hasn’t been a plan upgrade. If you look at the logs, the deploy just was stuck for 6+ hours. Take a look at the metrics graph

Sorry, yes, the plan was the same; We don’t manage Clickhouse ourselves so it’s hard to say what happened. What was the symptom when you triggered a manual deploy ~13 hours ago?

  • Around 8pm Taipei time yesterday the service went down and never came back up
  • For the next hour I tried all sorts of different things to bring the service back up but nothing worked
  • 4:28am Taipei time, a deploy finally goes through

Main things that are problematic

  1. Nothing I can do as a user to preempt the service to actually deploy when a physical server is borked
  2. No auto remediation on the render side to bring said service back up
  3. No support for 12+ hours when things are down =/
  4. Deploy takes 8 hours? Process is completely opaque to the user. I’d prefer if things just failed fast and told me why.

What was the symptom when you triggered a manual deploy ~13 hours ago?

The instance just became unresponsive. Couldn’t ssh in, couldn’t connect via http clickhouse clients, couldn’t deploy. I was made aware of the problem from external monitoring services that said Clickhouse became unresponsive.

We’ll investigate and post our findings here. It’s certainly frustrating when you can’t do anything while the server is down. Sorry about that.

1 Like

Hello @Yu_Chun_Kao,

We are continuing to investigate why your deploy took 8 hours and how we can add tools and observability to help if this happens again. I will post back once we know more.

1 Like

Hi Sean, do you have any updates? I feel a massive amount of anxiety everytime I deploy knowing that there is a lingering bug that can cause hours of unexplained downtime. cc @anurag

Hey,
We have your service switched over to a custom plan with more memory as requested. Can you try deploying your service to use the new plan on Monday during business hours so we can take a look if there are any issues.

Hey Sean, happy to, but that feels like we are papering over the problem. There are still open questions around observability, investigation, logs, etc. that remain unanswered. Am I to assume that we are out of avenues of investigation at this moment in time?

8 hours of unexplained downtime seems appropriate for a more rigorous investigation. I’m using the out of the box render template, nothing custom. It seems like if this can affect me, it likely affects many others.

4 Likes

@Yu_Chun_Kao From our investigation, your Clickhouse service took a long time to deploy because it ran out of memory. We’ve checked all other instances of Clickhouse on Render and they are continuing to deploy within a few minutes. Your deploy time should now be faster with the custom plan.

Thanks for flagging the concern around observability and logging. We treated this as an internal incident and are prioritizing gaps we found as a result of this in better tooling in addition to existing planned work.