I need help ASAP - one of my prod services stopped deploying

Yu_Chun_Kao · February 27, 2022, 12:39pm

I need help ASAP - one of my services stopped deploying (no code changes since last successful deploy). I think the physical server hosting my app is borked and there’s no apparent way to preempt my service to a different physical server!

Service is Render · The Easiest Cloud For All Your Apps

anurag · February 28, 2022, 12:52am

We’re looking into it. Sorry about the delay in responding: since it’s the weekend the Render team hasn’t been as attentive on this forum (we have a separate escalation process for customers with an SLA).

anurag · February 28, 2022, 1:02am

We don’t see an issue with your Clickhouse instance, except for high load. It looks like your service was unavailable for the last several hours but a plan upgrade fixed it. Are you still seeing issues?

Yu_Chun_Kao · February 28, 2022, 1:11am

There hasn’t been a plan upgrade. If you look at the logs, the deploy just was stuck for 6+ hours. Take a look at the metrics graph

anurag · February 28, 2022, 1:14am

Sorry, yes, the plan was the same; We don’t manage Clickhouse ourselves so it’s hard to say what happened. What was the symptom when you triggered a manual deploy ~13 hours ago?

Yu_Chun_Kao · February 28, 2022, 1:14am

Around 8pm Taipei time yesterday the service went down and never came back up
For the next hour I tried all sorts of different things to bring the service back up but nothing worked
4:28am Taipei time, a deploy finally goes through

Main things that are problematic

Nothing I can do as a user to preempt the service to actually deploy when a physical server is borked
No auto remediation on the render side to bring said service back up
No support for 12+ hours when things are down =/
Deploy takes 8 hours? Process is completely opaque to the user. I’d prefer if things just failed fast and told me why.

Yu_Chun_Kao · February 28, 2022, 1:15am

What was the symptom when you triggered a manual deploy ~13 hours ago?

The instance just became unresponsive. Couldn’t ssh in, couldn’t connect via http clickhouse clients, couldn’t deploy. I was made aware of the problem from external monitoring services that said Clickhouse became unresponsive.

anurag · February 28, 2022, 3:31am

We’ll investigate and post our findings here. It’s certainly frustrating when you can’t do anything while the server is down. Sorry about that.

Sean_Doughty · March 3, 2022, 8:19pm

Hello @Yu_Chun_Kao,

We are continuing to investigate why your deploy took 8 hours and how we can add tools and observability to help if this happens again. I will post back once we know more.

Yu_Chun_Kao · March 13, 2022, 1:16am

Hi Sean, do you have any updates? I feel a massive amount of anxiety everytime I deploy knowing that there is a lingering bug that can cause hours of unexplained downtime. cc @anurag

Sean_Doughty · March 13, 2022, 2:08am

Hey,
We have your service switched over to a custom plan with more memory as requested. Can you try deploying your service to use the new plan on Monday during business hours so we can take a look if there are any issues.

Yu_Chun_Kao · March 13, 2022, 2:27am

Hey Sean, happy to, but that feels like we are papering over the problem. There are still open questions around observability, investigation, logs, etc. that remain unanswered. Am I to assume that we are out of avenues of investigation at this moment in time?

8 hours of unexplained downtime seems appropriate for a more rigorous investigation. I’m using the out of the box render template, nothing custom. It seems like if this can affect me, it likely affects many others.

Sean_Doughty · March 16, 2022, 4:41pm

@Yu_Chun_Kao From our investigation, your Clickhouse service took a long time to deploy because it ran out of memory. We’ve checked all other instances of Clickhouse on Render and they are continuing to deploy within a few minutes. Your deploy time should now be faster with the custom plan.

Thanks for flagging the concern around observability and logging. We treated this as an internal incident and are prioritizing gaps we found as a result of this in better tooling in addition to existing planned work.

Topic		Replies	Views
Deploy gets stuck and never finishes	64	18110	June 20, 2022
Deploy service with existing disk time's out	7	693	September 2, 2021
Services not working, deploys taking long time	1	314	June 11, 2022
Taking long to deploy	3	2369	January 13, 2023
Service not ready	69	6926	August 8, 2023

I need help ASAP - one of my prod services stopped deploying

Related topics