Zombie workers that we can't stop

We’ve recently migrated a number of worker processes from Heroku to Render but ran into an issue last night with zombie worker processes (that are still running now several hours later) with no way to stop them.

We’re currently stuck with 3 zombie worker processes that are flooding our logs with error messages and overloading one of our redis instances with no way to kill the zombie workers. I tried re-deploying the workers and suspending them all (even deleting one worker from the dashboard) but the zombie processes are still running. Can someone from Render kill these for us?

From our logs:
srv-c9eabmp0gd076846l900-78df977b87-r52mp service-r52mp
srv-c9gqfjkobjdb4g3bm720-794f594c4b-cwsxb service-cwsxb
data-worker worker-npndn

Please DM me if you need more info.

Context: we had a database failover last night (not running on Render) which caused our Render workers to error which caused some issues with our Redis instance (also not running on Render). Our Render workers got into a restart loop over several hours and this morning we tried manually restarting and suspending all of the workers but noticed 3 workers were still running. I’ve had to move all of our workers back to Heroku.

We’re really enjoying Render so far but not being able to actually stop a worker process after suspending them or even deleting them in the dashboard is really concerning, especially when the zombie processes could be destabilizing our infra.

Thanks,
Jason

It looks like all of the zombie workers have stopped now. One stopped at 09:30:24 PT and another at 10:30:54 am PT.

Please let me know if this can be looked into. Ideally, suspending a worker would force stop all instances of the service within some reasonable time (a minute or two).

Thanks,
Jason

Hi Jason,

Sorry for the delayed response, and for the issue you faced here. When you suspend a Render worker we immediately attempt to cancel any in-progress deploys and scale the worker down to 0 instances. It’s possible something went wrong with this process on our end given what you observed in your logs.

We will certainly look into this more and attempt to reconstruct what happened using our internal observability tools. Please let us know if you have any other information which might be helpful, or if you see this issue happen again. You can email support@render.com if your message includes information that you don’t want to share publicly.

Sincerely,
David

Thanks David for getting back and looking into this!

We’ve been working on our DB failover handling on render so our instances should be a little more robust to this now.

I am seeing two workers of ours that are stuck in error loops right now which look similar to last week:

  • srv-c9gqfjkobjdb4g3bm700-6bbb97b59b-sjh9w service-sjh9w
  • srv-c9krjumhb05g1acsouvg-5bf4ff8486-k6nc4 service-k6nc4

They have lost their worker name in our logs which is what happened before to the zombie workers from last week - they may be stuck after a deploy or something similar?

Thanks,
Jason

Hi again -

I’m still seeing a number of zombie workers in my logs that have been running for 3+ days now that I can’t stop.

I emailed support@render.com with more details but it would be great if they could be force killed since they are just spamming my logs with errors.

It seems like Render should have something watching for these zombie workers/containers and killing any old ones, especially if the number of instances is higher than what is configured?

Thanks,
Jason

We’re on it! I’ll update you soon.

We just followed up via email. Thanks for your patience Jason.

Thanks David for the help.

Render support found that the workers that I was seeing in my logs were not in fact zombie workers but rather excessive logging from several days ago, that was still being written to our log stream a few days later. The workers were stopped correctly but the delayed logging only made it look like they were still running.

Render support mentioned this incident was opened for the delayed logging (Render Status - High latency for service logs in Ohio region) and the logging issue was fixed.

Thanks for troubleshooting and fixing this confusing issue!
Jason

1 Like