Unexplained exit status 137

At 12:53 AM 12/31 UTC, srv-btqcaaoti7j5oiis3nq0 exited with code 137, which typically indicates an OOM-killling. However, the metrics page shows the memory usage remaining fairly low:

While it’s possible that there was an issue in the application code, there is nothing to indicate that in the logs or in newrelic, the data flowing through it was normal, and I can’t think of anything that would have caused a 137 exit status code.

Is it possible that render restarted this service for non-memory reasons? If not, any thoughts on why this service was killed?

An exit status of 137 often means that a shell script received SIGKILL. This can be due to memory issues, but it doesn’t seem to be the case. It may also be due to some other resource constraint. It is possible for your app to exit with status 137 on its own accord, but I think that’d be unlikely unless you’re using some sort of wrapper that exits with 137.

After reviewing logs and metrics, it doesn’t appear to be a memory issue, which corroborates the metrics graph. My investigations haven’t revealed anything yet, but I’ve opened a case with the underlying service provider for further investigation.

Any update here? It’s been happening again every couple of days.

Hi @kai ,

From everything I can tell, our system is not OOMKilling your process, though it definitely looks suspicious. I see some spikes in the memory usage right before the process crashes, which indicates something unusual is happening in the process. The metrics sampling isn’t realtime, so we’ll miss some of the metrics right before a crash. If a memory spike is particularly sudden, it can be invisible to metrics. That said, nothing I can find indicates that your process is getting OOMKilled, looking at both our platform logs/metrics as well as the Linux kernel logs.

The memory spike does indicate something unusual happening, so is there any instrumentation you can add to your process? Either in the form of memory profiling or more verbose logging?

If none of that helps, you many want to consider increasing your service’s plan temporarily to see if that reduces the amount of restarts, and if the metrics show your service using more memory than your current plan allows.

Thanks Dan. Newrelic doesn’t turn anything up, but I’ll see if I can correlate it to any other events. Is there any way to see the exact timestamp at which a server crashes?

On the Render dashboard, the events tab for your service will show all the “Server failed” events to minute precision. If you want the exact timestamp, you can open up your browser’s developer console and find the graphql response with ServerFailed events, which will have the full timestamp.

1 Like

I had the same issue last night. Server failed and exited with code 137. Nothing in the logs as far as I can tell, memory at usual levels and far below the capacity of the Starter Plus plan. I also can’t see anything extraordinary in New Relic.
image

My questions:

  1. Any way to debug this further? No idea what to do to avoid this moving forward.
  2. Is there any way to automatically restart or redeploy the server? When this failure happened, all I got from Render was an email. I was lucky to see it within 5 mins. Previously I was assuming that the health check, which pings my app every few mins, would notice the downtime and auto-restart the service – but it didn’t. Are there any options?

Hi @simon ,

I did some investigation and it looks like this may be because your health check path (set to /) was not responding anymore, so the system terminated your services instance for being unhealthy. This is definitely something we can make more clear on the dashboard (it took me a while to figure out even with my extra tools), so I’ve prioritized an internal issue to improve that.

Render does automatically redeploy servers after failures, so it’s unclear why there would have been lag to replace it. This does seem to be a rare situation, but important, since you definitely shouldn’t have to manually deploy to get your service working again. I’ve also prioritized an internal issue to investigate this and fix the root cause. Since it’s rare, I wouldn’t expect that it will affect you again, but if it does, please let us know right away.

Hi @dan , thanks a lot for the quick response, I appreciate it.

Good to know that the health check wasn’t responding anymore, I wasn’t aware of that. Clearer logging / alerting would be super helpful in this case – I don’t think I can currently find this information anywhere?

Also good to hear that Render is expected to redeploy servers after health check failures. I was worried when I noticed that you didn’t, but am happy that it’s a bug, not the expected behavior. Please do let me know when you’ve found the root cause.

It may be completely unrelated, but FYI, I’ve also observed a handful of deployment timeouts recently. After a git push, the deployment starts, finishes the entire build incl. “Build successful”, but times out on the “Deploying” step. I didn’t report it because manually retriggering the deployment always worked, and it only affected ~5% of my deployments so far.