Unexplained exit status 137

At 12:53 AM 12/31 UTC, srv-btqcaaoti7j5oiis3nq0 exited with code 137, which typically indicates an OOM-killling. However, the metrics page shows the memory usage remaining fairly low:

While it’s possible that there was an issue in the application code, there is nothing to indicate that in the logs or in newrelic, the data flowing through it was normal, and I can’t think of anything that would have caused a 137 exit status code.

Is it possible that render restarted this service for non-memory reasons? If not, any thoughts on why this service was killed?

An exit status of 137 often means that a shell script received SIGKILL. This can be due to memory issues, but it doesn’t seem to be the case. It may also be due to some other resource constraint. It is possible for your app to exit with status 137 on its own accord, but I think that’d be unlikely unless you’re using some sort of wrapper that exits with 137.

After reviewing logs and metrics, it doesn’t appear to be a memory issue, which corroborates the metrics graph. My investigations haven’t revealed anything yet, but I’ve opened a case with the underlying service provider for further investigation.

Any update here? It’s been happening again every couple of days.

Hi @kai ,

From everything I can tell, our system is not OOMKilling your process, though it definitely looks suspicious. I see some spikes in the memory usage right before the process crashes, which indicates something unusual is happening in the process. The metrics sampling isn’t realtime, so we’ll miss some of the metrics right before a crash. If a memory spike is particularly sudden, it can be invisible to metrics. That said, nothing I can find indicates that your process is getting OOMKilled, looking at both our platform logs/metrics as well as the Linux kernel logs.

The memory spike does indicate something unusual happening, so is there any instrumentation you can add to your process? Either in the form of memory profiling or more verbose logging?

If none of that helps, you many want to consider increasing your service’s plan temporarily to see if that reduces the amount of restarts, and if the metrics show your service using more memory than your current plan allows.

Thanks Dan. Newrelic doesn’t turn anything up, but I’ll see if I can correlate it to any other events. Is there any way to see the exact timestamp at which a server crashes?

On the Render dashboard, the events tab for your service will show all the “Server failed” events to minute precision. If you want the exact timestamp, you can open up your browser’s developer console and find the graphql response with ServerFailed events, which will have the full timestamp.

1 Like