Rails apps can go into a bootloop

On some OOM kills, render does not clear tmp/pids/server.pid (used by puma, the rails default, to prevent two servers from running simultaneously). This causes rails to fail to boot and the process to exit immediately; render tries to restart only for the exact same thing to happen. The end result is downtime until you re-deploy.

I reported this a long time ago in slack as well. Unfortunately render logs only go back about six minutes, so I’ve lost the exact logs (nor are they present on logentries), but below find a screenshot of the event tonight on srv-bho3afi0gk6pimeflgug:

I’ve figured out how to trigger this, and have minimum reproducible example here: https://dirty-exit.onrender.com/ (note: it may be down for obvious reasons). Code for it is here: GitHub - KMarshland/render-dirty-exit: To test out dirty exits on render.com

It “works” by mallocing 1GB when you go to /dirty-exit. The logs look like:

Mar 2 11:13:15 PM  ==> Starting service with 'bundle exec rails s'
Mar 2 11:13:18 PM  => Booting Puma
Mar 2 11:13:18 PM  => Rails 6.1.3 application starting in production
Mar 2 11:13:18 PM  => Run `bin/rails server --help` for more startup options
Mar 2 11:13:18 PM  A server is already running. Check /tmp/puma-server.pid.
Mar 2 11:13:18 PM  Exiting
Mar 2 11:13:37 PM  ==> Starting service with 'bundle exec rails s'
Mar 2 11:13:39 PM  => Booting Puma
Mar 2 11:13:39 PM  => Rails 6.1.3 application starting in production
Mar 2 11:13:39 PM  => Run `bin/rails server --help` for more startup options
Mar 2 11:13:39 PM  A server is already running. Check /tmp/puma-server.pid.
Mar 2 11:13:39 PM  Exiting
Mar 2 11:14:10 PM  ==> Starting service with 'bundle exec rails s'
Mar 2 11:14:12 PM  => Booting Puma
Mar 2 11:14:12 PM  => Rails 6.1.3 application starting in production
Mar 2 11:14:12 PM  => Run `bin/rails server --help` for more startup options
Mar 2 11:14:12 PM  Exiting
Mar 2 11:14:12 PM  A server is already running. Check /tmp/puma-server.pid.

A hack to get it to work is to rm -f /tmp/puma-server.pid in the start command.

Note also that it’s able to recover as expected a reasonable percentage of the time – just not always.

1 Like

Though I haven’t had time to test this, I suspect that it doesn’t actually matter how the process dies, as long as it’s not graceful – it’s just that OOM-killing is the most common cause of death in the real world. You could test this by causing a segfault or by doing something like:

VALUE rails_crasher_crash_stack(VALUE self) {
    uint8_t encoded[10*1024*1024]; // ruby's stack is iirc only two megabytes, so this should in theory cause a crash

    // do something dumb to keep encoded from getting compiled out

    return Qnil;
}

Hey Kai,

Thanks for the detailed write up. Render will simply restart a failed container when it’s terminated and leave the filesystem in the same state. From my understanding, it sounds like Rails is generally responsible for managing the server.pid file, so I’m unsure if Render should be getting involved with updating the state of the filesystem here.

My initial thought was what you suggested of removing the file before running the start command. It looks like you could also use a dynamic lockfile path with -P to avoid conflicts. Is there any reason this wouldn’t work for you? Did you have thoughts on what Render should be doing differently to handle this?

Hi Jake,

The general thought is that render’s rails configuration should work reliably out of the box. Workarounds are all well and good – and certainly much better than nothing! – but in my opinion the value of render as opposed to, say, running on ec2, is being able to trust that render will handle reliability issues like this one. It’s worth noting that there’s nothing special about my minimum reproducible example, other than taking a normal action to its logical extreme. Plenty of gems have native extensions that malloc memory. Even your example rails project is vulnerable to this, as it uses the pg gem. The issue isn’t that rails is bad at cleaning up after itself; it’s that a render OOM-kill doesn’t allow rails to exit gracefully.

If I were you, I would consider two fixes:

  1. Force clear the server.pid. As you say, Render typically isn’t involved in updating the filesystem, so this does come with disadvantages from a design perspective, but it’s quick and reliable. It’s reliable even in cases where it’s the user’s fault (eg if I were to try to put 10MB on the ruby stack), which I like from the perspective of render’s value proposition being “you can trust us to handle reliability”.
  2. Allow services to exit gracefully when they run out of memory. This could be done in a way that increases uptime in other ways while you’re at it, though I imagine this would take longer to implement.

Good points here Kai. I think it makes sense for us to do something to defend against this case.

I don’t believe the second suggestion is feasible in a Kubernetes environment. Kubernetes doesn’t support container memory swap and also doesn’t support sending a signal other than SIGKILL when a container runs out of memory.

I think the best route forward is to automatically clear the lock file on startup. I’m going to look into adding that functionality and I’ll let you know where I land with it within a couple days.

Thanks for the thought out replies here.

1 Like

Thanks for the repro, Kai. I’m working on something that should fix this.

Thanks to both of you for being on top of it. If you have any questions about the example, let me know – I churned it out pretty quickly and didn’t pay attention to little details like “documentation”.

Hey @kai, a fix has been rolled out and should apply to your service after a rebuild.

1 Like