Docker service keeps rerunning `dockerCommand` indefinitely

Problem

I’m trying to start a docker service for Django. I have a problem with dockerCommand running over and over again indefinitely and my app never actually being available.

My dockerCommand runs a build script currently containing:

poetry run ./manage.py migrate
poetry run gunicorn -b 0.0.0.0:8000 ibproduct.wsgi:application

This results in the two commands just running over and over again, with the deploy stuck on ‘In progress’.

I’ve tried upgrading my plan to starter plus after reading in another issue it could be a memory problem. In this case, the commands don’t visibly repeat but the deploy still gets stuck on ‘In progress’ and the app isn’t working.

Questions:

  • Any thoughts on why this app isn’t working?
  • How does render decide a deploy is ‘finished’? After the dockerCommand is started and once the healthCheckUrl returns 200?
  • Is it expected behaviour for dockerCommand to retry multiple times under some conditions?

Possibly related issues


Thanks for any help you can provide.

I’ve been trying to get gunicorn to run just in the render shell. It work fine within the same docker image on my local machine, but on render the following happens:

root@backend-vjls-shell:/usr/apppoetry run gunicorn -b 127.0.0.1:9000 ibproduct.wsgi
[2021-05-05 13:05:50 +0000] [164] [INFO] Starting gunicorn 20.1.0
[2021-05-05 13:05:50 +0000] [164] [INFO] Listening at: http://127.0.0.1:9000 (164)
[2021-05-05 13:05:50 +0000] [164] [INFO] Using worker: sync
[2021-05-05 13:05:50 +0000] [170] [INFO] Booting worker with pid: 170
[2021-05-05 13:05:50 +0000] [171] [INFO] Booting worker with pid: 171
[2021-05-05 13:05:50 +0000] [172] [INFO] Booting worker with pid: 172
[2021-05-05 13:05:50 +0000] [173] [INFO] Booting worker with pid: 173
[2021-05-05 13:06:15 +0000] [186] [INFO] Booting worker with pid: 186
[2021-05-05 13:06:21 +0000] [164] [WARNING] Worker with pid 178 was terminated due to signal 9
[2021-05-05 13:06:21 +0000] [164] [WARNING] Worker with pid 171 was terminated due to signal 9
[2021-05-05 13:06:21 +0000] [189] [INFO] Booting worker with pid: 189
[2021-05-05 13:06:21 +0000] [190] [INFO] Booting worker with pid: 190
[2021-05-05 13:06:29 +0000] [164] [WARNING] Worker with pid 183 was terminated due to signal 9
[2021-05-05 13:06:29 +0000] [193] [INFO] Booting worker with pid: 193

Seems like the workers keep getting killed for some reason? I tried to curl my app within the render shell and just get an empty response. If I do the same without gunicorn - ./manage.py runserver - then I can query the app with curl within the render shell at least.

Hey Andrew,

The fact that the command is getting repeatedly restarted makes me think this might be a memory issue. Since the command doesn’t restart on the higher plan, I suspect that is indeed the case and there is some other issue that is causing it to get stuck in progress. Can you share the service ID for this service?

How does render decide a deploy is ‘finished’? After the dockerCommand is started and once the healthCheckUrl returns 200 ?

That is correct. Render will consider your app live when it responds with a 200 for the healthcheck path.

Thanks for getting back to me.

I’m assuming this is the ID (taken from the URL): srv-c297k23onml6gjf2mnm0

I’m fine with us on a higher plan - we’ll probably want to increase that again relatively soon anyway.

Do you know anything about why I might be having the gunicorn issues still on the higher plan even?

The issue while on the higher plan may have been a problem on our end. Could you try triggering another deploy now?

I’m seeing that the healthcheck path /stage/admin/login/ is returning a 500 response. This seems to indicate that the server is now up and responding to requests but hitting an error for that path. It won’t be marked as live as a result.

Do you have any insight into why that would return a 500?

I also just noticed the reference to 127.0.0.1 in your earlier post. You’ll want to make sure you’re listening on 0.0.0.0 rather than localhost.

I’ve been able to identify what the 500 error was and have fixed that for the health check. So now the deploy ends up at ‘live’.

However, if you actually go to that health check url (or any other path) after the deploy is live, it just returns a 502 error. Do you know why this could be?

Note that in the logs I’m seeing the health check coming through with 200:

May 5 05:28:45 PM  [2021-05-05 16:28:45 +0000] [25] [INFO] Starting gunicorn 20.1.0
May 5 05:28:45 PM  [2021-05-05 16:28:45 +0000] [25] [INFO] Listening at: http://0.0.0.0:8000 (25)
May 5 05:28:45 PM  [2021-05-05 16:28:45 +0000] [25] [INFO] Using worker: sync
May 5 05:28:45 PM  [2021-05-05 16:28:45 +0000] [31] [INFO] Booting worker with pid: 31
May 5 05:28:45 PM  [2021-05-05 16:28:45 +0000] [32] [INFO] Booting worker with pid: 32
May 5 05:28:45 PM  [2021-05-05 16:28:45 +0000] [33] [INFO] Booting worker with pid: 33
May 5 05:28:45 PM  [2021-05-05 16:28:45 +0000] [34] [INFO] Booting worker with pid: 34
May 5 05:28:49 PM  [05/May/2021 16:28:49] "GET /stage/admin/login/ HTTP/1.1" 200 1957
May 5 05:28:54 PM  [05/May/2021 16:28:54] "GET /stage/admin/login/ HTTP/1.1" 200 1957
May 5 05:28:59 PM  [05/May/2021 16:28:59] "GET /stage/admin/login/ HTTP/1.1" 200 1957
May 5 05:29:04 PM  [05/May/2021 16:29:04] "GET /stage/admin/login/ HTTP/1.1" 200 1957
May 5 05:29:09 PM  [05/May/2021 16:29:09] "GET /stage/admin/login/ HTTP/1.1" 200 1957
May 5 05:29:14 PM  [05/May/2021 16:29:14] "GET /stage/admin/login/ HTTP/1.1" 200 1957
May 5 05:29:19 PM  [05/May/2021 16:29:19] "GET /stage/admin/login/ HTTP/1.1" 200 1957

But I never even see the requests coming from me visiting the actual URL in the browser.

Your service seems to be up now. The issue on our end I referenced earlier is related to overriding the port with the PORT env var after the service is created. I failed to completely address it initially. Once the service was up and getting 502s I was able to identify the root cause and fully fix it.

I am going to see if we can get this issue prioritized so it doesn’t cause problems in the future. Let me know if you have any more issues.

Great, thanks for all your help.

Unfortunately now all my builds for this service are just hanging on ‘In progress’ with nothing to show in the logs. Any idea what’s happening?

1 Like

Yeah having the same issue also over here with my builds just hanging on “In progress”

1 Like

Is your service also a docker web service?

It’s a web application/server in a Docker image, if that’s what you mean

1 Like

Getting 502s also on another Render project, all this while the status page says there’s no problems… heh

All our systems have now been restored and we are monitoring to ensure everything stays healthy. Please let us know if you continue to see issues.

We will publish a postmortem after we conduct a full root cause analysis. Let us know if there’s anything else we can do to earn back your confidence. We know we fell short today, but we are working all the time towards higher reliability. It’s our top priority as a company.

I’m very sorry for the extended disruption.