I’m trying to start a docker service for Django. I have a problem with dockerCommand running over and over again indefinitely and my app never actually being available.
My dockerCommand runs a build script currently containing:
poetry run ./manage.py migrate
poetry run gunicorn -b 0.0.0.0:8000 ibproduct.wsgi:application
This results in the two commands just running over and over again, with the deploy stuck on ‘In progress’.
I’ve tried upgrading my plan to starter plus after reading in another issue it could be a memory problem. In this case, the commands don’t visibly repeat but the deploy still gets stuck on ‘In progress’ and the app isn’t working.
Questions:
Any thoughts on why this app isn’t working?
How does render decide a deploy is ‘finished’? After the dockerCommand is started and once the healthCheckUrl returns 200?
Is it expected behaviour for dockerCommand to retry multiple times under some conditions?
I’ve been trying to get gunicorn to run just in the render shell. It work fine within the same docker image on my local machine, but on render the following happens:
root@backend-vjls-shell:/usr/apppoetry run gunicorn -b 127.0.0.1:9000 ibproduct.wsgi
[2021-05-05 13:05:50 +0000] [164] [INFO] Starting gunicorn 20.1.0
[2021-05-05 13:05:50 +0000] [164] [INFO] Listening at: http://127.0.0.1:9000 (164)
[2021-05-05 13:05:50 +0000] [164] [INFO] Using worker: sync
[2021-05-05 13:05:50 +0000] [170] [INFO] Booting worker with pid: 170
[2021-05-05 13:05:50 +0000] [171] [INFO] Booting worker with pid: 171
[2021-05-05 13:05:50 +0000] [172] [INFO] Booting worker with pid: 172
[2021-05-05 13:05:50 +0000] [173] [INFO] Booting worker with pid: 173
[2021-05-05 13:06:15 +0000] [186] [INFO] Booting worker with pid: 186
[2021-05-05 13:06:21 +0000] [164] [WARNING] Worker with pid 178 was terminated due to signal 9
[2021-05-05 13:06:21 +0000] [164] [WARNING] Worker with pid 171 was terminated due to signal 9
[2021-05-05 13:06:21 +0000] [189] [INFO] Booting worker with pid: 189
[2021-05-05 13:06:21 +0000] [190] [INFO] Booting worker with pid: 190
[2021-05-05 13:06:29 +0000] [164] [WARNING] Worker with pid 183 was terminated due to signal 9
[2021-05-05 13:06:29 +0000] [193] [INFO] Booting worker with pid: 193
Seems like the workers keep getting killed for some reason? I tried to curl my app within the render shell and just get an empty response. If I do the same without gunicorn - ./manage.py runserver - then I can query the app with curl within the render shell at least.
The fact that the command is getting repeatedly restarted makes me think this might be a memory issue. Since the command doesn’t restart on the higher plan, I suspect that is indeed the case and there is some other issue that is causing it to get stuck in progress. Can you share the service ID for this service?
How does render decide a deploy is ‘finished’? After the dockerCommand is started and once the healthCheckUrl returns 200 ?
That is correct. Render will consider your app live when it responds with a 200 for the healthcheck path.
I’m seeing that the healthcheck path /stage/admin/login/ is returning a 500 response. This seems to indicate that the server is now up and responding to requests but hitting an error for that path. It won’t be marked as live as a result.
Do you have any insight into why that would return a 500?
I’ve been able to identify what the 500 error was and have fixed that for the health check. So now the deploy ends up at ‘live’.
However, if you actually go to that health check url (or any other path) after the deploy is live, it just returns a 502 error. Do you know why this could be?
Your service seems to be up now. The issue on our end I referenced earlier is related to overriding the port with the PORT env var after the service is created. I failed to completely address it initially. Once the service was up and getting 502s I was able to identify the root cause and fully fix it.
I am going to see if we can get this issue prioritized so it doesn’t cause problems in the future. Let me know if you have any more issues.
All our systems have now been restored and we are monitoring to ensure everything stays healthy. Please let us know if you continue to see issues.
We will publish a postmortem after we conduct a full root cause analysis. Let us know if there’s anything else we can do to earn back your confidence. We know we fell short today, but we are working all the time towards higher reliability. It’s our top priority as a company.