I have an established service with a Rails web service and a Database service, using custom URL. I have a few puma workers. The application loads just fine…most of the time.
Every once in a while, any GET or POST could result in a 502 gateway error. I got one this morning when going to the root URL. When it happens is random, and there is zero information in the logs (I have the request ID of my last one).
I made a request to your service and conveniently enough I got a 502 error as well, taking the rayID that’s exposed on the error page I was able to look up the request and can see that we responded with a 502 from the Render proxy, what would be useful would be to get the header x-render-routing value as this can further help debugging what the cause for this is. All my subsequent requests were successful though.
I have seen these errors pop in and out as well and managed to catch both the rayID and the x-render-routing header for my app. This seems to happen right as a new deploy is switching over from the old servers to the new servers.
I found this other thread referencing the “dynamic-paid-error”:
It suggests it is our application that served the error. Digging into our logs, I can see:
Our new service instance was spun up and served its first healthcheck request at 2022-10-28T20:15:32.468842Z. (syslog.appname = web-svrgt)
Render sent the shutdown signal to our old service instance running the Puma web process (were running rails) at 2022-10-28T20:16:37Z. (syslog.appname = web-6d667)
The Date response header here indicates that this request happened at 2022-10-28T20:16:51Z (Fri, 28 Oct 2022 20:16:51 GMT).
The only request logs around the time of the failed browser request on the new server instance were successful healthcheck requests.
Given this, it seems like Render might be continuing to try and serve traffic to the old service instance (web-6d667) after it had given it the shutdown signal. If that is true, it is continuing to do this behavior for 20+ seconds after the old service instance was told to shutdown.
Subjectively, this feels correct based on what I have observed. During the deploy, I was rapidly hitting refresh and saw the old version, then consistent 502 errors, then some 502 errors interspersed with the new version, then finally, the new version was fully live and everything worked great.
Does my conclusion seem right? Let me know if this helps or if there is anything else I can add.