Background:
We have a public-facing service built on Elixir and Phoenix running on Render. We have multiple clients that open a WebSocket connection to our Phoenix service and keep the connection open indefinitely, often over multiple days. There are “heartbeat” messages sent through the channel periodically to keep the channel open. The connection is all working fine and we see no signs of any misconfiguration there.
Problem:
Our clients are periodically showing these WebSocket connections as “closed” with a code of 1001
after being open for some (as yet indeterminate) period of time. Per the WebSocket RFC, this indicates that the connection is “going away”,as in the server may be going down, but this is not being triggered by any of our deployed processes shutting down. This causes unexpected behavior on our end and I’d like to try to understand what in the stack is causing this and if we can work around it.
Other Testing:
I’ve done some testing to rule out things like a browser problem (this has occurred in multiple versions of browser on multiple OSes) and to attempt to rule out a problem with our code or the Phoenix library (this does not happen when communicating internally with a development build of the software). This leaves me to speculate is something to do with Render’s load balancing, or with whatever Cloudflare is doing in front of Render (all of our requests pass through a Cloudlfare IP address).
I’d like to see if anyone else has experienced this or has any insight into what might be causing the problem.