Render Load Balancer Closing Connections

Background:

We have a public-facing service built on Elixir and Phoenix running on Render. We have multiple clients that open a WebSocket connection to our Phoenix service and keep the connection open indefinitely, often over multiple days. There are “heartbeat” messages sent through the channel periodically to keep the channel open. The connection is all working fine and we see no signs of any misconfiguration there.

Problem:

Our clients are periodically showing these WebSocket connections as “closed” with a code of 1001 after being open for some (as yet indeterminate) period of time. Per the WebSocket RFC, this indicates that the connection is “going away”,as in the server may be going down, but this is not being triggered by any of our deployed processes shutting down. This causes unexpected behavior on our end and I’d like to try to understand what in the stack is causing this and if we can work around it.

Other Testing:

I’ve done some testing to rule out things like a browser problem (this has occurred in multiple versions of browser on multiple OSes) and to attempt to rule out a problem with our code or the Phoenix library (this does not happen when communicating internally with a development build of the software). This leaves me to speculate is something to do with Render’s load balancing, or with whatever Cloudflare is doing in front of Render (all of our requests pass through a Cloudlfare IP address).

I’d like to see if anyone else has experienced this or has any insight into what might be causing the problem.

In all likelihood, this is happening when we push updates to our routing layer or when we’re cycling through machines. Although we try to reduce this kind of interruption, with any long-lived connection over the internet, there’s a good chance of disconnection. We recommend in these cases to attempt reconnections and retries with a backoff strategy.

Hi Andrew,

It sounds like you’re on the right track with your investigation. The closed connections you’re experiencing are most likely occurring when we push updates to our routing layer or when we cycle those machines.

We do try to minimize this kind of interruption. However, with any long-lived connection over the internet, there’s a good chance of disconnection.

Our recommendation in these cases is to attempt reconnections and retries with a backoff strategy.

Feel free to let us know if you have any additional questions about this.

Regards,

Matt

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.