Hello! Recently noticed that it wasn’t possible to achieve zero downtime deploys for private services due to there being no health checks there. I tried to get around this by having a web service listening on two ports, 10000 for public facing and 8080 for private traffic. So for example, I would expose the health check on port 10000 and the rest of the API on 8080 for private traffic only.
The above works fine but I noticed that during a deploy the private end point would be down for a couple of seconds anyway, even though it waited for the health check on the public port.
Any idea if it’s possible to get around this some way? Not sure what the point of the private network is if services cant be deployed without downtime, so it feels like there should be some way of solving this. I realize that my approach might be a bit of a hack but it would be really nice to have this working.
Did some more testing now and actually managed to get it working using both a web and private service. What I did was to add a timeout after the sigterm handler which would then keep the app-to-be-killed running for a while before shutting down. I set it to 20s just to test and it seemed to have done the trick. I assume the load balancer sends a couple of requests to the old service for some reason, and without the timeout they would not be served(?).
Would really appreciate some official feedback on this anyway to understand if this is a viable approach or not. And if it is, maybe it should be documented somewhere. Overall it would be great with a bit indepth info on how the health checks and load balancer are set up, especially when it comes to deploys.
Hey! I think my main concern is that it seems like there is a downtime when connecting to a service (web or private) over the private network, even with a health check set up. Not sure if that’s a bug or not, but like I wrote above I managed to fix it by delaying the sigterm handler. Is this a viable approach?
Reg the health checks docs, it’s not really clear how the load balancer directs the traffic during a deploy. It appears like it still sends a few requests to the old service even though the new one has a succesfull health check (according to my tests above). I don’t really mind this but if it’s necessary to delay the sigterm handler, like you would usually do in a k8s cluster, I think it should be written or explained somewhere