In our configuration, we have two microservices (Private Services) and one API Gateway (Web Service). These services communicate internally using gRPC, with the API Gateway exposing a GraphQL interface for public use. An issue that we’re running into is occasionally the API Gateway will receive “Connection Reset” errors when making gRPC calls, but they go away after making a few more calls (presumably the gRPC client is reconnecting?) We cannot reproduce this problem locally using Docker Compose; this leads me to believe it has something to do with the Private Service addresses or potentially a request timeout.
We are using Rust with Tonic as our gRPC library. Not sure how helpful it’ll be (since it’s not incredibly descriptive) but the full error message is below.
We could potentially just retry a couple times on failure, but I’d like to understand the root of the issue before doing that. The interesting thing here is it works most of the time, but occasionally it’ll give this error, at seemingly random times.
Just to provide more details, my initial thought was that the connection was idling (or was closed) due to inactivity. However, after adding a HTTP2 keepalive I experienced the same behavior (again, only when deployed.) Is there anything that I should be aware of when using long-living connections across Private Services and Web Services?
When you added the keepalive settings did this only happen on deployments? If this is the case I have a suspicion that a connection to a service which has shutdown is trying to be resued. Is your gRPC service gratefully shutting down and closing any open connections?
How long did make the keepalive setting?
Seeing this is internal connections can try using <service-slug>-discovery as the host to make the gRPC calls to? Even though you are using the private network there still is a proxy between your services. <service-slug>-discovery will resolve directly to the IP addresses of your services taking our proxy out of the equation.