Two days ago we enabled auto scaling in our production environment (service name is pyvott-api-prod) with 60% target CPU utilization and 60% target memory utilization.
For about 36 hours we only had one instance, but last night at 07:48:38 PM MDT that instance had a CPU spike up to 83.2%, so the service scaled up to 2 instances.
After the auto scale happened, however, most requests to our service began timing out. Occasionally a request would succeed, but it appeared to us that our app was basically down. Strangely, nothing in our logs indicated that there were any problems, and our health check returned 200 OK.
After disabling auto scaling about 30 minutes ago, all requests are succeeding again. Can anyone from Render investigate this issue and help us understand what went wrong?
I’d love to help figure out what’s going on here. For debugging purposes, could you try manually scaling to 2 instances to see if this causes the same issue? This will help us to determine whether it’s autoscaling or the number of instances that’s causing the issue.
After some digging on our end, it appears that when the autoscaling event occurred, it did correctly scale to 2 instances. Matching up with what you reported, the first instance was still handling the majority of the traffic.
However, I don’t believe this is an autoscaling bug specifically, as there were very few new requests after the scale-up event. This leads me to believe that the requests being handled by your app before the scale-up were stuck performing high CPU tasks that ended up maxing out the first instance.
My recommendation here would be to either try raising your service’s plan so it can handle tasks with higher CPU usage, or to try optimizing these tasks within your code.
If something like this happens again, what’s the best way for us to diagnose and troubleshoot? We opened a shell via the Render web console and ran commands like top and ps aux, but we didn’t see any high CPU processes. This begs the question: When we run commands like these via the web console shell, are we seeing the output from just one machine, or is the output somehow aggregated from all machines?
When you open a shell, you are connected to one single instance, so any commands you run are run for just that instance and not aggregated across your instances. We do not currently have a way to shell into a specific instance, unfortunately.