I have a node web service that is running a bull queue. This queue manages a lot of somewhat CPU intensive tasks (Does a lot of reading and writing to a Render Postgres DB Service).
The queue is able to start off just fine, but eventually gets stalled once it has roughly 6000 jobs in the queue. From there I will get an Error: Can't reach database server at
message in the application logs. When I look at the metrics tabs for my both my web service and Postgres, both are well under their limits. Even so I still have increased both the application and database to one of the higher tiers, but am still getting the same behavior. If I put on autoscaling, it will autoscale up the services to the highest number (I tried 10, didnt go higher) and then fail over completely once it has autoscaled all the way up. This is strange to me because both CPU usage and memory being reported in the metrics tab are well under the threshold to trigger a new scaling group.
Ive read that going over allocated CPU can cause a queue to stall, blocking the main JS event loop in node, which does line up with the behavior I am observing. But like I said, I have tried increasing the CPU to very high tiers and am still observing the same behavior. I have tried lowering the amount of active connections to postgres (all the way down to 1), but that has not helped either.
It should be noted that the queue is using the hosted redis service to manage the orchestration of the queue, but from what I can tell there is no problem there.
I have had this same app running everything (app, postgres, and redis) on a single small digital ocean droplet (4gb memory, 2cpu) for months running these jobs every day and have had zero issues. I have allocated mow powerful resources to these jobs in render and still I am having issues, but it is not clear where the actual problem is. Does anyone out there have any idea what might be happening, and maybe point me in the right direction?