We are on the pro version of the managed postgres product. Occasionally (once every six weeks or so) we get cpu spikes that go to 100%, and basically everything in our app becomes bottlenecked as the DB is getting throttled. Most times I restart the DB and things eventually settle back to normal, but its obviously very disruptive and also requires manual intervention on my part.
I have observability, logging, and as much tooling I can think of in place to try and understand why this happens when it does, but with managed resources like this there is only so much info I can get. From everything I have looked at, nothing out of the ordinary seems to be happening during these events (site traffic looks the same, queries look to be usual, etc.).
Does anyone have any suggestions on how they would go about diving deeper into what is going on to understanding this issue? Are the postgres plans virtualized and on shared resources? Is it possible it could be noisy neighbors and nothing to do with my application itself? Has anyone kept their applications hosted on render but moved the DB away to a different provider (something like neon), and if so how were those experiences?
Any insight would be super helpful.