Jason, I appreciate the quick response. But unfortunately, there is no Event on the Render Dashboard which corresponds to the moment when I think my app restarted. That means I have no idea then why my app is having trouble. Can you suggest anything any ideas?
For context, about once a day I am using my app, it’s working fine for awhile, and then I notice it hangs. After a few minutes it starts working again. Yesterday this happened and I found the precise moment in the logs.
It was yesterday at 1:17:01 PM (log time). The key log lines are:
[SolidQueue] Claimed 1 jobs
I, [2024-04-29T18:16:29.093278 #67] INFO -- : [ActiveJob] [GetNextAIMessageJob] [33930e21-87b8-4173-9c6e-edc5e39ecf4b] Performing GetNextAIMessageJob (Job ID: 33930e21-87b8-4173-9c6e-edc5e39ecf4b) from SolidQueue(default) enqueued at 2024-04-29T18:16:29.024459609Z with arguments: 3357, 3
### GetNextAIMessageJob.perform(3357, 3, 1)
### Wait for reply
[SolidQueue] Restarting fork[67] (status: )
[SolidQueue] Starting Worker(pid=91, hostname=srv-cn7qcf7sc6pc73cbk1ug-64d5657b7b-4kdb6, metadata={:polling_interval=>0.1, :queues=>"*", :thread_pool_size=>5})
[1] - Worker 0 (PID: 103) booted in 0.0s, phase: 0
I, [2024-04-29T18:18:21.404711 #103] INFO -- : [6ee21674-2654-445a-a387-2c64c24f786c] Started GET "/cable" for 23.112.39.144 at 2024-04-29 18:18:21 +0000
I, [2024-04-29T18:18:21.413428 #103] INFO -- : [6ee21674-2654-445a-a387-2c64c24f786c] Started GET "/cable" [WebSocket] for 23.112.39.144 at 2024-04-29 18:18:21 +0000
I, [2024-04-29T18:18:21.413564 #103] INFO -- : [6ee21674-2654-445a-a387-2c64c24f786c] Successfully upgraded to WebSocket (REQUEST_METHOD: GET, HTTP_CONNECTION: Upgrade, HTTP_UPGRADE: websocket)
I, [2024-04-29T18:18:21.514497 #103] INFO -- : Turbo::StreamsChannel is streaming from Z2lkOi8vaG9zdGVkLWdwdC9Db252ZXJzYXRpb24vMTI1
[SolidQueue] Pruning dead process 360 - {"polling_interval"=>0.1, "queues"=>"*", "thread_pool_size"=>5}
[SolidQueue] Claimed 1 jobs
I, [2024-04-29T18:23:10.309082 #91] INFO -- : [ActiveJob] [GetNextAIMessageJob] [33930e21-87b8-4173-9c6e-edc5e39ecf4b] Performing GetNextAIMessageJob (Job ID: 33930e21-87b8-4173-9c6e-edc5e39ecf4b) from SolidQueue(default) enqueued at 2024-04-29T18:16:29.024459609Z with arguments: 3357, 3
### GetNextAIMessageJob.perform(3357, 3, 1)
### Wait for reply
### Finished GetNextAIMessageJob.perform(3357, 3)
I, [2024-04-29T18:23:24.835017 #91] INFO -- : [ActiveJob] [GetNextAIMessageJob] [33930e21-87b8-4173-9c6e-edc5e39ecf4b] Performed GetNextAIMessageJob (Job ID: 33930e21-87b8-4173-9c6e-edc5e39ecf4b) from SolidQueue(default) in 14525.39ms
I, [2024-04-29T18:23:25.344451 #91] INFO -- : [ActiveJob] Enqueued Turbo::Streams::BroadcastStreamJob (Job ID: b0083a97-908d-4568-85ec-1ebceb374adf) to SolidQueue(default) with arguments: "Z2lkOi8vaG9zdGVkLWdwdC9Db252ZXJzYXRpb24vMTI1", {:content=>"<turbo-stream action=\"refresh\"></turbo-stream>"}
[SolidQueue] Claimed 1 jobs
I, [2024-04-29T18:23:25.365729 #91] INFO -- : [ActiveJob] [Turbo::Streams::BroadcastStreamJob] [b0083a97-908d-4568-85ec-1ebceb374adf] Performing Turbo::Streams::BroadcastStreamJob (Job ID: b0083a97-908d-4568-85ec-1ebceb374adf) from SolidQueue(default) enqueued at 2024-04-29T18:23:25.335394484Z with arguments: "Z2lkOi8vaG9zdGVkLWdwdC9Db252ZXJzYXRpb24vMTI1", {:content=>"<turbo-stream action=\"refresh\"></turbo-stream>"}
I, [2024-04-29T18:23:25.366418 #91] INFO -- : [ActiveJob] [Turbo::Streams::BroadcastStreamJob] [b0083a97-908d-4568-85ec-1ebceb374adf] Performed Turbo::Streams::BroadcastStreamJob (Job ID: b0083a97-908d-4568-85ec-1ebceb374adf) from SolidQueue(default) in 0.87ms
I, [2024-04-29T18:23:26.311872 #103] INFO -- : [5082870c-319c-444c-b71e-c36f29e7804c] Started GET "/conversations/125/messages?version=1" for 108.162.245.166 at 2024-04-29 18:23:26 +0000
I, [2024-04-29T18:23:26.317496 #103] INFO -- : [5082870c-319c-444c-b71e-c36f29e7804c] Processing by MessagesController#index as HTML
I, [2024-04-29T18:23:26.317594 #103] INFO -- : [5082870c-319c-444c-b71e-c36f29e7804c] Parameters: {"version"=>"1", "conversation_id"=>"125"}
[1] - Worker 1 (PID: 124) booted in 0.0s, phase: 0
I, [2024-04-29T18:23:28.218480 #103] INFO -- : [5082870c-319c-444c-b71e-c36f29e7804c] Rendered layout layouts/application.html.erb (Duration: 455.2ms | Allocations: 129565)
I, [2024-04-29T18:23:28.219448 #103] INFO -- : [5082870c-319c-444c-b71e-c36f29e7804c] Completed 200 OK in 1902ms (Views: 448.0ms | ActiveRecord: 43.2ms | Allocations: 179138)
from here: Render · The Easiest Cloud For All Your Apps
Notably the word “restart” is a distraction. What happened is that the job started running but a second afterwards—before it completed—the supervisor process for SolidQueue noticed that a fork had been terminated so it’s replacing it. Notably, there it’s missing a status
for the job that’s being replaced so that suggests it was a SIGKILL but I don’t know that for sure. That was the working theory, but since I see no Event on the Render Dashboard that corresponds with this time it now seems like maybe something else is going on? I don’t have any other theories or ideas of what to investigate.