Hi, I have a celery worker that works sometimes, but mostly fails with a KeyError when executing celery_tasks.crawl_linkmap_task. The first time I run the celery task, it usually runs correctly. But then when I restart my service, or call the task a second time, it always fails with a KeyError on the second shot (see below). I’m not sure if this is a celery or render issue, but either way, I really appreciate any help!
===
Details below:
Every time it fails, I warm shutdown my celery service, wait for my render queue to sync with my local environment and start it back up in the top level of my local virtual environment with:
celery -A celery_tasks worker --loglevel=info.`
In my local machine I see the task is present:
-------------- celery@Julians-MacBook-Pro.local v5.3.4 (emerald-rush)
--- ***** -----
-- ******* ---- macOS-10.16-x86_64-i386-64bit 2023-11-02 15:57:16
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app: celery_tasks:0x7fd1956dc8b0
- ** ---------- .> transport: rediss://red-cl007kas1bgc73fo0pt0:**@ohio-redis.render.com:6379//
- ** ---------- .> results: mongodb+srv://julianghadially:**@amati0.xwuxtdi.mongodb.net/
- *** --- * --- .> concurrency: 8 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
-------------- [queues]
.> celery exchange=celery(direct) key=celery
[tasks]
. celery_tasks.crawl_linkmap_task
However, I get the following Celery KeyError in my render queue:
Nov 2 04:01:42 PM [2023-11-02 21:01:42,503: ERROR/MainProcess] Received unregistered task of type ‘celery_tasks.crawl_linkmap_task’.
Nov 2 04:01:42 PM The message has been ignored and discarded.
Nov 2 04:01:42 PM
Nov 2 04:01:42 PM Did you remember to import the module containing this task?
Nov 2 04:01:42 PM Or maybe you’re using relative imports?
Nov 2 04:01:42 PM
Nov 2 04:01:42 PM Please see
Nov 2 04:01:42 PM Celery-Q - Programming Blog
Nov 2 04:01:42 PM for more information.
Nov 2 04:01:42 PM
Nov 2 04:01:42 PM The full contents of the message body was:
Nov 2 04:01:42 PM b’[[[“”]], {“callbacks”: null, “errbacks”: null, “chain”: null, “chord”: null}]’ (153b)
Nov 2 04:01:42 PM
Nov 2 04:01:42 PM Thw full contents of the message headers:
Nov 2 04:01:42 PM {‘lang’: ‘py’, ‘task’: ‘celery_tasks.crawl_linkmap_task’, ‘id’: ‘e3ca819f-079c-4dd5-a4a9-7a702eedb15a’, ‘shadow’: None, ‘eta’: None, ‘expires’: None, ‘group’: None, ‘group_index’: None, ‘retries’: 0, ‘timelimit’: [None, None], ‘root_id’: ‘e3ca819f-079c-4dd5-a4a9-7a702eedb15a’, ‘parent_id’: None, ‘argsrepr’: “([‘’],)”}", ‘origin’: ‘gen13472@Julians-MacBook-Pro.local’, ‘ignore_result’: False, ‘stamped_headers’: None, ‘stamps’: {}}
Nov 2 04:01:42 PM
Nov 2 04:01:42 PM The delivery info for this task is:
Nov 2 04:01:42 PM {‘exchange’: ‘’, ‘routing_key’: ‘celery’}
Nov 2 04:01:42 PM Traceback (most recent call last):
Nov 2 04:01:42 PM File “/opt/render/project/src/.venv/lib/python3.7/site-packages/celery/worker/consumer/consumer.py”, line 591, in on_task_received
Nov 2 04:01:42 PM strategy = strategies[type_]
Nov 2 04:01:42 PM KeyError: ‘celery_tasks.crawl_linkmap_task’
If you want to see the code I’m running here it is:
Filename: celery_tasks.py
from celery import Celery
import time
from time import sleep
import os
from celery.utils.log import get_task_logger
import json
import requests
from random import randint
from bs4 import BeautifulSoup
#tools
import tools
from tools import yyyymmdd_date
import datetime
from selenium.common.exceptions import WebDriverException
#import pymongo
mongo_key = os.environ.get('MONGODB_KEY')
redis_key = os.environ.get('REDIS_KEY')
print(redis_key)
bk = os.environ.get('BROWSERLESS_KEY')
print(bk)
#local = 'celery@Julians-MacBook-Pro.local'
redis = 'rediss://red-cl007kas1bgc73fo0pt0:'+redis_key+'@ohio-redis.render.com:6379'
mongo = 'mongodb+srv://julianghadially:'+mongo_key+'@amati0.xwuxtdi.mongodb.net/?retryWrites=true&w=majority'
logger = get_task_logger(__name__)
app = Celery('celery_tasks', broker=redis,backend=mongo)
@app.task()
def crawl_linkmap_task(linkmap,visited_urls=[],cap=None,bk=bk):
logger.info('Got request - starting work')
headers = {
'Cache-Control': 'no-cache',
'Content-Type': 'application/json',
}
date = yyyymmdd_date(datetime.date.today())
page_texts_trackr = []
links_trackr = []
link_visited_dates_trackr = []
if cap != None:
print("Capping urls in linkmap. Capping should cull recurssively based on parsing rules. Update code.")
cap = min(cap,len(linkmap))
linkmap = linkmap[0:cap]
for link in linkmap:
try:
if link not in visited_urls:
t = time.time()
print(link)
#Request
data = {"url": link}
data_json = json.dumps(data)
response = requests.post(
"https://chrome.browserless.io/content?token="+str(bk), headers=headers, data=data_json)
logger.info(str(bk))
if response.status_code == 200:
soup = BeautifulSoup(response.content, "html.parser")
result_textonly = soup.get_text()
page_texts_trackr.append(result_textonly)
links_trackr.append(link)
link_visited_dates_trackr.append(date)
logger.info("crawled a page")
else:
logger.info("Response status code: " + str(response.status_code))
#sleep
elapsed = time.time() - t
remaining_wait = max([int(5.0 - elapsed),0.5])
sleep(remaining_wait+(randint(0,100)/100))
except WebDriverException:
print("WebDriverException: Failed to load site " + str(link))
sleep(2+(randint(0,100)/100))
#driver.quit()
logger.info('Work finished')
return {'Text': page_texts_trackr, 'Links': links_trackr, 'Dates': link_visited_dates_trackr}