Render server can't handle concurrent users

Trevor_Dammon · November 2, 2024, 12:18am

Hi,

I was doing a bit of stress testing on my application today and I noticed that my server response times are struggling to handle just a few users. 1 user has around a 6 second response time. 3 users is close to 20 seconds and 10 users is over 30 seconds.

I started adding some logs in my backend to try and find the bottleneck. I am tracking the start time and the end time of my request and it looks like whether I am testing with one user or ten my logs show a total compute time of around 5 seconds. However my frontend is showing much more latency with more users.

I have checked my memory which never spikes above 50%, and my CPU usage which hovers around 5%.

Here is a log from my server:

[62d44229] Response ready at: 7.769s

Nov 1 07:05:52 PM[62d44229] Response sent at: 7.769s

Nov 1 07:05:52 PMINFO: 69.180.179.235:0 - “POST /generate_search_ideas HTTP/1.1” 200 OK

Nov 1 07:05:52 PM[POST]200torsera-dev.onrender.com/generate_search_ideasclientIP=“69.180.179.235” requestID=“9b03e306-0a41-454b” responseTimeMS=16051 responseBytes=843 userAgent=“python-requests/2.32.3”

The 7.769s is coming from my backend logging while the [POST] log, which shows a 16s responseTimeMS is the server request log.

I’d really appreciate some help getting to the bottom of this.

Jason-Render · November 3, 2024, 8:27pm

This question cannot be answered by anyone else as it is currently written because we don’t know what your application does, how it is configured, etc.

What does /generate_search_ideas do? Does your application have concurrency? Does some aspect of /generate_search_ideas cause requests to only be allowed to occur one at a time?

These are not the only questions that need to be answered, but it’s a start.

Trevor_Dammon · November 4, 2024, 2:39pm

Hey Jason, thanks for the response. I will update my post with more details. /generate_search_ideas is a function that calls some llms and returns data. Based on the logs it feels like the function itself is not being run asynchronously, however from the code I have written it should be asyn.

Trevor_Dammon · November 4, 2024, 2:44pm

@app.post('/generate_search_ideas')
async def generate_search_ideas(request: Request):
    start = time.time()
    request_id = str(uuid.uuid4())[:8]
    print(f"[{request_id}] Request received at: {start}")
    # Retrieve the session cookie from the request
    session_cookie = request.cookies.get('session')
    # If no session cookie is present, raise an Unauthorized error
    if not session_cookie:
        raise HTTPException(status_code=401, detail='Unauthorized')

    try:
        # Verify the session cookie and check if it has been revoked
        decoded_claims = auth.verify_session_cookie(session_cookie, check_revoked=True)
        # Extract the user_id from the decoded claims
        user_id = decoded_claims['user_id']
    except auth.InvalidSessionCookieError:
        # If the session cookie is invalid, raise an Unauthorized error
        raise HTTPException(status_code=401, detail='Unauthorized')

    try:
        data = await request.json()
        print(f"[{request_id}] Request parsed at: {time.time() - start:.3f}s")
        userInput = data['userInput']
        systemPrompt = "You are helpful and assist by generating related ideas for brainstorming."
        
        # Time the LLM calls
        llm_start = time.time()
        isFiction = await isFictionRelated(userInput)
        print(f"[{request_id}] Fiction check completed at: {time.time() - start:.3f}s")
        modifiedUserInput = userInput + " Unconventional ideas please." if isFiction else userInput
        initialIdeas = await fetchIdeas(modifiedUserInput, systemPrompt)
        print(f"[{request_id}] Initial ideas fetched at: {time.time() - start:.3f}s")
        print(f"""
        Time waiting for LLM: {time.time() - llm_start}
        Total request time: {time.time() - start}
        """)
        
        validInitialIdeas = list(filter(filterShortIdeas, initialIdeas))
        targetIdeaCount = 6

        if len(validInitialIdeas) < targetIdeaCount:
            print("running")
            additionalIdeas = await generateAdditionalIdeas(modifiedUserInput, validInitialIdeas, targetIdeaCount - len(validInitialIdeas))
            validAdditionalIdeas = list(filter(filterShortIdeas, additionalIdeas))
            allIdeas = validInitialIdeas + validAdditionalIdeas
            return JSONResponse(content= allIdeas, status_code=200)
        else:
            print(f"[{request_id}] Response ready at: {time.time() - start:.3f}s")
            response = JSONResponse(content=validInitialIdeas, status_code=200)
            print(f"[{request_id}] Response sent at: {time.time() - start:.3f}s")
            return response
    except Exception as e:
        logger.error(f"Error generating search ideas: {str(e)}")
        raise HTTPException(status_code=500, detail=f"An error occurred while generating search ideas: {str(e)}")

async def isFictionRelated(prompt):
    systemMessage = "Is the following prompt related to fiction or storytelling? Reply with 'yes' or 'no' only.";
    response = await openrouter_interface(openrouter_api_key,  prompt, model="mistralai/mixtral-8x7b-instruct", system_message=systemMessage, max_tokens=10, temperature=0.5);

    # this returns false if the response is not EXACTLY 'yes'   
    return response.strip().lower() == 'yes'

async def openrouter_interface(openrouter_api_key, prompt, model="", system_message="", max_tokens=750, temperature=0.9, top_p=0.7, top_k=50, repetition_penalty=1.09, max_retries=1, retry_delay=35):
    """
    Interface with Fireworks.ai API using OpenAI compatibility layer.
    """
    client = OpenAI(
        base_url="https://api.fireworks.ai/inference/v1",
        api_key=openrouter_api_key
    )
    
    # Map model names
    model_mapping = {
        "google/gemini-flash-1.5": "accounts/fireworks/models/llama-v3p1-70b-instruct",
        "cohere/command-r-08-2024": "accounts/fireworks/models/llama-v3p1-70b-instruct",
        "teknium/openhermes-2.5-mistral-7b": "accounts/fireworks/models/llama-v3p2-3b-instruct",
        "meta-llama/llama-3-8b-instruct:nitro": "accounts/fireworks/models/llama-v3p2-3b-instruct",
        "mistralai/mixtral-8x7b-instruct": "accounts/fireworks/models/llama-v3p2-3b-instruct"
    }
    
    fireworks_model = model_mapping.get(model, "accounts/fireworks/models/llama-v3p1-70b-instruct")
    messages = [{"role": "system", "content": system_message}, {"role": "user", "content": prompt}]
    
    print(f"System Message: {system_message}")

    for attempt in range(max_retries):
        try:
            completion = client.chat.completions.create(
                model=fireworks_model,
                messages=messages,
                max_tokens=max_tokens,
                temperature=temperature,
                top_p=1.0,
                presence_penalty=0.0,
                frequency_penalty=0.0,
                extra_body=dict(top_k=40)
            )
            
            response_text = completion.choices[0].message.content.strip()
            return response_text
        
        except Exception as err:
            print(f"Attempt {attempt + 1}: Error during request: {err}")
            if attempt + 1 == max_retries:
                return f"Failed to generate response after {max_retries} attempts. Please try again later."
            
            time.sleep(retry_delay)


async def fetchIdeas(input, systemPrompt):
   # category is never user
  #  category = await categorizeInput(input);
   commaCount = input.count(',')
   userMessagePrefix = ("Generate exactly 6 potential video related ideas for the following combined concepts that the user input -"
                        if commaCount >= 3 else
                        "Generate exactly 6 potential video related ideas for the following user input -")

   inputEnhancement = "";
   randomValue = random.random()
   if randomValue < 0.07:
       inputEnhancement = ", vox perspective"
   elif randomValue < 0.14:
       inputEnhancement = ", slight influence in terms of concepts if a speaker from TED talks were to be conceptualizing these"
   elif randomValue < 0.21:
       inputEnhancement = ", vsauce perspective"

   additionalInstruction = "";
   additionalRandomValue = random.random()
   if additionalRandomValue < 0.08:
       additionalInstruction = "Keep in mind, very strange/unexpected concepts are encouraged. Just make sure that they are compelling."
   elif additionalRandomValue < 0.16:
       additionalInstruction = "Keep in mind, very strange perspectives are encouraged."

   request_message = f'{userMessagePrefix} "{input}{inputEnhancement}". They are looking for video ideas related to this - so try to make inferences as to what they may find interesting/may be looking for. For each idea, format it as follows: First line should be the title in quotes (e.g. "The Hidden World of Dreams"), and the second line should be the description. Make sure each idea is separated by a line break. Half should be a little bit more loosely related to the user\'s query (for variety) and the other half can be a bit more closely related [Also make sure to add in a couple that are a bit more unconventional/out of left-field]. Also keep in mind that the context for these videos is that there will be generated (solo narrator) and for the visuals, we have an art generator generating imagery/visuals throughout the video that correlate with the narration - so you have some context; keep that in the back of your mind. Remember each description should be 2 sentences long. Make sure the ideas fall into the category of either interesting, entertaining, captivating, or compelling. {additionalInstruction} Remember that the videos are kind of driven by the narration, although the visuals are still important. You can provide some direction for both the narration and the visuals.'

   response = await openrouter_interface(openrouter_api_key, request_message, model="google/gemini-flash-1.5", system_message=systemPrompt, max_tokens=1600, temperature=0.65)
   
   # Split into individual ideas (title + description pairs)
   raw_ideas = [idea.strip() for idea in response.split('\n\n') if idea.strip()]

   parsed_ideas = []
   for idea_block in raw_ideas:
       lines = idea_block.split('\n')
       if len(lines) >= 2:
           # Extract title (removing quotes) and description
           title_match = re.search(r'"([^"]*)"', lines[0])
           if title_match:
               title = title_match.group(1)
               description = ' '.join(lines[1:]).strip()
               
               # Apply replacements
               title = re.sub(r'vox', 'thought-provoking', title, flags=re.IGNORECASE)
               title = re.sub(r'vsauce', 'intriguing', title, flags=re.IGNORECASE)
               title = re.sub(r'ted talk', 'thought-provoking talk', title, flags=re.IGNORECASE)
               
               description = re.sub(r'vox', 'thought-provoking', description, flags=re.IGNORECASE)
               description = re.sub(r'vsauce', 'intriguing', description, flags=re.IGNORECASE)
               description = re.sub(r'ted talk', 'thought-provoking talk', description, flags=re.IGNORECASE)
               
               parsed_ideas.append(f'"{title}" - {description}')


   if len(parsed_ideas) > 0:
       first_idea_is_intro = await isIntroduction(parsed_ideas[0])
       if first_idea_is_intro:
           print("First item is an introduction. Removing it from the list of ideas.")
           return parsed_ideas[1:]

   return parsed_ideas

Jason-Render · November 4, 2024, 5:56pm

What worker process is your app using?

Trevor_Dammon · November 4, 2024, 6:04pm

uvicorn

Jason-Render · November 4, 2024, 6:06pm

Have you configured concurrency in Uvicorn?

Trevor_Dammon · November 4, 2024, 6:43pm

Hey Jason, I am not exactly sure what you mean. I have tested the app using more or less server workers and the results are the same.

How exactly do I configure uvicorn for concurrency?

tosbourn · November 6, 2024, 11:31am

Sorry to jump in, but this section of the docs might be useful?

Jason-Render · November 10, 2024, 6:19am

Toby’s got it. We don’t set $WEB_CONCURRENCY and Uvicorn defaults to a single worker, so either setting $WEB_CONCURRENCY or --workers in the Start Command is necessary to parallelize operations.

system · December 10, 2024, 6:19am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Outbound request timeout	13	1880	October 22, 2022
Paid instance slow response time	2	270	November 16, 2023
Slow Performance compared to Local	4	1535	May 25, 2022
Response Time Pctl	2	11	August 10, 2024
Does Python `threading` work within render.com? I can't event get print statements to stream into my logs	3	1129	March 8, 2022

Render server can't handle concurrent users

Related topics