How Do You Handle Latency When Your App Depends on AI APIs?

Particle41 Team

April 9, 2026

Your application works fine when all your dependencies respond in 50-200ms. Add a call to an LLM API and suddenly your p99 latency jumps to 2.5 seconds. Your users notice. Your Slack channel fills with performance complaints. You realize you’ve built your application around the wrong assumption: that all API calls are created equal.

They aren’t. An image classification API taking 800ms to respond is catastrophic on a mobile web form but acceptable for a background job. A summarization API taking 2 seconds is fine for a report generation page but terrible for an interactive chat.

Latency is the friction between what your users expect and what your architecture delivers. When you depend on AI APIs, that friction grows because you’re surrendering control of response time to an external service.

Why AI APIs Are Different — The Numbers

First, let’s be concrete about what you’re dealing with:

Fast LLM inference: 300–800ms (cached, small models, simplified inference)
Standard LLM inference: 800ms–2s (streaming to first token, medium models, typical load)
Complex LLM operations: 2–5s (long generations, large models, task-specific fine-tuning)
Vision APIs: 1–3s (image processing, analysis, description generation)
Embedding/Search: 100–500ms (vector operations, retrieval ranking)

Compare that to your typical backend services:

Database queries: 5–50ms
Cache lookups: 1–5ms
Internal service calls: 10–100ms

The math is stark: a single LLM call adds more latency than 20 database queries. If you’re not accounting for this architecturally, you’ve already lost.

The User Experience Problem — When Latency Kills

Here’s where theory meets practice. Your users experience latency differently depending on the context:

Synchronous Operations (User Is Waiting) The user initiates an action and waits for the result. Response time is directly visible. Tolerable ranges:

Form submission: 200–500ms max
Search results: 300–800ms max
Chat message response: 1s max
Page transition: 500ms–1s max

Exceed these and users perceive the app as slow, unresponsive, or broken. If your AI feature makes an operation cross these thresholds, you need to restructure the interaction.

Asynchronous Operations (User Isn’t Waiting) The user triggers an action but doesn’t wait for the result. They get notified when it’s done (email, notification, page refresh). Tolerable ranges:

Background job: 5–60 seconds
Report generation: 30 seconds–5 minutes
Batch processing: Minutes to hours

Here, AI latency isn’t a problem because you’re not blocking the user.

Streaming Operations (Progressive Delivery) The response is delivered in chunks as it’s generated. The user sees progress and gets value incrementally. Tolerable ranges:

Stream time to first token: 500ms max
Total stream time: Open-ended (user can close anytime)

Streaming changes the latency equation because users perceive it differently. A chat application where the first token arrives in 300ms feels snappy even if the full response takes 5 seconds.

The issue: many teams start with synchronous operations (bad) when they should start with asynchronous or streaming (good).

Architecture Pattern 1 — Async Jobs for Long-Running Operations

Your first instinct should be: can this AI work happen in the background?

Example: A user uploads a document and wants AI-generated summaries and key points extracted. Instead of:

User uploads → AI processing → Return response to user
[Blocks for 2-3 seconds]

Do this:

User uploads → Queue background job → Return immediately with status
Background job → AI processing → Store result → Notify user
[User never waits]

Implementation:

# API endpoint - returns immediately
@app.post("/documents/upload")
async def upload_document(file: UploadFile):
    doc_id = save_file(file)

    # Queue the AI work
    background_jobs.enqueue(
        "process_document",
        doc_id=doc_id,
        user_id=current_user.id
    )

    return {
        "doc_id": doc_id,
        "status": "processing",
        "status_url": f"/documents/{doc_id}/status"
    }

# Background worker
def process_document(doc_id, user_id):
    document_text = load_document(doc_id)

    # Now the latency doesn't matter—this runs whenever
    summary = ai_summarizer.summarize(document_text)
    key_points = ai_extractor.extract_key_points(document_text)

    # Store results
    save_results(doc_id, summary=summary, key_points=key_points)

    # Notify user
    send_notification(user_id, f"Document {doc_id} is ready")

Your API returns in 50ms. The AI work happens in the background. The user gets notified when it’s done. Everyone’s happy.

This pattern works for:

Document analysis
Report generation
Image processing
Content moderation
Large-scale classification

Architecture Pattern 2 — Streaming for Interactive Operations

Some operations genuinely need user interaction and feedback. You can’t defer them. But you can stream the response, making latency feel less bad.

Example: A user is writing an article and wants AI suggestions for the next paragraph. They’re not going anywhere—they’re waiting for input. Streaming makes the wait feel shorter:

// User clicks "Generate next paragraph"
async function generateNextParagraph() {
  const response = await fetch('/api/generate-paragraph', {
    method: 'POST',
    body: JSON.stringify({ articleId, currentText })
  });

  const reader = response.body.getReader();
  let fullText = '';

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    fullText += new TextDecoder().decode(value);

    // Update UI in real-time as tokens arrive
    updateArticlePreview(fullText);
  }
}

The backend:

@app.post("/api/generate-paragraph")
async def generate_paragraph(articleId: str, currentText: str):
    async def generate():
        response = await ai_model.generate_async(
            prompt=f"Continue this article:\n\n{currentText}",
            stream=True
        )

        async for token in response:
            yield token.encode() + b'\n'

    return StreamingResponse(generate(), media_type="text/event-stream")

The user sees text appearing in real-time. They don’t wait 2 seconds for all of it—they see the first token in 300ms and can read as it streams in. Psychological difference is huge.

This pattern works for:

Content generation
Chat applications
Search result refinement
Code generation
Interactive suggestions

Architecture Pattern 3 — Caching + Predictive Pre-fetching

For operations where you can’t use async or streaming, cache aggressively and pre-compute where possible.

Example: You have a “smart recommendation” feature that uses an LLM to personalize suggestions. The first user request takes 2s (unacceptable). But most subsequent requests for similar users are similar. Cache them:

import hashlib
import redis

cache = redis.Redis()

def get_ai_recommendation(user_id, product_category):
    # Cache key based on user segment + category
    user_segment = segment_user(user_id)
    cache_key = f"recommendation:{user_segment}:{product_category}"

    # Try cache first
    cached = cache.get(cache_key)
    if cached:
        return json.loads(cached)

    # Cache miss—call AI
    recommendation = ai_recommender.recommend(
        user_segment=user_segment,
        category=product_category
    )

    # Cache for 24 hours
    cache.setex(cache_key, 86400, json.dumps(recommendation))

    return recommendation

Pre-fetch recommendations for segments you know users will request:

def prefetch_recommendations():
    """Run nightly to pre-warm the cache"""
    for segment in get_all_user_segments():
        for category in PRODUCT_CATEGORIES:
            get_ai_recommendation(segment, category)

Now when a user lands on your page, the recommendation exists in cache (5ms response) instead of hitting the API (2000ms response).

This pattern works for:

Personalization engines
Product recommendations
Contextual suggestions
Ranking operations
Segmentation

Architecture Pattern 4 — Hybrid Sync/Async for Required Context

Sometimes you need AI input to render the page, but you don’t need it to be blocking. Hybrid approach: show a partial result immediately, enhance it asynchronously.

Example: User clicks on a customer’s profile. You need:

Profile data (20ms from database)
AI-generated summary of interaction history (2s from LLM)

Instead of waiting for both:

@app.get("/customer/{customer_id}")
async def get_customer_profile(customer_id: str):
    # Get the data you have immediately
    customer = db.get_customer(customer_id)
    interaction_history = db.get_interactions(customer_id)

    # Queue the AI work
    background_jobs.enqueue(
        "generate_customer_summary",
        customer_id=customer_id
    )

    # Return immediately with what you have
    return {
        "customer": customer,
        "interaction_history": interaction_history,
        "summary": None,  # Will be available via separate API
        "summary_status": "generating"
    }

The frontend shows the profile immediately and polls for the summary:

const profile = await fetch(`/customer/${customerId}`).then(r => r.json());

// Render immediately
renderCustomerProfile(profile);

// Poll for summary
while (profile.summary === null) {
  await sleep(1000);
  const updated = await fetch(`/customer/${customerId}/summary`).then(r => r.json());
  if (updated.summary) {
    updateSummarySection(updated.summary);
    break;
  }
}

User sees the profile in 50ms. The summary shows up in 2-3 seconds. The interaction feels responsive.

Measuring the Tradeoff

Before you pick a pattern, measure what you’re dealing with:

Measure baseline latency without AI
Measure AI API latency under realistic load (p50, p99)
Measure acceptable latency for your use case
Calculate the gap

If your form needs to be under 200ms and AI adds 1200ms, async is mandatory. If you have a report page and AI adds 2s, maybe streaming is enough. If you’re building a batch processing system, latency doesn’t matter at all.

The teams that nail this aren’t the ones with the fastest AI APIs. They’re the ones that matched their architecture to the constraints.

Your dependency on AI APIs is a latency constraint. Architecture around it deliberately, not by accident.