How Do You Handle Latency When Your App Depends on AI APIs?
Your application works fine when all your dependencies respond in 50-200ms. Add a call to an LLM API and suddenly your p99 latency jumps to 2.5 seconds. Your users notice. Your Slack channel fills with performance complaints. You realize you’ve built your application around the wrong assumption: that all API calls are created equal.
They aren’t. An image classification API taking 800ms to respond is catastrophic on a mobile web form but acceptable for a background job. A summarization API taking 2 seconds is fine for a report generation page but terrible for an interactive chat.
Latency is the friction between what your users expect and what your architecture delivers. When you depend on AI APIs, that friction grows because you’re surrendering control of response time to an external service.
Why AI APIs Are Different — The Numbers
First, let’s be concrete about what you’re dealing with:
- Fast LLM inference: 300–800ms (cached, small models, simplified inference)
- Standard LLM inference: 800ms–2s (streaming to first token, medium models, typical load)
- Complex LLM operations: 2–5s (long generations, large models, task-specific fine-tuning)
- Vision APIs: 1–3s (image processing, analysis, description generation)
- Embedding/Search: 100–500ms (vector operations, retrieval ranking)
Compare that to your typical backend services:
- Database queries: 5–50ms
- Cache lookups: 1–5ms
- Internal service calls: 10–100ms
The math is stark: a single LLM call adds more latency than 20 database queries. If you’re not accounting for this architecturally, you’ve already lost.
The User Experience Problem — When Latency Kills
Here’s where theory meets practice. Your users experience latency differently depending on the context:
Synchronous Operations (User Is Waiting) The user initiates an action and waits for the result. Response time is directly visible. Tolerable ranges:
- Form submission: 200–500ms max
- Search results: 300–800ms max
- Chat message response: 1s max
- Page transition: 500ms–1s max
Exceed these and users perceive the app as slow, unresponsive, or broken. If your AI feature makes an operation cross these thresholds, you need to restructure the interaction.
Asynchronous Operations (User Isn’t Waiting) The user triggers an action but doesn’t wait for the result. They get notified when it’s done (email, notification, page refresh). Tolerable ranges:
- Background job: 5–60 seconds
- Report generation: 30 seconds–5 minutes
- Batch processing: Minutes to hours
Here, AI latency isn’t a problem because you’re not blocking the user.
Streaming Operations (Progressive Delivery) The response is delivered in chunks as it’s generated. The user sees progress and gets value incrementally. Tolerable ranges:
- Stream time to first token: 500ms max
- Total stream time: Open-ended (user can close anytime)
Streaming changes the latency equation because users perceive it differently. A chat application where the first token arrives in 300ms feels snappy even if the full response takes 5 seconds.
The issue: many teams start with synchronous operations (bad) when they should start with asynchronous or streaming (good).
Architecture Pattern 1 — Async Jobs for Long-Running Operations
Your first instinct should be: can this AI work happen in the background?
Example: A user uploads a document and wants AI-generated summaries and key points extracted. Instead of:
User uploads → AI processing → Return response to user
[Blocks for 2-3 seconds]
Do this:
User uploads → Queue background job → Return immediately with status
Background job → AI processing → Store result → Notify user
[User never waits]
Implementation:
# API endpoint - returns immediately
@app.post("/documents/upload")
async def upload_document(file: UploadFile):
doc_id = save_file(file)
# Queue the AI work
background_jobs.enqueue(
"process_document",
doc_id=doc_id,
user_id=current_user.id
)
return {
"doc_id": doc_id,
"status": "processing",
"status_url": f"/documents/{doc_id}/status"
}
# Background worker
def process_document(doc_id, user_id):
document_text = load_document(doc_id)
# Now the latency doesn't matter—this runs whenever
summary = ai_summarizer.summarize(document_text)
key_points = ai_extractor.extract_key_points(document_text)
# Store results
save_results(doc_id, summary=summary, key_points=key_points)
# Notify user
send_notification(user_id, f"Document {doc_id} is ready")
Your API returns in 50ms. The AI work happens in the background. The user gets notified when it’s done. Everyone’s happy.
This pattern works for:
- Document analysis
- Report generation
- Image processing
- Content moderation
- Large-scale classification
Architecture Pattern 2 — Streaming for Interactive Operations
Some operations genuinely need user interaction and feedback. You can’t defer them. But you can stream the response, making latency feel less bad.
Example: A user is writing an article and wants AI suggestions for the next paragraph. They’re not going anywhere—they’re waiting for input. Streaming makes the wait feel shorter:
// User clicks "Generate next paragraph"
async function generateNextParagraph() {
const response = await fetch('/api/generate-paragraph', {
method: 'POST',
body: JSON.stringify({ articleId, currentText })
});
const reader = response.body.getReader();
let fullText = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
fullText += new TextDecoder().decode(value);
// Update UI in real-time as tokens arrive
updateArticlePreview(fullText);
}
}
The backend:
@app.post("/api/generate-paragraph")
async def generate_paragraph(articleId: str, currentText: str):
async def generate():
response = await ai_model.generate_async(
prompt=f"Continue this article:\n\n{currentText}",
stream=True
)
async for token in response:
yield token.encode() + b'\n'
return StreamingResponse(generate(), media_type="text/event-stream")
The user sees text appearing in real-time. They don’t wait 2 seconds for all of it—they see the first token in 300ms and can read as it streams in. Psychological difference is huge.
This pattern works for:
- Content generation
- Chat applications
- Search result refinement
- Code generation
- Interactive suggestions
Architecture Pattern 3 — Caching + Predictive Pre-fetching
For operations where you can’t use async or streaming, cache aggressively and pre-compute where possible.
Example: You have a “smart recommendation” feature that uses an LLM to personalize suggestions. The first user request takes 2s (unacceptable). But most subsequent requests for similar users are similar. Cache them:
import hashlib
import redis
cache = redis.Redis()
def get_ai_recommendation(user_id, product_category):
# Cache key based on user segment + category
user_segment = segment_user(user_id)
cache_key = f"recommendation:{user_segment}:{product_category}"
# Try cache first
cached = cache.get(cache_key)
if cached:
return json.loads(cached)
# Cache miss—call AI
recommendation = ai_recommender.recommend(
user_segment=user_segment,
category=product_category
)
# Cache for 24 hours
cache.setex(cache_key, 86400, json.dumps(recommendation))
return recommendation
Pre-fetch recommendations for segments you know users will request:
def prefetch_recommendations():
"""Run nightly to pre-warm the cache"""
for segment in get_all_user_segments():
for category in PRODUCT_CATEGORIES:
get_ai_recommendation(segment, category)
Now when a user lands on your page, the recommendation exists in cache (5ms response) instead of hitting the API (2000ms response).
This pattern works for:
- Personalization engines
- Product recommendations
- Contextual suggestions
- Ranking operations
- Segmentation
Architecture Pattern 4 — Hybrid Sync/Async for Required Context
Sometimes you need AI input to render the page, but you don’t need it to be blocking. Hybrid approach: show a partial result immediately, enhance it asynchronously.
Example: User clicks on a customer’s profile. You need:
- Profile data (20ms from database)
- AI-generated summary of interaction history (2s from LLM)
Instead of waiting for both:
@app.get("/customer/{customer_id}")
async def get_customer_profile(customer_id: str):
# Get the data you have immediately
customer = db.get_customer(customer_id)
interaction_history = db.get_interactions(customer_id)
# Queue the AI work
background_jobs.enqueue(
"generate_customer_summary",
customer_id=customer_id
)
# Return immediately with what you have
return {
"customer": customer,
"interaction_history": interaction_history,
"summary": None, # Will be available via separate API
"summary_status": "generating"
}
The frontend shows the profile immediately and polls for the summary:
const profile = await fetch(`/customer/${customerId}`).then(r => r.json());
// Render immediately
renderCustomerProfile(profile);
// Poll for summary
while (profile.summary === null) {
await sleep(1000);
const updated = await fetch(`/customer/${customerId}/summary`).then(r => r.json());
if (updated.summary) {
updateSummarySection(updated.summary);
break;
}
}
User sees the profile in 50ms. The summary shows up in 2-3 seconds. The interaction feels responsive.
Measuring the Tradeoff
Before you pick a pattern, measure what you’re dealing with:
- Measure baseline latency without AI
- Measure AI API latency under realistic load (p50, p99)
- Measure acceptable latency for your use case
- Calculate the gap
If your form needs to be under 200ms and AI adds 1200ms, async is mandatory. If you have a report page and AI adds 2s, maybe streaming is enough. If you’re building a batch processing system, latency doesn’t matter at all.
The teams that nail this aren’t the ones with the fastest AI APIs. They’re the ones that matched their architecture to the constraints.
Your dependency on AI APIs is a latency constraint. Architecture around it deliberately, not by accident.