LLM observability with Langfuse — MVP setup

By HarmanJyot Kaur · May 2, 2026 · 5 min read

Most teams starting with LLM observability do one of two things wrong:

They wait until something breaks in production, then spend a week debugging blind.
They install three vendors at once, end up paying for two of them, and use none.

This is the minimum-viable Langfuse setup we put on every AI engagement, in the order we ship it.

Day 1: trace every model call

Wrap your LLM client. That’s it. Every call gets a trace.

from langfuse import Langfuse
from langfuse.openai import openai

# Same OpenAI client API; traces happen automatically.
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "..."}],
    metadata={"user_id": user_id, "session_id": session_id},
)

After this, you can answer: which model was called, what was the prompt, what came back, and how long did it take. That’s 80% of the value.

Day 2: structure the traces

Don’t let traces be a flat soup. Wrap multi-step operations in a single trace with named spans:

trace = langfuse.trace(name="rag-query", user_id=user_id)
retrieval_span = trace.span(name="retrieval", input={"query": q})
docs = vector_db.search(q)
retrieval_span.end(output={"docs": docs})

llm_span = trace.span(name="generation", input={"prompt": prompt})
answer = llm.invoke(prompt)
llm_span.end(output={"answer": answer})

Now you can answer: how long did retrieval take vs. generation? Which step is the bottleneck? Which retrieval result correlated with a hallucination?

Day 3: cost attribution

Tag every trace with the tenant or user. Langfuse rolls cost up automatically. This single thing has paid for itself in customer-account profitability conversations more often than any other observability investment we’ve made.

Defer until you actually need them

Eval pipelines. Useful, but premature without a hypothesis. Start with traces and human review.
Prompt management UI. Tempting. Most teams find prompts-in-git serves them better.
A/B testing inside Langfuse. Use your existing experimentation infra; don’t fragment the eval surface.

When Langfuse stops being enough

You’ll hit one of these:

Volume cost. At very high request volumes, hosted Langfuse gets expensive. Self-host (it’s open-source, deploys cleanly on EKS/GKE).
Compliance. PHI / PCI / on-prem requirements push you to self-host or an alternative (Phoenix, Helicone with on-prem).
Custom metrics. When you need richer eval (factuality scoring, custom rubrics), you’ll integrate a separate eval framework — but Langfuse remains the trace store.

What this gets you

A LLM application where, when something looks wrong in production, you have a trace, you can see the prompt, the retrieved context, the response, the cost, and the latency — within 60 seconds of the report. That’s the bar.

If you’d like a second pair of eyes on your AI observability stack, we do this regularly.

LLM observability with Langfuse — the minimum viable setup