Skip to content
CloudWizz

Blog · May 2, 2026 · 5 min

LLM observability with Langfuse — the minimum viable setup

What to wire up first, what to defer, and how to get useful production traces without becoming an observability vendor's full customer.

By HarmanJyot Kaur

Most teams starting with LLM observability do one of two things wrong:

  1. They wait until something breaks in production, then spend a week debugging blind.
  2. They install three vendors at once, end up paying for two of them, and use none.

This is the minimum-viable Langfuse setup we put on every AI engagement, in the order we ship it.

Day 1: trace every model call

Wrap your LLM client. That’s it. Every call gets a trace.

from langfuse import Langfuse
from langfuse.openai import openai

# Same OpenAI client API; traces happen automatically.
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "..."}],
    metadata={"user_id": user_id, "session_id": session_id},
)

After this, you can answer: which model was called, what was the prompt, what came back, and how long did it take. That’s 80% of the value.

Day 2: structure the traces

Don’t let traces be a flat soup. Wrap multi-step operations in a single trace with named spans:

trace = langfuse.trace(name="rag-query", user_id=user_id)
retrieval_span = trace.span(name="retrieval", input={"query": q})
docs = vector_db.search(q)
retrieval_span.end(output={"docs": docs})

llm_span = trace.span(name="generation", input={"prompt": prompt})
answer = llm.invoke(prompt)
llm_span.end(output={"answer": answer})

Now you can answer: how long did retrieval take vs. generation? Which step is the bottleneck? Which retrieval result correlated with a hallucination?

Day 3: cost attribution

Tag every trace with the tenant or user. Langfuse rolls cost up automatically. This single thing has paid for itself in customer-account profitability conversations more often than any other observability investment we’ve made.

Defer until you actually need them

  • Eval pipelines. Useful, but premature without a hypothesis. Start with traces and human review.
  • Prompt management UI. Tempting. Most teams find prompts-in-git serves them better.
  • A/B testing inside Langfuse. Use your existing experimentation infra; don’t fragment the eval surface.

When Langfuse stops being enough

You’ll hit one of these:

  • Volume cost. At very high request volumes, hosted Langfuse gets expensive. Self-host (it’s open-source, deploys cleanly on EKS/GKE).
  • Compliance. PHI / PCI / on-prem requirements push you to self-host or an alternative (Phoenix, Helicone with on-prem).
  • Custom metrics. When you need richer eval (factuality scoring, custom rubrics), you’ll integrate a separate eval framework — but Langfuse remains the trace store.

What this gets you

A LLM application where, when something looks wrong in production, you have a trace, you can see the prompt, the retrieved context, the response, the cost, and the latency — within 60 seconds of the report. That’s the bar.

If you’d like a second pair of eyes on your AI observability stack, we do this regularly.

Tags

aillmobservabilitylangfuse

Have a project that could use a sharper opinion?

Book a 30-min call →