Skip to content
Ibrahim Muhammad AI Engineer & Founder
Go back

How to Monitor Your LLM Conversations

Building a chat product or autonomous agent is different from anything that came before it. Traditional products have clear metrics: did a user take a certain action? It’s in your database or your product analytics tool. For conversations, useful is much harder to define. Was that a good interaction? What was the user even trying to do? Without evals, you’re mostly guessing.

Test Cases

When developing your product, having offline evals is essential. Your agent must pass them before a new version ships, though pass/fail may not be binary. You could define a threshold success rate for what’s acceptable.

The hard part is deciding what goes in. Evals need to represent production data: not the most relevant benchmark you found online, not the handful of examples from the PRD, not synthetically generated hypotheticals. Those all have their place, but if your evals don’t match what actually happens in production, you’re not measuring the right thing.

Prompts

Past the initial wow factor, you realize the agent isn’t doing what it’s supposed to. So you start prompt engineering or tweaking the harness. Over time the prompt grows to tens or even hundreds of statements. And despite explicitly telling the agent that a certain behavior matters, you still see it doing the opposite in production. How do you even know whether your agent is following the rules? Often you find out by accident. That’s not good enough.

LLM Observability

There are various LLM observability tools, but most feel like systems monitoring dashboards rather than tools built to catch whether your agent is following your instructions. Scorers and LLM-as-a-Judge can help, but model-based approaches have their inaccuracies. You still need humans reviewing the data.

With a lot of data points, random selection only gets you so far. The real question is how to prioritize what to look at.

Review Queues

If hundreds of conversations ask the same question or trigger the same action, reviewing the same thing repeatedly is a waste. You need diverse examples — using embedding distance, or looking at extremes in tools used, answer length, latency, or other signals.

Some issues can be auto-flagged: the agent didn’t follow an explicit prompt instruction, or a groundedness checker found a claim that doesn’t exist in the knowledge base. A good review system surfaces these first.

Labelling

As you review conversations, annotate them:

Labelling requires building and maintaining a taxonomy of problems specific to your application. It should evolve as you learn more about how your product is used. The key word is specific: this isn’t about generic helpfulness or toxicity, it’s about what matters for your use case.

Insights

Three ways to get more out of your conversation data:

Cost Implications

Human review is expensive but irreplaceable. LLM-as-a-Judge is cheaper, but costs add up fast when running it on every conversation. The practical answer is layering: small classifiers trained on human labels handle the bulk of the data cheaply; LLM-as-a-Judge runs on a subsample; humans review the most ambiguous or high-value examples.


How are you keeping track of your chat and agent sessions?


Share this post on:

Previous Post
Trends shaping technology in 2026
Next Post
MCP Changed How We Think About Integrations