Building a chat product or autonomous agent is different from anything that came before it. Traditional products have clear metrics: did a user take a certain action? It’s in your database or your product analytics tool. For conversations, useful is much harder to define. Was that a good interaction? What was the user even trying to do? Without evals, you’re mostly guessing.
Test Cases
When developing your product, having offline evals is essential. Your agent must pass them before a new version ships, though pass/fail may not be binary. You could define a threshold success rate for what’s acceptable.
The hard part is deciding what goes in. Evals need to represent production data: not the most relevant benchmark you found online, not the handful of examples from the PRD, not synthetically generated hypotheticals. Those all have their place, but if your evals don’t match what actually happens in production, you’re not measuring the right thing.
Prompts
Past the initial wow factor, you realize the agent isn’t doing what it’s supposed to. So you start prompt engineering or tweaking the harness. Over time the prompt grows to tens or even hundreds of statements. And despite explicitly telling the agent that a certain behavior matters, you still see it doing the opposite in production. How do you even know whether your agent is following the rules? Often you find out by accident. That’s not good enough.
LLM Observability
There are various LLM observability tools, but most feel like systems monitoring dashboards rather than tools built to catch whether your agent is following your instructions. Scorers and LLM-as-a-Judge can help, but model-based approaches have their inaccuracies. You still need humans reviewing the data.
With a lot of data points, random selection only gets you so far. The real question is how to prioritize what to look at.
Review Queues
If hundreds of conversations ask the same question or trigger the same action, reviewing the same thing repeatedly is a waste. You need diverse examples — using embedding distance, or looking at extremes in tools used, answer length, latency, or other signals.
Some issues can be auto-flagged: the agent didn’t follow an explicit prompt instruction, or a groundedness checker found a claim that doesn’t exist in the knowledge base. A good review system surfaces these first.
Labelling
As you review conversations, annotate them:
- Flag issues in agent behavior. Include a description of the problem and why it matters. This helps find similar cases and measure whether the problem improves over time. Add these as test cases in your offline evals.
- Note the correct behavior. Specific notes on what good looks like build a corpus of positive examples — and can be used as training data.
Labelling requires building and maintaining a taxonomy of problems specific to your application. It should evolve as you learn more about how your product is used. The key word is specific: this isn’t about generic helpfulness or toxicity, it’s about what matters for your use case.
Insights
Three ways to get more out of your conversation data:
- Clustering: Group similar conversations to understand what people are talking about at a high level, then drill into specific clusters for detail.
- Topic / use-case classification: Break down conversations by topic to understand how your tool is used. The taxonomy should be something you control and can update yourself.
- Scorers: A piece of code or a model that adds metadata to a conversation. If you want short responses, response length is a score. If the agent should never output code, an LLM-as-a-Judge module can flag when it does.
Cost Implications
Human review is expensive but irreplaceable. LLM-as-a-Judge is cheaper, but costs add up fast when running it on every conversation. The practical answer is layering: small classifiers trained on human labels handle the bulk of the data cheaply; LLM-as-a-Judge runs on a subsample; humans review the most ambiguous or high-value examples.
How are you keeping track of your chat and agent sessions?