/blog/the-agent-quality-problem

the agent quality problem

everyone's building agents. almost nobody's measuring whether they're actually good. evaluation is the difference between a demo and a product.

engineeringwenzel orlandwenzel orland2026-03-155 min

there's a pattern in AI right now. someone builds an agent. they demo it. it looks impressive. it gets deployed. six weeks later, nobody uses it.

the reason is almost always quality. the agent doesn't understand the domain well enough. it hallucinates in ways that erode trust. it can't recover from errors gracefully. it works in demos because demos are controlled. production isn't.

this is the agent quality problem. and most teams aren't even measuring it.

at doobls, evaluation is built into the platform. every agent interaction generates metrics. response relevance, memory retrieval accuracy, tool selection appropriateness, factual consistency. these are the signals that tell you whether expertise is being captured and applied correctly.

we run automated evals across model providers. when a new model ships (and they ship constantly), we benchmark it against our evaluation suite before adopting it. metrics, not vibes.

but quantitative evals only get you so far. the harder question is: does the agent actually sound like the expert it's supposed to embody? does it make the same judgment calls? does it catch the same edge cases?

this is where the 3-layer memory matters. an agent with episodic memory can recall specific precedents. one with narrative memory understands patterns and themes. one with strategic memory follows learned procedures. together, they produce output that's right, because it's grounded in real expertise.

if you're building agents and you're not measuring quality rigorously, you're building demos. the gap between demo and product is trust. trust comes from consistent, measurable, verifiable quality. everything else is theatre.

all posts