The agent can still delete a production database - and then gleefully reported its action back to you in the same way a dog might excitedly drop a dead bird at your feet.
In the not-so-distant past there was a brief period when picking an LLM meant picking a vendor. You wrote against their SDK, you paid their bill, and when their servers fell over on a Tuesday afternoon so did yours. Switching was easy enough but did require Ctrl+F'ing your way through the codebase - or asking your coding agent of choice to perform this insultingly simple task for you instead.
Then on the sixth day came the LLM gateway! One endpoint, dozens of models behind it, swap them in and out faster than the UK swaps Prime Ministers (and for international readers here is the context on that one). The gateway is, on its own terms, a genuinely nice piece of engineering, and there is now a number of teams shipping model routers with various bells and whistles: Portkey, LiteLLM, OpenRouter, TrueFoundry, Kong, Cloudflare.
Sitting alongside the gateway, usually paid for separately and instrumented by a different team three weeks later, is the LLM observability stack. Langfuse, Helicone, Arize Phoenix, Braintrust, LangSmith, Datadog now too. Pick your favourite, they mostly do the same thing. They capture every prompt and every completion, stitch them into traces, and present you with dashboards.
Between the two, your AI product is being watched harder than a Premier League VAR decision.
What Is Observability?
LiteLLM shipped to solve a pretty annoying problem: every provider's API was almost-but-not-quite the same. Portkey and OpenRouter commercialised the idea, TrueFoundry and Bifrost followed, and the existing API-gateway crowd (Kong, Cloudflare) extended their products to cover LLMs the moment it became clear LLMs were here to stay whether anyone liked it or not.
The observability side has a similar shape. Langfuse, Helicone, Arize Phoenix, Braintrust all turned up in 2023 to build native LLM observability from scratch. The companies that already did ML observability - Fiddler, WhyLabs, Galileo - pivoted with the arrival of ChatGPT and have spent the last three years insisting they were actually always about LLMs really anyway.
By 2026 both categories had matured enough to start being acquired. Portkey went to Palo Alto Networks in April for somewhere around $120-140M. Helicone went to Mintlify the month before. Langfuse was rolled into ClickHouse's Series D in January. Cisco picked up Galileo and folded it into Splunk.
All Data, No Insight
So you now have hundreds of thousands, possibly millions, of traces composting nicely in a database somewhere. Every prompt, every completion, every tool call, every retry. The gateway has them as individual requests. The observability tool has them stitched into trajectories. Between the two, your AI product is the most thoroughly documented thing in your engineering org.
The gateway, by design, treats each call as a self-contained unit - the only way to route, cache, fall back and bill at scale. It doesn't have a concept of the trajectory because it doesn't need one. As far as the gateway is concerned, each call was a separate transaction.
The observability tool does have a concept of the trajectory. It will stitch the spans together, draw you a lovely waterfall diagram, and tell you exactly how the agent arrived at whatever it arrived at. What it won't tell you is whether what it arrived at was any good - because good is a judgement, and the observability tool is, as its name suggests, there to observe.
Every call along the way can pass every check either tool runs. A 200 from the provider, the right shape of completion, no PII, latency inside SLO, and cost on budget. The agent can still invent a company policy, ignored an instruction from two steps earlier, and make three perfectly correct-looking calls on its way to deleting a production database - and then gleefully reported its action back to you in the same way a dog might excitedly drop a dead bird at your feet.
Where Observability Fails
The observability vendors aren't naive about any of this. Most of them ship some version of trace review - LangSmith has annotation queues, Braintrust has human review alongside LLM-as-judge scoring, Langfuse has dataset workflows, Arize Phoenix ships evals out of the box. Teams use them. Engineers were spending afternoons grading traces and writing scoring functions and tagging failure modes, and the tools make this less painful than it would otherwise be.
What you get at the end of all that effort is a graded slice of traces. A few thousand examples marked good or bad, often with structured scores attached, sometimes with notes. This is useful - for catching regressions, for spotting drift, for arguing with your model provider about why their new release broke something. It is not, on its own, a training or eval dataset, and it is several steps short of a model that has actually improved.
LLMs are intelligently dumb, they will learn an approximation of what you teach them, as such you need to teach them exactly what you want them to do. A graded trace is an observation. A dataset is a curated, balanced, deduplicated set of examples organised around the behaviours you want the model to learn. The difference between them is curation logic, clustering, sampling strategy, and a clear view of what the model is supposed to get better at - none of which the observability tool produces.
And even once you have the dataset, the pipeline from there to a better agent has its own shape. You need to fine-tune a model or train a LORA adapter, eval-gate the result against a held-out slice, deploy the new weights behind your routing layer, and watch the next week of traces to see whether the change actually moved the agent's behaviour or just shuffled the failure modes around. Each of those steps is its own piece of infrastructure.
Bridging The Gap
Some of the vendors have started shipping bolt-ons aimed at closing parts of this. Portkey's Autonomous Fine-Tuning wires gateway logs into a provider's training API and hands off the dataset. LangChain's Engine edits prompts and code from production traces. Braintrust's Loop generates evaluators and datasets from natural-language descriptions of failure modes. Each is a real move in the right direction, and each stops one step short of the model itself. The dataset gets handed off and the trail goes cold.
This is the gap between what's possible and what's shipped. The technology to take a production trace, score it, cluster it, curate it into a dataset, fine-tune a model on it, eval-gate the result and deploy it back through your routing layer - all of that exists, in pieces, sold by different companies with different pricing pages. What doesn't exist yet is a single layer that does the whole loop. You can probably see where i’m going with this but lets not jump ahead.
The Opportunity
Cursor is the example everyone reaches for, and for good reason. They trained their Composer model on production agent traces using online reinforcement learning, and it now beats the frontier at its size on cost and performance. The model that ships to Cursor users is the model trained on what Cursor users actually do. The labelling encodes the judgement. The fine-tune encodes the labelling. The product encodes the fine-tune. It compounds to create a very powerful product.
Doing this used to require a research team and the patience of an academic institution. It doesn't anymore. Open-weights bases - Llama, Qwen, Mistral, DeepSeek - are now genuinely competitive on scoped tasks. A LoRA fine-tune on a few thousand labelled trajectories runs for low hundreds of dollars on Modal or Together or your own GPUs if you're feeling brave. Unsloth and Tinker have made the mechanics broadly accessible to anyone willing to spend a weekend with them, with the small caveat that you do still currently have to be a bit of a genius to use them properly.
What this adds up to is something most teams haven't fully reckoned with yet. The data is sitting in the gateway. The traces are sitting in the observability tool. The curation can be automated. The fine-tuning is cheap. The deployment is a config change. The only thing standing between any reasonably-sized team and a continuously-improving fleet of specialist models trained on their own production traffic is whether anyone has joined the pieces up.
Overmind
And we arrive at the aforementioned point. We built Overmind on the same bet we wrote about in Think Smaller: the teams who win the next phase of agentic AI will be the ones who own their models, not the ones who rent the most powerful ones, and certainly not the ones who own the most expensive plumbing in front of those rented models.
Overmind is the layer that turns the data your gateway and observability stack are already collecting into models that improve because of it. It plugs into the trace store you've already got, runs the curation, builds the dataset, fine-tunes the model, eval-gates the result and deploys the new weights behind your existing routing layer. The gateway keeps doing what it's good at. The observability stack keeps doing what it's good at. Overmind does the bit in between that nobody else is doing.
Keep your gateway. Keep your observability stack. Just stop pretending they're enough on their own.
FAQ
What's the difference between an LLM gateway and LLM observability? Gateways route and rate-limit calls between your app and model providers. Observability tools record traces and let you inspect them. Neither curates the data or trains models.
Why isn't observability enough for production agents? Observability shows you what happened. It doesn't tell you whether the trajectory was good, and it doesn't turn that judgement into model improvements. That gap is the bottleneck.
What does Overmind do that observability tools don't? Overmind sits on top of your trace store, curates the data into training-grade datasets, fine-tunes specialist models, evaluates them, and deploys the new weights behind your existing routing layer.
Tyler Edwards is Co-Founder and CEO of Overmind. He writes about agent infrastructure, fine-tuning, and what it takes to ship AI that actually improves in production. Connect on LinkedIn.


