Eval and observability are moving into the agent runtime. That gives enterprises clearer accountability, but it can also weaken the independent evidence...

One Throat to Choke Means No Independent Witness

Eval and observability used to be sidecars. You ran a model from one vendor, scored it with a second tool, and traced it with a third. That constellation is collapsing. Model platforms are pulling eval, red-teaming, provenance, permissioning, and backup closer to the runtime itself. OpenAI's reported absorption of eval infrastructure, including a rumored Promptfoo acquisition, would fit this pattern if confirmed.

Enterprises have reasons to like the move. When an agent leaks data or hallucinates a refund policy, procurement does not want to mediate a blame war between a model vendor, an eval vendor, and an observability vendor. A single accountable platform is easier to buy, govern, and blame. Unified accountability sells, and the vendor that owns model plus eval plus deployment has a credible enterprise pitch.

The inversion is that one throat to choke can also mean no independent witness. The sidecar architecture looked like integration overhead, but it functioned as a separation of powers. When your eval layer is sold by the same vendor whose model it grades, the party being scored has more control over the scorer.

What the sidecar actually bought you

Consider what an independent eval layer did for a buyer:

It produced failure evidence the model vendor could not easily reframe. If your third-party eval suite showed a regression after a silent model update, you had leverage in the renewal conversation.
It made benchmarks portable. The same harness scored GPT-class models, Claude-class models, and open weights, which kept switching costs real and pricing honest.
It separated incident forensics from the party with the most to lose. Provenance logs held by the vendor that caused the incident are weaker evidence than logs held by a neutral system.

Fold all of that into the runtime and each property can degrade. The regression report may now come from the vendor that shipped the regression. The benchmark harness may speak one vendor's API natively and everyone else's grudgingly. The audit trail may live inside the blast radius of the thing being audited.

The accountability trade is real but asymmetric

The consolidation case is not fake. Multi-vendor finger-pointing is a real cost, and a single accountable platform can reduce it. For buyers without the staff to run eval infrastructure, the integrated stack may be the right default.

But the trade is asymmetric over time. The blame-war cost is paid during incidents, which are occasional. The independent-witness cost is paid continuously and quietly: in every renewal negotiation where you cannot produce neutral performance data, in every migration estimate inflated by eval lock-in, in every silent model update you cannot independently check.

One plausible failure mode is a platform shipping a model update that improves average benchmark scores but degrades your specific workload. With an independent eval harness, you catch it in your own regression suite and escalate with data. With evals absorbed into the platform, you are relying on the vendor's dashboards to surface a problem caused by the vendor's own change. They might surface it quickly. They might also define the regression differently than you do.

What buyers should actually do

Buyers can adopt integrated platforms and still keep a thin independent layer alongside them:

Keep a small, portable eval suite outside the platform, even if the platform's built-in evals are better. Its job is to be a witness the vendor does not control, not to provide coverage.
Export provenance and trace data to storage you own, on a schedule. Treat platform-native logs as convenient, not canonical.
Negotiate eval portability into contracts. If your test cases and scoring configs cannot leave the platform, your benchmarks are hostages.

The market is consolidating because unified accountability is a real product. But accountability you cannot verify from the outside is just a promise with better packaging. The vendors winning the platform play understand this trend better than most buyers do. The buyers who do best in this cycle will be the ones who accept the integrated runtime and still keep one small grader the vendor cannot touch.