Evals, provenance, and execution controls are merging into the agent runtime. The advantage is shifting from model quality to who owns the deployment surface.

The Moat Moved to the Runtime

Competition in AI is usually framed as a contest over the model: better weights, bigger context, higher scores on the benchmark of the month. That frame is losing explanatory power. Three recent signals point elsewhere. The fight is moving to the deployment surface, and evaluation and safety are being pulled inside the runtime that runs the agent.

Three signals, one direction

If reports hold, OpenAI has moved to acquire Promptfoo, an evaluation tooling company. Buying an eval vendor is a bet on owning the layer where teams decide whether an agent is behaving, not on scoring a model on a leaderboard.
GitHub has published a threat model and security architecture for running agents inside GitHub Actions. It documents safety as a property of the execution environment: how the platform helps you run agents safely, built in rather than commissioned as an outside report.
OpenClaw's 2026.3.8 release added ACP provenance, backup verification, and SSRF hardening. These ship as native runtime features rather than add-on audits.

The pattern: capabilities that used to sit outside the system as an external check are being absorbed into the system as built-in behavior.

The inversion

Independent evaluation was supposed to be a check on the platform. An outside eval harness, a third-party audit, a red team you did not employ. The value of that arrangement was distance. The referee was not on the home team.

Absorbing evals and safety into the runtime removes that distance, and that is why it is strategically valuable. Whoever owns the runtime shapes the default definition of "passing," holds the provenance trail that says what ran, and sets the controls that decide what an agent is allowed to do. That control is not total. Enterprise buyers can still demand external telemetry, standardized formats, or independent audits. But the default path runs through the platform, and defaults are sticky. A feature that looked neutral becomes a moat once it lives inside the platform. Ownership of the measurement can be worth more than being good at the thing being measured.

The practical consequence: a slightly worse model on a surface that controls evaluation, provenance, and execution can beat a slightly better model that has to import those things from outside. Model quality is converging and copyable. The runtime is neither.

What the runtime actually holds

Provenance. A record of which agent did what, with what inputs. This is trust infrastructure, and it compounds. The longer it runs, the harder it is to reconstruct elsewhere.
Execution controls. SSRF hardening and sandboxing decide what an agent can reach. These are policies enforced at the boundary, not properties of the model.
Backup and verification. Recoverability is a platform guarantee, not a prompt.
Evaluation. If your definition of correct behavior is enforced by the same system that runs the behavior, switching platforms means rewriting your standards, not just your prompts.

Each of these raises switching cost, and none is settled by picking a better base model.

What this does not prove

This is a directional inference from co-occurring signals, not a proven trend, and it deserves hedging. The Promptfoo acquisition, as of this writing, is reported rather than a closed and confirmed deal, and an acquisition can be read more than one way. Platform-native safety and independent evaluation are not mutually exclusive. The more likely near-term outcome is that both coexist, with native controls handling the default path and outside checks handling the cases that matter most. And a moat on the deployment surface is only as durable as the lock-in it actually creates. If provenance and eval formats standardize and become portable, the surface advantage thins out fast.

What to watch

If this read is right, expect more eval and safety tooling to be acquired or built directly into agent platforms rather than sold as standalone products. Expect provenance and execution policy to become procurement questions, not afterthoughts. And expect the interesting competitive move to be less "we trained a better model" and more "we own the surface where your agents actually run."

The teams that treat the runtime as plumbing will keep optimizing the model. The teams that treat the runtime as the product are the ones worth watching.