When NVIDIA launched Nemotron 3 Super in mid-March 2026, the headline was a 120 billion parameter open model. That is the wrong number to anchor on.
The number that actually drives the strategy is 12 billion: the active parameters per forward pass in its mixture-of-experts design. A model can advertise 120B total capacity and still compute like a 12B model during a forward pass, though holding 120B parameters in memory still requires a massive VRAM footprint. That gap between advertised size and active compute size is the whole point.
The inversion
For most of the open-model era, the competitive question was "how large is your model." Larger meant more capable, and the leaderboard rewarded scale. Nemotron 3 Super inverts that framing. The implicit pitch is that it is cheaper to serve at the context lengths agents actually need. It also reads as a deliberate U.S. counterweight to Alibaba's Qwen family, which has anchored much of the open-model conversation.
That shift matters because the cost of an open model is not paid at download time. It is paid every hour the model runs in production. A team can pull weights for free and still go broke serving them. The focus is shifting to the unit economics of inference.
What the architecture is actually optimizing
Three choices in Nemotron 3 Super line up behind one goal:
- A hybrid design that pairs Mamba-style sequence modeling with Transformer attention. Mamba layers scale more gracefully with sequence length than pure attention, though the remaining attention layers in a hybrid model will still create bottlenecks at the 1M token extreme.
- Mixture-of-experts routing, so only about 12B of the 120B parameters fire on any given token. You pay for the experts you use in compute, not the full bank.
- A 1M token context window, which NVIDIA positions as matching frontier models like GPT-5.4.
These choices target a specific constraint for agent workloads: long-context throughput per dollar, rather than peak benchmark scores on short prompts. The pairing is not arbitrary. It reads as a distillation of 2025's scaling experiments, where teams learned that dense attention at extreme context lengths gets expensive faster than the capability gains justify.
Why agents are the right test case
Agent systems are unusually punishing on context. They accumulate tool outputs, retrieved documents, intermediate reasoning, and long histories. A dense model that handles a 4K prompt cheaply can become unaffordable at 200K tokens, because attention cost grows fast with length.
Consider an agent that spends an afternoon working a single ticket: it reads a codebase, calls a dozen tools, holds onto error traces, and keeps prior steps in context so it does not repeat work. By the end its working context can balloon into the hundreds of thousands of tokens. On a dense model, the cost of each new step climbs as that history grows. This is exactly where the architecture is meant to earn its keep.
The combination of MoE sparsity and Mamba sequence handling targets these scaling issues. If the long-context throughput claims hold, this architecture could meaningfully lower the per-request compute cost of running an agent over a large working context, compared to a dense model of similar advertised size.
NVIDIA reports throughput exceeding competing open models such as Qwen3.5-122B and GPT-OSS-120B in long-context tests. Those are vendor claims measured under conditions NVIDIA chose. They are plausible given the architecture, but they are not yet independently confirmed on your workload.
What to do with this
Do not take the throughput numbers as settled. Treat them as a hypothesis worth testing.
- Benchmark Nemotron 3 Super on your own agent traces at the context lengths you actually hit, not on short synthetic prompts.
- Measure cost per completed task, not tokens per second in isolation. MoE routing and long context interact in ways a raw throughput chart will not show.
- Compare against Qwen3.5-122B and GPT-OSS-120B on the same hardware and the same traces, since those are the relevant alternatives.
When you evaluate open models from here forward, the first question should increasingly be "what does this cost to run at my context length, under my load." Nemotron 3 Super signals that at least some vendors are adapting their pitch to that question, even if many still heavily market dense peak capabilities. Your evaluation should move with it.