Business / T-2026-6671

NVIDIA's Nemotron 3 Ultra: An Open Model Built for Agent Orchestration

Q: NVIDIA's Nemotron 3 Ultra: An Open Model Built for Agent Orchestration — key point 1

Nemotron 3 Ultra is a 550B-parameter MoE model optimized for long-running agent workflows, not single-turn chat.

Q: NVIDIA's Nemotron 3 Ultra: An Open Model Built for Agent Orchestration — key point 2

Multi-Teacher On-Policy Distillation (MOPD) uses specialized teachers to improve the student model across domains without catastrophic forgetting.

Q: NVIDIA's Nemotron 3 Ultra: An Open Model Built for Agent Orchestration — key point 3

NVIDIA released open weights, data, and recipes under a permissive license, positioning the model for enterprise agentic system deployment.

NVIDIA releases Nemotron 3 Ultra, a 550B MoE model with 55B active parameters, targeting long-running agent workflows with 5x throughput gains and a new distillation technique.

Tessera Newsroom · 5 min read · June 7, 2026

Source Nemotron 3 Ultra by NVIDIA (producthunt.com)

FIGURE T-2026-6671

3 BUSINESS

NVIDIA released Nemotron 3 Ultra on June 4, a 550-billion-parameter Mixture-of-Experts model with 55 billion active parameters. The model is open: weights, training data, and recipes are all published. That alone would be notable. What makes Nemotron 3 Ultra interesting is not its raw benchmark scores, which are competitive but not dominant. It is the architectural and training choices NVIDIA made to optimize for a specific use case: orchestrating long-running, multi-turn agent workflows.

The model is the final release in the Nemotron 3 family. It uses a hybrid Mamba-Transformer architecture, where Mamba layers handle long-context efficiency and Transformer layers preserve precise recall. It is quantized in NVFP4, a 4-bit floating-point format that lets the same checkpoint run on Hopper, Blackwell, and Ampere GPUs using specialized kernels. It includes LatentMoE for expert routing and multi-token prediction layers for faster inference. These are not academic experiments. They are engineering decisions aimed at a concrete bottleneck: agent loops that balloon token counts across many turns.

NVIDIA claims Nemotron 3 Ultra achieves 5.9x, 4.8x, and 1.6x higher inference throughput compared to GLM-5.1-754B-A40B, Kimi-K2.6-1T-A32B, and Qwen-3.5-397B-17B respectively on an 8k token input, 64k token output setting. In agentic benchmarks, it lowered total tokens per task completion by up to 30%, according to NVIDIA’s experiments on SWE-bench and Terminal-Bench 2.0. The throughput advantage comes from the architecture and the quantization, not from model size alone. At 55B active parameters, Nemotron 3 Ultra is smaller than many of its peers.

The distillation method that matters

The most novel contribution in the release is Multi-Teacher On-Policy Distillation, or MOPD. NVIDIA trained more than 10 specialized teacher models, each with its own domain-specific pipeline. During MOPD, the student model (Nemotron 3 Ultra) generates rollouts across domains and receives dense reward signals from the corresponding teachers. The process runs asynchronously: student rollout generation, teacher scoring, and student optimization are fully pipelined.

MOPD is also iterative. After producing a checkpoint, new rounds of teacher training are initialized from the updated student model, and the improvements are merged into the next MOPD stage. This co-evolution between student and teachers is a departure from static distillation, where a fixed teacher transfers knowledge to a student once. NVIDIA says this enables continuous capability improvement and progressively stronger specialization across domains.

The method is significant because it addresses a problem that has become acute in frontier model development: how to improve a model across many domains simultaneously without catastrophic forgetting or plateauing. Most labs train on broad, general data and then fine-tune for specific capabilities. MOPD flips the approach. It starts with specialized teachers and lets the student learn from all of them at once, during its own generation process. Whether this generalizes beyond NVIDIA’s training infrastructure remains to be seen, but the recipe is open.

Open data, open weights, open licensing

NVIDIA released four checkpoints: a base model in BF16, a post-trained model in BF16, a quantized NVFP4 model, and a GenRM model used for RLHF. The data releases are substantial. They include 173 billion tokens of fresh code data from GitHub through September 30, 2025, a collection of synthetic legal datasets, and specialized datasets for factual recall, moral scenarios, and diverse generative tasks. For post-training, NVIDIA released 10 million new SFT samples and 1 million new RL tasks across multiple domains, bringing the cumulative Nemotron open data totals to 50 million SFT samples and 2 million RL tasks.

The licensing has also changed. Nemotron model releases are moving to OpenMDW-1.1, the Linux Foundation’s permissive license for open AI model distributions. This covers architecture, parameters, documentation, software, and related artifacts under a single framework. It reduces the licensing ambiguity that has slowed adoption of other open models.

What this means for AI builders

Nemotron 3 Ultra is not a benchmark-chasing model. Its scores on standard evaluations are on par with other state-of-the-art open LLMs, not ahead. On PinchBench, it scores 91%, tied with Kimi-K2.6. On Terminal-Bench 2.0, it scores 54%, behind Kimi at 67%. On EnterpriseOps-Gym, a long-horizon planning benchmark, it scores 33%, behind GLM-5.1 at 40%. These are not headline numbers.

But the model is not designed for single-turn chat. It is designed for agent orchestration: planning, tool calling, sub-agent delegation, validation, and error recovery across many turns. NVIDIA’s benchmarks show it completing tasks with fewer total tokens and lower cost. For teams building agentic systems, that tradeoff matters more than a two-point gain on a static benchmark.

The model also supports a 1 million token context length and outperforms other open LLMs on the RULER benchmark at that length. For workflows that involve long codebases, research literature, or multi-session conversations, this is a practical advantage.

The open model landscape shifts

NVIDIA is positioning Nemotron 3 Ultra as a platform play. The model integrates with NVIDIA’s NeMo libraries for fine-tuning, Dynamo for deployment, and Hermes Agent for orchestration. The release includes recipes for SFT, LoRA, GRPO, and MOPD. The data is open. The weights are open. The licensing is permissive.

This is a different strategy from building a closed, API-accessible frontier model. NVIDIA is not competing directly with GPT-5 or Claude 4.7 on general intelligence. It is building an open foundation for agentic systems, with the expectation that enterprises and sovereign AI developers will fine-tune and deploy the model on NVIDIA hardware.

The bet is that agent workflows will become the dominant mode of AI deployment, and that the winners in that market will be the platforms that provide the best model for the loop, not the best model for the leaderboard. Nemotron 3 Ultra is the strongest evidence yet that NVIDIA is making that bet.

Tessera Newsroom

Editorial

Masthead Contact

T-REL / BUSINESS

CreateOS Sandbox: A Firecracker Micro-VM for Every AI Agent

CreateOS Sandbox launches hardware-isolated micro-VMs for AI agents in ~30ms. Each agent gets its own kernel, private networking, and pause-to-zero snapshots.

Tessera Newsroom · July 22, 2026

Business / T-2026-1196

Inkling: The 975B open-weight model that forces a reckoning on fine-tuning

Inkling is an open-weights 975B multimodal model from Thinking Machines Lab. It forces a strategic question: is fine-tuning still relevant in the era of trillion-parameter models?

Tessera Newsroom · July 21, 2026

Business / T-2026-2285

BaseRT claims fastest Apple Silicon inference, beating llama.cpp by up to 56%

BaseRT, a native Metal inference engine, beats llama.cpp by up to 56% on decode and 1.81x on MoE prefill. The tradeoffs are telling.

Tessera Newsroom · July 20, 2026