NVIDIA Nemotron 3: Scaling Hybrid Mamba to 1M Tokens

Introduction

Hybrid Mamba-Transformer models appear to represent a game-changing solution to overcome quadratic scaling constraints of dense attention with state-space models (SSMs), for long-range memory, paired with Transformers for detailed structuring tasks. Meanwhile, training methodologies are being explored to move past strict supervision: models are able to develop reasoning skills over code-related environment, mathematical expressions-related environment, and tool use-related environment together with joint Reinforcement Learning (RL) approaches such as Concurrent Multi-environment RL (RLVR) using NeMo Gym, while a novel data synthesis scheme like InfiniByte cross-breeds different scientific fields for a trajectory of reasoning that is unlikely to pop up on the Web naturally.

Nemotron 3 pushes the frontiers of this area by integrating sparse hybrid architecture, synthetic data, and alignment via reinforcement learning in a completely controllable, open-weights setting. Instead of focusing on sheer size, Nemotron 3 illustrates the viability of long-horizon reasoning, throughput, and agentic stability on a scale more typical of much larger systems for small- to mid-scale models, giving a blueprint for building completely logically consistent, efficient, real-time AI systems that work well even in the resource-constraints of the enterprise setting, which will be explored extensively in the next few sections.

What is Nemotron 3?

Nemotron 3 is a family of Sparse Hybrid Mixture-of-Experts (MoE) large language models optimized for the accuracy-to-compute frontier. Unlike previous generations that relied on dense hybrid structures, Nemotron 3 utilizes a granular expert routing system that allows it to scale parameter counts into the hundreds of billions while maintaining the inference cost of much smaller models.

Model Variants

Three size variants of the Nemotron 3 AI models are available, allowing for large-scale production with differing reasoning abilities.

Nemotron 3 Nano: This is a model with 316 billion parameters, of which 32 billion are active and available for reasoning on each forward pass. This model has been optimised for high-speed processing applications such as debugging software or deploying locally on high-performance computers.
Nemotron 3 Super: The Nemotron 3 Super is a mid-sized model that contains approximately 100 billion total parameters. The Super also creates latent mixture of experts (MoE) with 10 billion active parameters so as to achieve greater precision in the automation of IT assistance and supporting multi-agent collaboration.
Nemotron 3 Ultra: The flagship of the Nemotron 3 line of models, the Ultra has approximately 500 billion total parameters. It is engineered to handle the largest and most complicated workloads encountered by businesses. The Ultra employs NVFP4 (4-bit floating point) to create a high price-to-accuracy ratio on state-of-the-art Blackwell generation processing hardware.

Key Features of Nemotron 3

Nemotron 3 maintains its uniqueness through a number of exclusive technological innovations, which emphasize control and performance:

1-Million Token Context Support: The model employs a long context phase at the end of its pretraining phase to handle up to 1M tokens, bettering the existing techniques Qwen3 based on the RULER tasks.
Granular MoE Routing: Rather than having a conventional 8 or 16 experts in MoE layers of other models, Nemotron 3 Nano relies on 128 routed experts plus 1 shared expert, turning on just 6 of them per token.
Multi-Token Prediction (MTP): Super & Ultra models include MTP layers, which predict multiple future tokens in one step for higher throughput for structured predictions or long reasoning chains.
Hardware-Aware Design: The design accommodates the NVIDIA H200 and Blackwell GPUs natively and adopts the NVFP4 format to achieve the highest inference-throughput and reduce the loss of accuracy.
Controllable Reasoning: Equipped with the enable_thinking flag that enables users to view internal trace evidence regarding the model's logic, which can be a necessary condition depending upon the application domain, viz., legal and scientific contexts.

Use Cases for Nemotron 3

The flexibility of Nemotron 3 makes possible a wide variety of high-value applications in various fields:

Enterprise IT & Automation: The Super model is specifically tailored for automating IT tickets and teamwork involving multiple agents, in which the workload has to be handled both quickly and precisely.
Software Engineering & Local Debugging: Since the Nano model has only 3.2B parameters, it can be run on local machines by developers in order to execute code completion, transpile, and debug without any latency involved in cloud APIs.
STEM & Scientific Research: By utilizing the InfiniByte data set, it is highly adept at interdisciplinary problem-solving for physics, chemistry, and high-level math concepts and applications.
Agentic Tool Use: These models can be fine-tuned on target data like Nemotron-Agentic-v1, and the resulting models can engage in multi-turn dialog systems. The models have to analyze complex tasks, apply external tools, and then interpret their outputs.

How does Nemotron 3 work?

Through the use of Mamba 2 layers (for linear, time-scale processing of huge context windows) and Transformer office (Grouping Queries Attention) layers that keep the underlying structure of the model intact for producing high-accuracy models, the model uses a Sparse Hybrid MoE Architecture. The combination of the two provides the strengths of both. The method of combining the two types of layers is made possible through a custom-provided granular MoE architecture consisting of 128 routed experts. The energy to the model is routed through a learned MLP router to ascertain the top six experts used for each token. By selecting only the necessary neurons for the purpose of that token, the brain is able to maximize output using a focused set of neurons that specialize in their respective inputs.

source - https://developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate/

When designing the Super and Ultra Models, the method of constructing the model is different, utilizing Latent MoE. This is accomplished by utilizing the latent representation of each expert, rather than using distinct token embeddings as the token for which the model should operate on. Because each specialist now has access to four times more expert tokens than before, the model will be able to achieve a significantly higher level of knowledge density without an accompanying increase in the time it takes to develop an inference.

Performance Evaluation

The results for Nemotron 3 Nano clearly demonstrate that there is a considerable improvement in efficiency. In the normal testing, Nemotron 3 Nano 30B-A3B produced results of 78.05% for HumanEval (0-shot) and 92.34% for GSM8K (8-shot), as can be viewed in the technical results report tables for accuracy. What is important here is that it outperforms and oftentimes rivals much larger and even more complex models, such as GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507.

source - https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf

In terms of throughputs in inferential reasoning, an imperative criterion for real-time tasks, Nemotron 3 Nano has 3.3 times higher throughputs than Qwen3-30B-A3B and 2.2 times higher throughputs than GPT-OSS-20B in heavy tasks involving tokenization and output (8K input, 16K output) using single H200 GPUs. This difference in throughputs would be further accentuated by the efficiency of this model in dealing with tasks requiring longer contexts, as it has beaten its competitors in RULER tests with respect to different token context lengths up to 1M.

Nemotron 3 Nano evaluations across a broad suite of established benchmarks

source - https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf

Supplemental assessments also show a strong level of capability in general knowledge and tools. The model achieved a score of 78.56% in MMLU (5-shot) and a score of 53.8% in the Berkeley Function Calling Leaderboard, validating the model’s preparedness for handling complex multi-step tasks. In addition to this, the model showcased unparalleled capability in dealing with mathematical problems by achieving a score of 78.63% in MATH-500 using advanced reasoning protocols.

How to Access and Use Nemotron 3

Nemotron 3 models can be obtained in different ways to suit both cloud-native and local-first developers. The weights for the Base, BF16, and FP8 models can be accessed on the Hugging Face model hub in the nvidia/nemotron-3 namespace. For more advanced applications, the models can be obtained through NVIDIA NIM (microservices), which is the optimized inference API. Instructions for executing the models locally can be obtained from the GitHub repos and the NVIDIA Research webpage. Nemotron 3 models come under the NVIDIA Open Model License. Though applications in research and commercial applications are encouraged in general, one still has to refer to the model card page for specifics.

Limitations

Nemotron 3 also has certain limitations. Handling a 1M token environment requires a lot of RAM on a virtual machine, going beyond the standard 256k token capacity of typical consumer settings. Also, a review of training data shows that there is a certain imbalance towards 'male' and 'White' identifiers that is generally a problem with BFM and needs careful consideration on a per-prompt basis of bias examination. However, on looking ahead towards the first half of 2026, there is planned coverage of Super (100B), Ultra (500B), and so on towards finalizing Nemotron 3 on the NVFP4 standardization of Latent MoE models so as to enhance reasoning scale capabilities.

Possible Technological Advancements and Future Directions

There are many ways in which Nemotron 3 can continue its evolution by incorporating new innovative technology into its existing system. The addition of dynamic hardware aware routing will help to overcome the limits of static bounds set on expert system activation, while allowing flexibility in response to the changing complexity of a given task and/or the amount of available system memory. This level of flexibility during the process of inference will allow for greater scalability of workloads across different types of infrastructure, especially if they are located within the confines of the enterprise environment.

Another new direction is recursive synthetic logic evolution. This involves the iterative creation of reasoning scenarios based on observed gaps within a model’s internal reasoning traces using synthetic data pipelines. This self-correcting feedback loop would allow for the improvement of infrequent yet complex failure modes, which are difficult to capture with human-created training datasets alone. Neural symbolic verification of reasoning chains and the use of formal solvers should be added to ensure compliance with regulatory and logical constraints.

Over time, it is also possible to improve the ability of efficient hybrid systems to perform reasoning tasks that require working with continuously fed data sources (for instance, video and sensor data) through the integration of multi-modal state-space layers. Doing this will allow these systems to perform similar scaling operations as what is done today with large amounts of text.

Conclusion

For the expert, the value is not only in the benchmark results, but also in the controllability – the possibility of turning reasoning traces on and off and leveraging data recipes such as InfiniByte for specific tasks that can never be addressed by natural data. This is an AI model that is as efficient as it is smart.

source:
Research: https://research.nvidia.com/labs/nemotron/Nemotron-3/
News: https://nvidianews.nvidia.com/news/nvidia-debuts-nemotron-3-family-of-open-models
Blog : https://developer.nvidia.com/blog/inside-nvidia-nemotron-3-techniques-tools-and-data-that-make-it-efficient-and-accurate/
Tech document: https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf
Nemotron3 collctions: https://huggingface.co/collections/nvidia/nvidia-nemotron-v3
Nano Base-BF16: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
Nano A3B-BF16: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Nano A3B-FP8: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8

Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are encouraged to conduct their own research and due diligence.

SocialViews From TechWorld

Pages

Tuesday, 23 December 2025

NVIDIA Nemotron 3: Scaling Hybrid Mamba to 1M Tokens

No comments:

Post a Comment

NVIDIA Nemotron 3: Scaling Hybrid Mamba to 1M Tokens