08. April 2025

Nvidia Cracks Ai Code With Llama-31 Nemotron Ultra Model

Nvidia Unveils Llama-3.1 Nemotron Ultra, A Dense Large Language Model Outperforming DeepSeek R1 at Half the Size

In a significant development in the field of artificial intelligence (AI), Nvidia has released its latest large language model (LLM) based on Meta’s older model Llama-3.1-405B-Instruct. The new model, dubbed Llama-3.1-Nemotron-Ultra-253B-v1, boasts near top performance on various third-party benchmarks and outperforms the vaunted rival DeepSeek R1 open source reasoning model.

The release reflects Nvidia’s continued focus on performance optimization through architectural innovation and targeted post-training. The model is designed to support advanced reasoning, instruction following, and AI assistant workflows, making it an attractive option for developers looking to build robust AI applications. Its architecture introduces structural variations such as skipped attention layers, fused feedforward networks (FFNs), and variable FFN compression ratios.

This overhaul reduces memory footprint and computational demands without severely impacting output quality, enabling deployment on a single 8x H100 GPU node. The result is a model that offers strong performance while being more cost-effective to deploy in data center environments. Additional hardware compatibility includes support for Nvidia’s B100 and Hopper microarchitectures, with configurations validated in both BF16 and FP8 precision modes.

Post-training for reasoning and alignment

Nvidia enhanced the base model through a multi-phase post-training pipeline. This included supervised fine-tuning across domains such as math, code generation, chat, and tool use, followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to further boost instruction-following and reasoning performance.

The model underwent a knowledge distillation phase over 65 billion tokens, followed by continual pretraining on an additional 88 billion tokens. Training datasets included sources like FineWeb, Buzz-V1.2, and Dolma. Post-training prompts and responses were drawn from a combination of public corpora and synthetic generation methods, including datasets that taught the model to differentiate between its reasoning modes.

Improved performance across numerous domains and benchmarks

Evaluation results show notable gains when the model operates in reasoning-enabled mode. On the MATH500 benchmark, performance increased from 80.40% in standard mode to 97.00%. Similarly, results on the AIME25 benchmark rose from 16.67% to 72.50%, and LiveCodeBench scores more than doubled, jumping from 29.03% to 66.31%.

Performance gains were also observed in tool-based tasks like BFCL V2 and function composition, as well as in general question answering (GPQA), where the model scored 76.01% in reasoning mode versus 56.60% without.

Compared to DeepSeek R1, a state-of-the-art MoE model with 671 billion total parameters, Llama-3.1-Nemotron-Ultra-253B shows competitive results despite having half the parameters. The results suggest that despite being a dense model, Nvidia’s offering matches or exceeds MoE alternatives on reasoning and general instruction alignment tasks.

Usage and integration

The model is compatible with the Hugging Face Transformers library (version 4.48.3 recommended) and supports input and output sequences up to 128,000 tokens. Developers can control reasoning behavior via system prompts and select decoding strategies based on task requirements. For reasoning tasks, Nvidia recommends using temperature sampling (0.6) with a top-p value of 0.95. For deterministic outputs, greedy decoding is preferred.

The model also supports multilingual applications, with capabilities in English and several additional languages, including German, French, Italian, Portuguese, Hindi, Spanish, and Thai. Suitable for common LLM use cases such as chatbot development, AI agent workflows, retrieval-augmented generation (RAG), and code generation, the model is ready for commercial use.

Released under the Nvidia Open Model License and governed by the Llama 3.1 Community License Agreement, the model is licensed for commercial use. According to Oleksii Kuchaiev, Director of AI Model Post-Training at Nvidia, the team was excited to share the open release, describing it as a dense 253B model designed with toggle ON/OFF reasoning capabilities and released with open weights and data.

Nvidia Cracks Ai Code With Llama-31 Nemotron Ultra Model

Relevant Links