QUICK ANSWER
Nemotron 3 Ultra is live as of June 4. Access via: Hugging Face (download weights - requires datacenter GPUs), NVIDIA NIM at build.nvidia.com (managed API, pay per token), OpenRouter (multi-model API routing), or ModelScope (Asia-Pacific). The model scores 48 on the Artificial Analysis Intelligence Index - #1 US open-weights - and runs at 300+ tokens/second. Consumer hardware cannot run it. On-premises options via HP DGX Station arrive August 2026.
How to Access Nemotron 3 Ultra on Each Platform
Hugging Face - Download Weights
Search nvidia/Nemotron-3-Ultra-550B on Hugging Face. The weights, training recipes, and dataset cards are published under NVIDIA's open model license. Download requires accepting the license terms. Running the model requires datacenter GPUs - minimum an A100 or H100 cluster with sufficient VRAM and NVLink bandwidth for the full 550B parameter model.
Best for: teams with their own GPU infrastructure who want full control over inference.
NVIDIA NIM - build.nvidia.com
NVIDIA NIM microservices provide managed API access to Nemotron 3 Ultra without requiring your own GPU cluster. Go to build.nvidia.com, select Nemotron 3 Ultra from the model catalog, and authenticate with your NVIDIA developer account. NIM uses an OpenAI-compatible API format - swap the base URL and model name and your existing client code works unchanged.
Best for: developers who want API access without infrastructure overhead.
OpenRouter
Nemotron 3 Ultra is available on OpenRouter's multi-model API. If you are already using OpenRouter for model routing, add Nemotron 3 Ultra as an option by selecting it from the model dropdown at openrouter.ai. OpenRouter pricing for Ultra has not been confirmed at time of writing - check openrouter.ai/models for current rates.
Best for: teams already on OpenRouter who want to route specific tasks to Ultra.
ModelScope
Available on ModelScope at modelscope.cn - same weights as the Hugging Face release. Preferred by developers in Asia-Pacific who use ModelScope as their primary model repository.
Best for: Asia-Pacific developers preferring ModelScope infrastructure.
Hardware Requirements - What You Actually Need to Run It
| Setup |
Can Run Ultra? |
Notes |
| Consumer PC (RTX 4090 / 5090) |
No |
Insufficient VRAM even at 4-bit quantization for 550B |
| Mac Studio / Mac Pro (M3/M4 Ultra) |
No |
Unified memory insufficient; CUDA not supported |
| 8x A100 80GB server |
Marginal |
Possible at aggressive quantization; throughput will be limited |
| 8x H100 80GB server |
Yes |
Comfortable at FP8; production-grade throughput achievable |
| NVIDIA DGX H100 / H200 |
Yes |
Optimal; native NVLink bandwidth enables full-speed inference |
| HP DGX Station (GB300 Grace Blackwell Ultra) |
August 2026 |
775GB coherent unified memory; deskside on-premises deployment |
The hardware picture changes significantly in August 2026 when HP ships the DGX Station for Windows - NVIDIA's new deskside workstation running the GB300 Grace Blackwell Ultra superchip with 775GB of coherent unified memory. That is enough to run Nemotron 3 Ultra at production-grade throughput locally, without cloud dependency. Asus, Dell, Supermicro, MSI, and Gigabyte have Q4 2026 DGX Station releases planned. RTX Spark laptops (the Surface Laptop Ultra and others) arrive in fall 2026 with enough unified memory to run the Nano and Super tiers locally but not Ultra.
What Nemotron 3 Ultra Is Good For - and Where It Falls Short
Ultra is designed specifically for agentic AI workloads - tasks that require planning, multi-step execution, and tool use across long sessions. NVIDIA benchmarked it on scenarios like autonomous code review pipelines, multi-document research summarization, and complex data transformation workflows. The 300+ tokens per second throughput matters here: when an agent is running multiple steps in sequence, latency compounds. At 3-6x the output speed of comparable Chinese open models, Ultra can complete agentic tasks faster than Kimi K2.6 or DeepSeek V4 Pro despite trailing them on raw intelligence scores.
Where Ultra falls short: the raw intelligence gap with Chinese open models (Intelligence Index 48 vs Kimi K2.6's 54, DeepSeek V4 Pro above that) is real. For tasks that require maximum reasoning depth - hard mathematics, competitive programming problems, complex multi-hop research - the Chinese open leaders still outperform. For enterprise teams that need US-jurisdiction open-weights infrastructure, or teams where inference speed and cost matter more than maximum reasoning depth, Ultra is the right choice today.
Frequently Asked Questions
Is Nemotron 3 Ultra free to use?
The weights on Hugging Face are free to download under NVIDIA's open model license. Running them requires your own GPU infrastructure, which is not free. API access via NVIDIA NIM and OpenRouter is paid on a per-token basis - pricing was not confirmed at time of writing. Check build.nvidia.com and openrouter.ai/models for current rates.
Can I fine-tune Nemotron 3 Ultra?
Yes - open weights means you can fine-tune. NVIDIA published training recipes alongside the weights. Fine-tuning a 550B parameter model requires significant GPU infrastructure and is not practical for most teams without access to a DGX cluster or equivalent. LoRA and QLoRA fine-tuning on the 55B active-parameter MoE architecture is more tractable than full fine-tuning - check the NVIDIA-NeMo GitHub repository for fine-tuning recipes and examples.
How does the Nemotron 3 license differ from Meta Llama or Mistral?
NVIDIA's open model license is more permissive than Meta's Llama license for most commercial use cases. Check the license file on the Hugging Face model card for the exact terms - as with all open-weight models, commercial use restrictions vary and should be verified before production deployment.