Decision Guide · Updated March 2026

Which Open LLM Should Power Your Agent?

NemoClaw Nano, Super, or Ultra — the tier you pick shapes your cost, latency, and quality ceiling. This guide walks you through every decision axis so you get it right the first time.

Open-source stacks only No vendor lock-in Deployment-agnostic
Nano
Edge · <100 ms · 1B–7B params
Super
Data centre · 100–300 ms · 13B–70B params
Ultra
Cloud/cluster · 300–800 ms · 70B+ params

The Three Tiers at a Glance

Pick the tier that matches your latency budget, hardware, and task complexity — then verify with the decision tree below.

N

NemoClaw Nano

Edge-first

Designed for edge devices, browser inference, and latency-critical tasks where every millisecond counts. Runs comfortably on a single consumer GPU or modern edge accelerator with no server infra.

Model size1B – 7B params
First-token latency< 100 ms
GPU requirementRTX 4060+ or edge SoC
VRAM4–8 GB
Est. infra cost/hr$0.10 – 0.50
Best for
  • Real-time assistants (<100 ms SLA)
  • Edge / IoT / offline inference
  • Browser/WebGPU agents
  • High-QPS cost-sensitive APIs
  • Simple classification + extraction
Not ideal for
  • Complex multi-step reasoning
  • Long context (>16K tokens)
  • Legal / medical accuracy requirements
Recommended models: Nemotron-Nano-4B, Llama 3.2-3B, Phi-3-mini
Most Popular
S

NemoClaw Super

Production-grade

The sweet spot for most production agents. Strong enough for coding assist, customer support, data extraction, and multi-step pipelines — without the infrastructure cost of running 70B+ weights.

Model size13B – 70B params
First-token latency100 – 300 ms
GPU requirementA10G / A100 (1–2×)
VRAM16 – 40 GB
Est. infra cost/hr$0.50 – 2.00
Best for
  • Coding assist + code review
  • Customer support automation
  • Structured data extraction
  • RAG + retrieval-augmented pipelines
  • Multi-step workflows (5–15 steps)
  • B2B SaaS agents at scale
Not ideal for
  • Sub-50 ms latency requirements
  • Very long context (>64K tokens) on budget
  • Maximum quality for high-stakes domains
Recommended models: Nemotron-Super-49B, Llama 3.1-70B-Instruct, Mistral-Large
U

NemoClaw Ultra

Maximum quality

For tasks where quality is the only metric that matters. Frontier open-weight performance at proprietary-rivalling accuracy — but requires serious GPU infrastructure and accepts higher latency.

Model size70B+ params
First-token latency300 – 800 ms
GPU requirementH100 / A100 (4–8×)
VRAM80 – 640 GB
Est. infra cost/hr$2 – 8
Best for
  • Legal review & contract analysis
  • Medical/clinical decision support
  • Complex research synthesis
  • Advanced code generation (>500-line PRs)
  • High-stakes financial modelling
  • Deep multi-agent orchestration (15+ steps)
Not ideal for
  • Latency-sensitive UX (<200 ms)
  • Small teams without MLOps
  • Budget under $5K/month infra
Recommended models: Nemotron-Ultra-253B, Llama 3.1-405B, DeepSeek-V3

Decision Tree: Find Your Tier in 5 Questions

Answer each question to narrow down the right NemoClaw tier. Or use our interactive selector for a ranked recommendation across all open and proprietary models.

1. What is your latency requirement?

<100 ms (real-time UX, voice, IoT) → Nano
100–300 ms (standard API, async pipelines) → Super or Ultra
300 ms+ acceptable (batch, background tasks) → Ultra for quality

2. Where will the model run?

Edge device / IoT / browser (WebGPU) → Nano only
Single data-centre GPU (A10G, A100) → Super
Multi-GPU cluster / managed inference API → Ultra

3. What is the task complexity?

Simple: classification, extraction, short QA → Nano sufficient
Medium: coding assist, summarisation, structured output → Super recommended
Complex: multi-step reasoning, legal/medical, 100K+ tokens → Ultra required

4. What is your monthly token volume?

<10M tokens/month (early-stage product) → Nano (cheapest compute)
10M – 500M tokens/month (growth stage) → Super (best cost/quality)
500M+ tokens/month (scale) → Ultra with dedicated infra

5. How strict are your compliance requirements?

None / standard commercial → Any tier — or consider managed APIs
GDPR / SOC 2 — data must stay in region → Super on dedicated cloud node
HIPAA / air-gapped / on-prem required → Ultra on private cluster
💡 Default recommendation: If you answered "Super" to 3 or more questions, start with Super. Nano is a strict choice (latency + edge). Ultra is a deliberate choice (quality over cost). Super handles 80% of production agent workloads well.

Deployment Context Matrix

Which tier fits which infrastructure context — at a glance.

Deployment Context Nano Super Ultra Notes
Edge / IoT device Nano is the only viable option at <8 GB VRAM
Browser / WebGPU WebLLM supports 4B and smaller models only
Single A10G (24 GB VRAM) Super (13B–30B) runs comfortably here
Single A100 (80 GB VRAM) Partial Ultra at 70B works; 100B+ needs multi-GPU
Multi-GPU cluster (4–8× H100) Full Ultra (70B–405B) runs here
Managed inference API (Together, Fireworks) All tiers available via serverless; check rate limits
AWS Bedrock / GCP Vertex (via API) Limited Managed serving with IAM; check model availability
Air-gapped / on-prem All tiers — ensure hardware meets VRAM requirements
Kubernetes with GPU operator vLLM or TensorRT-LLM recommended for serving

Task Complexity → Tier Mapping

Real agent tasks mapped to the tier that handles them reliably in production.

Classify support ticket (3 categories)
Single-pass classification. Nano handles with 95%+ accuracy.
Nano
complexity 1/10
Extract structured fields from an invoice PDF
Constrained extraction from short context. Nano with JSON mode.
Nano
complexity 2/10
Summarise a 5-page document
Short context, straightforward summarisation. Nano sufficient.
Nano
complexity 2/10
Write a 500-word blog post draft
Coherent long-form generation benefits from larger model quality.
Super
complexity 4/10
Review a 200-line code change for bugs
Code reasoning at this scale is reliable on Super (13B+).
Super
complexity 5/10
Multi-step research: query 5 APIs, synthesise results
Agentic loop with tool use across multiple steps. Super handles well.
Super
complexity 6/10
Answer questions from a 50-page technical spec
RAG with retrieval keeps context short enough for Super.
Super
complexity 6/10
Write a production-ready 2,000-line feature with tests
Long coherent code generation needs Ultra reasoning quality.
Ultra
complexity 8/10
Review a 500-page legal contract, flag risk clauses
Long context + domain accuracy requires Ultra. Consider Claude fallback.
Ultra
complexity 9/10
Plan and execute a 50-step autonomous research project
Maximum complexity agentic task. Ultra only. Add human checkpoints.
Ultra
complexity 10/10

The Hybrid Strategy: Route by Complexity

The best production systems don't pick one tier — they route tasks to the cheapest model that can handle them reliably.

Example: Coding Agent Pipeline

Classify issue → bug / feature / question
Nano $0.01/req
Extract structured requirements from issue body
Nano $0.02/req
Write implementation plan (10 steps)
Super $0.15/req
Generate code changes (100–500 lines)
Super $0.40/req
Review large PR (500+ lines, test coverage)
Ultra $1.50/req
Result: Average effective cost ~$0.42/request. A naive all-Ultra pipeline would cost ~$3.08/request. 7× savings with identical final quality.
💡 Implementation tip: Add a lightweight complexity classifier (Nano-sized!) that scores each incoming request 1–10 and routes to Nano (<4), Super (4–7), or Ultra (>7). NemoClaw's pluggable model backend makes this a config-level change, not a rewrite.

Frequently Asked Questions

Still not sure which tier?

Use our interactive LLM Selector. Answer 5 questions, get 7 ranked model recommendations — NemoClaw Nano, Super, Ultra, plus GPT-4o, Claude, Gemini, and more.

Get a Personalised Recommendation →