How NVIDIA AI-Q Reached \#1 on DeepResearch Bench I and II

Contents

Why Winning Both Benchmarks Matters Architecture at a Glance Core Stack: NVIDIA and Deep Research Key Ingredients in AI-Q Fine-Tuned NVIDIA Nemotron 3 Super: Data and Training AI-Q Deep Researcher Takeaways

Contributors: Raja Biswas, Divyansh Jain, Ivan Sorokin, Alessio Devoto, Chantal D Gama Rose, Ajay Thorve, David Austin, Jean-Francois Puget

NVIDIA AI-Q deep research agent recently achieved first place on both DeepResearch Bench (55.95) and DeepResearch Bench II (54.50), the two primary benchmarks for evaluating deep research agents. This marks a meaningful step for open, portable deep research. One configurable stack leading on both shows that developer accessible models and tooling can power state-of-the-art agentic research.

What sets AI-Q apart? AI-Q is an open blueprint for building AI agents that reason over enterprise and web data to deliver well-cited responses. AI-Q provides a fully open and modular architecture that enterprises can own, inspect, customize, and configure per use case. The deep researcher is one workflow within the larger AI-Q blueprint that includes intent routing, query clarification, and shallow research. The deep researcher adopts a multi-agent architecture consisting of planner, researcher, and orchestrator built on NVIDIA NeMo Agent Toolkit and fine-tuned NVIDIA Nemotron 3 Super models, with an optional ensemble and report refiner for maximum report quality. One stack – flexible by design, tunable to your needs.

Why Winning Both Benchmarks Matters

DeepResearch Bench I and II evaluate research agents in complementary ways.

DeepResearch Bench scores report quality against a reference report along comprehensiveness, depth of insight, instruction-following, and readability dimensions. Doing well here rewards polished, well-structured narratives and strong synthesis.
DeepResearch Bench II uses 70+ fine-grained, binary rubrics per task to check whether an agent retrieves the right information (Information Recall), synthesizes it into higher-level analysis (Analysis), and presents findings clearly (Presentation). Doing well here rewards granular factual correctness and analytical rigor.

Leading on both benchmarks means the AI-Q deep researcher produces polished well-cited reports and gets the underlying retrieval and reasoning right.

Architecture at a Glance

The AI-Q deep researcher architecture behind both results centers on three components: an orchestrator that coordinates the research loop, a planner that maps the information landscape and designs an evidence-grounded research plan, and a researcher that dispatches parallel specialists to gather and synthesize evidence across multiple analytical lenses. Each agent can be powered by a different LLM. An optional ensemble runs multiple agents in parallel and merges their outputs for maximum report quality and coverage of information. Figure 1 shows the full architecture.

Figure 1. AI-Q deep researcher: orchestrator, planner, and researcher pipeline (right) with optional ensemble (left).

Core Stack: NVIDIA and Deep Research

The same underlying stack powers both leaderboard submissions: open, reproducible, and built on:

NVIDIA NeMo Agent Toolkit for workflow wiring, function registration, and evaluation. The NeMo Agent Toolkit open source library provides config-driven composition of LLMs and tools and the ability to plug in different agent graphs.
LangChain DeepAgents for the multi-phase planner–researcher–orchestrator flow with subagent middleware where applicable.
NVIDIA Nemotron 3 LLMs powering the agent pipeline. Nemotron models can be fine-tuned to excel at research synthesis and long-horizon tool calling. Can be served via NVIDIA Build or NVIDIA NIM for model inference.

The core is always multi-step research (plan → gather → synthesize), web search (Tavily) and academic paper search (Serper), and citation-backed reports. Optionally, an ensemble layer and report refiner can be added on top for maximum report quality.

Key Ingredients in AI-Q

Four ingredients were central to the result:

Multi-agent architecture with evidence-grounded planning and specialist researchers, built on NVIDIA NeMo Agent Toolkit and LangChain DeepAgents.
Fine-tuned NVIDIA Nemotron 3 Super: Roughly 67k SFT trajectories from few seed datasets with research questions, filtered with a principle-based judge. This model powers the researcher and its sub-agents.
Custom middleware for long-horizon reliability. NeMo Agent Toolkit and LangChain middleware are extended with components that improve reliability and robustness.
Ensemble researcher and report refiner (optional): parallel pipeline outputs merged by an LLM, with a post-hoc refiner for maximum report quality.

Each is detailed in the sections that follow.

Fine-Tuned NVIDIA Nemotron 3 Super: Data and Training

A major factor in the results is a custom fine-tuned NVIDIA Nemotron-3-Super-120B-A12B model. We chose it for this workflow because it aligns well with multi-step agentic reasoning, tool use, and citation-grounded reporting; fine-tuning on real search-and-synthesis trajectories makes it effective for planner, researcher, and orchestrator roles at scale.

Trajectory generation

We collected research questions from multiple open-sourced datasets: about 17k questions from OpenScholar [https://allenai.org/blog/openscilm], 21k from ResearchQA [https://researchqa.cylumn.com/] and 2457 questions from Fathom-DeepResearch-SFT. [https://huggingface.co/datasets/FractalAIResearch/DeepResearch-SFT].
Then we generated ~80k trajectories for the full workflow using the open-sourced GPT-OSS-120B model. Each trajectory covers planner, researcher, and orchestrator behavior.
It’s worth noting that these trajectories include real web search results from the Tavily and Serper APIs so the model learns to navigate and perform multi-step searches and synthesis on real data.

Principle-based filtering

Most of the trajectories did not complete on time or were stopped due to exceeding the tool call limit, but for those that did produce expected results, we additionally applied filtering using the judge model.
The completed trajectories were scored with the nvidia/Qwen3-Nemotron-32B-GenRM-Principle judge model, which predicts quality along dimensions such as comprehensiveness, readability, accuracy, and relevance.
After filtering, ~67k trajectories were retained for training.

SFT training

Model: NVIDIA Nemotron-3-Super-120B-A12B
Setup: One epoch, 5,615 steps, approximately 25 hours on 16×8 NVIDIA H100 GPUs.

AI-Q Deep Researcher

AI-Q deep researcher adopts a multi-agent architecture (Orchestrator, Planner, and Researcher) with iterative plan → gather → synthesize loops, citation management, and custom middleware for long-horizon reliability. An optional ensemble and report refiner layer can be enabled for maximum report quality. The multi-agent design also serves as a long-context strategy: each subagent works within its own context window and returns only its synthesized output, so the orchestrator never sees the raw tool responses. This keeps the orchestrator’s context focused and prevents long, noisy search results from degrading its reasoning.

Orchestrator: Coordinates the full research loop. Calls the Planner to produce an evidence-grounded research plan, then the Researcher multiple times with focused research tasks derived from that plan. After research completes, the orchestrator reviews the plan’s quality constraints, dispatches targeted gap-filling research, and writes the long-form report. An optional refiner step makes edit to the report leveraging raw researcher briefs in a fresh context window – a second evidence recovery point.

Planner: Runs in two phases. A Scout subagent first maps the information landscape through broad searches. An Architect subagent then designs the research plan including report outline, targeted search queries, and quality constraints, while running its own searches to validate structural choices.

Evidence-grounded planning is key to producing reliable, high-quality reports. Our planner knows the information landscape before it commits to a structure. It decides where to go deep and broad based on what it actually found, not assumptions.

Researcher: Dispatches multiple specialist subagents in parallel, each with a distinct lens:

Evidence Gatherer: facts, statistics, specific numbers from authoritative sources
Mechanism Explorer: causal explanations, theoretical frameworks
Comparator: benchmarks, head-to-head data, trade-off analyses
Critic: counterarguments, limitations, failure cases
Horizon Scanner: recent developments, emerging trends

They share the same search tools, but with different analytical framing. Diverse specialists researching the same topic often surface evidence that a single generalist would miss.

The researcher synthesizes specialist findings into a unified, cited brief. An LLM then cross-checks this synthesis against the raw specialist outputs in a fresh context window, recovering any relevant information.

Config-Driven Flexibility
Every component is swappable. LLMs, tools, and agent graphs can be configured through YAML. Planner, researcher, and orchestrator can each be powered by a different LLM. For the benchmark submission, a fine-tuned Nemotron 3 drives the researcher, which processes 4x more tokens than the planner and orchestrator combined.

Custom Middleware for Long-Horizon Reliability

Each agent and subagent interleaves LLM and tool calls across many steps (often 32+). At that scale, the system may fail in ways that short interactions never expose. Our agent harness provides custom middleware to handle and mitigate these:

Tool name sanitization: LLMs may hallucinate tool names mid-run. This middleware applies pattern-based cleaning, alias resolution, and fuzzy matching to recover the intended tool.
Reasoning-aware retry: LLMs with reasoning sometimes produce thinking tokens without a tool call or final response, which would silently terminate the agent loop. Middleware detects this, preserves the reasoning in context, and retries.
Budget enforcement: Each agent and subagent has its own tool-call cap. When the limit is reached, middleware nudges the LLM to synthesize first, then removes tools entirely to force a text-only response.
Report validation: Before returning output, middleware checks minimum length and section structure. Incomplete reports get retried with a continuation prompt.

Each middleware addresses failure patterns observed in agent traces. Together they keep long-horizon runs reliable.

Ensemble
When enabled, N independent deep-research pipelines run in parallel. An LLM reads all N outputs, selects one as the structural base, and integrates unique content from the others. The ensemble produces broader evidence coverage than any single pipeline, directly improving comprehensiveness and information recall. A proofread pass removes process artifacts so the output reads as a single-authored work.

Post-hoc Refiner
An optional final report refiner step can run over the report with structured instructions to quantify vague claims, deepen entity coverage, cut scaffolding, ground risks, build comparison tables, and strengthen causal reasoning. The rewriting prompt is derived via self-supervised meta-learning against reference reports generated from our pipeline with frontier LLMs only.

Takeaways

NVIDIA AI-Q reached first place on both Deep Research Bench and Deep Research Bench II with a single stack: a multi-agent deep researcher built on NVIDIA NeMo Agent Toolkit, fine-tuned NVIDIA Nemotron 3 models, and custom middleware, with an optional ensemble and refiner when maximum report quality is needed. The stack is open, reproducible, and configurable to your needs. State-of-the-art results without compromising on transparency or control.

Join us at NVIDIA GTC in San Jose the week of March 16, 2026 to learn more.

S81706 – Evaluation-Driven Development: Best Practices for Building Reliable Agents
DLIT81725 – Develop Production Agents with Eval-Driven Design Dhruv Nandakumar US
S81570 – From Data to Decisions: Enabling AI Agents with Business Knowledge
S81569 – Self-Coding Agents: Architectures, Data Flywheels, and Autonomous Code Repair
S81789 – Open Source AI Shaping the Next Era of Intelligent Digital Workers

Source link

How NVIDIA AI-Q Reached \#1 on DeepResearch Bench I and II

Why Winning Both Benchmarks Matters

Architecture at a Glance

Core Stack: NVIDIA and Deep Research

Key Ingredients in AI-Q

Fine-Tuned NVIDIA Nemotron 3 Super: Data and Training

AI-Q Deep Researcher

Takeaways

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Why Winning Both Benchmarks Matters

Architecture at a Glance

Core Stack: NVIDIA and Deep Research

Key Ingredients in AI-Q

Fine-Tuned NVIDIA Nemotron 3 Super: Data and Training

AI-Q Deep Researcher

Takeaways

Leave a Reply Cancel reply

Recent Posts

Recent Comments

You Might Also Like

Toward robust automated cardiovascular arrhythmia detection using self-supervised learning and 1-dimensional vision transformers

The Surprising Key to Distilling Efficient Reasoning Models

Raising the Bar for Multimodal Retrieval with ViDoRe V3’s Top Model

Beginner’s Guide to Automating ML Workflows