what is the implementation of the bench "AIME25 (with tools)"?

#42

by YF-T - opened 13 days ago

Discussion

YF-T

13 days ago

I want to know how the model use tools while thinking

igitman

NVIDIA org 6 days ago

@YF-T , we provide a reference implementation for stateful Python code execution in Nemo-Skills. If you want to run it, please follow the docs to set up Nemo-Skills and then run the following command

from nemo_skills.pipeline.cli import wrap_arguments, eval

eval(
    ctx=wrap_arguments(
        "++inference.tokens_to_generate=120000 "
        "++inference.temperature=1.0 "
        "++inference.top_p=1.0 "
        "++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
    ),
    cluster='local',  # or slurm
    model='nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16',
    server_type='vllm',
    server_gpus=1,
    benchmarks="aime25",
    server_args=f"--enable-auto-tool-choice --enable-prefix-caching --tool-call-parser qwen3_coder --mamba_ssm_cache_dtype float32",
    output_dir='/workspace/test-nano-v3-single',
    with_sandbox=True,
)

Important arguments here are with_sandbox=True (launches our code execution sandbox) and ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]which configures the Python tool and connects it through mcp. If you have an existing Python code execution mcp server (that supports stateful execution, like the built-in tool for gpt-oss models), then you most likely can use it directly. You can also update any other general arguments like use more GPUs or set multiple repeats with benchmarks="aime25:16 e.g.

We will also update the official evaluation reproduction instructions in Nemo-Evaluator shortly with an example for how to run AIME25 with tools.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment