what is the implementation of the bench "AIME25 (with tools)"?

#42
by YF-T - opened

I want to know how the model use tools while thinking

NVIDIA org

@YF-T , we provide a reference implementation for stateful Python code execution in Nemo-Skills. If you want to run it, please follow the docs to set up Nemo-Skills and then run the following command

from nemo_skills.pipeline.cli import wrap_arguments, eval

eval(
    ctx=wrap_arguments(
        "++inference.tokens_to_generate=120000 "
        "++inference.temperature=1.0 "
        "++inference.top_p=1.0 "
        "++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
    ),
    cluster='local',  # or slurm
    model='nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16',
    server_type='vllm',
    server_gpus=1,
    benchmarks="aime25",
    server_args=f"--enable-auto-tool-choice --enable-prefix-caching --tool-call-parser qwen3_coder --mamba_ssm_cache_dtype float32",
    output_dir='/workspace/test-nano-v3-single',
    with_sandbox=True,
)

Important arguments here are with_sandbox=True (launches our code execution sandbox) and ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]which configures the Python tool and connects it through mcp. If you have an existing Python code execution mcp server (that supports stateful execution, like the built-in tool for gpt-oss models), then you most likely can use it directly. You can also update any other general arguments like use more GPUs or set multiple repeats with benchmarks="aime25:16 e.g.

We will also update the official evaluation reproduction instructions in Nemo-Evaluator shortly with an example for how to run AIME25 with tools.

Sign up or log in to comment