what is the implementation of the bench "AIME25 (with tools)"?
I want to know how the model use tools while thinking
@YF-T , we provide a reference implementation for stateful Python code execution in Nemo-Skills. If you want to run it, please follow the docs to set up Nemo-Skills and then run the following command
from nemo_skills.pipeline.cli import wrap_arguments, eval
eval(
ctx=wrap_arguments(
"++inference.tokens_to_generate=120000 "
"++inference.temperature=1.0 "
"++inference.top_p=1.0 "
"++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
),
cluster='local', # or slurm
model='nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16',
server_type='vllm',
server_gpus=1,
benchmarks="aime25",
server_args=f"--enable-auto-tool-choice --enable-prefix-caching --tool-call-parser qwen3_coder --mamba_ssm_cache_dtype float32",
output_dir='/workspace/test-nano-v3-single',
with_sandbox=True,
)
Important arguments here are with_sandbox=True (launches our code execution sandbox) and ++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool]which configures the Python tool and connects it through mcp. If you have an existing Python code execution mcp server (that supports stateful execution, like the built-in tool for gpt-oss models), then you most likely can use it directly. You can also update any other general arguments like use more GPUs or set multiple repeats with benchmarks="aime25:16 e.g.
We will also update the official evaluation reproduction instructions in Nemo-Evaluator shortly with an example for how to run AIME25 with tools.