Qwen/Qwen3-Next-80B-A3B-Thinking has MMLU_PRO 82.7 but you guys get 0.7271

#2
by hlxxxxxx - opened

what is the differences?

quantization?

quantization?

not possible, I mean the mmlu pro benchmark of baseline

we use lm-eval-harness to test the model, which is widely adopted the community. As we only care about the gap bettween bf16 and int4 models, we have no extra brandwidth to root cause the issue. Instead, you could submit an issue to lm-eval-harness

Sign up or log in to comment