Qwen/Qwen3-Next-80B-A3B-Thinking has MMLU_PRO 82.7 but you guys get 0.7271
#2
by
hlxxxxxx
- opened
what is the differences?
quantization?
quantization?
not possible, I mean the mmlu pro benchmark of baseline
we use lm-eval-harness to test the model, which is widely adopted the community. As we only care about the gap bettween bf16 and int4 models, we have no extra brandwidth to root cause the issue. Instead, you could submit an issue to lm-eval-harness