Why the model size appears to be 1B?

by docato - opened about 10 hours ago

about 10 hours ago

Just curious about that model size degradation in the model card. How could a 30B model be condensed into 1B? Is there a mistake?

n1ck-guo

Intel org about 10 hours ago

This is a display bug in Hugging Face Spaces related to quantized models.

docato

about 9 hours ago

Thank you. Could you also explain why there is this step after running the model with vllm:

Generate the model
Please make sure you have installed the auto_round package from the correct branch:
pip install git+https://github.com/intel/auto-round.git@enable_glm4_moe_lite_quantization
auto_round
--model=zai-org/GLM-4.7-Flash
--scheme "W4A16"
--ignore_layers="shared_experts,layers.0.mlp"
--format=auto_round
--enable_torch_compile
--output_dir=./tmp_autoround

We have already run the model with VLLM; why do we need this step? Sorry if this is inconvenient. I'm not familiar with autoround. Thanks for your guidance.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment