Why the model size appears to be 1B?
Just curious about that model size degradation in the model card. How could a 30B model be condensed into 1B? Is there a mistake?
This is a display bug in Hugging Face Spaces related to quantized models.
Thank you. Could you also explain why there is this step after running the model with vllm:
Generate the model
Please make sure you have installed the auto_round package from the correct branch:
pip install git+https://github.com/intel/auto-round.git@enable_glm4_moe_lite_quantization
auto_round
--model=zai-org/GLM-4.7-Flash
--scheme "W4A16"
--ignore_layers="shared_experts,layers.0.mlp"
--format=auto_round
--enable_torch_compile
--output_dir=./tmp_autoround
We have already run the model with VLLM; why do we need this step? Sorry if this is inconvenient. I'm not familiar with autoround. Thanks for your guidance.