---
language:
- en # English
- zh # Chinese
- es # Spanish
- pt # Portuguese
- de # German
- ja # Japanese
- ko # Korean
- fr # French
- ru # Russian
- id # Indonesian
- sv # Swedish
- it # Italian
- he # Hebrew
- nl # Dutch
- pl # Polish
- no # Norwegian
- tr # Turkish
- th # Thai
- ar # Arabic
- hu # Hungarian
- ca # Catalan
- cs # Czech
- da # Danish
- fa # Persian
- af # Afrikaans
- hi # Hindi
- fi # Finnish
- et # Estonian
- aa # Afar
- el # Greek
- ro # Romanian
- vi # Vietnamese
- bg # Bulgarian
- is # Icelandic
- sl # Slovenian
- sk # Slovak
- lt # Lithuanian
- sw # Swahili
- uk # Ukrainian
- kl # Kalaallisut
- lv # Latvian
- hr # Croatian
- ne # Nepali
- sr # Serbian
- tl # Filipino (ISO 639-1; 常见工程别名: fil)
- yi # Yiddish
- ms # Malay
- ur # Urdu
- mn # Mongolian
- hy # Armenian
- jv # Javanese
license: mit
pipeline_tag: automatic-speech-recognition
tags:
- ASR
- Transcriptoin
- Diarization
- Speech-to-Text
library_name: transformers
---
## VibeVoice-ASR
[](https://github.com/microsoft/VibeVoice)
[](https://aka.ms/vibevoice-asr)
[](https://arxiv.org/pdf/2601.18184)
**VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords** and over **50 languages**.
➡️ **Code:** [microsoft/VibeVoice](https://github.com/microsoft/VibeVoice)
➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)
➡️ **Report:** [VibeVoice-ASR Technical Report](https://arxiv.org/pdf/2601.18184)
➡️ **Finetuning:** [Finetuning](https://github.com/microsoft/VibeVoice/blob/main/finetuning-asr/README.md)
➡️ **vLLM:** [vLLM-VibeVoice-ASR](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-vllm-asr.md)