It seems you haven't specified your code repository, and some questions of the model
The repository only provides the model. What is the address of the code repository?
For example: scripts/onnx_inference_pure.py
Thanks
Hello, I see you've already updated your repository.
Please reply to this discussion once you've finished preparing, thank you.āŗļø
I have another question:
In the official cosyvoice3 examples, prompt_text supports inputting the style and emotion of the generated speech before <|endofprompt|>.
Does the model in this repository support this?
def cosyvoice3_example():
""" CosyVoice3 Usage, check https://funaudiollm.github.io/cosyvoice3/ for more details
"""
cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')
# zero_shot usage
for i, j in enumerate(cosyvoice.inference_zero_shot('å
«ē¾ę å
µå„åå”ļ¼åå”ē®å
µå¹¶ęč·ļ¼ē®å
µęęę å
µē¢°ļ¼ę å
µę碰ē®å
µē®ć', 'You are a helpful assistant. 请ēØå°½åÆč½åæ«å°čÆé诓äøå„čÆć<|endofprompt|>åøęä½ ä»„åč½å¤åēęÆęčæå„½å¦ć',
'./asset/zero_shot_prompt.wav', stream=False)):
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
@Genalp520
Hello! I've finished the preparation.
As for the prompt_text, this repository does not support that feature. I tested it, but adding style or emotion tags before <|endofprompt|> interfered with the inference process, resulting in improper audio output.
@ayousanz
Thank you for your quick fix, the effect is fantastic! However, after testing, I found some issues that don't quite match your description:
- When using Chinese prompt_wav, you must add
<|endofprompt|>andprompt, otherwise the generated audio will be abnormal. - The style tag before
<|endofprompt|>will produce some effect, but the effect is not as obvious as the original model.
Here are some examples:
prompt_wav:
- Input text: Hi everyone, the weather is so nice today! Shall we go on a picnic together?
Prompt audio: pretrained_models/Fun-CosyVoice3-0.5B/onnx/prompts/zero_shot_0.wav
Prompt text: You are a helpful assistant. Please say a sentence as quick as possible.<|endofprompt|>å «ē¾ę å µå„åå”ļ¼åå”ē®å µå¹¶ęč·ļ¼ē®å µęęę å µē¢°ļ¼ę å µę碰ē®å µē®ć - Input text: Hi everyone, the weather is so nice today! Shall we go on a picnic together?
Prompt audio: pretrained_models/Fun-CosyVoice3-0.5B/onnx/prompts/zero_shot_0.wav
Prompt text: You are a helpful assistant. Please say a sentence as slow as possible.<|endofprompt|>å «ē¾ę å µå„åå”ļ¼åå”ē®å µå¹¶ęč·ļ¼ē®å µęęę å µē¢°ļ¼ę å µę碰ē®å µē®ć - Input text: Hi everyone, the weather is so nice today! Shall we go on a picnic together?
Prompt audio: pretrained_models/Fun-CosyVoice3-0.5B/onnx/prompts/zero_shot_0.wav
Prompt text: You are a helpful assistant. ęę³ä½éŖäøäøå°ēŖä½©å„é£ę ¼ļ¼åÆä»„åļ¼<|endofprompt|>å «ē¾ę å µå„åå”ļ¼åå”ē®å µå¹¶ęč·ļ¼ē®å µęęę å µē¢°ļ¼ę å µę碰ē®å µē®ć - Error: prompt appeared in the audio.
Input text: Hi everyone, the weather is so nice today! Shall we go on a picnic together?
Prompt audio: pretrained_models/Fun-CosyVoice3-0.5B/onnx/prompts/zero_shot_0.wav
Prompt text: å «ē¾ę å µå„åå”ļ¼åå”ē®å µå¹¶ęč·ļ¼ē®å µęęę å µē¢°ļ¼ę å µę碰ē®å µē®ć
@Genalp520
Thank you so much for your detailed testing and feedback!
I'm surprised to learn that the style tags do have some effect. In my initial tests, the audio output was unstable when using them, so I assumed they were not properly supported. It is very helpful to know that they work to some extent and that <|endofprompt|> is actually required for Chinese prompts.
I really appreciate you pointing this out and providing these examples. This clarifies the model's behavior significantly.
@Genalp520
Current research on similar LLM architectures indicates that INT8 quantization leads to a loss in accuracy. Therefore, we generally plan to support up to FP16 for now.
However, if you are interested, I have actually experimented with the export in the branch below. Please feel free to try it out: https://github.com/ayutaz/CosyVoice/tree/feature/onnx-unity-export