It seems you haven't specified your code repository, and some questions of the model

#1
by Genalp520 - opened

The repository only provides the model. What is the address of the code repository?

For example: scripts/onnx_inference_pure.py

Thanks

Hello, I see you've already updated your repository.
Please reply to this discussion once you've finished preparing, thank you.ā˜ŗļø

I have another question:
In the official cosyvoice3 examples, prompt_text supports inputting the style and emotion of the generated speech before <|endofprompt|>.
Does the model in this repository support this?

def cosyvoice3_example():
    """ CosyVoice3 Usage, check https://funaudiollm.github.io/cosyvoice3/ for more details
    """
    cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')
    # zero_shot usage
    for i, j in enumerate(cosyvoice.inference_zero_shot('å…«ē™¾ę ‡å…µå„”åŒ—å”ļ¼ŒåŒ—å”ē‚®å…µå¹¶ęŽ’č·‘ļ¼Œē‚®å…µę€•ęŠŠę ‡å…µē¢°ļ¼Œę ‡å…µę€•ē¢°ē‚®å…µē‚®ć€‚', 'You are a helpful assistant. čÆ·ē”Øå°½åÆčƒ½åæ«åœ°čÆ­é€ŸčÆ“äø€å„čÆć€‚<|endofprompt|>åøŒęœ›ä½ ä»„åŽčƒ½å¤Ÿåšēš„ęÆ”ęˆ‘čæ˜å„½å‘¦ć€‚',
                                                        './asset/zero_shot_prompt.wav', stream=False)):
        torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
Genalp520 changed discussion title from It seems you haven't specified your code repository. to It seems you haven't specified your code repository, and some questions of the model

@Genalp520
Hello! I've finished the preparation.

As for the prompt_text, this repository does not support that feature. I tested it, but adding style or emotion tags before <|endofprompt|> interfered with the inference process, resulting in improper audio output.

@ayousanz
Thank you for your quick fix, the effect is fantastic! However, after testing, I found some issues that don't quite match your description:

  1. When using Chinese prompt_wav, you must add <|endofprompt|> and prompt, otherwise the generated audio will be abnormal.
  2. The style tag before <|endofprompt|> will produce some effect, but the effect is not as obvious as the original model.

Here are some examples:
prompt_wav:

  1. Input text: Hi everyone, the weather is so nice today! Shall we go on a picnic together?
    Prompt audio: pretrained_models/Fun-CosyVoice3-0.5B/onnx/prompts/zero_shot_0.wav
    Prompt text: You are a helpful assistant. Please say a sentence as quick as possible.<|endofprompt|>å…«ē™¾ę ‡å…µå„”åŒ—å”ļ¼ŒåŒ—å”ē‚®å…µå¹¶ęŽ’č·‘ļ¼Œē‚®å…µę€•ęŠŠę ‡å…µē¢°ļ¼Œę ‡å…µę€•ē¢°ē‚®å…µē‚®ć€‚
  2. Input text: Hi everyone, the weather is so nice today! Shall we go on a picnic together?
    Prompt audio: pretrained_models/Fun-CosyVoice3-0.5B/onnx/prompts/zero_shot_0.wav
    Prompt text: You are a helpful assistant. Please say a sentence as slow as possible.<|endofprompt|>å…«ē™¾ę ‡å…µå„”åŒ—å”ļ¼ŒåŒ—å”ē‚®å…µå¹¶ęŽ’č·‘ļ¼Œē‚®å…µę€•ęŠŠę ‡å…µē¢°ļ¼Œę ‡å…µę€•ē¢°ē‚®å…µē‚®ć€‚
  3. Input text: Hi everyone, the weather is so nice today! Shall we go on a picnic together?
    Prompt audio: pretrained_models/Fun-CosyVoice3-0.5B/onnx/prompts/zero_shot_0.wav
    Prompt text: You are a helpful assistant. ęˆ‘ęƒ³ä½“éŖŒäø€äø‹å°ēŒŖä½©å„‡é£Žę ¼ļ¼ŒåÆä»„å—ļ¼Ÿ<|endofprompt|>å…«ē™¾ę ‡å…µå„”åŒ—å”ļ¼ŒåŒ—å”ē‚®å…µå¹¶ęŽ’č·‘ļ¼Œē‚®å…µę€•ęŠŠę ‡å…µē¢°ļ¼Œę ‡å…µę€•ē¢°ē‚®å…µē‚®ć€‚
  4. Error: prompt appeared in the audio.
    Input text: Hi everyone, the weather is so nice today! Shall we go on a picnic together?
    Prompt audio: pretrained_models/Fun-CosyVoice3-0.5B/onnx/prompts/zero_shot_0.wav
    Prompt text: å…«ē™¾ę ‡å…µå„”åŒ—å”ļ¼ŒåŒ—å”ē‚®å…µå¹¶ęŽ’č·‘ļ¼Œē‚®å…µę€•ęŠŠę ‡å…µē¢°ļ¼Œę ‡å…µę€•ē¢°ē‚®å…µē‚®ć€‚

@Genalp520
Thank you so much for your detailed testing and feedback!

I'm surprised to learn that the style tags do have some effect. In my initial tests, the audio output was unstable when using them, so I assumed they were not properly supported. It is very helpful to know that they work to some extent and that <|endofprompt|> is actually required for Chinese prompts.

I really appreciate you pointing this out and providing these examples. This clarifies the model's behavior significantly.

@ayousanz

Thank you for your affirmation, the model's TTS effect is amazing. I have another question:
The majority of the time spent on TTS is in the LLM and Flow processes. Do you have plans to develop a lighter version, such as LLM and Flow INT8 version?

@Genalp520
Current research on similar LLM architectures indicates that INT8 quantization leads to a loss in accuracy. Therefore, we generally plan to support up to FP16 for now.

However, if you are interested, I have actually experimented with the export in the branch below. Please feel free to try it out: https://github.com/ayutaz/CosyVoice/tree/feature/onnx-unity-export

Sign up or log in to comment