CosyVoice is a multilingual speech generation model open-sourced by Alibaba, designed for the development of TTS (Text-to-Speech) tools. It supports built-in pre-configured speech generation, voice cloning, and natural language-controlled speech synthesis. One of the standout features of CosyVoice is its fine-grained control over the emotions and prosody of the generated speech, achieved through rich text or natural language input. This control mechanism significantly enhances the emotional expressiveness of the synthesized speech, making it more lifelike and emotionally rich. The system supports speech generation in five languages: Chinese, English, Japanese, Cantonese, and Korean, and its speech synthesis performance far exceeds traditional models.
Github Repo link: https://github.com/FunAudioLLM/CosyVoice
Official online demo link: https://www.modelscope.cn/studios/iic/CosyVoice-300M
No strict requirements. An ordinary personal computer can also run it, but inference time will be longer, making it suitable for trial experiences. If the machine has an NVIDIA GPU, NVIDIA CUDA can be used for acceleration. The machine deployed in this article uses GPUMart's RTX A4000 VPS, which features an NVIDIA RTX A4000 GPU with 16GB of VRAM.
First clone the official project and create an independent Python virtual environment.
#Because the Matcha-TTS project is referenced internally, remember to use the --recursive parameter. git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git #Create a Python 3.8+ environment and activate it conda create -n cosyvoice python=3.8 conda activate cosyvoice
At this point, the virtual environment has been activated. Now download the third-party packages that the project depends on.
#On GPUMart server with US IP pip install -r requirements.txt #If the server is in China pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
Pynini is a string-based propagation and transformation library that can be used for a variety of natural language processing tasks, such as part-of-speech tagging, noun phrase extraction, and dependency syntax analysis.
conda install -y -c conda-forge pynini==2.1.5
According to the document, the model must be downloaded in advance. Instead of using Alibaba's magic package to download, Git is used to download. The premise is to install the git lfs plug-in:
# git model download, please make sure git lfs is installed mkdir -p pretrained_models git clone https://www.modelscope.cn/iic/CosyVoice-300M.git pretrained_models/CosyVoice-300M git clone https://www.modelscope.cn/iic/CosyVoice-300M-SFT.git pretrained_models/CosyVoice-300M-SFT git clone https://www.modelscope.cn/iic/CosyVoice-300M-Instruct.git pretrained_models/CosyVoice-300M-Instruct
Optionally, you can download, unzip the ttsfrd resource and install the ttsfrd package for better text normalization performance. Please note that this step is not required. If you don't install the ttsfrd package, we will use WeTextProcessing by default.
git clone https://www.modelscope.cn/iic/CosyVoice-ttsfrd.git pretrained_models/CosyVoice-ttsfrd cd pretrained_models/CosyVoice-ttsfrd/ unzip resource.zip -d . pip install ttsfrd-0.3.6-cp38-cp38-linux_x86_64.whl
The model file is very large, and it takes a long time to download. Once completed, start the service using the following command:
python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M
There is something to note. If you need to access from the external network, you need to add it to the webui.py file.
demo.launch(server_port=args.port)
Change to
demo.launch(server_port=args.port, server_name="0.0.0.0")
If it is local access, it can be ignored. This does not exist in the latest version, and it supports external network access by default. At this time, you can access the WebUI web application built by the gradient library by accessing the LAN IP and port number 50000.
Taking voice cloning as an example, the first step is to upload the audio file of the original material (which may need to be processed to make the effect better), the second step is to enter the subtitles corresponding to the audio file of the original material, and the third step is to enter the desired generated voice. Copywriting, click Generate in the last step, wait patiently.
When using the speech cloning function, in addition to providing the text, an exemplary speech sample is needed for the large model to imitate the timbre, intonation, reading habits, etc. The quality of the speech sample has a very big impact on the final generation result. For the input sample voice, the same can do some pre-processing. For example, the length of the intercept (the official recommendation of the speech sample in 3-10s, too long need to consume more inference performance), noise reduction processing and so on.
If you need to do application development based on the model, or adjust more detailed parameters, you need to encapsulate and secondary develop the API provided by the model. For zero-sample/cross-language reasoning, use the CosyVoice-300M model. For sft reasoning, use the CosyVoice-300M-SFT model. For command inference, use the CosyVoice-300M-Instruct model. First, add third_party/Matcha-TTS to your PYTHONPATH.
export PYTHONPATH=third_party/Matcha-TTS
from cosyvoice.cli.cosyvoice import CosyVoice from cosyvoice.utils.file_utils import load_wav import torchaudio cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-SFT') # sft usage print(cosyvoice.list_avaliable_spks()) # change stream=True for chunk stream inference for i, j in enumerate(cosyvoice.inference_sft('你好,我是通义生成式语音大模型,请问有什么可以帮您的吗?', '中文女', stream=False)): torchaudio.save('sft_{}.wav'.format(i), j['tts_speech'], 22050) cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M') # zero_shot usage, <|zh|><|en|><|jp|><|yue|><|ko|> for Chinese/English/Japanese/Cantonese/Korean prompt_speech_16k = load_wav('zero_shot_prompt.wav', 16000) for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)): torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], 22050) # cross_lingual usage prompt_speech_16k = load_wav('cross_lingual_prompt.wav', 16000) for i, j in enumerate(cosyvoice.inference_cross_lingual('<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that\'s coming into the family is a reason why sometimes we don\'t buy the whole thing.', prompt_speech_16k, stream=False)): torchaudio.save('cross_lingual_{}.wav'.format(i), j['tts_speech'], 22050) cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-Instruct') # instruct usage, support[laughter][breath] for i, j in enumerate(cosyvoice.inference_instruct('在面对挑战时,他展现了非凡的勇气与智慧。', '中文男', 'Theo \'Crimson\', is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with impulsiveness.', stream=False)): torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], 22050)