SenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED). This blog will guide you through the installation and usage of the SenseVoice model, making it as user-friendly as possible.
Github Repo link: https://github.com/FunAudioLLM/SenseVoice
Online Demo link: https://huggingface.co/spaces/FunAudioLLM/SenseVoice
SenseVoice focuses on high-accuracy multilingual speech recognition, speech emotion recognition, and audio event detection.
Multilingual Speech Recognition: Trained with over 400,000 hours of data, supporting more than 50 languages, the recognition performance surpasses that of the Whisper model.
Efficient Inference: The SenseVoice-Small model utilizes a non-autoregressive end-to-end framework, leading to exceptionally low inference latency. It requires only 70ms to process 10 seconds of audio, which is 15 times faster than Whisper-Large.
Emotion Recognition: Offer sound event detection capabilities, supporting the detection of various common human-computer interaction events such as bgm, applause, laughter, crying, coughing, and sneezing.
Convenient Finetuning: Provide convenient finetuning scripts and strategies, allowing users to easily address long-tail sample issues according to their business scenarios.
Service Deployment: Offer service deployment pipeline, supporting multi-concurrent requests, with client-side languages including Python, C++, HTML, Java, and C#, among others.
First clone the official project and create an independent Python virtual environment.
git clone https://github.com/FunAudioLLM/SenseVoice.git cd SenseVoice #Create a Python 3.8+ environment and activate it conda create -n sensevoice python=3.8 conda activate sensevoice
At this point, the virtual environment has been activated. Now download the third-party packages that the project depends on.
#On GPUMart server with US IP pip install -r requirements.txt #If the server is in China pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
The model file is very large, and it takes a long time to download. Once completed, start the service using the following command:
python webui.py
Now you can access the WebUI web application built by the gradient library by accessing the LAN IP and port number 7860.
Using the webui is simple, just upload your audio file, select the language (optional), then click the start button, wait for the background processing to complete, and the recognized text will be output in the Results area.
We used GPUMart's RTX A4000 to test 90 minutes of audio, which took about seven or eight minutes.
If you need to do application development based on the model, or adjust more detailed parameters, you need to encapsulate and secondary develop the API provided by the model.
Inference Usage Sample - Supports input of audio in any format and of any duration.
from funasr import AutoModel from funasr.utils.postprocess_utils import rich_transcription_postprocess model_dir = "iic/SenseVoiceSmall" model = AutoModel( model=model_dir, trust_remote_code=True, remote_code="./model.py", vad_model="fsmn-vad", vad_kwargs={"max_single_segment_time": 30000}, device="cuda:0", ) # en res = model.generate( input=f"{model.model_path}/example/en.mp3", cache={}, language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech" use_itn=True, batch_size_s=60, merge_vad=True, # merge_length_s=15, ) text = rich_transcription_postprocess(res[0]["text"]) print(text)
For more advanced users, exporting your model to ONNX or Libtorch is also possible with the following commands:
# Take Libtorch as an example from pathlib import Path from funasr_torch import SenseVoiceSmall from funasr_torch.utils.postprocess_utils import rich_transcription_postprocess model_dir = "iic/SenseVoiceSmall" model = SenseVoiceSmall(model_dir, batch_size=10, device="cuda:0") wav_or_scp = ["{}/.cache/modelscope/hub/{}/example/en.mp3".format(Path.home(), model_dir)] res = model(wav_or_scp, language="auto", use_itn=True) print([rich_transcription_postprocess(i) for i in res])
Note: Libtorch model is exported to the original model directory.