How to Use the SenseVoice Speech Model



Preface

SenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED). This blog will guide you through the installation and usage of the SenseVoice model, making it as user-friendly as possible.

Github Repo link: https://github.com/FunAudioLLM/SenseVoice

Online Demo link: https://huggingface.co/spaces/FunAudioLLM/SenseVoice

Highlights of SenseVoice

SenseVoice focuses on high-accuracy multilingual speech recognition, speech emotion recognition, and audio event detection.

Multilingual Speech Recognition: Trained with over 400,000 hours of data, supporting more than 50 languages, the recognition performance surpasses that of the Whisper model.

Efficient Inference: The SenseVoice-Small model utilizes a non-autoregressive end-to-end framework, leading to exceptionally low inference latency. It requires only 70ms to process 10 seconds of audio, which is 15 times faster than Whisper-Large.

Emotion Recognition: Offer sound event detection capabilities, supporting the detection of various common human-computer interaction events such as bgm, applause, laughter, crying, coughing, and sneezing.

Convenient Finetuning: Provide convenient finetuning scripts and strategies, allowing users to easily address long-tail sample issues according to their business scenarios.

Service Deployment: Offer service deployment pipeline, supporting multi-concurrent requests, with client-side languages including Python, C++, HTML, Java, and C#, among others.

How to Install SenseVoice

Step 1. Clone the project and create python 3.8+ virtual environment

First clone the official project and create an independent Python virtual environment.

git clone https://github.com/FunAudioLLM/SenseVoice.git
cd SenseVoice

#Create a Python 3.8+ environment and activate it
conda create -n sensevoice python=3.8
conda activate sensevoice

Step 2. Then install the dependencies

At this point, the virtual environment has been activated. Now download the third-party packages that the project depends on.

#On GPUMart server with US IP
pip install -r requirements.txt

#If the server is in China
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

Step 3. Start the SenseVoice WebUI

The model file is very large, and it takes a long time to download. Once completed, start the service using the following command:

python webui.py

Now you can access the WebUI web application built by the gradient library by accessing the LAN IP and port number 7860.

How to Use SenseVoice

Method 1. Use in Web UI

Using the webui is simple, just upload your audio file, select the language (optional), then click the start button, wait for the background processing to complete, and the recognized text will be output in the Results area.

We used GPUMart's RTX A4000 to test 90 minutes of audio, which took about seven or eight minutes.

Method 2. Python programming call

If you need to do application development based on the model, or adjust more detailed parameters, you need to encapsulate and secondary develop the API provided by the model.

Inference Usage Sample - Supports input of audio in any format and of any duration.

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model_dir = "iic/SenseVoiceSmall"


model = AutoModel(
    model=model_dir,
    trust_remote_code=True,
    remote_code="./model.py",
    vad_model="fsmn-vad",
    vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0",
)

# en
res = model.generate(
    input=f"{model.model_path}/example/en.mp3",
    cache={},
    language="auto",  # "zh", "en", "yue", "ja", "ko", "nospeech"
    use_itn=True,
    batch_size_s=60,
    merge_vad=True,  #
    merge_length_s=15,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)

For more advanced users, exporting your model to ONNX or Libtorch is also possible with the following commands:

# Take Libtorch as an example
from pathlib import Path
from funasr_torch import SenseVoiceSmall
from funasr_torch.utils.postprocess_utils import rich_transcription_postprocess


model_dir = "iic/SenseVoiceSmall"

model = SenseVoiceSmall(model_dir, batch_size=10, device="cuda:0")

wav_or_scp = ["{}/.cache/modelscope/hub/{}/example/en.mp3".format(Path.home(), model_dir)]

res = model(wav_or_scp, language="auto", use_itn=True)
print([rich_transcription_postprocess(i) for i in res])

Note: Libtorch model is exported to the original model directory.