Whisper is a general-purpose speech recognition model. It is trained on a large and diverse audio dataset and is a multi-task model that can perform multilingual speech recognition, speech translation, and language identification.
A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. These tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing a single model to replace many stages of a traditional speech-processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.
Dedicated GPU P620 server (https://www.gpu-mart.com/quadro-k620), Ubuntu20 OS
There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model; actual speed may vary depending on many factors including the available hardware.
The .en models for English-only applications tend to perform better, especially for the tiny.en and base.en models. We observed that the difference becomes less significant for the small.en and medium.en models.
Refer to the instruction at https://portal.databasemart.com/kb/a536/nvidia-driver-installation-on-ubuntu.aspx
apt install python3-pip
pip install -U openai-whisper #Or pip install git+https://github.com/openai/whisper.git
It also requires the command line tools ffmpeg to be installed on your system, which can be provided by most package managers.
sudo apt update && sudo apt install ffmpeg
1. Edit translate file, vim whisper_01.py
import whisper model = whisper.load_model("base") result = model.transcribe("audio.mp3") print(result["text"])
Replace the audio.mp3 with your file name. You can download the audio resources online or recording it on your own. Chinese could be recognized but the accuracy may not be very high.
2. Run file with command
python3 whisper_01.py
The result will be like this:
The following command will transcribe speech in audio files, using the medium model:
> whisper audio.flac audio.mp3 audio.wav --model medium
The default setting (which selects the small model) works well for transcribing English. To transcribe an audio file containing non-English speech, you can specify the language using the --language option:
> whisper japanese.wav --language Japanese
Adding --task translate will translate the speech into English:
> whisper japanese.wav --language Japanese --task translate
Run the following to view all available options:
> whisper --help