How to Install Whisper in Ubuntu Server

Learn how to install Whisper in Ubuntu with this simple guide. Explore its powerful speech-to-text transcription capabilities today!

What is Whipser?

Whisper is a general-purpose speech recognition model. It is trained on a large and diverse audio dataset and is a multi-task model that can perform multilingual speech recognition, speech translation, and language identification.

A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. These tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing a single model to replace many stages of a traditional speech-processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.

Setup Environment

Dedicated GPU P620 server (https://www.gpu-mart.com/quadro-k620), Ubuntu20 OS

Available Models and Languages

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model; actual speed may vary depending on many factors including the available hardware.

Whisper models

The .en models for English-only applications tend to perform better, especially for the tiny.en and base.en models. We observed that the difference becomes less significant for the small.en and medium.en models.

How to Install Whisper

Step 1 - Install GPU driver

Step 2 - Install pip3

apt install python3-pip

Step 3 - Install Whipser

pip install -U openai-whisper
#Or
pip install git+https://github.com/openai/whisper.git

Step 4 - Install ffmpeg

It also requires the command line tools ffmpeg to be installed on your system, which can be provided by most package managers.

sudo apt update && sudo apt install ffmpeg

How to Use Whisper

After the installation is done, you can play with the audio-text translation.

Python usage

1. Edit translate file, vim whisper_01.py

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

Replace the audio.mp3 with your file name. You can download the audio resources online or recording it on your own. Chinese could be recognized but the accuracy may not be very high.

Edit whisper python script

2. Run file with command

python3 whisper_01.py

The result will be like this:

Whisper output

Command-line usage

The following command will transcribe speech in audio files, using the medium model:

> whisper audio.flac audio.mp3 audio.wav --model medium

The default setting (which selects the small model) works well for transcribing English. To transcribe an audio file containing non-English speech, you can specify the language using the --language option:

> whisper japanese.wav --language Japanese

Adding --task translate will translate the speech into English:

> whisper japanese.wav --language Japanese --task translate

Run the following to view all available options:

> whisper --help