3
Whisper from OpenAI is a speech recognition system that can transcribe and translate spoken texts. The program is capable of many languages.
OpenAI Whisper: Transcribe and Translate Texts
Whisper is an automatic speech recognition system from OpenAI with encoder-decoder transformer architecture. The AI system was trained on 680,000 hours of multilingual and multitasking monitored data from the internet. This should lead to improved robustness against accents, background noise and technical language.
- The spoken texts can be transcribed in multiple languages as well as the translation from these languages into English.
- The whisper architecture is a simple end-to-end approach implemented as an encoder-decoder transformer. The input signal is split into 30-second chunks, converted into a log mel spectrogram, and then passed to an encoder.
- A decoder is trained to predict the corresponding text label, mixed with special tokens, and perform tasks such as language identification, phrase-level timestamping.
- Since Whisper was trained on a large and diverse data set and not tuned for a specific one, it works more robustly and accurately than other models.
- About one-third of the Whisper audio dataset is non-English language. Furthermore, Whisper is trained alternately with the task of transcribing in the original language or translating into English. This approach is particularly effective for learning language-to-text translation.
Several model sizes to choose from
Whisper is available in five different model sizes that are used on a local computer. In addition, there is an API to a hosted version of Whisper. However, there are costs associated with this, which are based on the transcription length. The following model sizes are offered:
- Tiny: around 40 million parameters, language English only, required VRAM memory size 1 gigabyte, relative speed 32x
- Base: over 70 million parameters, language English only, 1 gigabyte VRAM memory, relative speed 16-fold
- Small: around 250 million parameters, English language only, requires 2 gigabytes of VRAM memory, 6 times relative speed
- Medium: around 770 million parameters, English language only, 5 gigabytes of VRAM memory required, 2x relative speed
- Large: over 1.5 billion parameters, multiple languages, 10 gigabytes of VRAM memory, 1x relative speed
- Whisper divides the recorded audio data into 30-second sections. It translates them into a spectrogram and then passes them to the encoder.
- Conclusion: Whisper is a free and open-source alternative to Google Speech-to-Text. The AI-based speech recognition system identifies the input language, transcribes the spoken text in around 100 languages, correctly places punctuation marks and translates the transcribed texts.