Openai Whisper Automated AI Transcribing & Translation is here? (and how badly I speak French)
We've all tried to manually transcribe audio at some point- it takes ages! But what if computers could finally understand the Scottish accent?
What if not only that, but what if computers could 'listen' to regional Chinese/Korean/Finnish/Urdu audio (with an equally challenging accents) and translate that into English too?
Openai recently (21st September 2022) published their 'Whisper' paper and model. It makes transcribing audio automatically astoundingly simple, and even translation. So naturally I tried speaking German & French to it to observe the quality of the translation.
Apologies in advance for my French!
> This article is entirely standing on the shoulders of giants, researchers in their fields of expertise. I am a mere engineer researching and assembling new tools in this brave new world.
Voice recognition is notoriously bad at understanding accents, for example Scottish, as pointed out by this hilarious voice-activated elevator comedy sketch by the The Scottish Comedy Channel. For personal research I took a recording of that sketch and used the Whisper model to test the accuracy of the transcription. See for yourself (below), the accuracy was astounding.
Then, I spoke some German and French and got Whisper to transcribe and translate that audio too.
We'll start with the voice activated elevator comedy sketch, which has thick Scottish accents and multiple people talking. How well can Whisper 'listen' to the audio and transcribe it for us?
For this research, the input was the recording of the audio from the sketch. We simply ask whisper to listen to the audio, and transcribe it for us:
whisper elevator-sketch-audio.wav --model medium --language English
> Interestingly, the first time I ran this, Whisper 'thought' the audio was in Welsh, so I had to explicitly 'tell' Whisper that no, this is in-fact English.
The output was the following text, entirely automated, ai transcribed using Openai Whisper... if you've every tried to transcribe audio like this manually you can appreciate how much time this saves. If you then consider the accent being 'understood', and the ability to translate (scroll down), the Whisper research is thrilling:
WEBVT
00:08.800 --> 00:10.000
Where's the buttons?
00:10.000 --> 00:15.000
Oh no, they installed voice recognition technology in this lift, I heard about this.
00:15.000 --> 00:20.200
Voice recognition technology? In a lift? In Scotland?
00:20.200 --> 00:23.000
You ever tried voice recognition technology?
00:23.000 --> 00:23.700
No.
00:23.700 --> 00:27.400
They don't do Scottish accents.
00:27.400 --> 00:29.500
Eleven!
00:29.500 --> 00:33.200
Could you please repeat that?
00:33.200 --> 00:35.200
Eleven!
00:35.200 --> 00:37.800
Eleven!
00:37.800 --> 00:38.800
Eleven!
00:38.800 --> 00:40.300
Eleven!
00:40.300 --> 00:43.500
Could you please repeat that?
00:43.500 --> 00:46.100
Eleven!
00:46.100 --> 00:48.500
Whose idea was this?
00:48.500 --> 00:51.500
You need to try an American accent.
00:51.500 --> 00:54.300
Eleven!
00:54.300 --> 00:55.300
Eleven!
00:55.300 --> 00:57.300
That sounds Irish, not American.
00:57.300 --> 00:59.300
Not, doesn't it?
00:59.300 --> 01:00.500
Eleven!
01:00.500 --> 01:01.900
Where in America is that? Dublin?
01:01.900 --> 01:05.900
I'm sorry, could you please repeat that?
01:05.900 --> 01:09.700
Try an English accent, right?
01:09.700 --> 01:11.800
Eleven!
01:11.800 --> 01:13.600
Eleven!
01:13.600 --> 01:16.100
Are you from the same part of England as Dick Van Dyke?
01:16.100 --> 01:18.200
Is he using smart house?
01:18.200 --> 01:20.700
Please speak slowly and clearly.
01:20.700 --> 01:23.100
Smart house!
01:23.100 --> 01:25.000
Eleven!
01:25.000 --> 01:29.500
I'm sorry, could you please repeat that?
01:29.500 --> 01:30.500
Eleven!
01:30.500 --> 01:34.600
If you don't understand a lingo, a way back came to your own country.
01:34.600 --> 01:38.200
Ooh, what's that talk now? Is it a way back to your own country?
01:38.200 --> 01:42.100
Oh, don't start, Mr. Bleeding Har, how can you be racist to a lift?
01:42.100 --> 01:46.000
Please speak slowly and clearly.
01:46.000 --> 01:48.000
Eleven!
01:48.000 --> 01:50.000
Eleven!
01:50.000 --> 01:52.200
Eleven!
01:52.200 --> 01:52.900
Eleven!
01:52.900 --> 01:55.100
They're just saying it the same way.
01:55.100 --> 01:59.200
You're going to keep saying it until they understand Scottish, all right?
01:59.200 --> 02:01.100
Eleven!
02:01.100 --> 02:03.000
Eleven!
02:03.000 --> 02:04.900
Eleven!
02:04.900 --> 02:05.600
Eleven!
02:05.600 --> 02:07.700
I'll just say it is anywhere, you cow!
02:07.700 --> 02:09.100
Just open the doors!
02:09.100 --> 02:12.000
This is a voice-activated elevator.
02:12.000 --> 02:17.000
Please state which floor you would like to go to in a clear and calm manner.
02:17.000 --> 02:17.800
Calm.
02:17.800 --> 02:20.200
Calm.
02:20.200 --> 02:21.100
Where's that coming from?
02:21.100 --> 02:22.900
Why is it telling people to be calm?
02:22.900 --> 02:27.800
Because they knew they'd be selling us to Scottish people who'd be going up for much sitting!
02:27.800 --> 02:29.900
You have not selected a floor.
02:29.900 --> 02:31.200
Aye, we have!
02:31.200 --> 02:32.800
Eleven!
02:32.800 --> 02:38.300
If you would like to get out of the elevator without selecting a floor, simply say,
02:38.300 --> 02:41.300
open the doors, please.
02:41.300 --> 02:42.300
Please?
02:42.300 --> 02:44.200
Please?
02:44.200 --> 02:46.000
Suck my wally.
02:46.000 --> 02:49.300
Maybe we should just say please.
02:49.300 --> 02:56.300
I'm not begging that for nothing.
02:56.300 --> 02:59.300
Open the doors, please.
02:59.300 --> 03:00.400
Please.
03:00.400 --> 03:01.800
Pathetic.
03:01.800 --> 03:03.300
Please remain calm.
03:03.300 --> 03:05.300
Oh, my God!
03:05.300 --> 03:06.400
Where did you end up tonight?
03:06.400 --> 03:08.400
You're not there!
03:08.400 --> 03:13.500
Aye, just wait for it to speak.
03:13.500 --> 03:15.000
You have not selected a floor.
03:15.000 --> 03:17.000
Oh, Bules, you cow!
03:17.000 --> 03:18.400
You don't let us through these doors!
03:18.400 --> 03:23.100
I'm going to come to America, I'm going to find whatever desperate actress gave you a voice,
03:23.100 --> 03:25.400
and I'm going to go in an electric chair for you!
03:25.400 --> 03:27.500
Scotland, you bastard!
03:27.500 --> 03:28.800
Scotland!
03:28.800 --> 03:30.000
Scotland!
03:30.000 --> 03:32.300
Scotland!
03:32.300 --> 03:34.500
Freedom!
03:34.500 --> 03:35.900
Freedom!
03:35.900 --> 03:38.300
Freedom!
03:38.300 --> 03:45.300
Goin' up?
What about a German audio? Can Whisper transcribe that into English automatically? Yes.
To test this, I spoke (very poor) German to generate a test audio recording spoken in German: https://youtu.be/pMNf-IPo2lU
Converted the audio into English text using Openai Whispher, and the transcription was not only accurate, but also translated from German into English:
WEBVT
00:00.000 --> 00:23.000
It's not that easy when I don't have another person to talk to, I need another person and then it's easy to have a conversation.
00:23.000 --> 00:29.000
What do you think?
This is especially impressive given the poor German, and background noise.
For variety, the same was done with French:
Input: (very poor) spoken French: https://youtu.be/UJay09BhecY
We ask Whisper to perform the transcription and translation:
whisper speaking-french.wav --model medium --language French --task translate
Output: We get an English transcription of the French audio
WEBVT
00:00.000 --> 00:09.000
Hello, my name is Chris. I live in England with my friends. I play the guitar and I love music.
Again, this is super impressive (Whispher- certainly not my French, my pronunciation is awful), and yet- the automatic transcription is accurate for this audio.
Warnings and caveats
Whilst this is all good fun, the reality is that the pace of change feels unprecedented, people are questioning whether legislation is keeping up pace with AI and Machine Learning (Professor Stephen Roberts Higgs Lecture 2018). Companies such as Mind Foundry, Ripjar and FPComplete are working out how to apply engineering, industry and legislation to these topics. As Professor Stephen Roberts pointed out in his Higgs Lecture: Amongst the hype, there is still a lot of focus needed on the engineering, ethics and understanding to take place.