Openai Whisper Automated AI Transcribing & Translation is here? (and how badly I speak French)

We've all tried to manually transcribe audio at some point- it takes ages! But what if computers could finally understand the Scottish accent?

What if not only that, but what if computers could 'listen' to regional Chinese/Korean/Finnish/Urdu audio (with an equally challenging accents) and translate that into English too?

Openai recently (21st September 2022) published their 'Whisper' paper and model. It makes transcribing audio automatically astoundingly simple, and even translation. So naturally I tried speaking German & French to it to observe the quality of the translation.

Apologies in advance for my French!

> This article is entirely standing on the shoulders of giants, researchers in their fields of expertise. I am a mere engineer researching and assembling new tools in this brave new world.

Voice recognition is notoriously bad at understanding accents, for example Scottish, as pointed out by this hilarious voice-activated elevator comedy sketch by the The Scottish Comedy Channel. For personal research I took a recording of that sketch and used the Whisper model to test the accuracy of the transcription. See for yourself (below), the accuracy was astounding.

Elevator Recognition https://www.youtube.com/watch?v=HbDnxzrbxn4 

Then, I spoke some German and French and got Whisper to transcribe and translate that audio too.

We'll start with the voice activated elevator comedy sketch, which has thick Scottish accents and multiple people talking. How well can Whisper 'listen' to the audio and transcribe it for us?

For this research, the input was the recording of the audio from the sketch. We simply ask whisper to listen to the audio, and transcribe it for us:

whisper elevator-sketch-audio.wav --model medium --language English

> Interestingly, the first time I ran this, Whisper 'thought' the audio was in Welsh, so I had to explicitly 'tell' Whisper that no, this is in-fact English.

The output was the following text, entirely automated, ai transcribed using Openai Whisper... if you've every tried to transcribe audio like this manually you can appreciate how  much time this saves. If you then consider the accent being 'understood', and the ability to translate (scroll down), the Whisper research is thrilling:

WEBVT

00:08.800 --> 00:10.000
 Where's the buttons?

00:10.000 --> 00:15.000
 Oh no, they installed voice recognition technology in this lift, I heard about this.

00:15.000 --> 00:20.200
 Voice recognition technology? In a lift? In Scotland?

00:20.200 --> 00:23.000

 You ever tried voice recognition technology?

00:23.000 --> 00:23.700
 No.

00:23.700 --> 00:27.400
 They don't do Scottish accents.

00:27.400 --> 00:29.500
 Eleven!

00:29.500 --> 00:33.200
 Could you please repeat that?

00:33.200 --> 00:35.200
 Eleven!

00:35.200 --> 00:37.800
 Eleven!

00:37.800 --> 00:38.800
 Eleven!

00:38.800 --> 00:40.300
 Eleven!

00:40.300 --> 00:43.500
 Could you please repeat that?

00:43.500 --> 00:46.100
 Eleven!

00:46.100 --> 00:48.500
 Whose idea was this?

00:48.500 --> 00:51.500
 You need to try an American accent.

00:51.500 --> 00:54.300
 Eleven!

00:54.300 --> 00:55.300
 Eleven!

00:55.300 --> 00:57.300
 That sounds Irish, not American.

00:57.300 --> 00:59.300
 Not, doesn't it?

00:59.300 --> 01:00.500
 Eleven!

01:00.500 --> 01:01.900
 Where in America is that? Dublin?


01:01.900 --> 01:05.900
 I'm sorry, could you please repeat that?


01:05.900 --> 01:09.700
 Try an English accent, right?

01:09.700 --> 01:11.800
 Eleven!

01:11.800 --> 01:13.600
 Eleven!

01:13.600 --> 01:16.100
 Are you from the same part of England as Dick Van Dyke?

01:16.100 --> 01:18.200
 Is he using smart house?

01:18.200 --> 01:20.700
 Please speak slowly and clearly.

01:20.700 --> 01:23.100
 Smart house!

01:23.100 --> 01:25.000
 Eleven!

01:25.000 --> 01:29.500
 I'm sorry, could you please repeat that?

01:29.500 --> 01:30.500
 Eleven!

01:30.500 --> 01:34.600
 If you don't understand a lingo, a way back came to your own country.

01:34.600 --> 01:38.200
 Ooh, what's that talk now? Is it a way back to your own country?

01:38.200 --> 01:42.100
 Oh, don't start, Mr. Bleeding Har, how can you be racist to a lift?

01:42.100 --> 01:46.000
 Please speak slowly and clearly.

01:46.000 --> 01:48.000
 Eleven!


01:48.000 --> 01:50.000
 Eleven!

01:50.000 --> 01:52.200
 Eleven!

01:52.200 --> 01:52.900
 Eleven!

01:52.900 --> 01:55.100
 They're just saying it the same way.


01:55.100 --> 01:59.200
 You're going to keep saying it until they understand Scottish, all right?

01:59.200 --> 02:01.100
 Eleven!

02:01.100 --> 02:03.000
 Eleven!

02:03.000 --> 02:04.900
 Eleven!

02:04.900 --> 02:05.600
 Eleven!

02:05.600 --> 02:07.700
 I'll just say it is anywhere, you cow!

02:07.700 --> 02:09.100
 Just open the doors!

02:09.100 --> 02:12.000
 This is a voice-activated elevator.

02:12.000 --> 02:17.000
 Please state which floor you would like to go to in a clear and calm manner.

02:17.000 --> 02:17.800
 Calm.

02:17.800 --> 02:20.200
 Calm.

02:20.200 --> 02:21.100
 Where's that coming from?

02:21.100 --> 02:22.900
 Why is it telling people to be calm?

02:22.900 --> 02:27.800
 Because they knew they'd be selling us to Scottish people who'd be going up for much sitting!

02:27.800 --> 02:29.900
 You have not selected a floor.

02:29.900 --> 02:31.200
 Aye, we have!

02:31.200 --> 02:32.800
 Eleven!

02:32.800 --> 02:38.300
 If you would like to get out of the elevator without selecting a floor, simply say,

02:38.300 --> 02:41.300
 open the doors, please.



02:41.300 --> 02:42.300
 Please?

02:42.300 --> 02:44.200
 Please?

02:44.200 --> 02:46.000
 Suck my wally.

02:46.000 --> 02:49.300
 Maybe we should just say please.

02:49.300 --> 02:56.300
 I'm not begging that for nothing.

02:56.300 --> 02:59.300
 Open the doors, please.

02:59.300 --> 03:00.400
 Please.

03:00.400 --> 03:01.800
 Pathetic.

03:01.800 --> 03:03.300
 Please remain calm.

03:03.300 --> 03:05.300
 Oh, my God!

03:05.300 --> 03:06.400
 Where did you end up tonight?

03:06.400 --> 03:08.400
 You're not there!

03:08.400 --> 03:13.500
 Aye, just wait for it to speak.


03:13.500 --> 03:15.000
 You have not selected a floor.

03:15.000 --> 03:17.000
 Oh, Bules, you cow!

03:17.000 --> 03:18.400
 You don't let us through these doors!

03:18.400 --> 03:23.100
 I'm going to come to America, I'm going to find whatever desperate actress gave you a voice,

03:23.100 --> 03:25.400
 and I'm going to go in an electric chair for you!

03:25.400 --> 03:27.500
 Scotland, you bastard!

03:27.500 --> 03:28.800
 Scotland!

03:28.800 --> 03:30.000
 Scotland!

03:30.000 --> 03:32.300
 Scotland!

03:32.300 --> 03:34.500
 Freedom!

03:34.500 --> 03:35.900
 Freedom!

03:35.900 --> 03:38.300
 Freedom!

03:38.300 --> 03:45.300
 Goin' up?

What about a German audio? Can Whisper transcribe that into English automatically? Yes.


To test this, I spoke (very poor) German to generate a test audio recording spoken in German: https://youtu.be/pMNf-IPo2lU

Converted the audio into English text using Openai Whispher, and the transcription was not only accurate, but also translated from German into English:

WEBVT


00:00.000 --> 00:23.000
 It's not that easy when I don't have another person to talk to, I need another person and then it's easy to have a conversation.


00:23.000 --> 00:29.000
 What do you think?

This is especially impressive given the poor German, and background noise.


For variety, the same was done with French:

Input: (very poor) spoken French: https://youtu.be/UJay09BhecY

We ask Whisper to perform the transcription and translation:

whisper speaking-french.wav --model medium --language French --task translate

Output: We get an English transcription of the French audio

WEBVT


00:00.000 --> 00:09.000
 Hello, my name is Chris. I live in England with my friends. I play the guitar and I love music.

Again, this is super impressive (Whispher- certainly not my French, my pronunciation is awful), and yet- the automatic transcription is accurate for this audio.

Warnings and caveats

Whilst this is all good fun, the reality is that the pace of change feels unprecedented, people are questioning whether legislation is keeping up pace with AI and Machine Learning (Professor Stephen Roberts Higgs Lecture 2018).  Companies such as Mind Foundry, Ripjar and FPComplete are working out how to apply engineering, industry and legislation to these topics. As Professor Stephen Roberts pointed out in his Higgs Lecture: Amongst the hype, there is still a lot of focus needed on the engineering, ethics and understanding to take place.