Join us and get access to thousands of tutorials and a community of expert Pythonistas.

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Working With Audio Files

Speech Recognition With Python Darren Jones 06:05

Here are some resources for more information about topics covered in this lesson:

00:00 Working with audio files. Before you continue, you’ll need an audio file to work with. The one I’m working with in this course is included in the course files, and you should make sure you save it to the same directory in which your Python interpreter session is running.

00:18 SpeechRecognition makes working with audio files easy thanks to the handy AudioFile class. This class can be initialized with the path to an audio file and provides a context manager interface for reading and working with the file’s contents.

00:33 Currently, SpeechRecognition supports the following file formats: WAV, which must be in PCM or LPCM format, AIFF, AIFF-C, and FLAC, which must be native FLAC format. OGG-FLAC is not supported.

00:53 If you’re working on an x86 based Linux, macOS, or Windows machine, you should be able to work with FLAC files without a problem. On other platforms, you’ll need to install a FLAC encoder and ensure you have access to the flac command line tool.

01:11 Using .record() to capture data from a file. Type the following into your interpreter session to process the contents of the harvard.wav file.

01:34 The context manager opens the file and reads its contents, storing the data in an AudioFile instance called source. Then the .record() method records the data from the entire file into an AudioData instance, in this case called audio.

01:49 You can confirm this by checking the type of the audio object.

01:56 You can now invoke .recognize_google() to attempt to recognize any speech in the audio. Depending on your internet connection speed, you may have to wait several seconds before seeing the result.

02:10 Congratulations! You’ve just transcribed your first audio file.

02:16 If you’re wondering where the phrases in the harvard.wav file come from, they are examples of Harvard Sentences. These phrases were published by the IEEE in 1965 for use in speech intelligibility testing of telephone lines.

02:30 They’re still used in VoIP and cellular testing today. The Harvard Sentences are comprised of 72 lists of 10 phrases. You can find free available recordings of these phrases on the Open Speech Repository website. Recordings are available in English, Mandarin Chinese, French, and Hindi.

02:50 They provide an excellent source of free material for testing your code. Capturing segments with offset and duration. What if you only want to capture a portion of the speech in a file?

03:03 The .record() method accepts a duration keyword argument that stops the recording after a specified number of seconds. For example, the following captures any speech in the first four seconds of the file.

03:32 The .record() method, when used inside a with block, always moves ahead in the file stream. This means if you record once for four seconds and then record again for four seconds, the second time returns the four seconds of audio after the first four seconds.

04:09 Notice that audio2 contains a portion of the third phrase in the file. When specifying a duration, the recording might stop mid-phrase, or even mid-word, which can hurt the accuracy of the transcription. More on this follows.

04:23 In addition to specifying a recording duration, the .record() method can be given a specific starting point using the offset keyword argument.

04:32 This value represents the number of seconds from the beginning of the file to ignore before starting to record. To capture only the second phrase in the file, you could start with an offset of 4 seconds and record for, say, 3 seconds.

04:59 The offset and duration keyword arguments are useful for segmenting an audio file if you have prior knowledge of the structure of the speech that’s in the file. However, using them hastily can result in poor transcriptions. To see this effect, try the following in your REPL.

05:29 By starting the recording at 4.7 seconds, you miss the first portion at the beginning of the phrase, so the API didn’t get all of it and mismatched it to the wrong word.

05:42 There is another reason you may get inaccurate transcriptions. Noise! The previous examples worked well because the audio file is reasonably clean. In the real world, unless you have the opportunity to process audio files beforehand, you can’t expect the audio to be noise-free. In the next section, you’ll see some techniques to deal with noise in audio files.

Become a Member to join the conversation.