Here are some resources for more information about topics covered in this lesson:
Working With Audio Files
00:00 Working with audio files. Before you continue, you’ll need an audio file to work with. The one I’m working with in this course is included in the course files, and you should make sure you save it to the same directory in which your Python interpreter session is running.
SpeechRecognition makes working with audio files easy thanks to the handy
AudioFile class. This class can be initialized with the path to an audio file and provides a context manager interface for reading and working with the file’s contents.
If you’re working on an x86 based Linux, macOS, or Windows machine, you should be able to work with FLAC files without a problem. On other platforms, you’ll need to install a FLAC encoder and ensure you have access to the
flac command line tool.
The context manager opens the file and reads its contents, storing the data in an
AudioFile instance called
source. Then the
.record() method records the data from the entire file into an
AudioData instance, in this case called
You can now invoke
.recognize_google() to attempt to recognize any speech in the audio. Depending on your internet connection speed, you may have to wait several seconds before seeing the result.
If you’re wondering where the phrases in the
harvard.wav file come from, they are examples of Harvard Sentences. These phrases were published by the IEEE in 1965 for use in speech intelligibility testing of telephone lines.
02:30 They’re still used in VoIP and cellular testing today. The Harvard Sentences are comprised of 72 lists of 10 phrases. You can find free available recordings of these phrases on the Open Speech Repository website. Recordings are available in English, Mandarin Chinese, French, and Hindi.
.record() method accepts a
duration keyword argument that stops the recording after a specified number of seconds. For example, the following captures any speech in the first four seconds of the file.
.record() method, when used inside a
with block, always moves ahead in the file stream. This means if you record once for four seconds and then record again for four seconds, the second time returns the four seconds of audio after the first four seconds.
audio2 contains a portion of the third phrase in the file. When specifying a duration, the recording might stop mid-phrase, or even mid-word, which can hurt the accuracy of the transcription. More on this follows.
This value represents the number of seconds from the beginning of the file to ignore before starting to record. To capture only the second phrase in the file, you could start with an offset of
4 seconds and record for, say,
duration keyword arguments are useful for segmenting an audio file if you have prior knowledge of the structure of the speech that’s in the file. However, using them hastily can result in poor transcriptions. To see this effect, try the following in your REPL.
05:42 There is another reason you may get inaccurate transcriptions. Noise! The previous examples worked well because the audio file is reasonably clean. In the real world, unless you have the opportunity to process audio files beforehand, you can’t expect the audio to be noise-free. In the next section, you’ll see some techniques to deal with noise in audio files.
Become a Member to join the conversation.