Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set the default subtitles language in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please see our video player troubleshooting guide to resolve the issue.

Learning How Speech Recognition Works

00:00 How speech recognition works: an overview.

00:05 Before we get to the nitty-gritty of doing speech recognition in Python, let’s take a moment to talk about how speech recognition works. A full discussion would fill a book, so I won’t bore you with all of the technical details here. In fact, this section is not a prerequisite for the rest of the course.

00:22 If you’d like to get straight to the point, then feel free to skip ahead. Speech recognition has its roots in research done in the Bell Labs in the early 1950s.

00:33 Early systems were limited to a single speaker and had limited vocabularies of about a dozen words. Modern speech recognition systems have come a long way since their ancient counterparts, and they can recognize speech from multiple speakers and have enormous vocabularies in numerous languages.

00:49 The first component of speech recognition is, of course, speech! Speech must be converted from a physical sound to an electrical signal with a microphone, and then to digital data with an analog-to-digital converter.

01:02 Once digitized, several models can be used to transcribe the audio to text. Most modern speech recognition systems rely on what is known as a Hidden Markov Model.

01:13 This approach works on the assumption that a speech signal, when viewed on a short enough timescale—say, ten milliseconds—can be reasonably approximated as a stationary process.

01:23 That is, a process in which statistical properties do not change over time.

01:29 In a typical HMM, the speech signal is divided into 10-millisecond fragments. The power spectrum of each fragment, which is essentially a plot of the signal’s power as a function of frequency, is mapped to a vector of real numbers known as cepstral coefficients. The dimension of this vector is usually small—sometimes as low as 10, although more accurate systems may have dimension 32 or more.

01:55 The final output of the HMM is a sequence of these vectors.

02:00 To decode the speech into text, groups of vectors are matched to one or more phonemes—a fundamental unit of speech. This calculation requires training, since the sound of a phoneme varies from speaker to speaker, and even varies from one utterance to another by the same speaker.

02:17 A special algorithm is then applied to determine the most likely word—or words—that produce the given sequence of phonemes. One could imagine that this whole process may be computationally expensive. In many modern speech recognition systems, neural networks are used to simplify the speech signal using techniques for feature transformation and dimensionality reduction before the HMM recognition step.

02:42 Voice activity detectors are also used to reduce an audio signal to only the portions that are likely to contain speech. This prevents the recognizer from wasting time analyzing unnecessary parts of the signal. Fortunately, as a Python programmer, you don’t have to worry about any of this.

03:00 A number of speech recognition services are available for use online through an API, and many of these services offer Python SDKs.

03:09 In the next section, you’ll see an overview of available Python packages.

Become a Member to join the conversation.