Learning How Speech Recognition Works
00:05 Before we get to the nitty-gritty of doing speech recognition in Python, let’s take a moment to talk about how speech recognition works. A full discussion would fill a book, so I won’t bore you with all of the technical details here. In fact, this section is not a prerequisite for the rest of the course.
00:33 Early systems were limited to a single speaker and had limited vocabularies of about a dozen words. Modern speech recognition systems have come a long way since their ancient counterparts, and they can recognize speech from multiple speakers and have enormous vocabularies in numerous languages.
00:49 The first component of speech recognition is, of course, speech! Speech must be converted from a physical sound to an electrical signal with a microphone, and then to digital data with an analog-to-digital converter.
01:29 In a typical HMM, the speech signal is divided into 10-millisecond fragments. The power spectrum of each fragment, which is essentially a plot of the signal’s power as a function of frequency, is mapped to a vector of real numbers known as cepstral coefficients. The dimension of this vector is usually small—sometimes as low as 10, although more accurate systems may have dimension 32 or more.
02:00 To decode the speech into text, groups of vectors are matched to one or more phonemes—a fundamental unit of speech. This calculation requires training, since the sound of a phoneme varies from speaker to speaker, and even varies from one utterance to another by the same speaker.
02:17 A special algorithm is then applied to determine the most likely word—or words—that produce the given sequence of phonemes. One could imagine that this whole process may be computationally expensive. In many modern speech recognition systems, neural networks are used to simplify the speech signal using techniques for feature transformation and dimensionality reduction before the HMM recognition step.
02:42 Voice activity detectors are also used to reduce an audio signal to only the portions that are likely to contain speech. This prevents the recognizer from wasting time analyzing unnecessary parts of the signal. Fortunately, as a Python programmer, you don’t have to worry about any of this.
Become a Member to join the conversation.