All nerds above the age of 20 probably remember thinking how cool it would be to have the computer in Star Trek: TNG because you could talk to it and it would understand you. And now we have those computers . . . sort of.
It seems like the first application that gained popular acceptance was the Ford Sync system, despite its flaws. And the movement has just rolled on from there. Now you have Siri, Google Now, Dragon Dictation, and on and on. But how does the computer interpret what we’re saying? There has been a ton of work developing speech recognition software since the 1950’s and lots of different approaches have evolved from all that research.
Statistics and Probability Models
Automatic speech recognition work began in earnest in the 1950’s. Back then, the best theory of speech was called “acoustic-phonetics.” The acoustic-phonetic theory tried to break speech down into phonemes, the individual sounds of speech. These could be vowels, consonants or something else, like a “th” sound in “the”. Each sound has a distinct frequency, which could then be used to understand the phoneme being spoken. Using this approach, researchers were able to build systems that could recognize vowels and very simple, regular words, especially single-digit numbers. None of these systems could recognize continuous speech.
In the late 60’s, a new method of analyzing and processing speech called Linear Predictive Coding came about. Speaking from my layperson point of view, it seems that Linear Predictive Coding is essentially a computer simulation of how humans create speech: the vibration of the vocal cords at a particular pitch and frequency, the noise introduced by friction from the mouth, tongue, teeth and lips, and the tube formed by the mouth and throat. By running Linear Predictive Coding in one direction, you can analyze and break down speech into its components. By running it in the other direction, you can actually produce speech, such as Siri telling you where the nearest Chinese restaurant is.
After Linear Predictive Coding came on the scene, speech recognition made much greater strides. Advanced Research Projects Agency and Carnegie Mellon University developed a system that parsed the speech and compared the parsed speech against a database of known words. Other projects attempted to use context and determine which, among a number of possible interpretations of the parsed speech, was the most likely correct interpretation. These programs did not have the level of success that Carnegie Mellon’s project did, but are still influencing the field to this day.
In the 70’s, IBM developed a statistical program that, using a large database of known words, would learn by repetition which words were likely to follow from the words it was given. This model, called an n-gram, was particularly useful for transcription programs in offices. AT&T also developed speech recognition methods useful for voice-controlled phone menus (think when you call a customer service line and speak the menu option rather than pressing a button).
Hidden Markov Model
In the 1980’s, there was a sea change in the approach of speech recognition because of a new model for analyzing systems with unknown (hidden) data. In these systems, the process moves from one state to another and each state is dependent on the states that came before it. Think like frames in a continuous shot from a movie: each frame is an increment that was dependent on the frame before it. People don’t teleport across rooms; they have to walk across them. Speech can be considered a similar system: each frequency depends on the previous ones. In the case of speech, the “hidden” data is which frequency gets paired up with which phoneme.
The hidden Markov model is particularly useful for analyzing these types of systems. Instead of actually just doing a one-shot comparison between the frequencies and the phonemes, it looks a series of frequencies and determines the most likely word that the series of frequencies represents. Let’s work through a simple word that represents the process, if I understand it correctly: fart. The first frequency that the computer will see is the “f” sound. However, that phoneme has a several different letter combinations associated with it: f, ph, gh (like rough). It’s not likely to be “gh”, though, because it’s at the start of the word. So that leaves “f” and “ph”. The next frequency looks like “a”, so it notes that. The next frequency is clearly associated with “r”, so it notes that. And the last frequency looks like “t”, but could also be a “d”. Between “phart”, “fart”, “fard” and “phard”, the probability is highest that it is fart, so the program goes with that word.
Around the same time that the hidden Markov model was gaining popularity, neural networks started gaining attention for speech recognition. The basic idea behind neural networks is that they try to imitate the way a human brain works. Individual nodes take in information and apply weight to the different inputs and pass that information along to the next node. The more complex the system, the more nodes and processing you have.
For speech recognition, the models do not work by have pre-known statistics, like the Hidden Markov Model. Instead, the neural network learns through training, much like a human does. The problem is that almost all neural networks are too limited; they work great for phonemes and short words, but not so great at longer sentences.
We’re pretty close to living in a world where computers understand our spoken words and we are only getting better at it. Personally, I can’t wait because it is much easier to write by dictation than by typing.