by BH Juang · Cited by 438 — More recently (in 1988), in the technology community, Apple Computer created a vision of speech technology and computers for the year. 2011,

95 KB – 24 Pages

PAGE – 1 ============
10/08/2004 09:56:44 AM 1 Automatic Speech Recognition ΠA Brief History of the Technology Development B.H. Juang# & Lawrence R. Rabiner* # Georgia Institute of Technology, Atlanta * Rutgers University and the University of California, Santa Barbara Abstract Designing a machine that mimics human behavior, particularly the capability of speaking naturally and responding properly to spoken langua ge, has intrigued engineers and scientists for centuries. Since the 1930s, when Homer Dudley of Bell Laboratories proposed a system model for speech analysis and synthesis [1, 2], the pr oblem of automatic speech recognition has been approached progressively, from a simple machine that responds to a small set of sounds to a sophisticated system that responds to fluently spoken natural language and takes into account the varying statistics of the language in which the speech is produced. Based on major advances in statistical modeling of speech in the 1980s, au tomatic speech recognition systems today find widespread application in tasks that require a human-machine interface, such as automatic call processing in the telephone network and query-based information systems that do things like provide updated travel information, stock price quot ations, weather reports, etc. In this article, we review some major highlights in the research and development of automatic speech recognition during the last few decades so as to provide a technological perspective and an appreciation of the fundamental progress that has been made in this important area of information and communication technology. Keywords Speech recognition, speech understanding, statistical modeling, spectral analysis, hidden Markov models, acoustic modeling, language modeling, finite state ne twork, office automation, automatic transcription, keyword spotting, dialog systems, neural networks, pattern recognition, time normalization

PAGE – 2 ============
10/08/2004 09:56:44 AM 2 1. Introduction Speech is the primary means of communication between people. For reasons ranging from technological curiosity about the mechanisms for mechanical realization of human speech capabilities, to the desire to automate simp le tasks inherently requiring human-machine interactions, research in automatic speech r ecognition (and speech synthesis) by machine has attracted a great deal of atten tion over the past five decades. The desire for automation of simple tasks is not a modern phenomenon, but one that goes back more than one hundred years in history. By way of example, in 1881 Alexander Graham Bell, his cousin Chichester Bell and Charles Su mner Tainter invented a recording device that used a rotating cylinder with a wax coating on which up-and-down grooves could be cut by a stylus, which responded to incoming sound pressure (in much the same way as a microphone that Bell invented earlier for use with the telephone). Based on this invention, Bell and Tainter formed the Volta Graphophone Co. in 1888 in order to manufacture machines for the recording and reproduction of sound in office environments . The American Graphophone Co., which later became the Columbia Graphophone Co., acquired th e patent in 1907 and trademarked the term fiDictaphone.fl Just about the same time, Thomas Edison invented the phonograph using a tinfoil based cylinder, which was subsequently adapted to wax, and developed the fi Ediphonefl to compete directly with Columbia. The purpose of these products was to record dictation of notes and letters for a secretary (likely in a large pool that offered the service as shown in Figure 1) who would later type them out (offline), thereby circumventing the need for costly stenographers. This turn-of-the-century concep t of fioffice mechanizationfl spawned a range of electric and electronic implements and improvements, includi ng the electric typewriter, which changed the face of office automation in the mid-part of the twentieth century. It does not take much imagination to envision the obvious interest in creating an fiautomatic typewriterfl that could directly respond to and transcribe a human™s voice without having to deal with the annoyance of recording and handling the speech on wax cylinders or other recording media. A similar kind of automation took place a century later in the 1990™s in the area of ficall centers.fl A call center is a concentration of agen ts or associates that handle telephone calls from customers requesting assistance. Among the tasks of such call centers are routing the in-coming calls to the proper department, where specific help is provided or where transactions are carried out. One example of such a service was the AT&T Operator line which helped a caller place calls, arrange payment methods, and c onduct credit card transactions. The number of agent positions (or stations) in a large call center could reach several thousand. Automatic speech recognition

PAGE – 3 ============
10/08/2004 09:56:44 AM 3 technologies provided the capability of automa ting these call handling functions, thereby reducing the large operating cost of a call center. By way of example, the AT&T Voice Recognition Call Processing (VRCP) service, which was introduced into the AT&T Network in 1992, routinely handles about 1.2 billion voice tr ansactions with machines each year using automatic speech recognition technology to a ppropriately route and handle the calls [3]. Speech recognition technology has also been a topic of great interest to a broad general population since it became popularized in several blockbuster movies of the 1960™s and 1970™s, most notably Stanley Kubrick™s acclaimed movie fi2001: A Space Odysseyfl. In this movie, an intelligent computer named fiHALfl spoke in a na tural sounding voice and was able to recognize and understand fluently spoken speech, and r espond accordingly. This anthropomorphism of HAL made the general public aware of the potentia l of intelligent machines. In the famous Star Wars saga, George Lucas extended the abilities of in telligent machines by making them mobile as well as intelligent and the droids like R2D2 and C3 PO were able to speak naturally, recognize and understand fluent speech, a nd move around and interact with their environment, with other droids, and with the human population at large. More recently (in 1988), in the technology community, Apple Computer created a vision of speech technology and computers for the year 2011, titled fiKnowledge Navigatorfl, which defined the concepts of a Speech User Interface (SUI) and a Multimodal User Interface (MUI) al ong with the theme of intelligent voice-enabled agents. This video had a dramatic effect in the technical community and focused technology efforts, especially in the area of visual talking agents. Figure 1 An early 20th century transcribing pool at Sears, Roebuck and Co. The women are using cylinder dictation machin es, and listening to the recordings with ear-tubes (David Morton, the history of Sound Recording History, )

PAGE – 4 ============
10/08/2004 09:56:44 AM 4 Today speech technologies are commercially available for a limited but interesting range of tasks. These technologies enable machines to respond correctly and reliably to human voices, and provide useful and valuable services. While we are still far from having a machine that converses with humans on any topic like another human, many important scientific and technological advances have taken place, bringing us closer to the fiHoly Grailfl of machines that recognize and understand fluently spoken speech. This article atte mpts to provide an historic perspective on key inventions that have enabled progress in sp eech recognition and langua ge understanding and briefly reviews several technology milestones as well as enumerating some of the remaining challenges that lie ahead of us. 2. From Speech Production Models to Spectral Representations Attempts to develop machines to mimic a human™s speech communication capability appear to have started in the 2 nd half of the 18 th century. The early interest was not on recognizing and understanding speech but instead on creating a sp eaking machine, perhaps due to the readily available knowledge of acoustic resonance tubes which were used to approximate the human vocal tract. In 1773, the Russian scientist Chris tian Kratzenstein, a professor of physiology in Copenhagen, succeeded in produc ing vowel sounds using resonance tubes connected to organ pipes [4]. Later, Wolfgang von Kempelen in Vienna constructed an fiAcoustic-Mechanical Speech Machinefl (1791) [5] and in the mid-1800’s Charles Wheatstone [6] built a version of von Kempelen’s speaking machine using resonators made of leather, the configuration of which could be altered or controlled with a hand to produce di fferent speech-like sounds, as shown in Figure 2. Figure 2 Wheatstone’s version of von Kempelen’s speaking machine (Flanagan [7]).

PAGE – 5 ============
10/08/2004 09:56:44 AM 5 During the first half of the 20 th century, work by Flet cher [8] and others at Bell Laboratories documented the relationship between a given speech spectrum (which is the distribution of power of a speech sound across frequency), and its sound ch aracteristics as well as its intelligibility, as perceived by a human listener. In the 1930™s Ho mer Dudley, influenced greatly by Fletcher™s research, developed a speech synthesizer called the VODER (Voice Operating Demonstrator) [2], which was an electrical equivalent (with mechanical control) of Wheatstone™s mechanical speaking machine. Figure 3 shows a block diagra m of Dudley™s VODER which consisted of a wrist bar for selecting either a relaxation oscillator output or noise as the driving signal, and a foot pedal to control the oscillator frequency (the p itch of the synthesized voice). The driving signal was passed through ten bandpass filters whose out put levels were controlled by the operator™s fingers. These ten bandpass filters were used to a lter the power distribution of the source signal across a frequency range, thereby determining the characteristics of the speech-like sound at the loudspeaker. Thus to synthesize a sentence, the VODER operator had to learn how to control and fiplayfl the VODER so that the appropriate sounds of the sentence were produced. The VODER was demonstrated at the World Fair in New York City in 1939 (shown in Fig 4) and was considered an important milestone in the evolution of speaking machines. Figure 3 A block schematic of Homer Dudley™s VODER [2]. Speech pioneers like Harvery Fletcher and Home r Dudley firmly established the importance of the signal spectrum for reliable identification of the phonetic nature of a speech sound. Following the convention establis hed by these two outstanding scientists, most modern systems and algorithms for speech recognition are based on the concept of measurement of the (time- varying) speech power spectrum (or its variants such as the cepstrum), in part due to the fact that

PAGE – 6 ============
10/08/2004 09:56:44 AM 6 measurement of the power spectrum from a signal is relatively easy to accomplish with modern digital signal processing techniques. 3. Early Automatic Speech Recognizers Early attempts to design systems for automatic speech recognition were mostly guided by the theory of acoustic-phonetics, which describes the phonetic elements of speech (the basic sounds of the language) and tries to explain how they ar e acoustically realized in a spoken utterance. These elements include the phonemes and the co rresponding place and manner of articulation used to produce the sound in various phonetic cont exts. For example, in order to produce a steady vowel sound, the vocal cords need to vibrate (to ex cite the vocal tract), a nd the air that propagates through the vocal tract results in sound with natura l modes of resonance similar to what occurs in an acoustic tube. These natural modes of resonance, called the formants or formant frequencies, are manifested as major regions of energy con centration in the speech power spectrum. In 1952, Davis, Biddulph, and Balashek of Bell Laboratori es built a system for isolated digit recognition for a single speaker [9], usi ng the formant frequencies measured (or estimated) during vowel regions of each digit. Figure 5 shows a block di agram of the digit recognizer developed by Davis et al., and Figure 6 shows plots of the formant trajectories along the dimensions of the first and the second formant frequencies for each of the ten digits, one-nine and oh, respectively. These trajectories served as the fireference patternfl fo r determining the identity of an unknown digit utterance as the best matching digit. Figure 4 The VODER at the 1939 World™s Fair in NYC.

PAGE – 8 ============
10/08/2004 09:56:44 AM 8 Forgie built a speaker-independent 10-vowel rec ognizer [11]. In the 1960™s, several Japanese laboratories demonstrated their capability of bu ilding special purpose hardware to perform a speech recognition task. Most notable were the vow el recognizer of Suzuki and Nakata at the Radio Research Lab in Tokyo [12], the phoneme recognizer of Sakai and Doshita at Kyoto University [13], and the digit recognizer of NEC Laboratories [14]. The work of Sakai and Doshita involved the first use of a speech segme nter for analysis and recognition of speech in different portions of the input utterance. In c ontrast, an isolated digit recognizer implicitly assumed that the unknown utterance contained a complete digit (and no other speech sounds or words) and thus did not need an explicit fiseg menter.fl Kyoto University™s work could be considered a precursor to a continuous speech recognition system. In another early recognition system Fry and Den es, at University College in England, built a phoneme recognizer to recognize 4 vowels and 9 c onsonants [15]. By incorporating statistical information about allowable phoneme sequences in English, they increased the overall phoneme recognition accuracy for words consisting of two or more phonemes. This work marked the first use of statistical syntax (at the phoneme level) in au tomatic speech recognition. An alternative to the use of a speech segmenter was the concept of adopting a non-uniform time scale for aligning speech patterns. This concep t started to gain acceptance in the 1960™s through the work of Tom Martin at RCA Laborator ies [16] and Vintsyuk in the Soviet Union [17]. Martin recognized the need to deal with the te mporal non-uniformity in repeated speech events and suggested a range of solutions, including de tection of utterance endpoints, which greatly enhanced the reliability of the recognizer pe rformance [16]. Vintsyuk proposed the use of dynamic programming for time alignment between two utterances in order to derive a meaningful assessment of their similarity [17]. His work, though largely unknown in the West, appears to have preceded that of Sakoe and Chiba [18] as we ll as others who proposed more formal methods, generally known as dynamic time warping, in sp eech pattern matching. Since the late 1970™s, mainly due to the publication by Sakoe and Ch iba, dynamic programming, in numerous variant forms (including the Viterbi algorithm [19] which came from the communication theory community), has become an indispensable technique in automatic speech recognition. 4. Technology Drivers since the 1970™s In the late 1960™s, Atal and Itakura independe ntly formulated the fundamental concepts of Linear Predictive Coding (LPC) [20, 21], which greatly simplified the estimation of the vocal tract response from speech waveforms. By the mid 1970™s, the basic ideas of applying

PAGE – 9 ============
10/08/2004 09:56:44 AM 9 fundamental pattern recognition technology to speech recognition, based on LPC methods, were proposed by Itakura [22], Rabiner and Levinson [23] and others. Also during this time period, based on his ear lier success at aligning speech utterances, Tom Martin founded the first speech recognition comm ercial company called Threshold Technology, Inc. and developed the first real ASR product called the VIP-100 System. The system was only used in a few simple applications, such as by television faceplate manufacturing firms (for quality control) and by FedEx (for package sorting on a conveyor belt), but its main importance was the way it influenced the Advanced Research Projects Agency (ARPA) of the U.S. Department of Defense to fund the Speech Understanding Resear ch (SUR) program during the early 1970™s. Among the systems built by the contractors of the ARPA program was Carnegie Mellon University™s fiHarpyfl (Lowerre [24]) which was shown to be able to recognize speech using a vocabulary of 1,011 words, and with reasonable accuracy. One particular contribution from the Harpy system was the concept of doing a gr aph search, where the speech recognition language was represented as a connected network derived from lexical representations of words, with syntactical production rules and word boundary rule s. In the proposed Harpy system, the input speech, after going through a parametric analysis , was segmented and the segmented parametric sequence of speech was then subjected to phone template matching using the Itakura distance [22]. The graph search, based on a beam search algorithm, compiled, hypothesized, pruned, and then verified the recognized sequence of word s (or sounds) that satisfied the knowledge constraints with the highest matching score (smallest distance to the reference patterns). The Harpy system was perhaps the first to take a dvantage of a finite state network to reduce computation and efficiently determine the clo sest matching string. However, methods which optimized the resulting finite state network (FSN) (for performance as well as to eliminate redundancy) did not come about until the early 1990™s [25] (see section 5). Other systems developed under DARPA™s SUR program included CMU™s Hearsay(-II) and BBN™s HWIM [26]. Neither Hearsay-II nor HWIM (Hear What I Mean) met the DARPA program™s performance goal at its conclusion in 1976. However, the approach proposed by Hearsay-II of using parallel asynchronous processe s that simulate the component knowledge sources in a speech system was a pioneering c oncept. The Hearsay-II system extended sound identity analysis (to higher level hy potheses) given the detection of a certain type of (lower level) information or evidence, which was provided to a global fiblackboardfl where knowledge from parallel sources was integrated to produce the next level of hypothesis. BBN™s HWIM system, on the other hand, was known for its interes ting ideas including a lexical decoding network

PAGE – 10 ============
10/08/2004 09:56:44 AM 10 incorporating sophisticated phonological rules (aimed at phoneme recognition accuracy), its handling of segmentation ambiguity by a lattice of alternative hypotheses, and the concept of word verification at the parametric level. A nother system worth noting of the time was the DRAGON system by Jim Baker, who moved to Massachusetts to start a company with the same name in the early 1980s. In parallel to the ARPA-initiated efforts, two broad directions in speech recognition research started to take shape in the 1970™s, with IBM and AT&T Bell Laboratories essentially representing two different schools of thought as to the applicability of automatic speech recognition systems for commercial applications. IBM™s effort, led by Fred Jelinek, was aime d at creating a fivoice-activated typewriterfl (VAT), the main function of which was to conve rt a spoken sentence into a sequence of letters and words that could be shown on a display or typed on paper [27]. The recognition system, called Tangora, was essentially a sp eaker-dependent system (i.e., the typewriter had to be trained by each individual user). The technical focus wa s on the size of the recognition vocabulary (as large as possible, with a primary target be ing one used in office correspondence), and the structure of the language model (the grammar), wh ich was represented by statistical syntactical rules that described how likely, in a probabilistic sense, was a sequence of language symbols (e.g., phonemes or words) that could appear in the speech signal. This type of speech recognition task is generally referred to as transcription . The set of statistical grammatical or syntactical rules was called a language model , of which the n-gram model, which defined the probability of occurrence of an ordered sequence of n words, was the most frequently used variant. Although both the n-gram language model and a traditional grammar are manifestations of the rules of the language, their roles were fundamentally different. The n-gram model, which characterized the word relationship within a span of n words, was purely a convenient and powerful statistical representation of a grammar. Its effectiveness in guiding a word search for speech recognition, however, was strongly validated by the famous word game of Claude Shannon [28] which involved a competition between a human and a com puter. In this competition both the computer and the human are asked to sequentially guess the ne xt word in an arbitr ary sentence. The human guesses based on native experience with language; the computer uses the accumulated word statistics to make its best guess based on maximum probability from the estimated word frequencies. It was shown that once the span of the words, n, exceeded 3, the computer was very likely to win (make better guesses as to the next word in the sequence) over the human player.

PAGE – 11 ============
10/08/2004 09:56:44 AM 11 Since their introduction in the 1980™s, the use of n-gram language models, and its variants, has become indispensable in large voca bulary speech recognition systems. At AT&T Bell Laboratories, the goal of the research program was to provide automated telecommunication services to the public, such as voice dialing, and command and control for routing of phone calls. These automated systems we re expected to work well for a vast population (literally tens of millions) of talkers without the n eed for individual speaker training. The focus at Bell Laboratories was in the design of a speaker-independent system that could deal with the acoustic variability intrinsic in the speech signals coming from many different talkers, often with notably different regional accents. This led to the creation of a range of speech clustering algorithms for creating word and sound referen ce patterns (initially templates but ultimately statistical models) that could be used across a wide range of talkers and accents. Furthermore, research to understand and to control the acous tic variability of various speech representations across talkers led to the study of a range of sp ectral distance measures (e.g., the Itakura distance [22]) and statistical modeling techniques [30] th at produced sufficiently rich representations of the utterances from a vast population. (As will be discussed in the next section, the technique of mixture density hidden Markov m odels [31, 32] has since become the prevalent representation of speech units for speaker independent continuous sp eech recognition.) Since applications, such as voice dialing and call routing, usually involved only short utterances of limited vocabulary and consisted of only a few words, there was an em phasis of the research at Bell Laboratories on what is generally called the acoustic model (the spectral representation of sounds or words) over the language model (the representation of the gramma r or syntax of the task). Also, of great importance in the Bell Laboratories™ approach was the concept of keyword spotting as a primitive form of speech understanding [33]. The techni que of keyword spotting aimed at detecting a keyword or a key-phrase of some particular signi ficance that was embedded in a longer utterance where there was no semantic significance to the othe r words in the utteran ce. The need for such keyword spotting was to accommodate talkers who pr eferred to speak in natural sentences rather than using rigid command sequences when requesting services (i.e., as if they were speaking to a human operator). For example, a telephone caller requesting a credit card charge might speak the sentence fiI™d like to charge it to my credit cardfl rather than just say ficredit cardfl. In a limited domain application, the presence of the key-phrase fic redit cardfl in an otherwise naturally spoken sentence was generally sufficient to indicate the caller™s intent to make a credit card call. The detected keyword or key-phrase would then trigge r a prescribed action (or sequence of actions) as part of the service, in response to the talker ™s spoken utterance. The technique of keyword

95 KB – 24 Pages