did-you-know? rent-now

Amazon no longer offers textbook rentals. We do!

did-you-know? rent-now

Amazon no longer offers textbook rentals. We do!

We're the #1 textbook rental company. Let us show you why.

9780130226167

Spoken Language Processing A Guide to Theory, Algorithm and System Development

by ; ;
  • ISBN13:

    9780130226167

  • ISBN10:

    0130226165

  • Edition: 1st
  • Format: Paperback
  • Copyright: 2001-04-25
  • Publisher: Pearson
  • Purchase Benefits
  • Free Shipping Icon Free Shipping On Orders Over $35!
    Your order must be $35 or more to qualify for free economy shipping. Bulk sales, PO's, Marketplace items, eBooks and apparel do not qualify for this offer.
  • eCampus.com Logo Get Rewarded for Ordering Your Textbooks! Enroll Now
List Price: $94.00

Summary

Preface

Our primary motivation in writing this book is to share our working experience to bridge the gap between the knowledge of industry gurus and newcomers to the spoken language processing community. Many powerful techniques hide in conference proceedings and academic papers for years before becoming widely recognized by the research community or the industry. We spent many years pursuing spoken language technology research at Carnegie Mellon University before we started spoken language R&D at Microsoft. We fully understand that it is by no means a small undertaking to transfer a state-of-the-art spoken language research system into a commercially viable product that can truly help people improve their productivity. Our experience in both industry and academia is reflected in the context of this book, which presents a contemporary and comprehensive description of both theoretic and practical issues in spoken language processing. This book is intended for people of diverse academic and practical backgrounds. Speech scientists, computer scientists, linguists, engineers, physicists, and psychologists all have a unique perspective on spoken language processing. This book will be useful to all of these special interest groups.

Spoken language processing is a diverse subject that relies on knowledge of many levels, including acoustics, phonology, phonetics, linguistics, semantics, pragmatics, and discourse. The diverse nature of spoken language processing requires knowledge in computer science, electrical engineering, mathematics, syntax, and psychology. There are a number of excellent books on the subfields of spoken language processing, including speech recognition, text-to-speech conversion, and spoken language understanding, but there is no single book that covers both theoretical and practical aspects of these subfields and spoken language interface design. We devote many chapters systematically introducing fundamental theories needed to understand how speech recognition, text-to-speech synthesis, and spoken language understanding work. Even more important is the fact that the book highlights what works well in practice, which is invaluable if you want to build a practical speech recognizer, a practical text-to-speech synthesizer, or a practical spoken language system. Using numerous real examples in developing Microsoft's spoken language systems, we concentrate on showing how the fundamental theories can be applied to solve real problems in spoken language processing.

Author Biography

Alex Acero is Senior Researcher at Microsoft Research and Senior Member of IFEE.

Table of Contents

Foreword xxi
Preface xxv
Introduction
1(18)
Motivations
2(2)
Spoken Language Interface
2(1)
Speech-to-Speech Translation
3(1)
Knowledge Partners
3(1)
Spoken Language System Architecture
4(4)
Automatic Speech Recognition
4(2)
Text-to-Speech Conversion
6(1)
Spoken Language Understanding
7(1)
Book Organization
8(2)
Part I: Fundamental Theory
9(1)
Part II: Speech Processing
9(1)
Part III: Speech Recognition
9(1)
Part IV: Text-to-Speech Systems
10(1)
Part V: Spoken Language Systems
10(1)
Target Audiences
10(1)
Historical Perspective and Further Reading
11(8)
PART I: FUNDAMENTAL THEORY
Spoken Language Structure
19(54)
Sound and Human Speech Systems
21(15)
Sound
21(3)
Speech Production
24(5)
Speech Production
29(7)
Phonetics and Phonology
36(15)
Phonemes
36(11)
The Allophone: Sound and Context
47(2)
Speech Rate and Coarticulation
49(2)
Syllables and Words
51(7)
Syllables
51(2)
Words
53(5)
Syntax and Semantics
58(10)
Syntactic Constituents
58(5)
Semantic Roles
63(1)
Lexical Semantics
64(3)
Logical Form
67(1)
Historical Perspective and Further Reading
68(5)
Probability, Statistics, and Information Theory
73(60)
Probability Theory
74(24)
Conditional Probability and Bayes' Rule
75(2)
Random Variables
77(2)
Mean and Variance
79(3)
Covariance and Correlation
82(1)
Random Vectors and Multivariate Distributions
83(2)
Some Useful Distributions
85(7)
Gaussian Distributions
92(6)
Estimation Theory
98(15)
Minimum/Least Mean Squared Error Estimation
99(5)
Maximum Likelihood Estimation
104(3)
Bayesian Estimation and MAP Estimation
107(6)
Significance Testing
113(7)
Level of Significance
114(1)
Normal Test (Z-Test)
115(1)
X2 Goodness-of-Fit Test
116(2)
Matched-Pairs Test
118(2)
Information Theory
120(8)
Entropy
120(3)
Conditional Entropy
123(1)
The Source Coding Theorem
124(2)
Mutual Information and Channel Coding
126(2)
Historical Perspective and Further Reading
128(5)
Pattern Recognition
133(68)
Bayes' Decision Theory
134(6)
Minimum-Error-Rate Decision Rules
135(3)
Discriminant Functions
138(2)
How to Construct Classifiers
140(10)
Gaussian Classifiers
142(2)
The Curse of Dimensionality
144(2)
Estimating the Error Rate
146(2)
Comparing Classifiers
148(2)
Discriminative Training
150(13)
Maximum Mutual Information Estimation
150(6)
Minimum-Error-Rate Estimation
156(2)
Neural Networks
158(5)
Unsupervised Estimation Methods
163(12)
Vector Quantization
163(7)
The EM Algorithm
170(2)
Multivariate Gaussian Mixture Density Estimation
172(3)
Classification and Regression Trees
175(15)
Choice of question Set
177(1)
Splitting Criteria
178(3)
Growing the Tree
181(1)
Missing Values and Conflict Resolution
182(1)
Complex Questions
182(2)
The Right-Sized Tree
184(6)
Historical Perspective and Further Reading
190(11)
PART II: SPEECH PROCESSING
Digital Signal Processing
201(74)
Digital Signals and Systems
202(6)
Sinusoidal Signals
203(3)
Other Digital Signals
206(1)
Digital Systems
206(2)
Continuous-Frequency Transforms
208(8)
The Fourier Transform
208(3)
Z-Transform
211(1)
Z-Transforms of Elementary Functions
212(3)
Propeties of the Z- and Fourier Transforms
215(1)
Discrete-Frequency Transforms
216(13)
The Discrete Fourier Transform (DFT)
218(1)
Fourier Transforms of Periodic Signals
219(3)
The Fast Fourier Transform (FFT)
222(5)
Circular Convolution
227(1)
The Discrete Cosine Transform (DCT)
228(1)
Digital Filters and Windows
229(13)
The Ideal Low-Pass Filter
229(1)
Window Functions
230(2)
FIR Filters
232(6)
IIR Filters
238(4)
Digital Processing of Analog Signals
242(6)
Fourier Transform of Analog Signals
243(1)
The Sampling Theorem
243(2)
Analog-to-Digital Conversion
245(1)
Digital-to-Analog Conversion
246(2)
Multirate Signal Processing
248(3)
Decimation
248(1)
Interpolation
249(1)
Resampling
250(1)
Filterbanks
251(9)
Two-Band Conjugate Quadrature Filters
251(3)
Multiresolution Filterbanks
254(1)
The DFT as a Filterbank
255(3)
Modulated Lapped Transforms
258(2)
Stochastic Processes
260(10)
Statistics of Stochastic Processes
261(3)
Stationary Processes
264(3)
LTI Systems with Stochastic Inputs
267(1)
Power Spectral Density
268(1)
Noise
269(1)
Historical Perspective and Further Reading
270(5)
Speech Signal Representations
275(62)
Short-Time Fourier Analysis
276(7)
Spectrograms
281(2)
Pitch-Synchronous Analysis
283(1)
Acoustical Model of Speech Production
283(7)
Glottal Excitation
284(1)
Lossless Tube Concatenation
284(4)
Source-Filter Models of Speech Production
288(2)
Linear Predictive Coding
290(16)
The Orthogonality Principle
291(1)
Solution of the LPC Equations
292(8)
Spectral Analysis via LPC
300(1)
The Prediction Error
301(2)
Equivalent Represenations
303(3)
Cepstral Processing
306(9)
The Real and Complex Cepstrum
307(1)
Cepstrum of Pole-Zero Filters
308(3)
Cepstrum of Periodic Signals
311(1)
Cepstrum of Speech Signals
312(2)
Source-Filter Separation via the Cepstrum
314(1)
Perceptually Motivated Representations
315(4)
The Bilinear Transform
315(1)
Mel-Frequency Cepstrum
316(2)
Perceptual Linear Prediction (PLP)
318(1)
Formant Frequencies
319(5)
Statistical Formant Tracking
320(4)
The Role of Pitch
324(8)
Autocorrelation Method
324(3)
Normalized Cross-Correlation Method
327(2)
Signal Conditioning
329(1)
Pitch Tracking
330(2)
Historical Perspective and Further Reading
332(5)
Speech Coding
337(40)
Speech Coders Attributers
338(2)
Scalar Waveform Coders
340(8)
Linear Pulse Code Modulation (PCM)
340(2)
μ-law and A-law PCM
342(2)
Adaptive PCM
344(1)
Differential Quantization
345(3)
Scalar Frequency Domain Coders
348(5)
Benefits of Masking
349(1)
Transform Coders
350(1)
Consumer Audio
351(1)
Digital Audio Broadcasting (DAB)
352(1)
Code Excited Linear Predication (CELP)
353(8)
LPC Vocoder
353(1)
Analysis by Synthesis
353(3)
Pitch Prediction: Adaptive Codebook
356(1)
Perceptual Weighting and Postfiltering
357(1)
Parameter Quantization
358(1)
CELP Standards
359(2)
Low-Bit Rate Speech Coders
361(10)
Mixed-Excitation LPC Vocoder
362(1)
Harmonic Coding
363(4)
Waveform Interpolation
367(4)
Historical Perspective and Further Reading
371(6)
PART III: SPEECH RECOGNITION
Hidden Markov Models
377(38)
The Markov Chain
378(2)
Definition of the Hidden Markov Model
380(14)
Dynamic Programming and DTW
383(2)
How to Evaluate an HMM---The Forward Algorithm
385(2)
How to Decode an HMM---The Viterbi Algorithm
387(2)
How to Estimate HMM Parameters---Baum-Welch Algorithm
389(5)
Continuous and Semicontinuous HMMs
394(4)
Continuous Mixture Density Hmms
394(2)
Semicontinuous Hmms
396(2)
Practical Issues in Using HMMs
398(7)
Initial Estimates
398(1)
Model Topology
399(2)
Training Criteria
401(1)
Deleted Interpolation
401(2)
Parameter Smoothing
403(1)
Probability Representations
404(1)
HMM Limitations
405(4)
Duration Modeling
406(2)
First-Order Assumption
408(1)
Conditional Independence Assumption
409(1)
Historical Perspective and Further Reading
409(6)
Acoustic Modeling
415(62)
Variability in the Speech Signal
416(3)
Context Variability
417(1)
Style Variability
418(1)
Speaker Variability
418(1)
Environment Variability
419(1)
How to Measure Speech Recognition Errors
419(2)
Signal Processing---Extracting Features
421(7)
Signal Acquisition
422(1)
End-Point Detection
422(2)
MFCC and Its Dynamic Features
424(2)
Feature Transformation
426(2)
Phonetic Modeling---Selecting Appropriate Units
428(11)
Comparison of Different Units
429(1)
Context Dependency
430(2)
Clustered Acoustic-Phonetic Units
432(4)
Lexical Baseforms
436(3)
Acoustic Modeling---Scoring Acoustic Features
439(5)
Choice of HMM Output Distributions
439(2)
Isolated vs. Continuous Speech Training
441(3)
Adaptive Techniques---Minimizing Mismatches
444(9)
Maximum a Posteriori (MAP)
445(2)
Maximum Likelihood Linear Regression (MLLR)
447(3)
MLLR and MAP Comparison
450(2)
Clustered Models
452(1)
Confidence Measures: Measuring the Reliability
453(4)
Filler Models
453(1)
Transformation Models
454(2)
Combination Models
456(1)
Other Techniques
457(7)
Neural Networks
457(2)
Segment Models
459(5)
Case Study: Whisper
464(1)
Historical Perspective and Further Reading
465(12)
Environmental Robustness
477(68)
The Acoustical Environment
478(8)
Additive Noise
478(2)
Reverberation
480(2)
A Model of the Envirornment
482(4)
Acoustical Transducers
486(11)
The Condenser Microphone
486(3)
Directionality Patterns
489(7)
Other Transduction Categories
496(1)
Adaptive Echo Cancellation (AEC)
497(7)
The LMS Algorithm
499(1)
Convergence Properties of the LMS Algorithm
500(1)
Normalized LMS Algorithm
501(1)
Transform-Domain LMS Algorithm
502(1)
The RLS Algorithm
503(1)
Multimicrophone Speech Enhancement
504(11)
Microphone Arrays
505(5)
Blind Source Separation
510(5)
Environment Compensation Preprocessing
515(13)
Spectral Subtraction
516(3)
Frequency-Domain MMSE from Stereo Data
519(1)
Wiener Filtering
520(2)
Cepstral Mean Normalization (CMN)
522(3)
Real-Time Cepstral Normalization
525(1)
The Use of Gaussian Mixture Models
525(3)
Environmental Model Adaptation
528(10)
Retraining on Corrupted Speech
528(2)
Model Adaptation
530(1)
Parallel Model Combination
531(4)
Vector Taylor Series
535(2)
Retraining on Compensated Features
537(1)
Modeling Nonstrationary Noise
538(2)
Historical Perspective and Further Reading
540(5)
Language Modeling
545(46)
Formal Language Theory
546(8)
Chomsky Hierarchy
547(2)
Chart Parsing for Context-Free Grammars
549(5)
Stochastic Language Models
554(6)
Probabilistic Context-Free Grammars
554(4)
N-gram Language Models
558(2)
Complexity Measure of Language Models
560(2)
N-Gram Smoothing
562(12)
Delected Interpolation Smoothing
564(1)
Backoff Smoothing
565(5)
Class N-grams
570(3)
Performance of N-gram Smoothing
573(1)
Adaptive Language Models
574(4)
Cache Language Models
574(1)
Topic-Adaptive Models
575(1)
Maximum Entropy Models
576(2)
Practical Issues
578(6)
Vocabulary Selection
578(2)
N-gram Pruning
580(1)
CFG vs. N-gram Models
581(3)
Historical Perspective and Further Reading
584(7)
Basic Search Algorithms
591(54)
Basic Search Algorithms
592(16)
General Graph Searching Procedures
593(4)
Blind Graph Search Algorithms
597(4)
Heuristic Graph Search
601(7)
Search Algorithms for Speech Recognition
608(5)
Decoder Basics
609(1)
Combining acoustic and Language Models
610(1)
Isolated Word Recognition
610(1)
Continuous speech Recognition
611(2)
Language Model States
613(9)
Search Space with FSM and CFG
613(3)
Search Space with the Unigram
616(1)
Search Space with Bigrams
617(2)
Search Space with Trigrams
619(2)
How to Handle Silences Between Words
621(1)
Time-Synchronous Viterbi Beam Search
622(4)
The Use of Beam
624(1)
Viterbi Beam Search
625(1)
Stack Decoding (A* Search)
626(14)
Admissible Heuristics for Remaining Path
630(1)
When to Extend New Words
631(3)
Fast Match
634(4)
Stack Pruning
638(1)
Multistack Search
639(1)
Historical Perspective and Further Reading
640(5)
Large-Vocabulary Search Algorithms
645(44)
Efficient Manipulation of a Tree Lexicon
646(13)
Lexical Tree
646(2)
Multiple Copies of Pronunciation Trees
648(2)
Factored Language Probabilities
650(3)
Optimization of Lexical Trees
653(3)
Exploiting Subtree Polymorphism
656(2)
Context-Dependent Units and Inter-word Triphones
658(1)
Other Efficient Search Techniques
659(4)
Using Entire HMM as a State in Search
659(1)
Different Layers of Beams
660(1)
Fast Match
661(2)
N-Best and Multipass Search Strategies
663(11)
N-best Lists and Word Lattices
664(2)
The Exact N-best Algorithm
666(1)
Word-Dependent N-best and Word-Lattice Algorithm
667(3)
The Forward-Backward Search Algorithm
670(3)
One-Pass vs. Multipass Search
673(1)
Search-Algorithm Evaluation
674(2)
Case Study---Microsoft Whisper
676(5)
The CFG Search Architecture
676(1)
The N-gram Search Architecture
677(4)
Historical Perspective and Further Reading
681(8)
PART IV: TEXT-TO-SPEECH SYSTEMS
Text and Phonetic Analysis
689(50)
Modules and Data Flow
690(7)
Modules
692(2)
Data Flows
694(2)
Localization Issues
696(1)
Lexicon
697(2)
Document Structure Detection
699(7)
Chapter and Section Headers
700(1)
Lists
701(1)
Paragraphs
702(1)
Sentences
702(2)
Email
704(1)
Web Pages
705(1)
Dialog Turns and Speech Acts
705(1)
Text Normalization
706(14)
Abbreviations and Acronyms
709(3)
Number Formats
712(6)
Domain-Specific Tags
718(1)
Miscellaneous Formats
719(1)
Linguistic Analysis
720(4)
Homograph Disambiguation
724(1)
Morphological Analysis
725(3)
Letter-to-Sound Conversion
728(2)
Evaluation
730(2)
Case Study: Festival
732(3)
Lexicon
733(1)
Text Analysis
733(2)
Phonetic Analysis
735(1)
Historical Perspective and Further Reading
735(4)
Prosody
739(54)
The Role of Understanding
740(3)
Prosody Generation Schematic
743(1)
Speaking Style
744(1)
Character
744(1)
Emotion
744(1)
Symbolic Prosody
745(16)
Pauses
747(2)
Prosodic Phrases
749(2)
Accent
751(2)
Tone
753(4)
Tune
757(2)
Prosodic Transcription Systems
759(2)
Duration Assignment
761(2)
Rule-Based Methods
762(1)
CART-Based Durations
763(1)
Pitch Generation
763(20)
Attributes of Pitch Contours
764(4)
Baseline F0 Contour Generation
768(6)
Parametric F0 Generation
774(4)
Corpus-Based F0 Generation
778(5)
Prosody Markup Languages
783(1)
Prosody Evaluation
784(1)
Historical Perspective and Further Reading
785(8)
Speech Synthesis
793(60)
Attributes of Speech Syntyhesis
794(2)
Format Speech Synthesis
796(8)
Waveform Generation from Formant Values
797(3)
Formant Generation by Rule
800(3)
Data-Driven Formant Generation
803(1)
Articulatory Synthesis
803(1)
Concatenative Speech Synthesis
804(14)
Choice of Unit
805(5)
Optimal Unit String: The Decoding Process
810(7)
Unit Inventory Design
817(1)
Prosodic Modification of Speech
818(13)
Synchronous Overlap and Add (SOLA)
818(2)
Pitch Synchronous Overlap and Add (PSOLA)
820(2)
Spectral Behavior of PSOLA
822(1)
Synthesis Epoch Calculation
823(2)
Pitch-Scale Modification Epoch Calculation
825(1)
Time-Scale Modification Epoch Calculation
826(1)
Pitch-Scale Time-Scale Epoch Calculation
827(1)
Waveform Mapping
827(1)
Epoch Detection
828(1)
Problems with PSOLA
829(2)
Source-Filter Models for Prosody Modification
831(3)
Prosody Modification of the LPC Residual
832(1)
Mixed Excitation Models
832(2)
Voice Effects
834(1)
Evaluation of TTs Systems
834(10)
Intelligibility Tests
837(3)
Overall Quality Tests
840(2)
Preference Tests
842(1)
Functional Tests
842(1)
Automated Tests
843(1)
Historical Perspective and Further Reading
844(9)
PART V: SPOKEN LANGUAGE SYSTEMS
Spoken Language Understanding
853(66)
Written vs. Spoken Languages
855(4)
Style
856(1)
Disfluency
857(1)
Communicative Prosody
858(1)
Dialog Structure
859(8)
Units of Dialog
860(1)
Dialog (Speech) Acts
861(5)
Dialog Control
866(1)
Semantic Representation
867(6)
Semantic Frames
867(5)
Conceptual Graphs
872(1)
Sentence Interpretation
873(8)
Robust Parsing
874(4)
Statistical Pattern Matching
878(3)
Discourse Analysis
881(5)
Resolution of Relative Expression
882(3)
Automatic Inference and Inconsistency Detection
885(1)
Dialog Management
886(8)
Dialog Grammars
887(1)
Plan-Based Systems
888(4)
Dialog Behavior
892(2)
Response Generation and Rendition
894(7)
Response Content Generation
895(4)
Concept-to-Speech Rendition
899(2)
Other Renditions
901(1)
Evaluation
901(5)
Evaluation in the ATIS Task
901(2)
PARADISE Framework
903(3)
Case Study---Dr. Who
906(7)
Semantic Representation
906(2)
Semantic Parser (Sentence Interpretation)
908(1)
Discourse Analysis
909(1)
Dialog Manager
910(3)
Historical Perspective and Further Reading
913(6)
Applications and User Interfaces
919(38)
Application Architecture
920(1)
Typical Applications
921(10)
Computer Command and Control
921(3)
Telephony Applications
924(2)
Dictation
926(3)
Accessibility
929(1)
Handheld Devices
930(1)
Automobile Applications
930(1)
Speaker Recognition
931(1)
Speech Interface Design
931(12)
General Principles
931(6)
Handling Errors
937(4)
Other Considerations
941(1)
Dialog Flow
942(1)
Internationalization
943(2)
Case Study---MiPad
945(7)
Specifying the Application
946(2)
Rapid Prototyping
948(1)
Evaluation
949(2)
Iterations
951(1)
Historical Perspective and Further Reading
952(5)
Index 957

Supplemental Materials

What is included with this book?

The New copy of this book will include any supplemental materials advertised. Please check the title of the book to determine if it should include any access cards, study guides, lab manuals, CDs, etc.

The Used, Rental and eBook copies of this book are not guaranteed to include any supplemental materials. Typically, only the book itself is included. This is true even if the title states it includes any access cards, study guides, lab manuals, CDs, etc.

Excerpts

PrefaceOur primary motivation in writing this book is to share our working experience to bridge the gap between the knowledge of industry gurus and newcomers to the spoken language processing community. Many powerful techniques hide in conference proceedings and academic papers for years before becoming widely recognized by the research community or the industry. We spent many years pursuing spoken language technology research at Carnegie Mellon University before we started spoken language R&D at Microsoft. We fully understand that it is by no means a small undertaking to transfer a state-of-the-art spoken language research system into a commercially viable product that can truly help people improve their productivity. Our experience in both industry and academia is reflected in the context of this book, which presents a contemporary and comprehensive description of both theoretic and practical issues in spoken language processing. This book is intended for people of diverse academic and practical backgrounds. Speech scientists, computer scientists, linguists, engineers, physicists, and psychologists all have a unique perspective on spoken language processing. This book will be useful to all of these special interest groups.Spoken language processing is a diverse subject that relies on knowledge of many levels, including acoustics, phonology, phonetics, linguistics, semantics, pragmatics, and discourse. The diverse nature of spoken language processing requires knowledge in computer science, electrical engineering, mathematics, syntax, and psychology. There are a number of excellent books on the subfields of spoken language processing, including speech recognition, text-to-speech conversion, and spoken language understanding, but there is no single book that covers both theoretical and practical aspects of these subfields and spoken language interface design. We devote many chapters systematically introducing fundamental theories needed to understand how speech recognition, text-to-speech synthesis, and spoken language understanding work. Even more important is the fact that the book highlights what works well in practice, which is invaluable if you want to build a practical speech recognizer, a practical text-to-speech synthesizer, or a practical spoken language system. Using numerous real examples in developing Microsoft's spoken language systems, we concentrate on showing how the fundamental theories can be applied to solve real problems in spoken language processing.

Rewards Program