9780130226167

Foreword

xxi

Preface

xxv

Introduction

(18)

Motivations

(2)

Spoken Language Interface

(1)

Speech-to-Speech Translation

(1)

Knowledge Partners

(1)

Spoken Language System Architecture

(4)

Automatic Speech Recognition

(2)

Text-to-Speech Conversion

(1)

Spoken Language Understanding

(1)

Book Organization

(2)

Part I: Fundamental Theory

(1)

Part II: Speech Processing

(1)

Part III: Speech Recognition

(1)

Part IV: Text-to-Speech Systems

(1)

Part V: Spoken Language Systems

(1)

Target Audiences

(1)

Historical Perspective and Further Reading

(8)

PART I: FUNDAMENTAL THEORY

Spoken Language Structure

(54)

Sound and Human Speech Systems

(15)

Sound

(3)

Speech Production

(5)

Speech Production

(7)

Phonetics and Phonology

(15)

Phonemes

(11)

The Allophone: Sound and Context

(2)

Speech Rate and Coarticulation

(2)

Syllables and Words

(7)

Syllables

(2)

Words

(5)

Syntax and Semantics

(10)

Syntactic Constituents

(5)

Semantic Roles

(1)

Lexical Semantics

(3)

Logical Form

(1)

Historical Perspective and Further Reading

(5)

Probability, Statistics, and Information Theory

(60)

Probability Theory

(24)

Conditional Probability and Bayes' Rule

(2)

Random Variables

(2)

Mean and Variance

(3)

Covariance and Correlation

(1)

Random Vectors and Multivariate Distributions

(2)

Some Useful Distributions

(7)

Gaussian Distributions

(6)

Estimation Theory

(15)

Minimum/Least Mean Squared Error Estimation

(5)

Maximum Likelihood Estimation

104

(3)

Bayesian Estimation and MAP Estimation

107

(6)

Significance Testing

113

(7)

Level of Significance

114

(1)

Normal Test (Z-Test)

115

(1)

X2 Goodness-of-Fit Test

116

(2)

Matched-Pairs Test

118

(2)

Information Theory

120

(8)

Entropy

120

(3)

Conditional Entropy

123

(1)

The Source Coding Theorem

124

(2)

Mutual Information and Channel Coding

126

(2)

Historical Perspective and Further Reading

128

(5)

Pattern Recognition

133

(68)

Bayes' Decision Theory

134

(6)

Minimum-Error-Rate Decision Rules

135

(3)

Discriminant Functions

138

(2)

How to Construct Classifiers

140

(10)

Gaussian Classifiers

142

(2)

The Curse of Dimensionality

144

(2)

Estimating the Error Rate

146

(2)

Comparing Classifiers

148

(2)

Discriminative Training

150

(13)

Maximum Mutual Information Estimation

150

(6)

Minimum-Error-Rate Estimation

156

(2)

Neural Networks

158

(5)

Unsupervised Estimation Methods

163

(12)

Vector Quantization

163

(7)

The EM Algorithm

170

(2)

Multivariate Gaussian Mixture Density Estimation

172

(3)

Classification and Regression Trees

175

(15)

Choice of question Set

177

(1)

Splitting Criteria

178

(3)

Growing the Tree

181

(1)

Missing Values and Conflict Resolution

182

(1)

Complex Questions

182

(2)

The Right-Sized Tree

184

(6)

Historical Perspective and Further Reading

190

(11)

PART II: SPEECH PROCESSING

Digital Signal Processing

201

(74)

Digital Signals and Systems

202

(6)

Sinusoidal Signals

203

(3)

Other Digital Signals

206

(1)

Digital Systems

206

(2)

Continuous-Frequency Transforms

208

(8)

The Fourier Transform

208

(3)

Z-Transform

211

(1)

Z-Transforms of Elementary Functions

212

(3)

Propeties of the Z- and Fourier Transforms

215

(1)

Discrete-Frequency Transforms

216

(13)

The Discrete Fourier Transform (DFT)

218

(1)

Fourier Transforms of Periodic Signals

219

(3)

The Fast Fourier Transform (FFT)

222

(5)

Circular Convolution

227

(1)

The Discrete Cosine Transform (DCT)

228

(1)

Digital Filters and Windows

229

(13)

The Ideal Low-Pass Filter

229

(1)

Window Functions

230

(2)

FIR Filters

232

(6)

IIR Filters

238

(4)

Digital Processing of Analog Signals

242

(6)

Fourier Transform of Analog Signals

243

(1)

The Sampling Theorem

243

(2)

Analog-to-Digital Conversion

245

(1)

Digital-to-Analog Conversion

246

(2)

Multirate Signal Processing

248

(3)

Decimation

248

(1)

Interpolation

249

(1)

Resampling

250

(1)

Filterbanks

251

(9)

Two-Band Conjugate Quadrature Filters

251

(3)

Multiresolution Filterbanks

254

(1)

The DFT as a Filterbank

255

(3)

Modulated Lapped Transforms

258

(2)

Stochastic Processes

260

(10)

Statistics of Stochastic Processes

261

(3)

Stationary Processes

264

(3)

LTI Systems with Stochastic Inputs

267

(1)

Power Spectral Density

268

(1)

Noise

269

(1)

Historical Perspective and Further Reading

270

(5)

Speech Signal Representations

275

(62)

Short-Time Fourier Analysis

276

(7)

Spectrograms

281

(2)

Pitch-Synchronous Analysis

283

(1)

Acoustical Model of Speech Production

283

(7)

Glottal Excitation

284

(1)

Lossless Tube Concatenation

284

(4)

Source-Filter Models of Speech Production

288

(2)

Linear Predictive Coding

290

(16)

The Orthogonality Principle

291

(1)

Solution of the LPC Equations

292

(8)

Spectral Analysis via LPC

300

(1)

The Prediction Error

301

(2)

Equivalent Represenations

303

(3)

Cepstral Processing

306

(9)

The Real and Complex Cepstrum

307

(1)

Cepstrum of Pole-Zero Filters

308

(3)

Cepstrum of Periodic Signals

311

(1)

Cepstrum of Speech Signals

312

(2)

Source-Filter Separation via the Cepstrum

314

(1)

Perceptually Motivated Representations

315

(4)

The Bilinear Transform

315

(1)

Mel-Frequency Cepstrum

316

(2)

Perceptual Linear Prediction (PLP)

318

(1)

Formant Frequencies

319

(5)

Statistical Formant Tracking

320

(4)

The Role of Pitch

324

(8)

Autocorrelation Method

324

(3)

Normalized Cross-Correlation Method

327

(2)

Signal Conditioning

329

(1)

Pitch Tracking

330

(2)

Historical Perspective and Further Reading

332

(5)

Speech Coding

337

(40)

Speech Coders Attributers

338

(2)

Scalar Waveform Coders

340

(8)

Linear Pulse Code Modulation (PCM)

340

(2)

μ-law and A-law PCM

342

(2)

Adaptive PCM

344

(1)

Differential Quantization

345

(3)

Scalar Frequency Domain Coders

348

(5)

Benefits of Masking

349

(1)

Transform Coders

350

(1)

Consumer Audio

351

(1)

Digital Audio Broadcasting (DAB)

352

(1)

Code Excited Linear Predication (CELP)

353

(8)

LPC Vocoder

353

(1)

Analysis by Synthesis

353

(3)

Pitch Prediction: Adaptive Codebook

356

(1)

Perceptual Weighting and Postfiltering

357

(1)

Parameter Quantization

358

(1)

CELP Standards

359

(2)

Low-Bit Rate Speech Coders

361

(10)

Mixed-Excitation LPC Vocoder

362

(1)

Harmonic Coding

363

(4)

Waveform Interpolation

367

(4)

Historical Perspective and Further Reading

371

(6)

PART III: SPEECH RECOGNITION

Hidden Markov Models

377

(38)

The Markov Chain

378

(2)

Definition of the Hidden Markov Model

380

(14)

Dynamic Programming and DTW

383

(2)

How to Evaluate an HMM---The Forward Algorithm

385

(2)

How to Decode an HMM---The Viterbi Algorithm

387

(2)

How to Estimate HMM Parameters---Baum-Welch Algorithm

389

(5)

Continuous and Semicontinuous HMMs

394

(4)

Continuous Mixture Density Hmms

394

(2)

Semicontinuous Hmms

396

(2)

Practical Issues in Using HMMs

398

(7)

Initial Estimates

398

(1)

Model Topology

399

(2)

Training Criteria

401

(1)

Deleted Interpolation

401

(2)

Parameter Smoothing

403

(1)

Probability Representations

404

(1)

HMM Limitations

405

(4)

Duration Modeling

406

(2)

First-Order Assumption

408

(1)

Conditional Independence Assumption

409

(1)

Historical Perspective and Further Reading

409

(6)

Acoustic Modeling

415

(62)

Variability in the Speech Signal

416

(3)

Context Variability

417

(1)

Style Variability

418

(1)

Speaker Variability

418

(1)

Environment Variability

419

(1)

How to Measure Speech Recognition Errors

419

(2)

Signal Processing---Extracting Features

421

(7)

Signal Acquisition

422

(1)

End-Point Detection

422

(2)

MFCC and Its Dynamic Features

424

(2)

Feature Transformation

426

(2)

Phonetic Modeling---Selecting Appropriate Units

428

(11)

Comparison of Different Units

429

(1)

Context Dependency

430

(2)

Clustered Acoustic-Phonetic Units

432

(4)

Lexical Baseforms

436

(3)

Acoustic Modeling---Scoring Acoustic Features

439

(5)

Choice of HMM Output Distributions

439

(2)

Isolated vs. Continuous Speech Training

441

(3)

Adaptive Techniques---Minimizing Mismatches

444

(9)

Maximum a Posteriori (MAP)

445

(2)

Maximum Likelihood Linear Regression (MLLR)

447

(3)

MLLR and MAP Comparison

450

(2)

Clustered Models

452

(1)

Confidence Measures: Measuring the Reliability

453

(4)

Filler Models

453

(1)

Transformation Models

454

(2)

Combination Models

456

(1)

Other Techniques

457

(7)

Neural Networks

457

(2)

Segment Models

459

(5)

Case Study: Whisper

464

(1)

Historical Perspective and Further Reading

465

(12)

Environmental Robustness

477

(68)

The Acoustical Environment

478

(8)

Additive Noise

478

(2)

Reverberation

480

(2)

A Model of the Envirornment

482

(4)

Acoustical Transducers

486

(11)

The Condenser Microphone

486

(3)

Directionality Patterns

489

(7)

Other Transduction Categories

496

(1)

Adaptive Echo Cancellation (AEC)

497

(7)

The LMS Algorithm

499

(1)

Convergence Properties of the LMS Algorithm

500

(1)

Normalized LMS Algorithm

501

(1)

Transform-Domain LMS Algorithm

502

(1)

The RLS Algorithm

503

(1)

Multimicrophone Speech Enhancement

504

(11)

Microphone Arrays

505

(5)

Blind Source Separation

510

(5)

Environment Compensation Preprocessing

515

(13)

Spectral Subtraction

516

(3)

Frequency-Domain MMSE from Stereo Data

519

(1)

Wiener Filtering

520

(2)

Cepstral Mean Normalization (CMN)

522

(3)

Real-Time Cepstral Normalization

525

(1)

The Use of Gaussian Mixture Models

525

(3)

Environmental Model Adaptation

528

(10)

Retraining on Corrupted Speech

528

(2)

Model Adaptation

530

(1)

Parallel Model Combination

531

(4)

Vector Taylor Series

535

(2)

Retraining on Compensated Features

537

(1)

Modeling Nonstrationary Noise

538

(2)

Historical Perspective and Further Reading

540

(5)

Language Modeling

545

(46)

Formal Language Theory

546

(8)

Chomsky Hierarchy

547

(2)

Chart Parsing for Context-Free Grammars

549

(5)

Stochastic Language Models

554

(6)

Probabilistic Context-Free Grammars

554

(4)

N-gram Language Models

558

(2)

Complexity Measure of Language Models

560

(2)

N-Gram Smoothing

562

(12)

Delected Interpolation Smoothing

564

(1)

Backoff Smoothing

565

(5)

Class N-grams

570

(3)

Performance of N-gram Smoothing

573

(1)

Adaptive Language Models

574

(4)

Cache Language Models

574

(1)

Topic-Adaptive Models

575

(1)

Maximum Entropy Models

576

(2)

Practical Issues

578

(6)

Vocabulary Selection

578

(2)

N-gram Pruning

580

(1)

CFG vs. N-gram Models

581

(3)

Historical Perspective and Further Reading

584

(7)

Basic Search Algorithms

591

(54)

Basic Search Algorithms

592

(16)

General Graph Searching Procedures

593

(4)

Blind Graph Search Algorithms

597

(4)

Heuristic Graph Search

601

(7)

Search Algorithms for Speech Recognition

608

(5)

Decoder Basics

609

(1)

Combining acoustic and Language Models

610

(1)

Isolated Word Recognition

610

(1)

Continuous speech Recognition

611

(2)

Language Model States

613

(9)

Search Space with FSM and CFG

613

(3)

Search Space with the Unigram

616

(1)

Search Space with Bigrams

617

(2)

Search Space with Trigrams

619

(2)

How to Handle Silences Between Words

621

(1)

Time-Synchronous Viterbi Beam Search

622

(4)

The Use of Beam

624

(1)

Viterbi Beam Search

625

(1)

Stack Decoding (A* Search)

626

(14)

Admissible Heuristics for Remaining Path

630

(1)

When to Extend New Words

631

(3)

Fast Match

634

(4)

Stack Pruning

638

(1)

Multistack Search

639

(1)

Historical Perspective and Further Reading

640

(5)

Large-Vocabulary Search Algorithms

645

(44)

Efficient Manipulation of a Tree Lexicon

646

(13)

Lexical Tree

646

(2)

Multiple Copies of Pronunciation Trees

648

(2)

Factored Language Probabilities

650

(3)

Optimization of Lexical Trees

653

(3)

Exploiting Subtree Polymorphism

656

(2)

Context-Dependent Units and Inter-word Triphones

658

(1)

Other Efficient Search Techniques

659

(4)

Using Entire HMM as a State in Search

659

(1)

Different Layers of Beams

660

(1)

Fast Match

661

(2)

N-Best and Multipass Search Strategies

663

(11)

N-best Lists and Word Lattices

664

(2)

The Exact N-best Algorithm

666

(1)

Word-Dependent N-best and Word-Lattice Algorithm

667

(3)

The Forward-Backward Search Algorithm

670

(3)

One-Pass vs. Multipass Search

673

(1)

Search-Algorithm Evaluation

674

(2)

Case Study---Microsoft Whisper

676

(5)

The CFG Search Architecture

676

(1)

The N-gram Search Architecture

677

(4)

Historical Perspective and Further Reading

681

(8)

PART IV: TEXT-TO-SPEECH SYSTEMS

Text and Phonetic Analysis

689

(50)

Modules and Data Flow

690

(7)

Modules

692

(2)

Data Flows

694

(2)

Localization Issues

696

(1)

Lexicon

697

(2)

Document Structure Detection

699

(7)

Chapter and Section Headers

700

(1)

Lists

701

(1)

Paragraphs

702

(1)

Sentences

702

(2)

704

(1)

Web Pages

705

(1)

Dialog Turns and Speech Acts

705

(1)

Text Normalization

706

(14)

Abbreviations and Acronyms

709

(3)

Number Formats

712

(6)

Domain-Specific Tags

718

(1)

Miscellaneous Formats

719

(1)

Linguistic Analysis

720

(4)

Homograph Disambiguation

724

(1)

Morphological Analysis

725

(3)

Letter-to-Sound Conversion

728

(2)

Evaluation

730

(2)

Case Study: Festival

732

(3)

Lexicon

733

(1)

Text Analysis

733

(2)

Phonetic Analysis

735

(1)

Historical Perspective and Further Reading

735

(4)

Prosody

739

(54)

The Role of Understanding

740

(3)

Prosody Generation Schematic

743

(1)

Speaking Style

744

(1)

Character

744

(1)

Emotion

744

(1)

Symbolic Prosody

745

(16)

Pauses

747

(2)

Prosodic Phrases

749

(2)

Accent

751

(2)

Tone

753

(4)

Tune

757

(2)

Prosodic Transcription Systems

759

(2)

Duration Assignment

761

(2)

Rule-Based Methods

762

(1)

CART-Based Durations

763

(1)

Pitch Generation

763

(20)

Attributes of Pitch Contours

764

(4)

Baseline F0 Contour Generation

768

(6)

Parametric F0 Generation

774

(4)

Corpus-Based F0 Generation

778

(5)

Prosody Markup Languages

783

(1)

Prosody Evaluation

784

(1)

Historical Perspective and Further Reading

785

(8)

Speech Synthesis

793

(60)

Attributes of Speech Syntyhesis

794

(2)

Format Speech Synthesis

796

(8)

Waveform Generation from Formant Values

797

(3)

Formant Generation by Rule

800

(3)

Data-Driven Formant Generation

803

(1)

Articulatory Synthesis

803

(1)

Concatenative Speech Synthesis

804

(14)

Choice of Unit

805

(5)

Optimal Unit String: The Decoding Process

810

(7)

Unit Inventory Design

817

(1)

Prosodic Modification of Speech

818

(13)

Synchronous Overlap and Add (SOLA)

818

(2)

Pitch Synchronous Overlap and Add (PSOLA)

820

(2)

Spectral Behavior of PSOLA

822

(1)

Synthesis Epoch Calculation

823

(2)

Pitch-Scale Modification Epoch Calculation

825

(1)

Time-Scale Modification Epoch Calculation

826

(1)

Pitch-Scale Time-Scale Epoch Calculation

827

(1)

Waveform Mapping

827

(1)

Epoch Detection

828

(1)

Problems with PSOLA

829

(2)

Source-Filter Models for Prosody Modification

831

(3)

Prosody Modification of the LPC Residual

832

(1)

Mixed Excitation Models

832

(2)

Voice Effects

834

(1)

Evaluation of TTs Systems

834

(10)

Intelligibility Tests

837

(3)

Overall Quality Tests

840

(2)

Preference Tests

842

(1)

Functional Tests

842

(1)

Automated Tests

843

(1)

Historical Perspective and Further Reading

844

(9)

PART V: SPOKEN LANGUAGE SYSTEMS

Spoken Language Understanding

853

(66)

Written vs. Spoken Languages

855

(4)

Style

856

(1)

Disfluency

857

(1)

Communicative Prosody

858

(1)

Dialog Structure

859

(8)

Units of Dialog

860

(1)

Dialog (Speech) Acts

861

(5)

Dialog Control

866

(1)

Semantic Representation

867

(6)

Semantic Frames

867

(5)

Conceptual Graphs

872

(1)

Sentence Interpretation

873

(8)

Robust Parsing

874

(4)

Statistical Pattern Matching

878

(3)

Discourse Analysis

881

(5)

Resolution of Relative Expression

882

(3)

Automatic Inference and Inconsistency Detection

885

(1)

Dialog Management

886

(8)

Dialog Grammars

887

(1)

Plan-Based Systems

888

(4)

Dialog Behavior

892

(2)

Response Generation and Rendition

894

(7)

Response Content Generation

895

(4)

Concept-to-Speech Rendition

899

(2)

Other Renditions

901

(1)

Evaluation

901

(5)

Evaluation in the ATIS Task

901

(2)

PARADISE Framework

903

(3)

Case Study---Dr. Who

906

(7)

Semantic Representation

906

(2)

Semantic Parser (Sentence Interpretation)

908

(1)

Discourse Analysis

909

(1)

Dialog Manager

910

(3)

Historical Perspective and Further Reading

913

(6)

Applications and User Interfaces

919

(38)

Application Architecture

920

(1)

Typical Applications

921

(10)

Computer Command and Control

921

(3)

Telephony Applications

924

(2)

Dictation

926

(3)

Accessibility

929

(1)

Handheld Devices

930

(1)

Automobile Applications

930

(1)

Speaker Recognition

931

(1)

Speech Interface Design

931

(12)

General Principles

931

(6)

Handling Errors

937

(4)

Other Considerations

941

(1)

Dialog Flow

942

(1)

Internationalization

943

(2)

Case Study---MiPad

945

(7)

Specifying the Application

946

(2)

Rapid Prototyping

948

(1)

Evaluation

949

(2)

Iterations

951

(1)

Historical Perspective and Further Reading

952

(5)

Index

957

Amazon no longer offers textbook rentals. We do!

Amazon no longer offers textbook rentals. We do!

We're the #1 textbook rental company. Let us show you why.

Spoken Language Processing A Guide to Theory, Algorithm and System Development

0130226165

Summary

Preface

Author Biography

Table of Contents

Supplemental Materials

Excerpts

Rewards Program