Automatic speech recognition (ASR) systems are finding increasing use in everyday life. Many of the commonplace environments where the systems are used are noisy, for example users calling up a voice search system from a busy cafeteria or a street. This can result in degraded speech recordings and adversely affect the performance of speech recognition systems. As the use of ASR systems increases, knowledge of the state-of-the-art in techniques to deal with such problems becomes critical to system and application engineers and researchers who work with or on ASR technologies. This book presents a comprehensive survey of the state-of-the-art in techniques used to improve the robustness of speech recognition systems to these degrading external influences. Key features: Reviews all the main noise robust ASR approaches, including signal separation, voice activity detection, robust feature extraction, model compensation and adaptation, missing data techniques and recognition of reverberant speech. Acts as a timely exposition of the topic in light of more widespread use in the future of ASR technology in challenging environments. Addresses robustness issues and signal degradation which are both key requirements for practitioners of ASR. Includes contributions from top ASR researchers from leading research units in the field

Tuomas Virtanen, Tampere University of Technology, Finland
Dr . Virtanen is a senior researcher at Tampere University of Technology. Previously, he has worked at Cambridge University, UK as a research associate. His main research contributions are in sound source separation and its application to robust speech recognition, audio content analysis, and music information retrieval. He is well-known for his work on non-negative matrix factorization based source separation, which is currently widely used in the field. He has published numerous journal and conference articles related to above topics.

Rita Singh, Carnegie Mellon University, USA
Dr. Singh is the CEO of a speech-technology startup but remains an adjunct faculty of the Language Technologies Institute at Carnegie Mellon University. She has been a major contributor to the open-source CMU sphinx and is one of the main architects of the popular Sphinx4 java-based open-source speech recognition system. In addition to her work on core speech recognition technology, she has also developed several algorithms for noise compensation, and was the prime architect of CMU's award-winning submission to the 2001 Naval Research Lab's challenge on automatic recognition of speech in noisy environments (SPINE).

Bhiksha Raj, Carnegie Mellon University, USA
Dr. Raj is an associate professor in the Language Technologies Institute and in Electrical and Computer Engineering at Carnegie Mellon University. He has worked extensively on robustness algorithms for speech recognition, and is very well-known for his contributions to the highly-popular VTS approach for noise compensation, as well as his contributions to missing-feature-based techniques for noise compensation. He has published extensively on and holds patents for algorithms for microphone array processing and signal separation.

List of Contributors xv

Acknowledgments xvii

1 Introduction 1
Tuomas Virtanen, Rita Singh, Bhiksha Raj

1.1 Scope of the Book 1

1.2 Outline 2

1.3 Notation 4

Part One FOUNDATIONS

2 The Basics of Automatic Speech Recognition 9
Rita Singh, Bhiksha Raj, Tuomas Virtanen

2.1 Introduction 9

2.2 Speech Recognition Viewed as Bayes Classification 10

2.3 Hidden Markov Models 11

2.3.1 Computing Probabilities with HMMs 12

2.3.2 Determining the State Sequence 17

2.3.3 Learning HMM Parameters 19

2.3.4 Additional Issues Relating to Speech Recognition Systems 20

2.4 HMM-Based Speech Recognition 24

2.4.1 Representing the Signal 24

2.4.2 The HMM for a Word Sequence 25

2.4.3 Searching through all Word Sequences 26

References 29

3 The Problem of Robustness in Automatic Speech Recognition 31
Bhiksha Raj, Tuomas Virtanen, Rita Singh

3.1 Errors in Bayes Classification 31

3.1.1 Type 1 Condition: Mismatch Error 33

3.1.2 Type 2 Condition: Increased Bayes Error 34

3.2 Bayes Classification and ASR 35

3.2.1 All We Have is a Model: A Type 1 Condition 35

3.2.2 Intrinsic Interferences—Signal Components that are Unrelated to the Message: A Type 2 Condition 36

3.2.3 External Interferences—The Data are Noisy: Type 1 and Type 2 Conditions 36

3.3 External Influences on Speech Recordings 36

3.3.1 Signal Capture 37

3.3.2 Additive Corruptions 41

3.3.3 Reverberation 42

3.3.4 A Simplified Model of Signal Capture 43

3.4 The Effect of External Influences on Recognition 44

3.5 Improving Recognition under Adverse Conditions 46

3.5.1 Handling the Model Mismatch Error 46

3.5.2 Dealing with Intrinsic Variations in the Data 47

3.5.3 Dealing with Extrinsic Variations 47

References 50

Part Two SIGNAL ENHANCEMENT

4 Voice Activity Detection, Noise Estimation, and Adaptive Filters for Acoustic Signal Enhancement 53
Rainer Martin, Dorothea Kolossa

4.1 Introduction 53

4.2 Signal Analysis and Synthesis 55

4.2.1 DFT-Based Analysis Synthesis with Perfect Reconstruction 55

4.2.2 Probability Distributions for Speech and Noise DFT Coefficients 57

4.3 Voice Activity Detection 58

4.3.1 VAD Design Principles 58

4.3.2 Evaluation of VAD Performance 62

4.3.3 Evaluation in the Context of ASR 62

4.4 Noise Power Spectrum Estimation 65

4.4.1 Smoothing Techniques 65

4.4.2 Histogram and GMM Noise Estimation Methods 67

4.4.3 Minimum Statistics Noise Power Estimation 67

4.4.4 MMSE Noise Power Estimation 68

4.4.5 Estimation of the A Priori Signal-to-Noise Ratio 69

4.5 Adaptive Filters for Signal Enhancement 71

4.5.1 Spectral Subtraction 71

4.5.2 Nonlinear Spectral Subtraction 73

4.5.3 Wiener Filtering 74

4.5.4 The ETSI Advanced Front End 75

4.5.5 Nonlinear MMSE Estimators 75

4.6 ASR Performance 80

4.7 Conclusions 81

References 82

5 Extraction of Speech from Mixture Signals 87
Paris Smaragdis

5.1 The Problem with Mixtures 87

5.2 Multichannel Mixtures 88

5.2.1 Basic Problem Formulation 88

5.2.2 Convolutive Mixtures 92

5.3 Single-Channel Mixtures 98

5.3.1 Problem Formulation 98

5.3.2 Learning Sound Models 100

5.3.3 Separation by Spectrogram Factorization 101

5.3.4 Dealing with Unknown Sounds 105

5.4 Variations and Extensions 107

5.5 Conclusions 107

References 107

6 Microphone Arrays 109
John McDonough, Kenichi Kumatani

6.1 Speaker Tracking 110

6.2 Conventional Microphone Arrays 113

6.3 Conventional Adaptive Beamforming Algorithms 120

6.3.1 Minimum Variance Distortionless Response Beamformer 120

6.3.2 Noise Field Models 122

6.3.3 Subband Analysis and Synthesis 123

6.3.4 Beamforming Performance Criteria 126

6.3.5 Generalized Sidelobe Canceller Implementation 129

6.3.6 Recursive Implementation of the GSC 130

6.3.7 Other Conventional GSC Beamformers 131

6.3.8 Beamforming based on Higher Order Statistics 132

6.3.9 Online Implementation 136

6.3.10 Speech-Recognition Experiments 140

6.4 Spherical Microphone Arrays 142

6.5 Spherical Adaptive Algorithms 148

6.6 Comparative Studies 149

6.7 Comparison of Linear and Spherical Arrays for DSR 152

6.8 Conclusions and Further Reading 154

References 155

Part Three FEATURE ENHANCEMENT

7 From Signals to Speech Features by Digital Signal Processing 161
Matthias W¨olfel

7.1 Introduction 161

7.1.1 About this Chapter 162

7.2 The Speech Signal 162

7.3 Spectral Processing 163

7.3.1 Windowing 163

7.3.2 Power Spectrum 165

7.3.3 Spectral Envelopes 166

7.3.4 LP Envelope 166

7.3.5 MVDR Envelope 169

7.3.6 Warping the Frequency Axis 171

7.3.7 Warped LP Envelope 175

7.3.8 Warped MVDR Envelope 176

7.3.9 Comparison of Spectral Estimates 177

7.3.10 The Spectrogram 179

7.4 Cepstral Processing 179

7.4.1 Definition and Calculation of Cepstral Coefficients 180

7.4.2 Characteristics of Cepstral Sequences 181

7.5 Influence of Distortions on Different Speech Features 182

7.5.1 Objective Functions 182

7.5.2 Robustness against Noise 185

7.5.3 Robustness against Echo and Reverberation 187

7.5.4 Robustness against Changes in Fundamental Frequency 189

7.6 Summary and Further Reading 191

References 191

8 Features Based on Auditory Physiology and Perception 193
Richard M. Stern, Nelson Morgan

8.1 Introduction 193

8.2 Some Attributes of Auditory Physiology and Perception 194

8.2.1 Peripheral Processing 194

8.2.2 Processing at more Central Levels 200

8.2.3 Psychoacoustical Correlates of Physiological Observations 202

8.2.4 The Impact of Auditory Processing on Conventional Feature Extraction 206

8.2.5 Summary 208

8.3 “Classic” Auditory Representations 208

8.4 Current Trends in Auditory Feature Analysis 213

8.5 Summary 221

Acknowledgments 222

References 222

9 Feature Compensation 229
Jasha Droppo

9.1 Life in an Ideal World 229

9.1.1 Noise Robustness Tasks 229

9.1.2 Probabilistic Feature Enhancement 230

9.1.3 Gaussian Mixture Models 231

9.2 MMSE-SPLICE 232

9.2.1 Parameter Estimation 233

9.2.2 Results 236

9.3 Discriminative SPLICE 237

9.3.1 The MMI Objective Function 238

9.3.2 Training the Front-End Parameters 239

9.3.3 The Rprop Algorithm 240

9.3.4 Results 241

9.4 Model-Based Feature Enhancement 242

9.4.1 The Additive Noise-Mixing Equation 243

9.4.2 The Joint Probability Model 244

9.4.3 Vector Taylor Series Approximation 246

9.4.4 Estimating Clean Speech 247

9.4.5 Results 247

9.5 Switching Linear Dynamic System 248

9.6 Conclusion 249

References 249

10 Reverberant Speech Recognition 251
Reinhold Haeb-Umbach, Alexander Krueger

10.1 Introduction 251

10.2 The Effect of Reverberation 252

10.2.1 What is Reverberation? 252

10.2.2 The Relationship between Clean and Reverberant Speech Features 254

10.2.3 The Effect of Reverberation on ASR Performance 258

10.3 Approaches to Reverberant Speech Recognition 258

10.3.1 Signal-Based Techniques 259

10.3.2 Front-End Techniques 260

10.3.3 Back-End Techniques 262

10.3.4 Concluding Remarks 265

10.4 Feature Domain Model of the Acoustic Impulse Response 265

10.5 Bayesian Feature Enhancement 267

10.5.1 Basic Approach 268

10.5.2 Measurement Update 269

10.5.3 Time Update 270

10.5.4 Inference 271

10.6 Experimental Results 272

10.6.1 Databases 272

10.6.2 Overview of the Tested Methods 273

10.6.3 Recognition Results on Reverberant Speech 274

10.6.4 Recognition Results on Noisy Reverberant Speech 276

10.7 Conclusions 277

Acknowledgment 278

References 278

Part Four MODEL ENHANCEMENT

11 Adaptation and Discriminative Training of Acoustic Models 285
Yannick Est`eve, Paul Del´eglise

11.1 Introduction 285

11.1.1 Acoustic Models 286

11.1.2 Maximum Likelihood Estimation 287

11.2 Acoustic Model Adaptation and Noise Robustness 288

11.2.1 Static (or Offline) Adaptation 289

11.2.2 Dynamic (or Online) Adaptation 289

11.3 Maximum A Posteriori Reestimation 290

11.4 Maximum Likelihood Linear Regression 293

11.4.1 Class Regression Tree 294

11.4.2 Constrained Maximum Likelihood Linear Regression 297

11.4.3 CMLLR Implementation 297

11.4.4 Speaker Adaptive Training 298

11.5 Discriminative Training 299

11.5.1 MMI Discriminative Training Criterion 301

11.5.2 MPE Discriminative Training Criterion 302

11.5.3 I-smoothing 303

11.5.4 MPE Implementation 304

11.6 Conclusion 307

References 308

12 Factorial Models for Noise Robust Speech Recognition 311
John R. Hershey, Steven J. Rennie, Jonathan Le Roux

12.1 Introduction 311

12.2 The Model-Based Approach 313

12.3 Signal Feature Domains 314

12.4 Interaction Models 317

12.4.1 Exact Interaction Model 318

12.4.2 Max Model 320

12.4.3 Log-Sum Model 321

12.4.4 Mel Interaction Model 321

12.5 Inference Methods 322

12.5.1 Max Model Inference 322

12.5.2 Parallel Model Combination 324

12.5.3 Vector Taylor Series Approaches 326

12.5.4 SNR-Dependent Approaches 331

12.6 Efficient Likelihood Evaluation in Factorial Models 332

12.6.1 Efficient Inference using the Max Model 332

12.6.2 Efficient Vector-Taylor Series Approaches 334

12.6.3 Band Quantization 335

12.7 Current Directions 337

12.7.1 Dynamic Noise Models for Robust ASR 338

12.7.2 Multi-Talker Speech Recognition using Graphical Models 339

12.7.3 Noise Robust ASR using Non-Negative Basis Representations 340

References 341

13 Acoustic Model Training for Robust Speech Recognition 347
Michael L. Seltzer

13.1 Introduction 347

13.2 Traditional Training Methods for Robust Speech Recognition 348

13.3 A Brief Overview of Speaker Adaptive Training 349

13.4 Feature-Space Noise Adaptive Training 351

13.4.1 Experiments using fNAT 352

13.5 Model-Space Noise Adaptive Training 353

13.6 Noise Adaptive Training using VTS Adaptation 355

13.6.1 Vector Taylor Series HMM Adaptation 355

13.6.2 Updating the Acoustic Model Parameters 357

13.6.3 Updating the Environmental Parameters 360

13.6.4 Implementation Details 360

13.6.5 Experiments using NAT 361

13.7 Discussion 364

13.7.1 Comparison of Training Algorithms 364

13.7.2 Comparison to Speaker Adaptive Training 364

13.7.3 Related Adaptive Training Methods 365

13.8 Conclusion 366

References 366

Part Five COMPENSATION FOR INFORMATION LOSS

14 Missing-Data Techniques: Recognition with Incomplete Spectrograms 371
Jon Barker

14.1 Introduction 371

14.2 Classification with Incomplete Data 373

14.2.1 A Simple Missing Data Scenario 374

14.2.2 Missing Data Theory 376

14.2.3 Validity of the MAR Assumption 378

14.2.4 Marginalising Acoustic Models 379

14.3 Energetic Masking 381

14.3.1 The Max Approximation 381

14.3.2 Bounded Marginalisation 382

14.3.3 Missing Data ASR in the Cepstral Domain 384

14.3.4 Missing Data ASR with Dynamic Features 386

14.4 Meta-Missing Data: Dealing with Mask Uncertainty 388

14.4.1 Missing Data with Soft Masks 388

14.4.2 Sub-band Combination Approaches 391

14.4.3 Speech Fragment Decoding 393

14.5 Some Perspectives on Performance 395

References 396

15 Missing-Data Techniques: Feature Reconstruction 399
Jort Florent Gemmeke, Ulpu Remes

15.1 Introduction 399

15.2 Missing-Data Techniques 401

15.3 Correlation-Based Imputation 402

15.3.1 Fundamentals 402

15.3.2 Implementation 404

15.4 Cluster-Based Imputation 406

15.4.1 Fundamentals 406

15.4.2 Implementation 408

15.4.3 Advances 409

15.5 Class-Conditioned Imputation 411

15.5.1 Fundamentals 411

15.5.2 Implementation 412

15.5.3 Advances 413

15.6 Sparse Imputation 414

15.6.1 Fundamentals 414

15.6.2 Implementation 416

15.6.3 Advances 418

15.7 Other Feature-Reconstruction Methods 420

15.7.1 Parametric Approaches 420

15.7.2 Nonparametric Approaches 421

15.8 Experimental Results 421

15.8.1 Feature-Reconstruction Methods 422

15.8.2 Comparison with Other Methods 424

15.8.3 Advances 426

15.8.4 Combination with Other Methods 427

15.9 Discussion and Conclusion 428

Acknowledgments 429

References 430

16 Computational Auditory Scene Analysis and Automatic Speech Recognition 433
Arun Narayanan, DeLiang Wang

16.1 Introduction 433

16.2 Auditory Scene Analysis 434

16.3 Computational Auditory Scene Analysis 435

16.3.1 Ideal Binary Mask 435

16.3.2 Typical CASA Architecture 438

16.4 CASA Strategies 440

16.4.1 IBM Estimation Based on Local SNR Estimates 440

16.4.2 IBM Estimation using ASA Cues 442

16.4.3 IBM Estimation as Binary Classification 448

16.4.4 Binaural Mask Estimation Strategies 451

16.5 Integrating CASA with ASR 452

16.5.1 Uncertainty Transform Model 454

16.6 Concluding Remarks 458

Acknowledgment 458

References 458

17 Uncertainty Decoding 463
Hank Liao

17.1 Introduction 463

17.2 Observation Uncertainty 465

17.3 Uncertainty Decoding 466

17.4 Feature-Based Uncertainty Decoding 468

17.4.1 SPLICE with Uncertainty 470

17.4.2 Front-End Joint Uncertainty Decoding 471

17.4.3 Issues with Feature-Based Uncertainty Decoding 472

17.5 Model-Based Joint Uncertainty Decoding 473

17.5.1 Parameter Estimation 475

17.5.2 Comparisons with Other Methods 476

17.6 Noisy CMLLR 477

17.7 Uncertainty and Adaptive Training 480

17.7.1 Gradient-Based Methods 481

17.7.2 Factor Analysis Approaches 482

17.8 In Combination with Other Techniques 483

17.9 Conclusions 484

References 485

Index 487

What is included with this book?

The New copy of this book will include any supplemental materials advertised. Please check the title of the book to determine if it should include any access cards, study guides, lab manuals, CDs, etc.

The Used, Rental and eBook copies of this book are not guaranteed to include any supplemental materials. Typically, only the book itself is included. This is true even if the title states it includes any access cards, study guides, lab manuals, CDs, etc.

Amazon no longer offers textbook rentals. We do!

Amazon no longer offers textbook rentals. We do!

We're the #1 textbook rental company. Let us show you why.

Techniques for Noise Robustness in Automatic Speech Recognition

9781119970880

1119970881

Supplemental Materials

Summary

Author Biography

Table of Contents

Supplemental Materials

Rewards Program