Class-tested and coherent, this textbook teaches classical and web information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. It gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Slides and additional exercises (with solutions for lecturers) are also available through the book's supporting website to help course instructors prepare their lectures.

Christopher D. Manning is Associate Professor of Computer Science and Linguistics at Stanford University Prabhakar Raghavan is Head of Yahoo! Research and a Consulting Professor of Computer Science at Stanford University Hinrich Schutze is Chair of Theoretical Computational Linguistics at the Institute for Natural Language Processing, University of Stuttgart

Table of Notation	p. xi
Preface	p. xv
Boolean retrieval	p. 1
An example information retrieval problem	p. 3
A first take at building an inverted index	p. 6
Processing Boolean queries	p. 9
The extended Boolean model versus ranked retrieval	p. 13
References and further reading	p. 16
The term vocabulary and postings lists	p. 18
Document delineation and character sequence decoding	p. 18
Determining the vocabulary of terms	p. 21
Faster postings list intersection via skip pointers	p. 33
Positional postings and phrase queries	p. 36
References and further reading	p. 43
Dictionaries and tolerant retrieval	p. 45
Search structures for dictionaries	p. 45
Wildcard queries	p. 48
Spelling correction	p. 52
Phonetic correction	p. 58
References and further reading	p. 59
Index construction	p. 61
Hardware basics	p. 62
Blocked sort-based indexing	p. 63
Single-pass in-memory indexing	p. 66
Distributed indexing	p. 68
Dynamic indexing	p. 71
Other types of indexes	p. 73
References and further reading	p. 76
Index compression	p. 78
Statistical properties of terms in information retrieval	p. 79
Dictionary compression	p. 82
Postings file compression	p. 87
References and further reading	p. 97
Scoring, term weighting, and the vector space model	p. 100
Parametric and zone indexes	p. 101
Term frequency and weighting	p. 107
The vector space model for scoring	p. 110
Variant tf-idf functions	p. 116
References and further reading	p. 122
Computing scores in a complete search system	p. 124
Efficient scoring and ranking	p. 124
Components of an information retrieval system	p. 132
Vector space scoring and query operator interaction	p. 136
References and further reading	p. 137
Evaluation in information retrieval	p. 139
Information retrieval system evaluation	p. 140
Standard test collections	p. 141
Evaluation of unranked retrieval sets	p. 142
Evaluation of ranked retrieval results	p. 145
Assessing relevance	p. 151
A broader perspective: System quality and user utility	p. 154
Results snippets	p. 157
References and further reading	p. 159
Relevance feedback and query expansion	p. 162
Relevance feedback and pseudo relevance feedback	p. 163
Global methods for query reformulation	p. 173
References and further reading	p. 177
XML retrieval	p. 178
Basic XML concepts	p. 180
Challenges in XML retrieval	p. 183
A vector space model for XML retrieval	p. 188
Evaluation of XML retrieval	p. 192
Text-centric versus data-centric XML retrieval	p. 196
References and further reading	p. 198
Probabilistic information retrieval	p. 201
Review of basic probability theory	p. 202
The probability ranking principle	p. 203
The binary independence model	p. 204
An appraisal and some extensions	p. 212
References and further reading	p. 216
Language models for information retrieval	p. 218
Language models	p. 218
The query likelihood model	p. 223
Language modeling versus other approaches in information retrieval	p. 229
Extended language modeling approaches	p. 230
References and further reading	p. 232
Text classification and Naive Bayes	p. 234
The text classification problem	p. 237
Naive Bayes text classification	p. 238
The Bernoulli model	p. 243
Properties of Naive Bayes	p. 245
Feature selection	p. 251
Evaluation of text classification	p. 258
References and further reading	p. 264
Vector space classification	p. 266
Document representations and measures of relatedness in vector spaces	p. 267
Rocchio classification	p. 269
k nearest neighbor	p. 273
Linear versus nonlinear classifiers	p. 277
Classification with more than two classes	p. 281
The bias-variance tradeoff	p. 284
References and further reading	p. 291
Support vector machines and machine learning on documents	p. 293
Support vector machines: The linearly separable case	p. 294
Extensions to the support vector machine model	p. 300
Issues in the classification of text documents	p. 307
Machine-learning methods in ad hoc information retrieval	p. 314
References and further reading	p. 318
Flat clustering	p. 321
Clustering in information retrieval	p. 322
Problem statement	p. 326
Evaluation of clustering	p. 327
K-means	p. 331
Model-based clustering	p. 338
References and further reading	p. 343
Hierarchical clustering	p. 346
Hierarchical agglomerative clustering	p. 347
Single-link and complete-link clustering	p. 350
Group-average agglomerative clustering	p. 356
Centroid clustering	p. 358
Optimality of hierarchical agglomerative clustering	p. 360
Divisive clustering	p. 362
Cluster labeling	p. 363
Implementation notes	p. 365
References and further reading	p. 367
Matrix decompositions and latent semantic indexing	p. 369
Linear algebra review	p. 369
Term-document matrices and singular value decompositions	p. 373
Low-rank approximations	p. 376
Latent semantic indexing	p. 378
References and further reading	p. 383
Web search basics	p. 385
Background and history	p. 385
Web characteristics	p. 387
Advertising as the economic model	p. 392
The search user experience	p. 395
Index size and estimation	p. 396
Near-duplicates and shingling	p. 400
References and further reading	p. 404
Web crawling and indexes	p. 405
Overview	p. 405
Crawling	p. 406
Distributing indexes	p. 415
Connectivity servers	p. 416
References and further reading	p. 419
Link analysis	p. 421
The Web as a graph	p. 422
PageRank	p. 424
Hubs and authorities	p. 433
References and further reading	p. 439
Bibliography	p. 441
Index	p. 469
Table of Contents provided by Ingram. All Rights Reserved.

What is included with this book?

The New copy of this book will include any supplemental materials advertised. Please check the title of the book to determine if it should include any access cards, study guides, lab manuals, CDs, etc.

The Used, Rental and eBook copies of this book are not guaranteed to include any supplemental materials. Typically, only the book itself is included. This is true even if the title states it includes any access cards, study guides, lab manuals, CDs, etc.