Preface

vii

Introduction

(18)

What Is Data Mining?

(2)

Motivating Challenges

(2)

The Origins of Data Mining

(1)

Data Mining Tasks

(4)

Scope and Organization of the Book

(2)

Bibliographic Notes

(3)

Exercises

(3)

Data

(78)

Types of Data

(14)

Attributes and Measurement

(6)

Types of Data Sets

(7)

Data Quality

(8)

Measurement and Data Collection Issues

(6)

Issues Related to Applications

(1)

Data Preprocessing

(21)

Aggregation

(2)

Sampling

(3)

Dimensionality Reduction

(2)

Feature Subset Selection

(3)

Feature Creation

(2)

Discretization and Binarization

(6)

Variable Transformation

(2)

Measures of Similarity and Dissimilarity

(19)

Basics

(1)

Similarity and Dissimilarity between Simple Attributes

(2)

Dissimilarities between Data Objects

(3)

Similarities between Data Objects

(1)

Examples of Proximity Measures

(7)

Issues in Proximity Calculation

(3)

Selecting the Right Proximity Measure

(1)

Bibliographic Notes

(4)

Exercises

(9)

Exploring Data

(48)

The Iris Data Set

(1)

Summary Statistics

(7)

Frequencies and the Mode

(1)

Percentiles

100

(1)

Measures of Location: Mean and Median

101

(1)

Measures of Spread: Range and Variance

102

(2)

Multivariate Summary Statistics

104

(1)

Other Ways to Summarize the Data

105

(1)

Visualization

105

(26)

Motivations for Visualization

105

(1)

General Concepts

106

(4)

Techniques

110

(14)

Visualizing Higher-Dimensional Data

124

(6)

Do's and Don'ts

130

(1)

OLAP and Multidimensional Data Analysis

131

(8)

Representing Iris Data as a Multidimensional Array

131

(2)

Multidimensional Data: The General Case

133

(2)

Analyzing Multidimensional Data

135

(4)

Final Comments on Multidimensional Data Analysis

139

(1)

Bibliographic Notes

139

(2)

Exercises

141

(4)

Classification: Basic Concepts, Decision Trees, and Model Evaluation

145

(62)

Preliminaries

146

(2)

General Approach to Solving a Classification Problem

148

(2)

Decision Tree Induction

150

(22)

How a Decision Tree Works

150

(1)

How to Build a Decision Tree

151

(4)

Methods for Expressing Attribute Test Conditions

155

(3)

Measures for Selecting the Best Split

158

(6)

Algorithm for Decision Tree Induction

164

(2)

An Example: Web Robot Detection

166

(2)

Characteristics of Decision Tree Induction

168

(4)

Model Overfitting

172

(14)

Overfitting Due to Presence of Noise

175

(2)

Overfitting Due to Lack of Representative Samples

177

(1)

Overfitting and the Multiple Comparison Procedure

178

(1)

Estimation of Generalization Errors

179

(5)

Handling Overfitting in Decision Tree Induction

184

(2)

Evaluating the Performance of a Classifier

186

(2)

Holdout Method

186

(1)

Random Subsampling

187

(1)

Cross-Validation

187

(1)

Bootstrap

188

(1)

Methods for Comparing Classifiers

188

(5)

Estimating a Confidence Interval for Accuracy

189

(2)

Comparing the Performance of Two Models

191

(1)

Comparing the Performance of Two Classifiers

192

(1)

Bibliographic Notes

193

(5)

Exercises

198

(9)

Classification: Alternative Techniques

207

(120)

Rule-Based Classifier

207

(16)

How a Rule-Based Classifier Works

209

(2)

Rule-Ordering Schemes

211

(1)

How to Build a Rule-Based Classifier

212

(1)

Direct Methods for Rule Extraction

213

(8)

Indirect Methods for Rule Extraction

221

(2)

Characteristics of Rule-Based Classifiers

223

(1)

Nearest-Neighbor classifiers

223

(4)

Algorithm

225

(1)

Characteristics of Nearest-Neighbor Classifiers

226

(1)

Bayesian Classifiers

227

(19)

Bayes Theorem

228

(1)

Using the Bayes Theorem for Classification

229

(2)

Naive Bayes Classifier

231

(7)

Bayes Error Rate

238

(2)

Bayesian Belief Networks

240

(6)

Artificial Neural Network (ANN)

246

(10)

Perceptron

247

(4)

Multilayer Artificial Neural Network

251

(4)

Characteristics of ANN

255

(1)

Support Vector Machine (SVM)

256

(20)

Maximum Margin Hyperplanes

256

(3)

Linear SVM: Separable Case

259

(7)

Linear SVM: Nonseparable Case

266

(4)

Nonlinear SVM

270

(6)

Characteristics of SVM

276

(1)

Ensemble Methods

276

(18)

Rationale for Ensemble Method

277

(1)

Methods for Constructing an Ensemble Classifier

278

(3)

Bias-Variance Decomposition

281

(2)

Bagging

283

(2)

Boosting

285

(5)

Random Forests

290

(4)

Empirical Comparison among Ensemble Methods

294

(1)

Class Imbalance Problem

294

(12)

Alternative Metrics

295

(3)

The Receiver Operating Characteristic Curve

298

(4)

Cost-Sensitive Learning

302

(3)

Sampling-Based Approaches

305

(1)

Multiclass Problem

306

(3)

Bibliographic Notes

309

(6)

Exercises

315

(12)

Association Analysis: Basic Concepts and Algorithms

327

(88)

Problem Definition

328

(4)

Frequent Itemset Generation

332

(17)

The Apriori Principle

333

(2)

Frequent Itemset Generation in the Apriori Algorithm

335

(3)

Candidate Generation and Pruning

338

(4)

Support Counting

342

(3)

Computational Complexity

345

(4)

Rule Generation

349

(4)

Confidence-Based Pruning

350

(1)

Rule Generation in Apriori Algorithm

350

(2)

An Example: Congressional Voting Records

352

(1)

Compact Representation of Frequent Itemsets

353

(6)

Maximal Frequent Itemsets

354

(1)

Closed Frequent Itemsets

355

(4)

Alternative Methods for Generating Frequent Itemsets

359

(4)

FP-Growth Algorithm

363

(7)

FP-Tree Representation

363

(3)

Frequent Itemset Generation in FP-Growth Algorithm

366

(4)

Evaluation of Association Patterns

370

(16)

Objective Measures of Interestingness

371

(11)

Measures beyond Pairs of Binary Variables

382

(2)

Simpson's Paradox

384

(2)

Effect of Skewed Support Distribution

386

(4)

Bibliographic Notes

390

(14)

Exercises

404

(11)

Association Analysis: Advanced Concepts

415

(72)

Handling Categorical Attributes

415

(3)

Handling Continuous Attributes

418

(8)

Discretization-Based Methods

418

(4)

Statistics-Based Methods

422

(2)

Non-discretization Methods

424

(2)

Handling a Concept Hierarchy

426

(3)

Sequential Patterns

429

(13)

Problem Formulation

429

(2)

Sequential Pattern Discovery

431

(5)

Timing Constraints

436

(3)

Alternative Counting Schemes

439

(3)

Subgraph Patterns

442

(15)

Graphs and Subgraphs

443

(1)

Frequent Subgraph Mining

444

(3)

Apriori-like Method

447

(1)

Candidate Generation

448

(5)

Candidate Pruning

453

(4)

Support Counting

457

(1)

Infrequent Patterns

457

(12)

Negative Patterns

458

(1)

Negatively Correlated Patterns

458

(2)

Comparisons among Infrequent Patterns, Negative Patterns, and Negatively Correlated Patterns

460

(1)

Techniques for Mining Interesting Infrequent Patterns

461

(2)

Techniques Based on Mining Negative Patterns

463

(2)

Techniques Based on Support Expectation

465

(4)

Bibliographic Notes

469

(4)

Exercises

473

(14)

Cluster Analysis: Basic Concepts and Algorithms

487

(82)

Overview

490

(6)

What Is Cluster Analysis?

490

(1)

Different Types of Clusterings

491

(2)

Different Types of Clusters

493

(3)

K-means

496

(19)

The Basic K-means Algorithm

497

(9)

K-means: Additional Issues

506

(2)

Bisecting K-means

508

(2)

K-means and Different Types of Clusters

510

(1)

Strengths and Weaknesses

510

(3)

K-means as an Optimization Problem

513

(2)

Agglomerative Hierarchical Clustering

515

(11)

Basic Agglomerative Hierarchical Clustering Algorithm

516

(2)

Specific Techniques

518

(6)

The Lance-Williams Formula for Cluster Proximity

524

(1)

Key Issues in Hierarchical Clustering

524

(2)

Strengths and Weaknesses

526

(1)

DBSCAN

526

(6)

Traditional Density: Center-Based Approach

527

(1)

The DBSCAN Algorithm

528

(2)

Strengths and Weaknesses

530

(2)

Cluster Evaluation

532

(23)

Overview

533

(3)

Unsupervised Cluster Evaluation Using Cohesion and Separation

536

(6)

Unsupervised Cluster Evaluation Using the Proximity Matrix

542

(2)

Unsupervised Evaluation of Hierarchical Clustering

544

(2)

Determining the Correct Number of Clusters

546

(1)

Clustering Tendency

547

(1)

Supervised Measures of Cluster Validity

548

(5)

Assessing the Significance of Cluster Validity Measures

553

(2)

Bibliographic Notes

555

(4)

Exercises

559

(10)

Cluster Analysis: Additional Issues and Algorithms

569

(82)

Characteristics of Data, Clusters, and Clustering Algorithms

570

(7)

Example: Comparing K-means and DBSCAN

570

(1)

Data Characteristics

571

(2)

Cluster Characteristics

573

(2)

General Characteristics of Clustering Algorithms

575

(2)

Prototype-Based Clustering

577

(23)

Fuzzy Clustering

577

(6)

Clustering Using Mixture Models

583

(11)

Self-Organizing Maps (SOM)

594

(6)

Density-Based Clustering

600

(12)

Grid-Based Clustering

601

(3)

Subspace Clustering

604

(4)

Denclue: A Kernel-Based Scheme for Density-Based Clustering

608

(4)

Graph-Based Clustering

612

(18)

Sparsification

613

(1)

Minimum Spanning Tree (MST) Clustering

614

(2)

Opossum: Optimal Partitioning of Sparse Similarities Using METIS

616

(1)

Chameleon: Hierarchical Clustering with Dynamic Modeling

616

(6)

Shared Nearest Neighbor Similarity

622

(3)

The Jarvis-Patrick Clustering Algorithm

625

(2)

SNN Density

627

(2)

SNN Density-Based Clustering

629

(1)

Scalable Clustering Algorithms

630

(9)

Scalability: General Issues and Approaches

630

(3)

Birch

633

(2)

Cure

635

(4)

Which Clustering Algorithm?

639

(4)

Bibliographic Notes

643

(4)

Exercises

647

(4)

Anomaly Detection

651

(34)

Preliminaries

653

(5)

Causes of Anomalies

653

(1)

Approaches to Anomaly Detection

654

(1)

The Use of Class Labels

655

(1)

Issues

656

(2)

Statistical Approaches

658

(8)

Detecting Outliers in a Univariate Normal Distribution

659

(2)

Outliers in a Multivariate Normal Distribution

661

(1)

A Mixture Model Approach for Anomaly Detection

662

(3)

Strengths and Weaknesses

665

(1)

Proximity-Based Outlier Detection

666

(2)

Strengths and Weaknesses

666

(2)

Density-Based Outlier Detection

668

(3)

Detection of Outliers Using Relative Density

669

(1)

Strengths and Weaknesses

670

(1)

Clustering-Based Techniques

671

(4)

Assessing the Extent to Which an Object Belongs to a Cluster

672

(2)

Impact of Outliers on the Initial Clustering

674

(1)

The Number of Clusters to Use

674

(1)

Strengths and Weaknesses

674

(1)

Bibliographic Notes

675

(5)

Exercises

680

(5)

Appendix A Linear Algebra

685

(16)

Vectors

685

(6)

Definition

685

(1)

Vector Addition and Multiplication by a Scalar

685

(2)

Vector Spaces

687

(1)

The Dot Product, Orthogonality, and Orthogonal Projections

688

(2)

Vectors and Data Analysis

690

(1)

Matrices

691

(9)

Matrices: Definitions

691

(1)

Matrices: Addition and Multiplication by a Scalar

692

(1)

Matrices: Multiplication

693

(2)

Linear Transformations and Inverse Matrices

695

(2)

Eigenvalue and Singular Value Decomposition

697

(2)

Matrices and Data Analysis

699

(1)

Bibliographic Notes

700

(1)

Appendix B Dimensionality Reduction

701

(18)

PCA and SVD

701

(7)

Principal Components Analysis (PCA)

701

(5)

SVD

706

(2)

Other Dimensionality Reduction Techniques

708

(8)

Factor Analysis

708

(2)

Locally Linear Embedding (LLE)

710

(2)

Multidimensional Scaling, FastMap, and ISOMAP

712

(3)

Common Issues

715

(1)

Bibliographic Notes

716

(3)

Appendix C Probability and Statistics

719

(10)

Probability

719

(4)

Expected Values

722

(1)

Statistics

723

(3)

Point Estimation

724

(1)

Central Limit Theorem

724

(1)

Interval Estimation

725

(1)

Hypothesis Testing

726

(3)

Appendix D Regression

729

(10)

Preliminaries

729

(1)

Simple Linear Regression

730

(6)

Least Square Method

731

(2)

Analyzing Regression Errors

733

(2)

Analyzing Goodness of Fit

735

(1)

Multivariate Linear Regression

736

(1)

Alternative Least-Square Regression Methods

737

(2)

Appendix E Optimization

739

(11)

Unconstrained Optimization

739

(7)

Numerical Methods

742

(4)

Constrained Optimization

746

(4)

Equality Constraints

746

(1)

Inequality Constraints

747

(3)

Author Index

750

(8)

Subject Index

758

(11)

769

Amazon no longer offers textbook rentals. We do!

Amazon no longer offers textbook rentals. We do!

We're the #1 textbook rental company. Let us show you why.

Introduction to Data Mining

9780321321367

0321321367

Supplemental Materials

Summary

Table of Contents

Supplemental Materials

Rewards Program