9781439810187

This book provides a self-contained introduction to the use of R for exploratory data mining and machine learning. Employing a practical, learn-by-doing approach, the author presents a series of representative case studies from ecology, financial prediction, fraud detection, and bioinformatics, including all of the necessary steps, code, and data. These examples demonstrate how to address important data mining issues, such as handling data sets with too many variables, and illustrate key concepts, including outlier detection and semisupervised learning. A supporting web page provides additional code and data for further study.

Preface	p. ix
Acknowledgments	p. xi
List of Figures	p. xiii
List of Tables	p. xv
Introduction	p. 1
How to Read This Book?	p. 2
A Short Introduction to R	p. 3
Starting with R	p. 3
R Objects	p. 5
Vectors	p. 7
Vectorization	p. 10
Factors	p. 11
Generating Sequences	p. 14
Sub-Setting	p. 16
Matrices and Arrays	p. 19
Lists	p. 23
Data Frames	p. 26
Creating New Functions	p. 30
Objects, Classes, and Methods	p. 33
Managing Your Sessions	p. 34
A Short Introduction to MySQL	p. 35
Predicting Algae Blooms	p. 39
Problem Description and Objectives	p. 39
Data Description	p. 40
Loading the Data into R	p. 41
Data Visualization and Summarization	p. 43
Unknown Values	p. 52
Removing the Observations with Unknown Values	p. 53
Filling in the Unknowns with the Most Frequent Values	p. 55
Filling in the Unknown Values by Exploring Correlations	p. 56
Filling in the Unknown Values by Exploring Similarities between Cases	p. 60
Obtaining Prediction Models	p. 63
Multiple Linear Regression	p. 64
Regression Trees	p. 71
Model Evaluation and Selection	p. 77
Predictions for the Seven Algae	p. 91
Summary	p. 94
Predicting Stock Market Returns	p. 95
Problem Description and Objectives	p. 95
The Available Data	p. 96
Handling Time-Dependent Data in R	p. 97
Reading the Data from the CSV File	p. 101
Getting the Data from the Web	p. 102
Reading the Data from a MySQL Database	p. 104
Loading the Data into R Running on Windows	p. 105
Loading the Data into R Running on Linux	p. 107
Defining the Prediction Tasks	p. 108
What to Predict?	p. 108
Which Predictors?	p. 111
The Prediction Tasks	p. 117
Evaluation Criteria	p. 118
The Prediction Models	p. 120
How Will the Training Data Be Used?	p. 121
The Modeling Tools	p. 123
Artificial Neural Networks	p. 123
Support Vector Machines	p. 126
Multivariate Adaptive Regression Splines	p. 129
From Predictions into Actions	p. 130
How Will the Predictions Be Used?	p. 130
Trading-Related Evaluation Criteria	p. 132
Putting Everything Together: A Simulated Trader	p. 133
Model Evaluation and Selection	p. 141
Monte Carlo Estimates	p. 141
Experimental Comparisons	p. 143
Results Analysis	p. 148
The Trading System	p. 156
Evaluation of the Final Test Data	p. 156
An Online Trading System	p. 162
Summary	p. 163
Detecting Fraudulent Transactions	p. 165
Problem Description and Objectives	p. 165
The Available Data	p. 166
Loading the Data into R	p. 166
Exploring the Dataset	p. 167
Data Problems	p. 174
Unknown Values	p. 175
Few Transactions of Some Products	p. 179
Defining the Data Mining Tasks	p. 183
Different Approaches to the Problem	p. 183
Unsupervised Techniques	p. 184
Supervised Techniques	p. 185
Semi-Supervised Techniques	p. 186
Evaluation Criteria	p. 187
Precision and Recall	p. 188
Lift Charts and Precision/Recall Curves	p. 188
Normalized Distance to Typical Price	p. 193
Experimental Methodology	p. 194
Obtaining Outlier Rankings	p. 195
Unsupervised Approaches	p. 196
The Modified Box Plot Rule	p. 196
Local Outlier Factors (LOF)	p. 201
Clustering-Based Outlier Rankings (OR_h)	p. 205
Supervised Approaches	p. 208
The Class Imbalance Problem	p. 209
Naive Bayes	p. 211
AdaBoost	p. 217
Semi-Supervised Approaches	p. 223
Summary	p. 230
Classifying Microarray Samples	p. 233
Problem Description and Objectives	p. 233
Brief Background on Microarray Experiments	p. 233
The ALL Dataset	p. 234
The Available Data	p. 235
Exploring the Dataset	p. 238
Gene (Feature) Selection	p. 241
Simple Filters Based on Distribution Properties	p. 241
ANOVA Filters	p. 244
Filtering Using Random Forests	p. 246
Filtering Using Feature Clustering Ensembles	p. 248
Predicting Cytogenetic Abnormalities	p. 251
Defining the Prediction Task	p. 251
The Evaluation Metric	p. 252
The Experimental Procedure	p. 253
The Modeling Techniques	p. 254
Random Forests	p. 254
k-Nearest Neighbors	p. 255
Comparing the Models	p. 258
Summary	p. 267
Bibliography	p. 269
Subject Index	p. 279
Index of Data Mining Topics	p. 285
Index of R Functions	p. 287
Table of Contents provided by Ingram. All Rights Reserved.

What is included with this book?

The New copy of this book will include any supplemental materials advertised. Please check the title of the book to determine if it should include any access cards, study guides, lab manuals, CDs, etc.

The Used, Rental and eBook copies of this book are not guaranteed to include any supplemental materials. Typically, only the book itself is included. This is true even if the title states it includes any access cards, study guides, lab manuals, CDs, etc.

Amazon no longer offers textbook rentals. We do!

Amazon no longer offers textbook rentals. We do!

We're the #1 textbook rental company. Let us show you why.

Data Mining with R: Learning with Case Studies

1439810184

Summary

Table of Contents

Supplemental Materials

Rewards Program