Data Mining with R: Learning with Case Studies

by ;
  • ISBN13:


  • ISBN10:


  • Format: Hardcover
  • Copyright: 2010-11-09
  • Publisher: Chapman & Hall
  • Purchase Benefits
  • Free Shipping On Orders Over $35!
    Your order must be $35 or more to qualify for free economy shipping. Bulk sales, PO's, Marketplace items, eBooks and apparel do not qualify for this offer.
  • Get Rewarded for Ordering Your Textbooks! Enroll Now
List Price: $93.95 Save up to $48.85
  • eBook
    Add to Cart


Supplemental Materials

What is included with this book?

  • The eBook copy of this book is not guaranteed to include any supplemental materials. Typically, only the book itself is included. This is true even if the title states it includes any access cards, study guides, lab manuals, CDs, etc.


This book provides a self-contained introduction to the use of R for exploratory data mining and machine learning. Employing a practical, learn-by-doing approach, the author presents a series of representative case studies from ecology, financial prediction, fraud detection, and bioinformatics, including all of the necessary steps, code, and data. These examples demonstrate how to address important data mining issues, such as handling data sets with too many variables, and illustrate key concepts, including outlier detection and semisupervised learning. A supporting web page provides additional code and data for further study.

Table of Contents

Prefacep. ix
Acknowledgmentsp. xi
List of Figuresp. xiii
List of Tablesp. xv
Introductionp. 1
How to Read This Book?p. 2
A Short Introduction to Rp. 3
Starting with Rp. 3
R Objectsp. 5
Vectorsp. 7
Vectorizationp. 10
Factorsp. 11
Generating Sequencesp. 14
Sub-Settingp. 16
Matrices and Arraysp. 19
Listsp. 23
Data Framesp. 26
Creating New Functionsp. 30
Objects, Classes, and Methodsp. 33
Managing Your Sessionsp. 34
A Short Introduction to MySQLp. 35
Predicting Algae Bloomsp. 39
Problem Description and Objectivesp. 39
Data Descriptionp. 40
Loading the Data into Rp. 41
Data Visualization and Summarizationp. 43
Unknown Valuesp. 52
Removing the Observations with Unknown Valuesp. 53
Filling in the Unknowns with the Most Frequent Valuesp. 55
Filling in the Unknown Values by Exploring Correlationsp. 56
Filling in the Unknown Values by Exploring Similarities between Casesp. 60
Obtaining Prediction Modelsp. 63
Multiple Linear Regressionp. 64
Regression Treesp. 71
Model Evaluation and Selectionp. 77
Predictions for the Seven Algaep. 91
Summaryp. 94
Predicting Stock Market Returnsp. 95
Problem Description and Objectivesp. 95
The Available Datap. 96
Handling Time-Dependent Data in Rp. 97
Reading the Data from the CSV Filep. 101
Getting the Data from the Webp. 102
Reading the Data from a MySQL Databasep. 104
Loading the Data into R Running on Windowsp. 105
Loading the Data into R Running on Linuxp. 107
Defining the Prediction Tasksp. 108
What to Predict?p. 108
Which Predictors?p. 111
The Prediction Tasksp. 117
Evaluation Criteriap. 118
The Prediction Modelsp. 120
How Will the Training Data Be Used?p. 121
The Modeling Toolsp. 123
Artificial Neural Networksp. 123
Support Vector Machinesp. 126
Multivariate Adaptive Regression Splinesp. 129
From Predictions into Actionsp. 130
How Will the Predictions Be Used?p. 130
Trading-Related Evaluation Criteriap. 132
Putting Everything Together: A Simulated Traderp. 133
Model Evaluation and Selectionp. 141
Monte Carlo Estimatesp. 141
Experimental Comparisonsp. 143
Results Analysisp. 148
The Trading Systemp. 156
Evaluation of the Final Test Datap. 156
An Online Trading Systemp. 162
Summaryp. 163
Detecting Fraudulent Transactionsp. 165
Problem Description and Objectivesp. 165
The Available Datap. 166
Loading the Data into Rp. 166
Exploring the Datasetp. 167
Data Problemsp. 174
Unknown Valuesp. 175
Few Transactions of Some Productsp. 179
Defining the Data Mining Tasksp. 183
Different Approaches to the Problemp. 183
Unsupervised Techniquesp. 184
Supervised Techniquesp. 185
Semi-Supervised Techniquesp. 186
Evaluation Criteriap. 187
Precision and Recallp. 188
Lift Charts and Precision/Recall Curvesp. 188
Normalized Distance to Typical Pricep. 193
Experimental Methodologyp. 194
Obtaining Outlier Rankingsp. 195
Unsupervised Approachesp. 196
The Modified Box Plot Rulep. 196
Local Outlier Factors (LOF)p. 201
Clustering-Based Outlier Rankings (ORh)p. 205
Supervised Approachesp. 208
The Class Imbalance Problemp. 209
Naive Bayesp. 211
AdaBoostp. 217
Semi-Supervised Approachesp. 223
Summaryp. 230
Classifying Microarray Samplesp. 233
Problem Description and Objectivesp. 233
Brief Background on Microarray Experimentsp. 233
The ALL Datasetp. 234
The Available Datap. 235
Exploring the Datasetp. 238
Gene (Feature) Selectionp. 241
Simple Filters Based on Distribution Propertiesp. 241
ANOVA Filtersp. 244
Filtering Using Random Forestsp. 246
Filtering Using Feature Clustering Ensemblesp. 248
Predicting Cytogenetic Abnormalitiesp. 251
Defining the Prediction Taskp. 251
The Evaluation Metricp. 252
The Experimental Procedurep. 253
The Modeling Techniquesp. 254
Random Forestsp. 254
k-Nearest Neighborsp. 255
Comparing the Modelsp. 258
Summaryp. 267
Bibliographyp. 269
Subject Indexp. 279
Index of Data Mining Topicsp. 285
Index of R Functionsp. 287
Table of Contents provided by Ingram. All Rights Reserved.

Rewards Program

Write a Review