Preface | p. ix |
Acknowledgments | p. xi |
List of Figures | p. xiii |
List of Tables | p. xv |
Introduction | p. 1 |
How to Read This Book? | p. 2 |
A Short Introduction to R | p. 3 |
Starting with R | p. 3 |
R Objects | p. 5 |
Vectors | p. 7 |
Vectorization | p. 10 |
Factors | p. 11 |
Generating Sequences | p. 14 |
Sub-Setting | p. 16 |
Matrices and Arrays | p. 19 |
Lists | p. 23 |
Data Frames | p. 26 |
Creating New Functions | p. 30 |
Objects, Classes, and Methods | p. 33 |
Managing Your Sessions | p. 34 |
A Short Introduction to MySQL | p. 35 |
Predicting Algae Blooms | p. 39 |
Problem Description and Objectives | p. 39 |
Data Description | p. 40 |
Loading the Data into R | p. 41 |
Data Visualization and Summarization | p. 43 |
Unknown Values | p. 52 |
Removing the Observations with Unknown Values | p. 53 |
Filling in the Unknowns with the Most Frequent Values | p. 55 |
Filling in the Unknown Values by Exploring Correlations | p. 56 |
Filling in the Unknown Values by Exploring Similarities between Cases | p. 60 |
Obtaining Prediction Models | p. 63 |
Multiple Linear Regression | p. 64 |
Regression Trees | p. 71 |
Model Evaluation and Selection | p. 77 |
Predictions for the Seven Algae | p. 91 |
Summary | p. 94 |
Predicting Stock Market Returns | p. 95 |
Problem Description and Objectives | p. 95 |
The Available Data | p. 96 |
Handling Time-Dependent Data in R | p. 97 |
Reading the Data from the CSV File | p. 101 |
Getting the Data from the Web | p. 102 |
Reading the Data from a MySQL Database | p. 104 |
Loading the Data into R Running on Windows | p. 105 |
Loading the Data into R Running on Linux | p. 107 |
Defining the Prediction Tasks | p. 108 |
What to Predict? | p. 108 |
Which Predictors? | p. 111 |
The Prediction Tasks | p. 117 |
Evaluation Criteria | p. 118 |
The Prediction Models | p. 120 |
How Will the Training Data Be Used? | p. 121 |
The Modeling Tools | p. 123 |
Artificial Neural Networks | p. 123 |
Support Vector Machines | p. 126 |
Multivariate Adaptive Regression Splines | p. 129 |
From Predictions into Actions | p. 130 |
How Will the Predictions Be Used? | p. 130 |
Trading-Related Evaluation Criteria | p. 132 |
Putting Everything Together: A Simulated Trader | p. 133 |
Model Evaluation and Selection | p. 141 |
Monte Carlo Estimates | p. 141 |
Experimental Comparisons | p. 143 |
Results Analysis | p. 148 |
The Trading System | p. 156 |
Evaluation of the Final Test Data | p. 156 |
An Online Trading System | p. 162 |
Summary | p. 163 |
Detecting Fraudulent Transactions | p. 165 |
Problem Description and Objectives | p. 165 |
The Available Data | p. 166 |
Loading the Data into R | p. 166 |
Exploring the Dataset | p. 167 |
Data Problems | p. 174 |
Unknown Values | p. 175 |
Few Transactions of Some Products | p. 179 |
Defining the Data Mining Tasks | p. 183 |
Different Approaches to the Problem | p. 183 |
Unsupervised Techniques | p. 184 |
Supervised Techniques | p. 185 |
Semi-Supervised Techniques | p. 186 |
Evaluation Criteria | p. 187 |
Precision and Recall | p. 188 |
Lift Charts and Precision/Recall Curves | p. 188 |
Normalized Distance to Typical Price | p. 193 |
Experimental Methodology | p. 194 |
Obtaining Outlier Rankings | p. 195 |
Unsupervised Approaches | p. 196 |
The Modified Box Plot Rule | p. 196 |
Local Outlier Factors (LOF) | p. 201 |
Clustering-Based Outlier Rankings (ORh) | p. 205 |
Supervised Approaches | p. 208 |
The Class Imbalance Problem | p. 209 |
Naive Bayes | p. 211 |
AdaBoost | p. 217 |
Semi-Supervised Approaches | p. 223 |
Summary | p. 230 |
Classifying Microarray Samples | p. 233 |
Problem Description and Objectives | p. 233 |
Brief Background on Microarray Experiments | p. 233 |
The ALL Dataset | p. 234 |
The Available Data | p. 235 |
Exploring the Dataset | p. 238 |
Gene (Feature) Selection | p. 241 |
Simple Filters Based on Distribution Properties | p. 241 |
ANOVA Filters | p. 244 |
Filtering Using Random Forests | p. 246 |
Filtering Using Feature Clustering Ensembles | p. 248 |
Predicting Cytogenetic Abnormalities | p. 251 |
Defining the Prediction Task | p. 251 |
The Evaluation Metric | p. 252 |
The Experimental Procedure | p. 253 |
The Modeling Techniques | p. 254 |
Random Forests | p. 254 |
k-Nearest Neighbors | p. 255 |
Comparing the Models | p. 258 |
Summary | p. 267 |
Bibliography | p. 269 |
Subject Index | p. 279 |
Index of Data Mining Topics | p. 285 |
Index of R Functions | p. 287 |
Table of Contents provided by Ingram. All Rights Reserved. |
The New copy of this book will include any supplemental materials advertised. Please check the title of the book to determine if it should include any access cards, study guides, lab manuals, CDs, etc.
The Used, Rental and eBook copies of this book are not guaranteed to include any supplemental materials. Typically, only the book itself is included. This is true even if the title states it includes any access cards, study guides, lab manuals, CDs, etc.