Modern analysis of HEP data needs advanced statistical tools to separate signal from background. This is the first book which focuses on machine learning techniques. It will be of interest to almost every high energy physicist, and, due to its coverage, suitable for students.
The authors are experts in the use of statistics in particle physics data analysis. Frank C. Porter is Professor at Physics at the California Institute of Technology and has lectured extensively at Cal Tech, the SLAC Laboratory at Standford, and elsewhere. Ilya Narsky is Senior Matlab Developer at The MathWorks, a leading developer of technical computing software for engineers and scientists, and the initiator of the StatPatternRecognition, a C++ package for statistical analysis of HEP data. Together, they have taught courses for graduate students and postdocs.
Table of Contents
PARAMETRIC UNBINNED LIKELIHOOD FITS Fits for Small Statistics Fits Near the Boundary of the Physical Region Likelihood Ratio Test for Presence of Signal 5-Plots GOODNESS OF FIT PROBLEM Binned Goodness-of-Fit Tests Statistics Converging to Chi-Square Univariate Unbinned Goodness-of-Fit Tests: Kolmogorov-Smirnov, Anderson-Darling, Watson, and Neyman Smooth Tests Multivariate Tests RESAMPLING TECHNIQUES Jackknife, Bootstrap and Cross-Validation Choice of the Optima Resampling Method: Bias, Variance and the Learning Curve Resampling Weighted Observations NON-PARAMETRIC DENSITY ESTIMATION Equal-Bin and Adaptive Histograms Optimal Binning Density Estimation by Kernels Optimal Kernel Size The Curse of Dimensionality BASIC CONCEPTS AND DEFINITIONS OF MACHINE LEARNING Supervised, Unsupervised and Semi-Supervised Learning Batch and Online Learning Sequential and Parallel Learning Classification and Regression Training, Validation and Test Categorical Variables Missing Values DATA PRE-PROCESSING Linear Transformations and Dimensionality Reduction Principal and Independent Component Analysis Partial Least Squares INTRODUCTION TO CLASSIFICATION Forms of Classification Loss Perfect Classifier to Bayes 0-1 Loss Bias and Variance of a Classifier Data with Unbalanced Classes and Unequal Misclassification Costs MONITORING CLASSIFIER PERFORMANCE ROC and Other Performance Curves Confidence Bounds for ROC Curves Testing if One Classifier Outperforms Another Comparison Across Multiple Classifiers LINEAR AND QUADRATIC DISCRIMINANT ANALYSIS Testing Multivariate Normality Logistic Regression for Data with Two Classes BUMP HUNTING IN HIGH-DIMENSIONAL DATA Search for Rectangular Regions by High-Dimensional Optimization and PRIM Algorithms Voronoi Tessellation and SLEUTH Algorithm NEURAL NETWORKS Back-Propagation Activation Functions Optimization of Neural Nets by Genetic Algorithms LOCAL METHODS Nearest Neighbors Radial Basis Functions, Thin-Plate Splines, Regularized Least Squares, and Support Vector Machines (SVM) Multiclass Extensions of SVM The Curse of Dimensionality DECISION TREES Splitting Criteria Binary and Multiway Splits Pruningdecision Trees Surrogate Splits and Their Uses ENSEMBLE METHODS Boostin: AdaBoostM1 Algorithm Boosting Multiclass Learners Bagging and Random Forest Boosting as Fitting a Stagewise Additive Model Convex Loss Functions and Label Noise Pruning (Post-Fitting) Ensembles UNIFIED APPROACH FOR REDUCING MULTICLASS TO BINARY Error Correcting Output Code COMBINING CLASSIFIERS Trainable Combiners Mixture of Experts Optimallinear Combination of Classifiers Stacked Generalization METHODS FOR VARIABLE SELECTION Filters, Wrappers and Embedded Methods Relevance and Redundancy of Variables Variable Ranking and Optimal Subset Selection Generalzed Sequential Forward Addition and Backward Elimination Variable Importance Using Nearest Neighbors (ReliefF Algorithm) Variable Importance from Randomized Subsets Variable Importance from Decision Trees SURVEY OF SOFTWARE PACKAGES FOR MACHINE LEARNING