This book offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. Inside, you'll learn all you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data miningincluding both tried-and-true techniques of the past and Java-based methods at the leading edge of contemporary research. If you're involved at any level in the work of extracting usable knowledge from large collections of data, this clearly written and effectively illustrated book will prove an invaluable resource. Complementing the authors' instruction is a fully functional platform-independent Java software system for machine learning, available for download. Apply it to the sample data sets provided to refine your data mining skills, apply it to your own data to discern meaningful patterns and generate valuable insights, adapt it for your specialized data mining applications, or use it to develop your own machine learning schemes. * Helps you select appropriate approaches to particular problems and to compare and evaluate the results of different techniques. * Covers performance improvement techniques, including input preprocessing and combining output from different methods. * Comes with downloadable machine learning software: use it to master the techniques covered inside, apply it to your own projects, and/or customize it to meet special needs.

Ian H. Witten is professor of computer science at the University of Waikato in New Zealand. He is a fellow of the ACM and the Royal Society of New Zealand and a member of professional computing, information retrieval, and engineering associations in the U.K., U.S., Canada, and New Zealand. Eibe Frank is a researcher in the Machine Learning group at the University of Waikato. He holds a degree in computer science from the University of Karlsruhe in Germany and is the author of several papers, both presented at machine learning conferences and published in machine learning journals.

Foreword

v

Preface

xix

What's it all about?

1

(36)

Data mining and machine learning

2

(6)

Describing structural patterns

4

(1)

Machine learning

5

(2)

Data mining

7

(1)

Simple examples: The weather problem and others

8

(12)

The weather problem

8

(3)

Contact lenses: An idealized problem

11

(2)

Irises: A classic numeric dataset

13

(2)

CPU performance: Introducing numeric prediction

15

(1)

Labor negotiations: A more realistic example

16

(1)

Soybean classification: A classic machine learning success

17

(3)

Fielded applications

20

(6)

Decisions involving judgment

21

(1)

Screening images

22

(1)

Load forecasting

23

(1)

Diagnosis

24

(1)

Marketing and sales

25

(1)

Machine learning and statistics

26

(1)

Generalization as search

27

(5)

Enumerating the concept space

28

(1)

Bias

29

(3)

Data mining and ethics

32

(2)

Further reading

34

(3)

Input: Concepts, instances, attributes

37

(20)

What's a concept?

38

(3)

What's in an example?

41

(4)

What's in an attribute?

45

(3)

Preparing the input

48

(7)

Gathering the data together

48

(1)

Arff format

49

(2)

Attribute types

51

(1)

Missing values

52

(1)

Inaccurate values

53

(1)

Getting to know your data

54

(1)

Further reading

55

(2)

Output: Knowledge representation

57

(20)

Decision tables

58

(1)

Decision trees

58

(1)

Classification rules

59

(4)

Association rules

63

(1)

Rules with exceptions

64

(3)

Rules involving relations

67

(3)

Trees for numeric prediction

70

(2)

Instance-based representation

72

(3)

Clusters

75

(1)

Further reading

76

(1)

Algorithms: The basic methods

77

(42)

Inferring rudimentary rules

78

(4)

Missing values and numeric attributes

80

(1)

Discussion

81

(1)

Statistical modeling

82

(7)

Missing values and numeric attributes

85

(3)

Discussion

88

(1)

Divide and conquer: Constructing decision trees

89

(8)

Calculating information

93

(1)

Highly branching attributes

94

(3)

Discussion

97

(1)

Covering algorithms: Constructing rules

97

(7)

Rules versus trees

98

(1)

A simple covering algorithm

98

(5)

Rules versus decision lists

103

(1)

Mining association rules

104

(8)

Item sets

105

(1)

Association rules

105

(3)

Generating rules efficiently

108

(3)

Discussion

111

(1)

Linear models

112

(2)

Numeric prediction

112

(1)

Classification

113

(1)

Discussion

113

(1)

Instance-based learning

114

(2)

The distance function

114

(1)

Discussion

115

(1)

Further reading

116

(3)

Credibility: Evaluating what's been learned

119

(38)

Training and testing

120

(3)

Predicting performance

123

(2)

Cross-validation

125

(2)

Other estimates

127

(2)

Leave-one-out

127

(1)

The bootstrap

128

(1)

Comparing data mining schemes

129

(4)

Predicting probabilities

133

(4)

Quadratic loss function

134

(1)

Informational loss function

135

(1)

Discussion

136

(1)

Counting the cost

137

(10)

Lift charts

139

(2)

ROC curves

141

(3)

Cost-sensitive learning

144

(1)

Discussion

145

(2)

Evaluating numeric prediction

147

(3)

The minimum description length principle

150

(4)

Applying MDL to clustering

154

(1)

Further reading

155

(2)

Implementations: Real machine learning schemes

157

(72)

Decision trees

159

(11)

Numeric attributes

159

(2)

Missing values

161

(1)

Pruning

162

(2)

Estimating error rates

164

(3)

Complexity of decision tree induction

167

(1)

From trees to rules

168

(1)

C4.5: Choices and options

169

(1)

Discussion

169

(1)

Classification rules

170

(18)

Criteria for choosing tests

171

(1)

Missing values, numeric attributes

172

(1)

Good rules and bad rules

173

(1)

Generating good rules

174

(1)

Generating good decision lists

175

(2)

Probability measure for rule evaluation

177

(1)

Evaluating rules using a test set

178

(3)

Obtaining rules from partial decision trees

181

(3)

Rules with exceptions

184

(3)

Discussion

187

(1)

Extending linear classification: Support vector machines

188

(5)

The maximum margin hyperplane

189

(2)

Nonlinear class boundaries

191

(2)

Discussion

193

(1)

Instance-based learning

193

(8)

Reducing the number of exemplars

194

(1)

Pruning noisy exemplars

194

(1)

Weighting attributes

195

(1)

Generalizing exemplars

196

(1)

Distance functions for generalized exemplars

197

(2)

Generalized distance functions

199

(1)

Discussion

200

(1)

Numeric prediction

201

(9)

Model trees

202

(1)

Building the tree

202

(1)

Pruning the tree

203

(1)

Nominal attributes

204

(1)

Missing values

204

(1)

Pseudo-code for model tree induction

205

(3)

Locally weighted linear regression

208

(1)

Discussion

209

(1)

Clustering

210

(19)

Iterative distance-based clustering

211

(1)

Incremental clustering

212

(5)

Category utility

217

(1)

Probability-based clustering

218

(3)

The EM algorithm

221

(2)

Extending the mixture model

223

(2)

Bayesian clustering

225

(1)

Discussion

226

(3)

Moving on: Engineering the input and output

229

(36)

Attribute selection

232

(6)

Scheme-independent selection

233

(2)

Searching the attribute space

235

(1)

Scheme-specific selection

236

(2)

Discretizing numeric attributes

238

(9)

Unsupervised discretization

239

(1)

Entropy-based discretization

240

(3)

Other discretization methods

243

(1)

Entropy-based versus error-based discretization

244

(2)

Converting discrete to numeric attributes

246

(1)

Automatic data cleansing

247

(3)

Improving decision trees

247

(1)

Robust regression

248

(1)

Detecting anomalies

249

(1)

Combining multiple models

250

(13)

Bagging

251

(3)

Boosting

254

(4)

Stacking

258

(2)

Error-correcting output codes

260

(3)

Further reading

263

(2)

Nuts and bolts: Machine learning algorithms in Java

265

(56)

Getting started

267

(4)

Javadoc and the class library

271

(6)

Classes, instances, and packages

272

(1)

The weka.core package

272

(2)

The weka.classifiers package

274

(2)

Other packages

276

(1)

Indexes

277

(1)

Processing datasets using the machine learning programs

277

(20)

Using M5'

277

(2)

Generic options

279

(3)

Scheme-specific options

282

(1)

Classifiers

283

(3)

Meta-learning schemes

286

(3)

Filters

289

(5)

Association rules

294

(2)

Clustering

296

(1)

Embedded machine learning

297

(9)

A simple message classifier

299

(7)

Writing new learning schemes

306

(15)

An example classifier

306

(8)

Conventions for implementing classifiers

314

(1)

Writing filters

314

(2)

An example filter

316

(1)

Conventions for writing filters

317

(4)

Looking forward

321

(18)

Learning from massive datasets

322

(3)

Visualizing machine learning

325

(4)

Visualizing the input

325

(2)

Visualizing the output

327

(2)

Incorporating domain knowledge

329

(2)

Text mining

331

(4)

Finding key phrases for documents

331

(2)

Finding information in running text

333

(1)

Soft parsing

334

(1)

Mining the World Wide Web

335

(1)

Further reading

336

(3)

References

339

(12)

Index

351

(20)

About the authors

371

What is included with this book?

The New copy of this book will include any supplemental materials advertised. Please check the title of the book to determine if it should include any access cards, study guides, lab manuals, CDs, etc.

The Used, Rental and eBook copies of this book are not guaranteed to include any supplemental materials. Typically, only the book itself is included. This is true even if the title states it includes any access cards, study guides, lab manuals, CDs, etc.

Amazon no longer offers textbook rentals. We do!

Amazon no longer offers textbook rentals. We do!

We're the #1 textbook rental company. Let us show you why.

Data Mining : Practical Machine Learning Tools and Techniques with Java Implementations

9781558605527

1558605525

Supplemental Materials

Summary

Author Biography

Table of Contents

Supplemental Materials

Rewards Program