Statistical Horizons Seminar: Data Mining


A two-day seminar taught by Dr Robert Stine

Data mining has become respectable, useful, even essential.

Data mining is based on highly automated statistical algorithms that are able to identify reproducible patterns in “wide” data sets. Wide data sets have many columns (or variables), often more columns than rows.

If you have any of the following problems, then data mining can help:
• Your models fit well when you build them, but do poorly when implemented.
• You have so many variables, you donʼt know where to begin.
• Your predictions just are not accurate enough to beat the competition.

Rather than building a model that relies on a few carefully chosen measurements in an experiment, data mining commonly involves a search for patterns from a wide dataset. These searches might consider 100,000 or more features, looking for the few that predict the response. Data mining is also useful even if you donʼt have the worldʼs largest data set and just want to make better use of the information you do have.

Learn data mining by focusing on applications.

This course takes you through cases that use data mining to solve real-world problems like these:
• Identify patients who are most at risk of a disease
• Pick out the best prospective job candidates
• Predict which credit applications are fraudulent.

Success with data mining requires more than fitting a model. The analyses of these and other cases start by identifying the problem, digging through the relevant data, and wrapping up with a discussion of how to present the results to others. Better predictions wonʼt solve any problems unless you can present those results in a way that others can grasp, appreciate, and act on.

The course leaves you able to start using these methods right away. Data mining does not require exotic hardware or software. Once you understand how to use regression for data mining, youʼll be able to appreciate the strengths and weaknesses of newer methods. With that in mind, the class starts with a data minerʼs view of regression and then moves to neural networks, classification and regression trees, boosting, random projections, and support vector machines. Real-time demonstrations in class use JMP (the interactive software from SAS) and R.


This course is designed for researchers and data analysts with a modest statistical background who want to use data mining methods on their own data sets. No previous background in data mining is necessary. But participants should have a good working knowledge of the basic principles of statistical inference (e.g., standard errors, hypothesis tests, confidence intervals), and should also have a good understanding of the basic theory and practice of linear regression. Some acquaintance with logistic regression is also helpful.

Schedule and materials

The class will meet from 9 to 4 each day with a 1-hour lunch break.

Participants receive a bound manual containing detailed lecture notes (with equations and graphics), examples of computer printout, and many other useful features. This book frees participants from the distracting task of note taking.

Registration and Lodging

The fee of $895 includes all course materials. The early registration fee of $795 is available until March 11.

Lodging Reservation Instructions

A block of rooms has been reserved at the Club Quarters Hotel, 1628 Chestnut St., Philadelphia, PA at a rate of $142 per night for a Standard room. This hotel is about a 5-minute walk from the course location. To register, you must call 203-905-2100 during business hours and identify yourself by mentioning the group code STA410. For guaranteed rate and availability, you must make your reservation before March 10, 2014.

Course outline

1 Introduction and overview
• Examples
• Data mining process
• Books
• Software

2 Data preparation and exploration
• Brushing and linking
• Aggregation and subsampling
• Surface plots
• Outliers
• Data sparsity

3 Foundations
• Role of sample size and classical inference
• Loss functions, goals
• Overfitting
• Cross validation
• Bagging and ensemble methods
• Dealing with missing data
• Regression examples

4 Data mining with regression
• Feature creation, transformation
• PCA, random projection, RKHS (kernel trick)
• Streaming features
• Calibration
• Variable/feature selection
• Lasso vs subset (L0 vs L1)
• Ridge regression, shrinkage
• Penalization

5 Classification
• Confusion matrix
• Loss functions for classification
• Discriminant analysis (i.e., regression)
• Logistic regression
• Neural networks

6 Trees and methods based on partitioning
• Local vs global models
• Interactions
• Splitting and degrees of freedom
• Boosting
• Treed regression

7 Clustering
• Unsupervised learning
• k-means, hierarchical/agglomerative clustering
• Bayesian, Dirichlet process

8 Text mining, special systems
• Bag of words
• Markov models
• Recommender systems
• Network models
• Hybrid combinations
• Wrap up, summary

Related Topics

Related Publications

Related Content

Site Footer


This website is provided by John Wiley & Sons Limited, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ (Company No: 00641132, VAT No: 376766987)

Published features on are checked for statistical accuracy by a panel from the European Network for Business and Industrial Statistics (ENBIS)   to whom Wiley and express their gratitude. This panel are: Ron Kenett, David Steinberg, Shirley Coleman, Irena Ograjenšek, Fabrizio Ruggeri, Rainer Göb, Philippe Castagliola, Xavier Tort-Martorell, Bart De Ketelaere, Antonio Pievatolo, Martina Vandebroek, Lance Mitchell, Gilbert Saporta, Helmut Waldl and Stelios Psarakis.