The paper describes a journey between questions, models and data analysis to reach specific goals. This journey is typical in industrial, engineering, biology and social science applications. It contrasts regulated clinical research where a statistical analysis plan is declared before data collection. We consider random forests, ridge regression, lasso and elastic nets. Specifically, consider a system tracked with 63 sensors. Your car is such an example. The case study used in the paper consists of data from 63 sensors collected in the testing of an electronic system. The analyst asks, in the paper, a sequence of questions and the paper shows how they were tackled by statistical analysis to meet the analysis goal. Eventually the statistical analyst was able to provide a robust parsimonious and effective model for predicting the system condition using a subset of the 63 sensors. In handling this problem, the paper develops and applies several innovative methods and insights that can prove useful in a general data analysis life cycle. The questions, the life cycle step and the analytic tools used in the case study are listed in the table below:.
Question | Life cycle step | Analytic Tool |
Question 1: Is the sensor data differentiating between good and failed systems? | (1) problem elicitation | Multivariate T2 |
Question 2: What sensors contribute to the T2 control chart | (8) impact assessment. | Contribution plots |
Question 3: Can we predict failure modes from sensor data? | (6) operationalization of findings | Bootstrap forest |
Question 4: How robust are the column contributions to alternative choices of training and validation subsets? | (5) formulation of findings | Sensitivity analysis of column contribution |
Question 5: Is there a subset of sensors we can focus on to adequately predict test results? | (5) formulation of findings | Variable clustering |
Question 6: What is the performance of penalized regression models using only Cluster 1 sensors? | (8) impact assessment. | Penalized regression |
Question 7: Can we have simple rules that translate sensor data into test result predictions? | (6) operationalization of findings | Decision tree |
This journey demonstrates how study goal, data, questions and analysis interact. To quote the famous mathematician and bio-scientist Sam Karlin: “The purpose of models is not to fit the data but to sharpen the question”. https://onlinelibrary.wiley.com/doi/10.1002/9781118445112.stat08377