Each week, we publish lay abstracts of new articles from our prestigious portfolio of journals in statistics. The aim is to highlight the latest research to a broader audience in an accessible format.
The article featured today is from Stat with the full article (Open Access) now available to read here.
A methodology is proposed which deals with the presence of ordered factors used as explanatory variables. Hence they are typically included in the linear predictor of some model under consideration. For any given ordered factor with K levels, say, a set of K numeric values is introduced, with a given value assigned to each factor level. In the end, the original factor is effectively replaced by a numeric variable. This scheme represents a refinement of the elementary scoring system represented by the basic sequence of integers from 1 to K, which constitutes a simple time-honoured option to deal with ordered factors, but it is not always appropriate.
The actual construction of numeric scores proceeds by selecting K quantiles of a distribution belonging to some suitable parametric family. The adoption of a sufficiently flexible parametric family helps to find a scoring system appropriate for the data under consideration. A concomitant product of this scheme is the identification of numeric values which indicate how the K levels are “really” spaced. Combining these two features, a key feature of the proposal is interpretability of the resulting construction.
The proposed method represents an alternative to the use of contrasts based on orthogonal polynomials, which is commonly the method of choice for ordered factors. In the logic which is put forward, the constructed scores are intended to be used, and interpreted, without further manipulation. Hence, for instance, building a polynomial form using one such variable would diverge somewhat from the proposed logic, although still conceivable. With a single numeric variable to represent a given factor, one cannot expect to achieve the same numerical fit to the data as obtained the polynomial contrasts built for the original factor, when these contrasts involve high degrees polynomials, and correspondingly several parameters. However, a range of numerical explorations has indicated that in many cases the resulting fit is equal or similar to the one achieved via polynomial contrasts, with non-negligible simplification in the model specification, and easier interpretation.
In a nutshell, the aim of the approach is to achieve a satisfactory data fit while improving on model parsimony, with simple interpretability of the scoring scheme.
The methodology has been implemented in an R package freely available at https://CRAN.R-project.org/package=smof whose documentation also shows how to reproduce the numerical illustrations of the present paper.
More Details