The lay abstract featured today (for Improved Variance Estimation From Trimmed Samples by Daniel Andrade) is from Stat with the full Open Access article now available to read here.
Andrade, D. (2024), Improved Variance Estimation From Trimmed Samples. Stat, 13: e70018. https://doi.org/10.1002/sta4.70018
In data analysis, a fundamental task is to summarize how much a set of numbers vary from the average, i.e. the spread around the mean. For that purpose, the variance or standard deviation are often used in practice, though, their naive estimation is not robust to outliers. That means, only a few extreme values can lead to completely wrong estimates.
The authors propose a new estimator of the variance that guarantees the robustness of up to 50% outliers, while ensuring that, in case of no outliers the estimation accuracy is still high.
Moreover, depending on the needs of the user, the maximal robustness can be chosen also to a lower level than 50% with the benefits of increased accuracy.
Their estimator is based on the popular Qn estimator which trims pairs of samples with the largest absolute distance. Their estimator works by using a linear combination of trimming estimators, with different trimming ratios, in a way that theoretically ensures the optimal achievable performance, which is either the minimal mean squared error, or minimal variance under unbiasedness.
Additionally, their theory and simulations show that their proposed estimator is better than the median absolute deviation (MAD) and the standard Qn estimators. As such the authors suggest practitioners to use this new estimation method instead of MAD and others. The authors also provide implementations in Python so that everybody can quickly try out the new estimators.
More Details
