Nearest Neighbour Classification Based on Imbalanced Data: A Statistical Approach – lay abstract

The lay abstract featured today (for Nearest Neighbour Classification Based on Imbalanced Data: A Statistical Approach by Anvit Garg, Anil K. Ghosh, Soham Sarkar) is from Stat with the full article now available to read here.

How to Cite

Garg, A., A. Ghosh, and S. Sarkar. 2025. Nearest Neighbour Classification Based on Imbalanced Data: A Statistical Approach. Stat 14, no. 4: e70110. https://doi.org/10.1002/sta4.70110.

Lay Abstract

In many real-world classification problems, like spotting a rare disease based on medical records or flagging fraudulent transactions, the occurrence of “interesting” examples is scarce. Popular tools such as the nearest-neighbour classifier tend to be influenced heavily by the commonly occurring examples, which often lead to missed detections. In the article, a simple fix to this problem is introduced. Instead of always checking among a fixed number of neighbours, the method keeps looking outward until it has seen a set number of rare examples. It then asks a straightforward statistical question: Had the labels been random, what is the probability of looking this far? The answer provides a score that helps in deciding fairly between the rare and common cases. A multi-scale version of the method is used, where the scores are calculated for different values of rare occurrences, which are combined to give the final classification. Two extensions of the method are provided to handle multi-class classification problems. Theoretically it is shown that the method learns the optimal decision rule as more data arrive. In tests on simulated and public benchmark datasets, including multi-class problems, it consistently matches or outperforms standard nearest-neighbour and popular data-balancing techniques, especially when the class imbalance is severe.

More Details