## Estimating the number of clusters

### Abstract

Hartigan (1975) defines the number q of clusters in a d ‐variate statistical population as the number of connected components of the set {f > c}, where f denotes the underlying density function on Rd and c is a given constant. Some usual cluster algorithms treat q as an input which must be given in advance. The authors propose a method for estimating this parameter which is based on the computation of the number of connected components of an estimate of {f > c}. This set estimator is constructed as a union of balls with centres at an appropriate subsample which is selected via a nonparametric density estimator of f. The asymptotic behaviour of the proposed method is analyzed. A simulation study and an example with real data are also included.

View all

View all