Abstract
Selecting an optimal clustering solution is a longstanding problem. In model-based clustering, this amounts to choose the architecture of the model mixture distribution. Decisions are cluster prototype distribution, number of mixture components, perhaps restrictions on the clusters' geometry, etc. Classical methods address this issue via penalized model selection criteria that are based on the observed likelihood. We compare these methods with the less explored cross-validation alternative, which is almost the default option in the prediction-oriented paradigm. We introduce a framework for "scoring" clustering solutions, where scores are intimately connected with likelihood and information-theoretic quantities. We propose to estimate scores and their confidence intervals based on resampling methods and to use these estimates to formulate selection rules. Theoretical guarantees are given. Both real and artificial data sets are analyzed to assess the relative performance of the proposed methodology.
Organizzatore
Christian Hennig