Statistics seminar 2016: "Clustering high dimensional mixed data: joint analysis of phenotypic and genotypic data"

  • Data: 21 giugno 2016 dalle 14:30 alle 16:00

  • Luogo: Dipartimento di Scienze Statistiche "P. Fortunati" - Via Belle Arti 41 - Aula II

Claire Gormley
(University College Dublin)

Abstract
The LIPGENE-SU.VI.MAX study, like many others, recorded high dimensional continuous phenotypic data and categorical genotypic data. Interest lies in clustering the study participant into homogeneous groups orsub-phenotypes, by jointly considering their phenotypic and genotypic data, and in determining which variables are discriminatory. A novel latent variable model which elegantly accommodates high dimensional, mixed data is developed to cluster participants using a Bayesian finite mixture model. A computationally efficient variable selection algorithm is incorporated, estimation is via a Gibbs sampling algorithm and an approximate BIC-MCMC criterion is developed to select the optimal model. Two clusters or sub-phenotypes (‘healthy’ and ‘at risk’) are uncovered. A small subset of variables is deemed discriminatory which notably includes phenotypic and genotypic variables, highlighting the need to jointly consider both factors. Further, seven years after the data were collected, participants underwent further analysis to diagnose presence or absence of the metabolic syndrome (MetS). The two uncovered sub-phenotypes strongly correspond to the seven year follow up disease classification, highlighting the role of phenotypic and genotypic factors in the MetS, and emphasising the potential utility of the clustering approach in early screening. Additionally, the ability of the proposed approach to define the uncertainty in sub-phenotype membership at the participant level is synonymous with the concepts of precision medicine and nutrition.

Contact person
Cinzia Viroli