Selective inference on trees
Résumé
The ever-increasing scope and scale of data collection has shifted the focus of data collection away from testing pre-specified hypotheses and towards hypothesis generation. Researchers are often interested in performing exploratory data analysis on a data set to generate hypotheses, then to validate those hypotheses in that same data via tests of significance. Unfortunately, this type of “double-dipping” can lead to extremely inflated type I error rates.
In this talk, I will consider double-dipping on trees. First, I will focus on trees generated by hierarchical clustering, and consider testing for differences between clusters obtained by cutting the tree. I will propose a selective inference approach to test for a difference in means between two clusters that properly accounts for the fact that the choice of null hypothesis was made based on the data. Second, I will consider trees generated using the CART algorithm, and will again use a selective inference approach to conduct inference on the means of the terminal nodes. Applications include single-cell RNA-sequencing data and the Box Lunch Study.
This is joint work with Jacob Bien (University of Southern California), Daniela Witten (University of Washington), and Anna Neufeld (University of Washington).
Biographie
Lucy Gao est professeur adjointe au département de statistique et de sciences actuarielles de l’Université de Waterloo. Elle est détentrice d’un doctorat en biostatistique de l’Université de Washington où elle a étudiée sous la gouverne de Daniela Witten.
Dr. Gao est intéressée par le développement de méthodologie statistique en lien avec la statistique multidimensionnelle, l’apprentissage statistique et l’inférence sélective, appliquée à des problèmes scientifiques en biologie et en sciences de la santé. Elle travaille également sur la planification optimale de devis expérimentaux.