Statistical Inference for High-Dimensional Genetic Data

Date

2019-03-05

Authors

Li, Xuan

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

This dissertation focuses on three types of high-dimensional genetic data: protein sequences, DNA methylation data, and microRNA expression data. The four major parts are presented in Chapters 2-5, respectively.

In Chapter 2, we develop a new clustering method for protein sequences. First, we reduce the dimensionality based on entropy. Second, the sequences are clustered using the Hamming distance vectors of chosen sites. We apply this new method to an influenza A H3N2 HA data set, which consists of 1960 viral sequences. Our method aggregates these sequences into 23 clusters. Based on the temporal evolution pattern of these clusters, we find that the dominant clusters change from time to time and are often different from the clusters housing vaccine strains.

In Chapter 3, we conduct systematic simulation studies and real data analysis to compare the performance of seven statistical tests for equal-variance hypothesis. Our results show that Brown-Forsythe test and trimmed-mean-based-Levene's test have better performance on DNA methylation data in comparison with other tests.

Detection of differential DNA methylation and differential variability have received a lot of attention in the literature. In Chapter 4, we derive the asymptotic distribution of a joint score test (AW), proposed by Anh and Wang (2013). Furthermore, we propose three improved joint score tests, namely iAW.Lev, iAW.BF, and iAW.TM. Systematic simulation studies show that at least one of the proposed tests performs better than the existing tests for data with outliers or from non-normal distributions. The real data analyses demonstrate that the three proposed tests have higher true validation rates than the existing tests.

Besides DNA methylation, microRNA regulation is another important epigenetic mechanism. In Chapter 5, we propose a novel model-based clustering method to detect differentially variable (DV) miRNAs. We impose biologically meaningful structures on covariance matrices for each cluster of miRNAs. Simulation studies show that the proposed method performs better than other model-based methods when miRNA expression levels are from a multivariate normal distribution. In real data analysis, the proposed method has a higher validation rate than other methods.

Description

Keywords

Statistics

Citation