Statistical Inference for High-Dimensional Genetic Data

Li, Xuan

Statistical Inference for High-Dimensional Genetic Data

dc.contributor.advisor	Fu, Yuejiao
dc.contributor.advisor	Wang, Xiaogang
dc.creator	Li, Xuan
dc.date.accessioned	2019-03-05T14:57:06Z
dc.date.available	2019-03-05T14:57:06Z
dc.date.copyright	2018-11-23
dc.date.issued	2019-03-05
dc.date.updated	2019-03-05T14:57:05Z
dc.degree.discipline	Mathematics & Statistics
dc.degree.level	Doctoral
dc.degree.name	PhD - Doctor of Philosophy
dc.description.abstract	This dissertation focuses on three types of high-dimensional genetic data: protein sequences, DNA methylation data, and microRNA expression data. The four major parts are presented in Chapters 2-5, respectively. In Chapter 2, we develop a new clustering method for protein sequences. First, we reduce the dimensionality based on entropy. Second, the sequences are clustered using the Hamming distance vectors of chosen sites. We apply this new method to an influenza A H3N2 HA data set, which consists of 1960 viral sequences. Our method aggregates these sequences into 23 clusters. Based on the temporal evolution pattern of these clusters, we find that the dominant clusters change from time to time and are often different from the clusters housing vaccine strains. In Chapter 3, we conduct systematic simulation studies and real data analysis to compare the performance of seven statistical tests for equal-variance hypothesis. Our results show that Brown-Forsythe test and trimmed-mean-based-Levene's test have better performance on DNA methylation data in comparison with other tests. Detection of differential DNA methylation and differential variability have received a lot of attention in the literature. In Chapter 4, we derive the asymptotic distribution of a joint score test (AW), proposed by Anh and Wang (2013). Furthermore, we propose three improved joint score tests, namely iAW.Lev, iAW.BF, and iAW.TM. Systematic simulation studies show that at least one of the proposed tests performs better than the existing tests for data with outliers or from non-normal distributions. The real data analyses demonstrate that the three proposed tests have higher true validation rates than the existing tests. Besides DNA methylation, microRNA regulation is another important epigenetic mechanism. In Chapter 5, we propose a novel model-based clustering method to detect differentially variable (DV) miRNAs. We impose biologically meaningful structures on covariance matrices for each cluster of miRNAs. Simulation studies show that the proposed method performs better than other model-based methods when miRNA expression levels are from a multivariate normal distribution. In real data analysis, the proposed method has a higher validation rate than other methods.
dc.identifier.uri	http://hdl.handle.net/10315/35894
dc.language.iso	en
dc.rights	Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subject	Statistics
dc.subject.keywords	Statistical genetics
dc.subject.keywords	High-dimensional data
dc.subject.keywords	Clustering categorical data
dc.subject.keywords	Model-based clustering
dc.subject.keywords	Two-sample problem
dc.title	Statistical Inference for High-Dimensional Genetic Data
dc.type	Electronic Thesis or Dissertation

Files

Original bundle

Now showing 1 - 1 of 1

Name:: LI_XUAN_2018_PhD.pdf
Size:: 977.5 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: license.txt
Size:: 1.83 KB
Format:: Plain Text
Description:

Download

Name:: YorkU_ETDlicense.txt
Size:: 3.38 KB
Format:: Plain Text
Description:

Download

Collections

Mathematics & Statistics