High-Dimensional Data Integration with Multiple Heterogeneous and Outlier Contaminated Tasks
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Data integration is the process of extracting information from multiple sources and analyzing different related data sets simultaneously. The aggregated information can reduce the sample biases caused by low-quality data, boost the statistical power for joint inference, and enhance the model prediction. Therefore, this dissertation focuses on the development and implementation of statistical methods for data integration.
In clinical research, the study outcomes usually consist of various patients' information corresponding to the treatment. Since the joint inference across related data sets can provide more efficient estimates compared with marginal approaches, analyzing multiple clinical endpoints simultaneously can better understand treatment effects. Meanwhile, the data from different research are usually heterogeneous with continuous and discrete endpoints. To alleviate computational difficulties, we apply the pairwise composite likelihood method to analyze the data. We can show that the estimators are consistent and asymptotically normally distributed based on the Godambe information.
Under high dimensionality, the joint model needs to select the important features to analyze the intrinsic relatedness among all data sets. The multi-task feature learning is widely used to recover this union support through the penalized M-estimation framework. However, the heterogeneity among different data sets may cause difficulties in formulating the joint model. Thus, we propose the mixed
When data from multiple sources are contaminated by large outliers, the multi-task learning methods suffer efficiency loss. Next, we propose robust multi-task feature learning by combining the adaptive Huber regression tasks with mixed regularization. The robustification parameters can be chosen to adapt to the sample size, model dimension, and error moments while striking a balance between unbiasedness and robustness. We consider heavy-tailed distributions for multiple data sets that have bounded