High-Dimensional Data Integration with Multiple Heterogeneous and Outlier Contaminated Tasks
dc.contributor.advisor | Xu, Wei | |
dc.contributor.advisor | Gao, Xin | |
dc.contributor.author | Zhong, Yuan | |
dc.date.accessioned | 2023-10-03T20:07:30Z | |
dc.date.available | 2023-10-03T20:07:30Z | |
dc.date.issued | 2023-02 | |
dc.date.updated | 2023-10-03T20:07:29Z | |
dc.degree.discipline | Mathematics & Statistics | |
dc.degree.level | Doctoral | |
dc.degree.name | PhD - Doctor of Philosophy | |
dc.description.abstract | Data integration is the process of extracting information from multiple sources and analyzing different related data sets simultaneously. The aggregated information can reduce the sample biases caused by low-quality data, boost the statistical power for joint inference, and enhance the model prediction. Therefore, this dissertation focuses on the development and implementation of statistical methods for data integration. In clinical research, the study outcomes usually consist of various patients' information corresponding to the treatment. Since the joint inference across related data sets can provide more efficient estimates compared with marginal approaches, analyzing multiple clinical endpoints simultaneously can better understand treatment effects. Meanwhile, the data from different research are usually heterogeneous with continuous and discrete endpoints. To alleviate computational difficulties, we apply the pairwise composite likelihood method to analyze the data. We can show that the estimators are consistent and asymptotically normally distributed based on the Godambe information. Under high dimensionality, the joint model needs to select the important features to analyze the intrinsic relatedness among all data sets. The multi-task feature learning is widely used to recover this union support through the penalized M-estimation framework. However, the heterogeneity among different data sets may cause difficulties in formulating the joint model. Thus, we propose the mixed $\ell_{2,1}$ regularized composite quasi-likelihood function to perform multi-task feature learning. In our framework, we relax the distributional assumption of responses, and our result establishes the sign recovery consistency and estimation error bounds of the penalized estimates. When data from multiple sources are contaminated by large outliers, the multi-task learning methods suffer efficiency loss. Next, we propose robust multi-task feature learning by combining the adaptive Huber regression tasks with mixed regularization. The robustification parameters can be chosen to adapt to the sample size, model dimension, and error moments while striking a balance between unbiasedness and robustness. We consider heavy-tailed distributions for multiple data sets that have bounded $(1+\omega)$th moment for any $\omega>0$. Our method is shown to achieve estimation consistency and sign recovery consistency. In addition, the robust information criterion can conduct joint inference on related tasks for consistent model selection. | |
dc.identifier.uri | https://hdl.handle.net/10315/41453 | |
dc.language | en | |
dc.rights | Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests. | |
dc.subject | Statistics | |
dc.subject.keywords | Data integration | |
dc.subject.keywords | Composite likelihood | |
dc.subject.keywords | Penalized M-estimation | |
dc.subject.keywords | Robust M-estimation | |
dc.subject.keywords | Mixed â„“2,1 Regularization | |
dc.subject.keywords | Adaptive Huber regression | |
dc.subject.keywords | Outlier contamination | |
dc.title | High-Dimensional Data Integration with Multiple Heterogeneous and Outlier Contaminated Tasks | |
dc.type | Electronic Thesis or Dissertation |
Files
Original bundle
1 - 1 of 1