Statistical Modeling for Complex Data
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In this dissertation, we focus on statistical modeling techniques for exploring complex data with features such as high dimensionality, nonstationary structure, heavy-tailed distributions, missing data, etc. We study four problems: dimension reduction in high-dimensional data, clarifying complex patterns in nonstationary spatial data, improving hierarchical Bayesian modeling of spatio-temporal data with staircase pattern of missing observations, and detecting change points in spatio-temporal data with outliers and heavy-tailed observations.
Sufficient dimension reduction draws a lot of attention in the last twenty years due to the largely increasing dimensions of the covariates. The semiparametric approach to dimension reduction proposed by Ma and Zhu [2012] is a novel and completely different approach to dimension-reduction problems from the existing literature. We present a theoretical result that relaxes a critical condition required by the semiparametric approach. The asymptotic normality of the estimators still maintains under weaker assumptions. This improvement increases the applicability of the semiparametric approach.
For spatial data, nonstationarity brings difficulties to learn the underlying processes, more specifically, to find spatial dependency using the semivariogram model. We improve the modeling technique through dimension expansion proposed by Bornn et al. [2012] by considering the correlation structure. We propose two generalized least-squares methods. Both of the methods provide more accurate parameter estimations than the least-squares method, which has been demonstrated through simulation studies and real data analyses.
As spatio-temporal data are usually observed over a large area and in many years, modeling spatio-temporal data is non-trivial. Missing data makes the task even more challenging. One of the problems discussed in this dissertation is to model ozone concentrations in a region in the presence of missing data. We propose a method without assumptions on the correlation structure to estimate the covariance matrix through dimension expansion method for modeling the semivariograms in nonstationary fields based on the estimations from the hierarchical Bayesian spatio-temporal modeling technique [Le and Zidek, 2006]. For demonstration, we apply the method in ozone concentrations at 25 stations in the Pittsburgh region studied in Jin et al. [2012]. The comparison of the proposed method and the one in Jin et al. [2012] are provided through leave-one-out cross-validation which shows that the proposed method is more general and applicable.
The last problem which is also related to spatio-temporal data is to detect structural changes for spatio-temporal data with missing in the presence of outliers and heavy-tailed observations. We improve the estimation algorithm of a general spatio-temporal autoregressive (GSTAR) model proposed by Wu et al. [2017]. We propose M-estimation-based EM algorithm and change-point detection procedure. Through data examples, we compare the proposed algorithm and the proposed change-point detection procedure with the existing ones and show that our method provides more robust estimation and is more accurate in detecting change points in the presence of outliers and/or heavy-tailed observations.