A Generic Sure Independence Screening Procedure
Introduction
Extracting important features from ultra-high dimensional data is one of the primary tasks in statistical learning, information theory, precision medicine and biological discovery. Many of the sure independent screening methods developed to meet these needs are suitable for special models that follow certain assumptions. With the availability of more data types and possible models, a model-free generic procedure with fewer and less restrictive assumptions of the data is required. In this paper, we propose a generic nonparametric sure independence screening procedure, called SBI-SIS, on the basis of a recently developed universal dependence measure: standardized ball information. We show that the proposed procedure has strong screening consistency even when the dimensionality is an exponential order of the sample size without subexponential moment assumptions of the data. We investigate the exibility of this procedure by considering three commonly encountered challenging settings in biological discovery or precision medicine: iterative SBI-SIS, interaction pursuit, and survival outcomes. We use simulation studies and real data analyses to illustrate the versatility and practicability of our SBI-SIS method.
Software Download
References
1. Pan, W., Wang, X.Q., Xiao, W.N., Zhu, H.T., A Generic Sure Independence Screening Procedure, Submitted.
Diseased Region Detection of Longitudinal Knee Magnetic Resonance Imaging Data
Introduction
Magnetic resonance imaging (MRI) has become an important imaging technique for quantifying the spatial location and magnitude/direction of longitudinal cartilage morphology changes in patients with osteoarthritis (OA). Although several analytical methods, such as subregion-based analysis, have been developed to refine and improve quantitative cartilage analyses, they can be suboptimal due to two major issues: the lack of spatial correspondence across subjects and time and the spatial heterogeneity of cartilage progression across subjects. The aim of this paper is to present a statistical method for longitudinal cartilage quantification in OA patients, while addressing these two issues. The 3D knee image data is preprocessed to establish spatial correspondence across subjects and/or time. Then, a Gaussian hidden Markov model (GHMM) is proposed to deal with the spatial heterogeneity of cartilage progression across both time and OA subjects. To estimate unknown parameters in GHMM, we employ a pseudo-likelihood function and optimize it by using an expectation-maximization (EM) algorithm. The proposed model can effectively detect diseased regions in each OA subject and present a localized analysis of longitudinal cartilage thickness within each latent subpopulation. Our GHMM integrates the strengths of two standard statistical methods including the local subregion-based analysis and the ordered value approach. We use simulation studies and the Pfizer longitudinal knee MRI dataset to evaluate the finite sample performance of GHMM in the quantification of longitudinal cartilage morphology changes. Our results indicate that GHMM significantly outperforms several standard analytical methods.
Software Download
References
1. Huang, C., Shan, L., Charles, H., Niethammer, M., and Zhu, H. T.. Diseased Region Detection of Longitudinal Knee MRI Data. Information Processing in Medical Imaging, 7917, 632-643, 2013.
2. Huang, C., Shan, L., Charles, H., Wirth, W., Niethammer, M., and Zhu, H. T.. Diseased Region Detection of Longitudinal Knee Magnetic Resonance Imaging Data. IEEE Transactions on Medical Imaging, 34, 1914-1927, 2015.Clustering High-Dimensional Landmark-Based Two-Dimensional Shape Data
Introduction
An important goal in image analysis is to cluster and recognize objects of interest according to the shapes of their boundaries. Clustering such objects faces at least four major challenges including a curved shape space, a high-dimensional feature space, a complex spatial correlation structure, and shape variation associated with some covariates (e.g., age or gender). The aim of this article is to develop a penalized model-based clustering framework to cluster landmark-based planar shape data, while explicitly addressing these challenges. Specifically, a mixture of offset-normal shape factor analyzers (MOSFA) is proposed with mixing proportions defined through a regression model (e.g., logistic) and an offset-normal shape distribution in each component for data in the curved shape space. A latent factor analysis model is introduced to explicitly model the complex spatial correlation. A penalized likelihood approach with both adaptive pairwise fused Lasso penalty function and L2 penalty function is used to automatically realize variable selection via thresholding and deliver a sparse solution. Our real data analysis has confirmed the excellent finite-sample performance of MOSFA in revealing meaningful clusters in the corpus callosum shape data obtained from the Attention Deficit Hyperactivity Disorder-200 (ADHD-200) study. Supplementary materials for this article are available online.
Software Download
References
1. Huang, C., Styner, M., and Zhu, H. T.. Clustering High-Dimensional Landmark-Based Two-Dimensional Shape Data. Journal of the American Statistical Association, 110, 946-961, 2015.
Regression Models on Riemannian Symmetric Spaces
Introduction
The aim of this paper is to develop a general regression framework for the analysis of manifold-valued response in a Riemannian symmetric space (RSS) and its association with multiple covariates of interest, such as age or gender, in Euclidean space. Such RSS-valued data arises frequently in medical imaging, surface modeling, and computer vision, among many others. We develop an intrinsic regression model solely based on an intrinsic conditional moment assumption, avoiding specifying any parametric distribution in RSS. We propose various link functions to map from the Euclidean space of multiple covariates to the RSS of responses. We develop a two-stage procedure to calculate the parameter estimates and determine their asymptotic distributions. We construct the Wald and geodesic test statistics to test hypotheses of unknown parameters. We systematically investigate the geometric invariant property of these estimates and test statistics. Simulation studies and a real data analysis are used to evaluate the finite sample properties of our methods.
Software Download
References
1. Cornea, E., Zhu, H. T., Kim, P., and Ibrahim, J. G. Intrinsic regression model for data in Riemannian symmetric space. Journal of the Royal Statistical Society B, 2016.
FVGWAS: Fast Voxelwise Genome Wide Association Analysis
Introduction
The aim of this tool is to to develop a Fast Voxelwise Genome Wide Association analysiS (FVGWAS) framework to efficiently carry out whole-genome analyses of whole-brain data. FVGWAS consists of three components including a heteroscedastic linear model, a global sure independence screening (GSIS) procedure, and a detection procedure based on wild bootstrap methods. Specifically, for standard linear association, the computational complexity is O(n*N_V*N_C) for voxelwise genome wide association analysis (VGWAS) method compared with O((N_C+N_V)*n^2) for FVGWAS. Our FVGWAS may be a valuable statistical toolbox for large-scale imaging genetic analysis as the field is rapidly advancing with ultra-high-resolution imaging and whole-genome sequencing.
Software Download
References
1. Huang, M. Y., Nichols, T., Huang, C., Yu, Y., Lu, Z. H., Knickmeyer, R. C., Feng, Q. J., Zhu, H. T., and for the Alzheimer's Disease Neuroimaging Initiative FVGWAS: Fast Voxelwise Genome Wide Association Analysis of Large-scale Imaging Genetic Data. NeuroImage, 2015, accepted.
Functional Mixed Processes Models
Introduction
The aim of this tool is to implement a functional analysis pipeline, for the joint analysis of longitudinally measured functional data and clinical data, for example age, gender and disease status. FMPM consists of a functional mixed effects model for characterizing the association of functional response with covariates of interest by incorporating complex spatial–temporal correlation structure, an efficient method for spatially smoothing varying coefficient functions, an estimation method for estimating the spatial– temporal correlation structure, a test procedure with local and global test statistics for testing hypotheses of interest associated with functional response, and a simultaneous confidence band for quantifying the uncertainty in the estimated coefficient functions.
Software Download
References
1. Yuan, Y., Gilmore, J. H., Geng, X. J., Styner, M., Chen, K. H., Wang, J. L., and Zhu, H. T. FMEM: Functional mixed effects modeling for the analysis of longitudinal white matter Tract data. NeuroImage, 2014;84:753-764.
2. Yuan, Y., Gilmore, J. H., Geng, X. J., Styner, M., Chen, K. H., Wang, J. L., and Zhu, H. T. A longitudinal functional analysis framework for analysis of white matter tract statistics. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2013;7917 LNCS:220-231.
Perturbation and Scaled Cook's Distance
Introduction
Cook's (Cook, 1977) distance is one of the most important diagnostic tools for detecting influential individual or subsets of observations in linear regression for cross-sectional data. However, for many complex data structures (e.g., longitudinal data), no rigorous approach has been developed to address a fundamental issue: deleting subsets with different numbers of observations introduces different degrees of perturbation to the current model fitted to the data and the magnitude of Cook's distance is associated with the degree of the perturbation. The aim of this paper is to address this issue in general parametric models with complex data structures. We propose a new quantity for measuring the degree of the perturbation introduced by deleting a subset. We use stochastic ordering to quantify the stochastic relationship between the degree of the perturbation and the magnitude of Cook's distance. We develop several scaled Cook's distances to resolve the comparison of Cook's distance for different subset deletions. Theoretical and numerical examples are examined to highlight the broad spectrum of applications of these scaled Cook's distances in a formal influence analysis.
Software Download
Supplement Document
References
1. Zhu, HT., Ibrahim JG, Cho HS. Perturbation and Scaled Cook's distance. Annals of Statistics, in revision , 2012.
Local Polynomial Regression for Symmetric Positive Definite Matrices
Introduction
Abstract: Local polynomial regression has received extensive attention for the nonparametric estimation of regression functions when both the response and the covariate are in Euclidean space. However, little has been done when the response is in a Riemannian manifold. We develop an intrinsic local polynomial regression estimate for the analysis of symmetric positive definite (SPD) matrices as responses that lie in a Riemannian manifold with covariate in Euclidean space. The primary motivation and application of the proposed methodology is in computer vision and medical imaging. We examine two commonly used metrics, including the trace metric and the Log- Euclidean metric on the space of SPD matrices. For each metric, we develop a cross-validation bandwidth selection method, derive the asymptotic bias, variance, and normality of the intrinsic local constant and local linear estimators, and compare their asymptotic mean square errors. Simulation studies are further used to compare the estimators under the two metrics and to examine their finite sample performance. We use our method to detect diagnostic differences between diffusion tensors along fiber tracts in a study of human immunodeficiency virus.
Software Download
Supplement Document
References
1. Ying Yuan, Hongtu Zhu, Weili Lin, J. S. Marron (2011). Local Polynomial Regression for Symmetric Positive Definite Matrices. JRSSB
Fixed and Random Effects Selection in Mixed Effects Toolkit
Introduction
Abstract: We consider selecting both fixed and random effects in a general class of mixed effects models using maximum penalized likelihood (MPL) estimation along with the smoothly clipped absolute deviation (SCAD) and adaptive least absolute shrinkage and selection operator (ALASSO) penalty functions. The MPL estimates are shown to possess consistency and sparsity properties and asymptotic normality. A model selection criterion, called the ICQ statistic, is proposed for selecting the penalty parameters (Ibrahim, Zhu, and Tang, 2008, Journal of the American Statistical Association 103, 1648-1658). The variable selection procedure based on ICQ is shown to consistently select important fixed and random effects. The methodology is very general and can be applied to numerous situations involving random effects, including generalized linear mixed models. Simulation studies and a real data set from a Yale infant growth study are used to illustrate the proposed methodology.
Documents Download
Software Download
References
1. Joseph G. Ibrahim, Hongtu Zhu, Ramon I. Garcia, and Ruixin Guo (2011). "Fixed and Random Effects Selection in Mixed Effects Models". Biometrics.
Bayesian Lasso for Semiparametric Structural Equation Models Toolkit
Introduction
There has been great interest in developing nonlinear structural equation models and associated statistical inference procedures, including estimation and model selection methods. In this paper a general semiparametric structural equation model (SSEM) is developed in which the structural equation is composed of nonparametric functions of exogenous latent variables and fixed covariates on a set of latent endogenous variables. A basis representation is used to approximate these nonparametric functions in the structural equation and the Bayesian Lasso method coupled with a Markov Chain Monte Carlo (MCMC) algorithm is used for simultaneous estimation and model selection. The proposed method is illustrated using a simulation study and data from the Affective Dynamics and Individual Differences (ADID) study. Results demonstrate that our method can accurately estimate the unknown parameters and correctly identify the true underlying model. Key words: Bayesian Lasso; Latent variable; Spline; Structural equation model.
Documents Download
Software Download
References
1. Ruixin Guo, Hongtu Zhu, Sy-Miin Chow, and Joseph G. Ibrahim (2011). "Bayesian Lasso for Semiparametric Structural Equation Models". Submitted.
DTI-Statistics FADTTS Toolkit
Introduction
FADTTS is a functional analysis pipeline for delineating the structure of the variability of multiple di usion properties along major white matter fiber bundles and their association with a set of covariates of interest, such as age, diagnostic status and gender, in various diffiusion tensor imaging studies. The FADTTS integrates five statistical tools: a multivariate varying coefficient model for allowing the varying coefficient functions to characterize the varying association between fiber bundles di usion properties and a set of covariates, a weighted least squares estimation to estimate the varying coefficient functions, a functional principal component analysis to delineate the structure of the variability in fiber bundles diffiusion properties, a global test statistic to test hypotheses of interest, and a simultaneous conffidence band to quantify the uncertainty in the estimated coefficient function. FADTTS can be used to facilitate understanding normal brain development, the neural bases of neuropsychiatric disorders, and the joint effects of environmental and genetic factors on white matter fiber bundles.
Documents Download
Software Download (with GUI)
References
1. Zhu, H. T., Li, R. Z., and Kong, L. L. (2011). "Multivariate varying coefficient models for functional responses", Technical report, University of North Carolina at Chapel Hill. [pdf]
2. Hongtu Zhu, Linglong Kong, Runze Li, Martin Stynerb, Guido Gerigg, Weili Lin, and John H. Gilmore (2010). "FADTTS: Functional Analysis of Diffusion Tensor Tract Statistics", Submitted.
3. Zhu, H., Styner, M., Li, Y., Kong, L., Shi, Y., Lin, W., Coe, C., and Gilmore, J. (2010). "Multivariate varying coefficient models for dti tract statistics". In Jiang, T., Navab, N., Pluim, J., and Viergever, M., editors, Medical Image Computing and Computer-Assisted Intervention MICCAI 2010, volume 6361 of Lecture Notes in Computer Science, pages 690-697. Springer Berlin / Heidelberg. [pdf]
4. Audrey R Verde, Jean-Baptiste Berger, Aditya Gupta, Mahshid Farzinfar, Adrien Kaiser, Vicki W Chanon, Charlotte Boettiger, Hans Johnson, Joy Matsui, Anuja Sharma, Casey Goodlett, Yundi Shi, Hongtu Zhu, Guido Gerig, Sylvain Gouttard, Clement Vachet, Martin Styner (2013). "UNC-Utah NA-MIC DTI framework: Atlas Based Fiber Tract Analysis with Application to a Study of Nicotine Smoking Addiction". In Medical Imaging 2013: Image Processing, edited by Sebastien Ourselin, David R. Haynor, Proc. of SPIE Vol. 8669, 86692D, doi: 10.1117/12.2007093. [pdf]DTI-Statistics FRATS Toolkit
Introduction
FRATS: Functional Regression Analysis of DTI Tract Statistics a functional regression framework, is for the analysis of multiple diffusion properties along fiber bundle as functions in an infinite dimensional space and their association with a set of covariates of interest, such as age, diagnostic status and gender, in real applications. It consists of four integrated components: the local polynomial kernel method for smoothing multiple diffusion properties along individual fiber bundles, a functional linear model for characterizing the association between fiber bundle diffusion properties and a set of covariates, a global test statistic for testing hypotheses of interest, and a resampling method for approximating the p-value of the global test statistic. The resulting analysis pipeline can be used for understanding normal brain development, the neural bases of neuropsychiatric disorders, and the joint effects of environmental and genetic factors on white matter fiber bundles.
Documents Download
Software Download (with GUI)
References
1. Zhu, H. T., Styner, M., Tang, N. S., Liu, Z. X., Lin, W. L., and Gilmore, J. (2010). "FRATS: Functional Regression Analysis of DTI Tract Statistics", IEEE Transactions on Medical Imaging, 29:1039 - 1049.[pdf]
DTI-Statistics MARM Toolkit
Introduction
Neuroimaging studies aim to analyze imaging data with complex spatial patterns in a large number of locations (called voxels) on a two-dimensional (2D) surface or in a 3D volume. Conventional analyses of imaging data include two sequential steps: spatially smoothing imaging data and then independently fitting a statistical model at each voxel. However, conventional analyses suffer from the same amount of smoothing throughout the whole image, the arbitrary choice of smoothing extent, and low statistical power in detecting spatial patterns. We propose a multiscale adaptive regression model (MARM) to integrate the propagation?separation (PS) approach (Polzehl and Spokoiny, 2000, 2006) with statistical modeling at each voxel for spatial and adaptive analysis of neuroimaging data from multiple subjects. MARM has three features: being spatial, being hierarchical, and being adaptive. We use a multiscale adaptive estimation and testing procedure (MAET) to utilize imaging observations from the neighboring voxels of the current voxel to adaptively calculate parameter estimates and test statistics. Theoretically, we establish consistency and asymptotic normality of the adaptive estimates and the asymptotic distribution of the adaptive test statistics.
Documents
Download
References
1. Yimei Li, Hongtu Zhu, Dinggang Shen, Weili Lin, John H. Gilmore and Joseph G Ibrahim (2010). "Multiscale Adaptive Regression Models for Neuroimaging Data", JRSS, Series B, under revision.
Varying Coefficient Model For Modeling Diffusion Tensors Along White Matter Tracts
Introduction
Diffusion tensor imaging provides important information on tis-sue structure and orientation of fiber tracts in brain white matter in vivo. It results in diffusion tensors, which are 3*3 symmetric positive definite (SPD) matrices, along fiber bundles. This paper develops a functional data analysis framework to model diffusion tensors along fiber tracts as functional data in a Riemannian manifold with a set of covariates of interest, such as age and gender. We propose a statistical model with varying coefficient functions to characterize the dynamic association between functional SPD matrix-valued responses and covariates. We calculate weighted least squares estimators of the varying coefficient functions for the Log-Euclidean metric in the space of SPD matrices. We also develop a global test statistic to test specific hypotheses about these coefficient functions and construct their simultaneous confidence bands. Simulated data are further used to examine the finite sample performance of the estimated varying coefficient functions. We apply our model to study potential gender differences and find a statistically significant aspect of the development of diffusion tensors along the right internal capsule tract in a clinical study of neurodevelopment.
Documents
Download
References
1. Ying Yuan, Hongtu Zhu, Martin Styner, John H. Gilmore and J. S. Marron. "Varying Coefficient Model For Modeling Diffusion Tensors Along White Matter Tracts", Annals of Applied Statistics, under revision.
Projection Regression Models for Multivariate Imaging Phenotype
Introduction
This paper presents a projection regression model (PRM) to assess the relationship between a multivariate phenotype and a set of covariates, such as a genetic marker, age and gender. In the existing literature, a standard statistical approach to this problem is to fit a multivariate linear model to the multivariate phenotype and then use Hotelling's $T^2$ to test hypotheses of interest. An alternative approach is to fit a simple linear model and test hypotheses for each individual phenotype and then correct for multiplicity. However, even when the dimension of the multivariate phenotype is relatively small, say 5, such standard approaches can suffer from the issue of low statistical power in detecting the association between the multivariate phenotype and the covariates. The PRM generalizes a statistical method based on the principal component of heritability for association analysis in genetic studies of complex multivariate phenotypes. The key components of the PRM include an estimation procedure for extracting several principal directions of multivariate phenotypes relating to covariates and a test procedure based on wild-bootstrap method for testing for the association between the weighted multivariate phenotype and explanatory variables. Simulation studies and an imaging genetic dataset are used to examine the finite sample performance of the PRM.
Documents
Download
References
BIAS-SSPM Toolkits
Introduction
Many large-scale longitudinal imaging studies have been or are being widely conducted to better understand the progress of neuropsychiatric and neurodegenerative disorders and normal brain development. The goal of this article is to develop a multiscale adaptive generalized estimation equation (MAGEE) method for spatial and adaptive analysis of neuroimaging data from longitudinal studies. MAGEE is applicable to making statistical inference on regression coefficients in both balanced and unbalanced longitudinal designs and even in twin and familial studies, whereas standard software platforms have several major limitations in handling these complex studies. Specifically, conventional voxel-based analyses in these software platforms involve Gaussian smoothing imaging data and then independently fitting a statistical model at each voxel. However, the conventional smoothing methods suffer from the lack of spatial adaptivity to the shape and spatial extent of region of interest and the arbitrary choice of smoothing extent, while independently fitting statistical models across voxels does not account for the spatial properties of imaging observations and noise distribution. To address such drawbacks, we adapt a powerful propagation-separation (PS) procedure to sequentially incorporate the neighboring information of each voxel and develop a new novel strategy to solely update a set of parameters of interest, while fixing other nuisance parameters at their initial estimators. Simulation studies and real data analysis show that MAGEE significantly outperforms voxel-based analysis.
Documents
Download
References
1. Yimei Li, Hongtu Zhu, Dinggang Shen, Weili Lin, John H. Gilmore and Joseph G Ibrahim (2010). "Multiscale Adaptive Regression Models for Neuroimaging Data", JRSS, Series B, under revision.
2. Li Y, Gilmore JH, Shen D, Styner M, Lin W, Zhu H. (2013). Multiscale adaptive generalized estimating equations for longitudinal neuroimaging data. Neuroimage, 72:91-105.
Software Agreement
THE SOFTWARE PACKAGES ARE PROVIDED ``AS IS'', AND ONLY FOR NON-PROFIT USE. CURRENTLY THERE IS NO FORMAL SUPPORT ON IT; FURTHER ASSISTANCE BY THE AUTHORS REGARDING APPLICATION OF THIS SOFTWARE WILL NOT BE PROVIDED, IN GENERAL. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA ETC.) CAUSED AND ON ANY THEORY OF LIABILITY ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE PACKAGE.