My research is motivated by large-scale complex data sets in recent genomic and familial studies and by important biological questions that emerge from the analysis of these data. My current interests can be divided into two parts: 1) Development of methods and software for the accurate measurement of high-throughput genomic data; 2) Development and validation of statistical approaches and software for personalized cancer risk prediction (see publications and software for details).
1). It is non-trivial to extract genomic information of interest from the raw signals that come directly from chemical or physical reactions. Current high-throughput technologies have all inevitably incorporated multi-level confounders that affect the observed signals. The large amount of data they produce also make it difficult to calibrate these technologies using "gold standards", usually generated by experiments that are more accurate but are low-throughput and expensive. My work in this part is focused on the accurate interpretation of raw high-throughput signals using statistical modeling. I have worked with high-throughput data measuring genomes such as copy number and single nucleotide variants, as well as those measuring transcriptomes such as expression and alternative splicing. Currently my research group is focusing on integrating multi-platform information in order to better understand complex human diseases. In particular, we are looking in to non-small cell lung cancer (NSCLC) through collaboration with clinicians and pathologists at MDACC.
2). Cancer results from accumulation of multiple genetic mutations. Germline mutation of a cancer gene predisposes the carrier to the development of cancer, known as "inherited susceptibility". This inheritance results in familial clustering of cancers, known as "familial cancer syndromes". Clinical researchers utilize model-based prediction algorithms to identify cancer patients at earlier and more treatable stages and/or to identify healthy individuals at high risk of developing cancer in future. As a result, Mendelian carrier probability models are based on Bayesian methods using detailed family history as input,and have shown performances better than empirical models using regression or classification trees alone. My work in this part is focused on a) applying Mendelian models to cancers of interest for personalized risk assessment and b) developing methodologies for evaluation of risk assessment models using family and correlated data.
Main Projects
Tumor heterogenuity and evolutoin

DeMix
We develop a novel method DeMixT for the gene expression deconvolution of three compartments in cancer patient samples: tumor, immune and surrounding stromal cells. In validation studies using mixed cell line and laser-capture microdissection data, DeMixT yielded accurate estimates for both cell proportions and compartment-specific expression profiles. Application to the head and neck cancer data shows DeMixT-based deconvolution provides an important step to link tumor transcriptome data with clinical outcomes.

cLip
Tumors usually consist of different subpopulations (subclones) that are characterized by somatic mutations. The composition of such subpopulations may affect cancer prognosis and treatment efficacy. And understanding the subclonal structure helps infer the evolution of tumor cells which can further guide the discovery of driver mutations. Currently, most subclonal reconstruction methods are Dirichlet Process (DP) based, requiring expensive computing since MCMC algorithm is commonly adopted to solve the problem, and careful post-processing due to the fact that the number of clusters scales with number of single nucleotide variations (SNVs) in the DP setting. In this article, we present CliP (Clonal structure identification through penalizing pairwise difference), a fast and minimum post-processing algorithm for calling subclonal structures.
Somatic mutation calling using paired tumor-normal sequencing data

MuSE
Subclonal mutations reveal important features of the genetic architecture of tumors. However, accurate detection of mutations in genetically heterogeneous tumor cell populations using next-generation sequencing remains challenging. We develop MuSE (http://bioinformatics.mdanderson.org/main/MuSE), Mutation calling using a Markov Substitution model for Evolution, a novel approach for modeling the evolution of the allelic composition of the tumor and normal tissue at each reference base. MuSE adopts a sample-specific error model that reflects the underlying tumor heterogeneity to greatly improve the overall accuracy. We demonstrate the accuracy of MuSE in calling subclonal mutations in the context of large-scale tumor sequencing projects using whole exome and whole genome sequencing.