Dr. Wenyi Wang

My research is motivated by large-scale complex data sets from recent genomic and familial studies and by important biological questions that emerge from the analysis of these data. My current interests can be divided into two parts: 1) Development of methods and software for the accurate measurement of high-throughput genomic data; 2) Development of software and validation of statistical approaches for personalized cancer risk prediction (see publications and software for details).

1). It is non-trivial to extract genomic information of interest from the raw signals that come directly from chemical or physical reactions. Current high-throughput technologies have all inevitably incorporated multi-level confounders that affect these observed signals. The large amount of data they produce also make it difficult to calibrate these technologies using "gold standards", usually generated by experiments that are more accurate but are low-throughput and expensive. In this area, my work is focused on the accurate interpretation of raw high-throughput signals using statistical modeling. I have worked with high-throughput data measuring genomes such as copy number and single nucleotide variants, as well as those measuring transcriptomes such as expression and alternative splicing. Currently, my research group is focusing on integrating multi-platform information in order to better understand complex human diseases. In particular, we are looking in to non-small cell lung cancer (NSCLC) through collaboration with clinicians and pathologists at the University of Texas MD Anderson Cancer Center.

2). Cancer results from the accumulation of multiple genetic mutations. Germline mutations in specific genes predisposes the carrier to the development of cancer; the increased risk is also known as "inherited susceptibility". This inheritance results in familial clustering of cancers, known as "familial cancer syndromes". Clinical researchers utilize model-based prediction algorithms to identify mutation-carrying cancer patients at earlier and more treatable stages and/or to identify healthy individuals at high risk of developing cancer in future. As a result, Mendelian carrier probability models are based on Bayesian methods using detailed family history as input, and have shown performances better than empirical models using regression or classification trees alone. In this area, my work is focused on a) applying Mendelian models to cancers of interest for personalized risk assessment and b) developing methodologies for evaluation of risk assessment models using family and correlated data.

Main Projects

Tumor Heterogeneity



Tumors usually consist of different subpopulations (subclones) that are characterized by somatic mutations. The composition of such subpopulations may affect cancer prognosis and treatment efficacy. And understanding the subclonal structure helps infer the evolution of tumor cells which can further guide the discovery of driver mutations. Currently, most subclonal reconstruction methods are Dirichlet Process (DP) based, requiring expensive computing since MCMC algorithm is commonly adopted to solve the problem, and careful post-processing due to the fact that the number of clusters scales with number of single nucleotide variations (SNVs) in the DP setting. In this article, we present CliP (Clonal structure identification through penalizing pairwise difference), a fast and minimum post-processing algorithm for calling subclonal structures.



Consensus Clustering for Subclonal Structure Reconstruction (CSR) was originally created for Pan-Cancer Analysis of Whole Genome (PCAWG) working group, Heterogeneity and Evolution, of International Cancer Genome Consortium (ICGC) during the Heterogeneity project. It was used to make a consensus subclonal architecture out of results of 11 participating methods.



We develop a novel method DeMixT for the gene expression deconvolution of three compartments in cancer patient samples: tumor, immune and surrounding stromal cells. In validation studies using mixed cell line and laser-capture microdissection data, DeMixT yielded accurate estimates for both cell proportions and compartment-specific expression profiles. Application to the head and neck cancer data shows DeMixT-based deconvolution provides an important step to link tumor transcriptome data with clinical outcomes.

Variant Calling



De novo germline mutations have been increasingly recognized as causal factors for rare diseases. Given this important role, the ability to identify de novo mutation carriers among all mutation carriers would allow researchers to study the contribution of de novo mutations to these diseases. To fill this need, using Li-Fraumeni syndrome as a representative of a rare disease, we developed and tested a method called Famdenovo that predicts the parental mutation status (de novo or familial) of TP53 mutation carriers. We validated Famdenovo on 182 families with known parental mutation status and applied Famdenovo on 318 families with TP53 mutations.



Our method (Family-Based Sequencing Program, FamSeq) integrates Mendelian transmission information and raw sequencing reads for variant calling. In a large family affected with Wilms tumor, 84% of variants uniquely identified by FamSeq were confirmed by Sanger sequencing. In children with early-onset neurodevelopmental disorders from 26 families, de novo variant calls in disease candidate genes were corrected by FamSeq as Mendelian variants, and the number of uniquely identified variants in affected individuals increased proportionally as additional family members were included in the analysis. To gain insight into maximizing variant detection, we studied factors impacting actual improvements of family-based calling, including pedigree structure, allele frequency (common vs. rare variants), prior settings of minor allele frequency, sequence signal-to-noise ratio, and coverage depth (20* to >200*). These data will help guide the design, analysis, and interpretation of family-based sequencing studies to improve the ability to identify new disease-associated genes.



Sequencing by hybridization to oligonucleotides has evolved into an inexpensive, reliable and fast technology for targeted sequencing. Hundreds of human genes can now be sequenced within a day using a single hybridization to a resequencing microarray. However, several issues inherent to these arrays (e.g. cross-hybridization, variable probe/target affinity) cause sequencing errors and have prevented more widespread applications. We developed an R package for resequencing microarray data analysis that integrates a novel statistical algorithm, sequence robust multi-array analysis (SRMA), for rare variant detection with high sensitivity (false negative rate, FNR 5%) and accuracy (false positive rate, FPR 1*10e-5). The SRMA package consists of five modules for quality control, data normalization, single array analysis, multi-array analysis and output analysis. The entire workflow is efficient and identifies rare DNA single nucleotide variations and structural changes such as gene deletions with high accuracy and sensitivity.

Somatic Mutation Calling



Subclonal mutations reveal important features of the genetic architecture of tumors. However, accurate detection of mutations in genetically heterogeneous tumor cell populations using next-generation sequencing remains challenging. We develop MuSE (http://bioinformatics.mdanderson.org/main/MuSE), Mutation calling using a Markov Substitution model for Evolution, a novel approach for modeling the evolution of the allelic composition of the tumor and normal tissue at each reference base. MuSE adopts a sample-specific error model that reflects the underlying tumor heterogeneity to greatly improve the overall accuracy. We demonstrate the accuracy of MuSE in calling subclonal mutations in the context of large-scale tumor sequencing projects using whole exome and whole genome sequencing.

Risk Prediction


LFS-Cancer Specific

Penetrance, which plays a key role in genetic research, is defined as the proportion of individuals with the genetic variants (i.e., {genotype}) that cause a particular trait and who have clinical symptoms of the trait (i.e., {phenotype}). We propose a Bayesian semiparametric approach to estimate the cancer-specific age-at-onset penetrance in the presence of the competing risk of multiple cancers. We employ a Bayesian semiparametric competing risk model to model the duration until individuals in a high-risk group develop different cancers, and accommodate family data using family-wise likelihoods. We tackle the ascertainment bias arising when family data are collected through probands in a high-risk population in which disease cases are more likely to be observed. We apply the proposed method to a cohort of 186 families with Li-Fraumeni syndrome identified through probands with sarcoma treated at MD Anderson Cancer Center from 1944 to 1982.


LFS-Multiple Primary

A common phenomenon in cancer syndromes is that the same invidividual may have multiple primary cancers at different sites during his/her lifetime. In Li-Fraumeni syndrome (LFS), a rare pediatric cancer syndrome, TP53 mutation carriers are known to have a higher probability of developing a second primary cancer than the non-carriers. In this context, modeling the onset of multiple primary cancers is desirable for a better clinical management of LFS. In this article, we propose a Bayesian recurrent event model based on a non-homogeneous Poisson process in order to estimate a set of penetance for multiple primary cancers related to LFS.



Li-Fraumeni syndrome (LFS) is associated with germline TP53 mutations and a very high lifetime cancer risk. Algorithms that assess a patient's risk of inherited cancer predisposition are often used in clinical counseling. The existing LFS criteria have limitations, suggesting the need for an advanced prediction tool to support clinical decision making for TP53 mutation testing and LFS management. Based on a Mendelian model, LFSPRO estimates TP53 mutation probability through the Elston-Stewart algorithm and consequently estimates future risk of cancer. With independent datasets of 1,353 tested individuals from 867 families, we evaluated the prediction performance of LFSPRO.