Computer Simulation, Bioinformatics, and Statistical Analysis of Cancer Data and Processes Kellie J
Total Page:16
File Type:pdf, Size:1020Kb
Virginia Commonwealth University VCU Scholars Compass Biostatistics Publications Dept. of Biostatistics 2015 Computer Simulation, Bioinformatics, and Statistical Analysis of Cancer Data and Processes Kellie J. Archer Virginia Commonwealth University Kevin Dobbin University of Georgia Swati Biswas University of Texas at Dallas See next page for additional authors Follow this and additional works at: http://scholarscompass.vcu.edu/bios_pubs Part of the Medicine and Health Sciences Commons Copyright © 2015 the author(s), publisher and licensee Libertas Academica Ltd. This is an open-access article distributed under the terms of the Creative Commons CC-BY-NC 3.0 License. Downloaded from http://scholarscompass.vcu.edu/bios_pubs/37 This Article is brought to you for free and open access by the Dept. of Biostatistics at VCU Scholars Compass. It has been accepted for inclusion in Biostatistics Publications by an authorized administrator of VCU Scholars Compass. For more information, please contact [email protected]. Authors Kellie J. Archer, Kevin Dobbin, Swati Biswas, Roger S. Day, David C. Wheeler, and Hao Wu This article is available at VCU Scholars Compass: http://scholarscompass.vcu.edu/bios_pubs/37 Computer Simulation, Bioinformatics, and Statistical Analysis of Cancer Data and Processes §§Kellie J. Archer §§Roger S. Day Professor, Department of Biostatistics, Director of the Associate Professor of Biomedical Informatics, Associate Massey Cancer Center Biostatistics Shared Resource, Professor of Biostatistics, University of Pittsburgh, Virginia Commonwealth University, Richmond, VA, USA. Pittsburgh, PA, USA. §§Kevin Dobbin §§David C. Wheeler Associate Professor of Biostatistics, University of Georgia, Assistant Professor, Department of Biostatistics, Virginia Athens, GA, USA. Commonwealth University, Richmond, VA, USA. §§Swati Biswas §§Hao Wu Associate Professor, Department of Mathematical Assistant Professor, Department of Biostatistics and Sciences, University of Texas at Dallas, Richardson, Bioinformatics, Emory University, Atlanta, GA, USA. TX, USA. Supplement Aims and Scope Cancer Informatics represents a hybrid discipline encompass- §§ Spatio-Temporal Simulation Models ing the fields of oncology, computer science, bioinformat- §§ Parametric Validation of Simulation Models ics, statistics, computational biology, genomics, proteomics, §§ Simulation of Dynamic Phenomena in Cancer using metabolomics, pharmacology, and quantitative epidemiology. Highly Specialized Algorithms The common bond or challenge that unifies the various dis- §§ Hyper-High Performance and Biocomplexity Systems ciplines is the need to bring order to the massive amounts of Modelling of Cancer data generated by researchers and clinicians attempting to find §§ Sequence Alignment Methods for DNA-Seq, RNA- the underlying causes and effective means of treating cancer. Seq, miRNA-Seq, CHiP-Seq Experiments §§ Spectra Analysis The future cancer informatician will need to be well-versed §§ Generic Visualization Tools in each of these fields and have the appropriate background §§ Array-Comparative Genomic Hybridization to leverage the computational, clinical, and basic science Visualization resources necessary to understand their data and separate §§ Statistical Methods for Next Generation Sequencing signal from noise. Knowledge of and the communication Data among these specialty disciplines, acting in unison, will be §§ Predictive Modeling for High-Dimensional Data the key to success as we strive to find answers underlying the §§ Robust Feature Selection complex and often puzzling diseases known as cancer. §§ Integration and Analysis of Big Biomedical Data §§ Articles focus on computer simulation, bioinformatics, Meta-Data Imaging §§ and statistical analysis of cancer data and processes and Equivalent Cross-Relaxation Imaging §§ may include: Mathematical Modeling and Image Enhancement of §§ Multi-dimensional Simulation Models of Tumour MRI Cancer Data §§ Response Rapid Imaging Analysis of PET Cancer Scans §§ Simulating Tumour Growth Dynamics CANCER INFORMATICS 2015:14(S2) 247 Archer et al. ancer is a widely-recognized public health burden and a Zemmour et al. took an important look back at centrally complex disease. High-dimensional “Omics” research important breast cancer datasets. These datasets had previ- – research that uses high dimensional genomic, pro- ously been used to develop bioinformatic diagnostic tools that teomic, and/or metabolomic data – has often been motivated are currently in use. But these tools were developed when we C 1 by cancer applications . Important lessons have been learned knew less about the statistics of this type of data than we do in processing and analyzing such data. Further, additional today. They applied regularized methods that are becoming the sophisticated technologies in medical imaging, geographic gold standard in this area. They found that shorter and argu- information systems, and environmental exposure monitor- ably more robust gene lists with equivalent prediction value ing also give rise to high-dimensional data. Yet, as quickly as are the result. They also found that the overlap of the gene methods adapt to the new technologies, it seems that biotech- lists from different modern tools is much higher than with the nological developments produce even higher dimensional and previous tools, suggesting that the previous poor overlap may more complex datasets – making bioinformatics more and have caused unnecessary concerns. Other articles pertained to more critical to the processing and understanding of patient extending penalized methods for discrete response modeling. data. But, in the rush of data processing we must not lose For example, ordinal responses have an inherent ordering, but sight of biostatistics principles so that data are correctly inter- the distance between the responses cannot be quantitatively preted, signal separated from noise, and new discoveries effi- measured. Examples of ordinal variables include tumor grade ciently translated into clinical practice without unnecessary (T0 to T4), degree of spread to the lymph nodes (N0 to N3), delays or inefficiencies. This supplement includes articles by and cancer stage (I to IV), all of which are commonly reported various researchers in statistics, biostatistics and bioinformat- in cancer research studies. The ordinal generalized monotone ics who are developing methods for addressing a wide spec- incremental forward stagewise method (GMIFS) was previ- trum of scientific questions related to big data. Some of these ously described for fitting ordinal response models in the pres- main areas are summarized below. ence of a high-dimensional covariate space. Ferber and Archer As mentioned, the challenges facing today in cancer demonstrated how the GMIFS algorithm, via a forward con- informatics are multifaceted lying at the interface of bioinfor- tinuation ratio model, could be used to fit a discrete survival matics, genetic epidemiology, epigenetics, and risk prediction, time response model, applicable when survival is recorded on and several authors contributed in these areas. In particular, an ordinal scale such as short-, intermediate-, and long-term some papers focused on the approaches for detecting gene-gene survival. Gentry et al extended the ordinal GMIFS algorithm and gene-environment interactions. Talluri and Shete consid- to enable inclusion of covariates that should not be penalized ered an information theory approach for detecting epistasis, in the model fitting process, and applied the method to predict compared it with a standard logistic regression approach, and stage of breast cancer using methylation of high-throughput demonstrated utility by applying it to head and neck cancer CpG sites as penalized predictors with important clinical data. Liu and Xuan constructed a mutual SNP association covariates coerced into the model. Makowski and Archer network (SAN) based on information from various sources of extended the GMIFS method to the Poisson regression setting gene interaction such as protein-protein interaction and gene for modeling a count response in high-dimensional covariate co-expression, with the goal that SAN reflects the real func- spaces, and applied the method for predicting micronuclei fre- tional associations between genomic loci. Zhang and Biswas quency using gene expression features as predictors. focused on detecting interactions of rare haplotypes with envi- In order to understand the gene regulatory mechanism ronmental covariates with an enhanced version of Logistic in cancer, Wei P et al. proposed a novel statistical framework Bayesian LASSO and used it to uncover a rare haplotype to integrate diverse types of genomic data from The Cancer interaction with smoking for lung cancer. Zang et al. conside- Genome Atlas (TCGA) to decipher expression regulation. red lung cancer and explored the association of its sub-types Wei Y et al. tackled a similar problem and proposed joint with CDKN3 gene expression, and found that higher expres- models for integrative analysis of multiple types of omics data. sion of CDKN3 was associated with poorer survival outcomes The results from the analysis shed light on personalized medi- for lung adenocarcinoma but not for squamous cell carcinoma. cine. Statistical methods for differential expression analysis Sun and Li developed a pipeline to identify hemi-methylation, in both microarray and RNA-seq have been well-developed, that is, DNA methylation occurring in one strand only, and however,