Machine Learning and Complex Biological Data Chunming Xu1,2* and Scott A
Total Page:16
File Type:pdf, Size:1020Kb
Xu and Jackson Genome Biology (2019) 20:76 https://doi.org/10.1186/s13059-019-1689-0 EDITORIAL Open Access Machine learning and complex biological data Chunming Xu1,2* and Scott A. Jackson1* Abstract this reason, the integrated analysis of different data types has been attracting more attention. Integration Machine learning has demonstrated potential in of different data types should, in theory, lead to a analyzing large, complex biological data. In practice, more holistic understanding of complex biological however, biological information is required in addition phenomena, but this is difficult due to the challenges to machine learning for successful application. of heterogeneous data and the implicitly noisy nature of biological data [1]. Another challenge is data di- The revolution of biological techniques and mensionality: omics data are high resolution, or stated another way, highly dimensional. In biological studies, demands for new data mining methods the number of samples is often limited and much In order to more completely understand complex bio- fewer than the number of variables due to costs or logical phenomena, such as many human diseases or available sources (e.g., cancer samples, plant/animal quantitative traits in animals/plants, massive amounts and replicates); this is also referred to as the ‘curse of multiple types of ‘big’ data are generated from compli- dimensionality’, which may lead to data sparsity, mul- cated studies. In the not so distant past, data generation ticollinearity, multiple testing, and overfitting [2]. was the bottleneck, now it is data mining, or extracting useful biological insights from large, complicated datasets. In the past decade, technological advances in data gener- Machine learning versus statistics ation have advanced studies of complex biological phe- The boundary between machine learning and statistics is nomena. In particular, next generation sequencing (NGS) fuzzy. Some methods are common to both domains and technologies have allowed researchers to screen changes either can be used for prediction and inference. How- at varying biological scales, such as genome-wide genetic ever, machine learning and statistics have different foci, variation, gene expression and small RNA abundance, epi- prediction or inference [3]. In general, classic statistical genetic modifications, protein binding motifs, and methods rely on assumptions about the data-generating chromosome conformation in a high-throughput and systems. Statistics can provide explicit inferences cost-efficient manner (Fig. 1). The explosion of data, espe- through fitting a specified probability model when cially omics data (Fig. 1), challenges the long-standing enough data are collected from well-designed studies. methodologies for data analysis. Machine learning is concerned with the question of cre- Biological systems are complex. Most large-scale ation and application of algorithms that improve with studies focus only on one specific aspect of the bio- experience. Many machine learning methods can derive logical system; for example, genome-wide association models for pattern recognition, classification, and pre- studies (GWAS) focus on genetic variants associated diction from existing data and do not rely on stringent with measured phenotypes. However, complex bio- assumptions about the data-generating systems, which logical phenomena can involve many biological as- makes them more effective in some complicated applica- pects, both intrinsic and extrinsic (Fig. 1), and, thus, tions, as further described below, but less effective in cannot be fully explained using a single data type. For producing explicit models with biological significance, in some cases [3]. * Correspondence: [email protected]; [email protected] 1Center for Applied Genetic Technologies, Institute for Plant Breeding, The applications of machine learning in biology Genetics and Genomics, The University of Georgia, 111 Riverbend Rd, Athens, There are two primary types of machine learning GA 30602, USA Full list of author information is available at the end of the article methods: supervised learning and unsupervised learning. © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Xu and Jackson Genome Biology (2019) 20:76 Page 2 of 4 Fig. 1 Machine learning using complex biological data. High-throughput data generation techniques for different biological aspects are shown (left). ATAC-seq assay for transposase-accessible chromatin using sequencing, ChIP-seq chromatin immunoprecipitation sequencing, DNase-seq DNase I hypersensitive sites sequencing, GC-MS gas chromatography-mass spectrometry, LC-MS liquid chromatography–mass spectrometry, lncRNA-seq long non-coding RNA sequencing, NMR nuclear magnetic resonance, RNA-seq RNA sequencing, smRNA-seq small RNA sequencing, WES whole exome sequencing, WGBS whole-genome bisulfite sequencing, WGS whole genome sequencing, Hi-C chromatin conformation capture combined with deep sequencing, iTRAQ isobaric tags for relative and absolute quantification Supervised learning algorithms learn the relationship be- omics approach, such as DNase I hypersensitive sites tween a set of input variables and a designated (DNase-seq), formaldehyde-assisted isolation of regu- dependent variable or labels from training instances and latory elements with sequencing (FAIRE-seq), assay can subsequently be used to predict the outcomes of for transposase-accessible chromatin using sequencing new instances. Many sophisticated machine learning (ATAC-seq), and self-transcribing active regulatory re- methods are supervised, e.g., decision tree, support vec- gion sequencing (STARR-seq). Machine learning can tor machine, and neural network. Unsupervised learning be used to build models to predict regulatory ele- algorithms infer patterns from data without a dependent ments and non-coding variant effects de novo from a variable or known labels. Cluster and principle compo- DNA sequence [5] that can then be tested/validated nent analysis are two popular unsupervised learning for their contribution to gene regulation and ultim- methods used to find patterns in high dimensionality ately to observable traits/pathologies. data such as omics data. Deep learning is a subtype of In addition to the prediction of regulatory regions, re- machine learning originally inspired by neuroscience, es- cently, supervised learning showed considerable poten- sentially describing a class of large neural networks. tial for solving population and evolutionary genetics Deep learning has been applied in many fields, largely questions, such as the identification of regions under driven by the massive increases in both computational purifying selection or selective sweeps, as well as more power and big data. Deep learning can be both super- complicated spatiotemporal questions (reviewed in [6]). vised and unsupervised, has revolutionized fields such as Up to now, machine learning approaches have also image recognition, and shows promise for applications been used to predict transcript abundance [7], imput- in genomics, medicine, and healthcare. ation of missing SNPs and DNA methylation states [8, Machine learning has been used broadly in bio- 9],variant calling [10], disease diagnosis/classification, logical studies for prediction and discovery. With the and many different biological questions using datasets increasing availability of more and different types of from different biological aspects such as genomes, epi- omics data, the application of machine learning genomes, transcriptomes, and metabolomes. methods, especially deep learning approaches, has be- come more frequent. One area of opportunity for ma- Challenges and future outlooks chine learning approaches is in the prediction of The massive and rapid advancements in both bio- genomic features, particularly those that are hard to logical data generation and machine learning method- predict using current approaches such as regulatory ologies are promising for the analysis and discovery regions. Machine learning has been used to predict from complex biological data. However, there are sev- the sequence specificities of DNA- and RNA-binding eral hurdles. Firstly, interpretation of models derived proteins, enhancers, and other regulatory regions [4, from some sophisticated machine learning approaches 5] on data generated by one or multiple types of such as deep learning can be difficult if not Xu and Jackson Genome Biology (2019) 20:76 Page 3 of 4 impossible. In many cases, researchers are more inter- data from different sources and additional work on ested in the biological meaning of the predictive data production, sharing, and processing will be model than the predictive accuracy of the model and necessary. the ‘black box’ nature of the model can inhibit inter- Although improved machine learning methods and pretation. The information from the model may need the increasing number of available samples show great further processing and should be carefully interpreted