Tasks, Techniques, and Tools for Genomic Data Visualization
Total Page:16
File Type:pdf, Size:1020Kb
Tasks, Techniques, and Tools for Genomic Data Visualization Sabrina Nusrat1;∗ Theresa Harbig1;∗ and Nils Gehlenborg1 1Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA; ∗Authors contributed equally to this work Data Taxonomy Visualization Taxonomy Task Taxonomy Feature Sets Coordinate System Tracks View Configurations Tasks Type Interconnection Layout Partition Abstraction Arrangement Encoding Alignment Views Scales Foci Search Query Lp Lookup I Identify Lt Locate C Compare One None B Browse S Summarize E Explore Sparse Within Many Mapping Feature Sets Positions Between B S C Contiguous S Lt E Lp I S E Figure 1: Data taxonomy, visualization taxonomy, and task taxonomy for genomic visualizations. The data taxonomy describes how different genomic data types can be encoded as feature sets. A genomic visualization contains one or multiple coordinate systems applying a specific layout, partition, abstraction and arrangement (in case of multiple axes) of sequence coordinates. Feature sets are encoded as tracks and placed on the coordinate systems. Multiple tracks can either be aligned by stacking or overlaying. A visualization can consist of one or multiple views, each containing a set of aligned tracks. Multiple views can show data on one or multiple scales and foci. The task taxonomy explains how different search and query tasks operate on genomic visualizations. Abstract Genomic data visualization is essential for interpretation and hypothesis generation as well as a valuable aid in communicating discoveries. Visual tools bridge the gap between algorithmic approaches and the cognitive skills of investigators. Addressing this need has become crucial in genomics, as biomedical research is increasingly data-driven and many studies lack well- defined hypotheses. A key challenge in data-driven research is to discover unexpected patterns and to formulate hypotheses in an unbiased manner in vast amounts of genomic and other associated data. Over the past two decades, this has driven the development of numerous data visualization techniques and tools for visualizing genomic data. Based on a comprehensive literature survey, we propose taxonomies for data, visualization, and tasks involved in genomic data visualization. Furthermore, arXiv:1905.02853v1 [q-bio.GN] 8 May 2019 we provide a comprehensive review of published genomic visualization tools in the context of the proposed taxonomies. 1. Introduction While a large amount of genomic data is produced within individual small scale projects, there are also numerous na- A rapidly growing understanding of how the genome and tional and international efforts to generate genomic data at large epigenome of an organism control molecular function and cellu- scale. These kinds of projects include efforts to catalog genomic lar processes has revolutionized research in biology and medicine. features across cell types and tissues (e.g. ENCODE Consor- Driven by affordable high-throughput technologies that allow sci- tium [ENC12]), studies aimed at understanding fundamental prin- entists and clinicians to reliably obtain high quality sequence infor- ciples of DNA architecture (e.g. 4D Nucleome Project [DBG∗17]), mation from DNA and RNA molecules, generation and handling as well as disease specific studies that aim to elucidate the molec- of genomic sequencing data are now routine aspects of many basic ular changes that cause diseases such as cancer (e.g. The Can- science and clinical research projects in biology and medicine. 2 S. Nusrat, T. Harbig, N. Gehlenborg / Tasks, Techniques, and Tools for Genomic Data Visualization cer Genome Atlas [CWC∗13], International Cancer Genome Con- data, we limit the scope of this survey to visualizations that incor- sortium [The10]). Yet other projects aim to develop new ap- porate one or more genomic coordinate systems and present data in proaches for early identification of genetic risk factors (e.g. Baby- the order defined by the sequence of that coordinate system. This Seq [HACB∗18]). Many of these projects do not only generate ge- explicitly excludes many techniques that are based on reorderable nomic data for many samples, but also produce dozens of different matrices and node-link diagram approaches such as visualization of types of genomic data. gene expression levels as matrix-based, clustered heatmaps or vi- sualization of gene regulatory networks as node-link diagrams with Visualization of genomic data is frequently employed in biomed- expression data mapped overlaid onto the nodes. Since the presence ical research to access knowledge within a genomic context, to of a genome sequence is required for inclusion in this survey, we communicate, and to explore datasets for hypothesis generation. As also excluded tools for genome assembly, which is the process of biomedical research is increasingly data-driven and many studies defining the sequence of a reference genome for a given species. lack well-defined hypotheses [Wei10, Gol10], it is a key challenge to discover unexpected patterns and to formulate questions in an Our survey makes two major contributions: In Section4, we pro- unbiased manner in vast amounts of genomic and other associated pose taxonomies for data, visualization, and tasks involved in ge- data. nomic data visualization. In Sections5 and6, we provide a com- prehensive review of published genomic visualization tools in the Over the last two decades, hundreds of visualization tools for context of the proposed taxonomies. In addition, we discuss cur- genomic data have been published. The large number of tools are rent challenges and research opportunities in genomic data visual- an indicator for the broad application of genomic data and a sign ization. that visualization of genomic data is a complex problem and active research domain. Several challenges in visualizing genomic data are directly con- 2. Biological Background nected to how genomes are organized. Genomes are collections of one or more chromosomes, which are individual molecules that en- 2.1. DNA Structure code information as a sequence of nucleotides. Although genomic Deoxyribonucleic acid (DNA) is a molecule contained in living information is stored in the form of a sequence, the function of cells carrying the genetic instructions to build every biological the genome is influenced by and requires various types of long- function and molecule of an organism. DNA is studied for nu- and short-range interactions between non-adjacent regions of the merous reasons, including analyzing cancerous DNA to find treat- sequence. This includes interactions within and between chromo- ments, finding possible risk factors for certain diseases, and com- somes. Patterns in genomic data can be found across many scales, paring DNA of different species to study them in the context of ranging from the size of whole chromosomes, which can span hun- evolution. dreds of millions of nucleotides, down to individual nucleotides. Another important aspect of many genomes is the sparse distribu- DNA consists of two complementary strands coiled up in a dou- tion of many types of patterns along the genome sequence. ble helix. Each strand is composed of smaller units called nu- cleotides, each consisting of a base (either Adenine (A), Cytosine Furthermore, the questions that are addressed with genomic data (C), Guanine (G), or Thymine (T)), a sugar, and a phosphate group. are aimed at the understanding of complex biological systems Biological information is stored in the order of the different nu- where all components are highly interconnected and influence each cleotides. The two strands, called forward and reverse strands, are other. For example, the regulation of gene activity is controlled by connected at the bases. They are called complementary because an the presence or absence of particular regulatory proteins, chemical A in the forward strand corresponds to a T in the reverse strand modification of parts of the DNA, and the 3D structure of chromo- as well as G corresponds to C and vice versa. Therefore, often somes, all of which are changing depending on environmental and only one of the strands is considered when analyzing or visual- other factors. An abundance of proteins, chemical modifications, izing genomic data. In prokaryotes (bacteria and archaea) DNA is and 3D structures can be measured comprehensively and mapped organized in one single circular sequence, which is called a chro- to the genome. While this is a greatly simplified view, it shows mosome, while eukaryotic DNA is usually organized in multiple the diversity and number of data types from multiple sources that chromosomes (multiple sequences). often need to be integrated into visualization in order to interpret genomic information. A gene is a sequence of nucleotides encoding for a protein, a molecule which has a biological function in the organism such as The combination of long sequences, sparse distribution of pat- catalyzing reactions (as an enzyme) or being a building block of a terns across multiple scales, interactions between distant parts of tissue. In order to build a protein using the information of a gene, the sequence, and large numbers of diverse data types pose numer- the DNA has to be transcribed to mRNA and translated into a pro- ous visualization challenges. These require the design and develop- tein (see Figure2). The process of transcribing genes into mRNA is ment of specialized tools. Additionally, the number of features, the called gene expression. The rate at which a gene is expressed (and size of the datasets, and the diversity of data types all make tight in- translated into a protein) is not the same at all times but depends tegration of genomic data visualization tools with algorithmic tools on many different regulatory factors, such as molecules called tran- a requirement for efficient analysis workflows. This further compli- scription factors. Certain sequences of the RNA molecule can initi- cates the design of effective visualization tools for genomic data. ate the transcription of a gene, called promotors, which are located As the sequential organization is a key characteristic of genomic upstream of the gene sequence. Transcription factors can bind to S. Nusrat, T. Harbig, N.