BIOTECH Basics Exploring and Understanding the ENCODE Project

BIOTECH Basics Exploring and Understanding the ENCODE Project What you need to know: Once upon a midnight dreary, quences for how cells function or interact while I pondered, weak and weary, with neighboring cells. • The Human Genome Project Over many a quaint and curious The ENCODE Project involved 442 re- identified the sequence of the ~ 3.2 volume of forgotten lore – searchers from 32 institutions around the billion letters in a human genome and the location of ~ 21,000 genes. In a similar way, the ENCODE Project world. All told, 1,649 experiments that looked The ENCODE Project seeks to seeks to help scientists make sense of hu- across the genome were performed. From understand the ways those genes man genomes – by understanding the bio- these, scientists identified the location of mil- are regulated in cells. logical language contained in the sequences lions of functional elements in our genome. • The location of millions of functional of letters in our DNA. The Human Genome The data show the genome as a very active elements has been identified, Project determined the locations of more place – a beehive of molecules docking at although the biological significance than ~21,000 genes among ~3.2 billion DNA specific DNA sequences. These molecules of many of these locations is nucleotides. However, these genes only ac- initiate transcription, regulate genes thou- uncertain. count for 1-2 percent of an entire genome. sands of bases away or control how the DNA • The ENCODE data is useful for This means the functional significance of is folded. Scientists have known these regu- biological and disease-related much of the remainder has been a mystery. latory processes exist, but the scope and research, helping correlate DNA regions associated with disease and The ENCODE Project is the genetic equiva- scale took many by surprise. Each cell type the functional significance of those lent of providing spacing and punctuation – a uses only a small subset of these regulators, regions. set of experiments to determine which pieces which in part is what differentiates those cells of DNA regulate the action and storyline of a – e.g. skin cells rely on a different combina- human genome. tion of genes and regulators than muscle stands for the ENCy- Some analyses identified regions where or liver cells. Choosing from a vast array of clopedia Of DNA El- ENCODE proteins known as transcription factors bind biological controls lets the timing and level of ements. Launched in 2003, the decade-long DNA and control gene transcription – the gene activation be finely tuned to the needs project has sought to catalog all of the regions process that copies the genetic message of a cell. in human genomes that control how genes into RNA. Other experiments searched for However, the biological significance of function and facilitate understanding of when DNA methylation – small molecules that at- many of these active elements is hotly debat- and in which cells they are used. Following up tach to the DNA sequence. Transcription fac- ed in the genomics community. For example, a series of major publications in 2007, findings tor binding and DNA methylation can greatly roughly half of our genome is composed of from the most recent phase of ENCODE were increase or completely silence the activity of repetitive sequences of DNA and fragments published in September 2012, providing new a corresponding gene - dramatically altering of inactive genes - leftovers from our evo- insights into how our genome is organized and the level of RNA it produces. Varying levels lutionary history. Although sometimes de- regulated. of RNA influence how much protein can be scribed in the popular press as “junk” DNA, As an analogy, think of a passage from a produced. This may have important conse- scientists avoid using this word as it is both story or poem. Imagine the text is missing punctuation, spacing and those visual cues that make sense of the language. All that is If you want to know more: present are the individual letters that make www.encodeproject.org up the words. As an example, we’ll consider As with many large-scale genomic projects, the data are publicly available, allowing scientists the first sentence from Edgar Allan Poe’s fa- in all fields to better understand how human genomes are organized and regulated. mous poem The Raven. www.nature.com/encode onceuponamidnightdrearywhileipondered- The results were simultaneously published in 30 different papers. In order to navigate through weakandwearyovermanyaquaintandcuri- the findings, the scientific journals created an online portal site that allows individuals to ousvolumeofforgottenlore explore one of 13 topics of interest, following “threads” that gather the relevant information Adding in other parts of language provides from each paper. This data are also available in a free app for iPhone and iPad users. meaning to the string of text – things like capi- http://www.hudsonalpha.org/more-encode tal letters, commas and spaces. From a string Still not sure you really understand what ENCODE is all about? Take a look at of seeming nonsense, familiar words appear. HudsonAlpha’s “Understanding ENCODE” video on our YouTube channel. 4 Reprinted by permission from Macmillan misleading and inaccurate. ENCODE’s re- Publishers Ltd: sults suggest that many of these regions are Nature 489:46, in fact actively bound by transcription factors copyright 2012. and exhibit other types of molecular activity. Setting aside the debate over what frac- If we incorporate this concept into the literary tion of the genome is functionally important, example mentioned earlier, the opening line the ENCODE Project represents a landmark referenced to the ENCODE data. At least of the poem would looks more like this: scientific achievement. It picks up where the 88 percent lie outside of protein-producing Once up, up, up, upon a midnight very, Human Genome Project left off, in terms of sections of the genome, many of which land very, very, very, dreary, while I, I, I, I, I identifying context and meaning for the ge- in the ENCODE-identified regulatory sites. If pondered, weak and w-, w-, w-, weary, nome. ENCODE has provided a deep and these findings are validated, they offer insight Genomic scientists don’t yet know what high-quality description of genomic activity into the process of disease progression and fraction of this binding and transcriptional ac- that will be useful throughout many biological may lead to new candidates for therapy. The tivity is important for the daily activities of a and disease research areas. The next step ability to correlate association with function is cell. For example, because DNA binding pro- is to figure out how the various players in this a major benefit of the ENCODE data. teins use short DNA sequences (just a hand- regulatory symphony interact. For example, While the ENCODE Project represents an ful of letters) as a binding target, the chance if a binding site is altered or deleted through important milestone, a great deal of work still occurrence of these docking sites is high and mutation, is there an effect on the body? Over remains to fully decipher the functional com- will likely occur many times in genomes that the last several years, researchers have ponents of our genome. Less than 150 of the are billions of letters in length, especially if used a technique called genome-wide asso- thousands of human cell types were exam- those target sequences are repeated over ciation study (see the Summer 2009 Biotech ined and only a fraction of known transcrip- and over again. These segments may in turn Basics for more details) to identify thousands tion factors were examined. A new phase of produce RNA and mimic the features of be- of DNA variants associated with disease. the project is just beginning to dig deeper into ing functional, even if they serve no purpose. The vast majority of these associations occur the regulatory process. Although further studies are needed, it is like- outside of known gene regions. The assump- - Neil Lamb, Ph.D. ly that many of the ENCODE-detected activi- tion has been that they occur in regulatory director of educational outreach ties are not biologically important or needed regions that control the activity of disease- HudsonAlpha Institute by our cells. They are essentially molecular related genes. In support of this hypothesis, for Biotechnology noise. all known associated regions were cross- THE SCIENCE OF PROGRESS 5.

BIOTECH Basics Exploring and Understanding the ENCODE Project

ENCODE Regions Identification of Higher-Order Functional Domains in the Human

Developing and Implementing an Institute-Wide Data Sharing Policy Stephanie OM Dyke and Tim JP Hubbard*

Estimating the Number of Unseen Variants in the Human Genome

The ENCODE Project

The ENCODE (Encyclopedia of S DNA Elements) Project

What Is a Gene, Post-ENCODE? History and Updated Definition

A User's Guide to the Encyclopedia of DNA Elements (ENCODE)

Current Advances of Epigenetics in Periodontology from ENCODE Project

DNA Copy Number Analysis of Fresh and Formalin-Fixed Specimens By

ENCODE Phase 2: Participants and Projects

National Human Genome Research Institute EA Feingold, PJ G

Multiple Genes Encode Nuclear Factor 1-Like Proteins That Bind To