Trends in Genetics
Total Page:16
File Type:pdf, Size:1020Kb
Trends in Genetics The GEP: Crowd-Sourcing Big Data Analysis with Undergraduates --Manuscript Draft-- Manuscript Number: Article Type: Scientific Life Keywords: undergraduate research, science education, bioinformatics, CURE (course-based undergraduate research experience), crowd-sourcing science Corresponding Author: Sarah Elgin Washington University St Louis, MO UNITED STATES First Author: Sarah Elgin Order of Authors: Sarah Elgin Charles Hauser Teresa M Holzen Christopher J Jones Adam Kleinschmit Judith Leatherman Abstract: The era of "big data" is also the era of abundant data, creating new opportunities for student/scientist research partnerships. By coordinating undergraduate efforts, the Genomics Education Partnership produces high quality annotated datasets/analyses that could not be generated otherwise, leading to scientific publications while providing many students with research experience. Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation Cover Letter Department of Biology Dear Ms. Navarro, Thanks you for the kind invitation to allow us to share news about the GEP with your readers. Please find attached the article which is ready for review. As per prior agreement, we have kept the author list short, listing only those individuals of the writing committee, and have added a complete author list of all those involved in the production of the published work of the GEP and the final reading and approval of the manuscript as “contributing authors” listed at the end of the MS. We trust that all authors will be listed in the final database curation of the article citation. Thanks again, Chris Shaffer Ph.D. , for Sarah C. R. Elgin Manuscript Click here to download Manuscript Elgin_MS_TIG_2016.doc 1 1 2 3 4 The GEP: Crowd-Sourcing Big Data Analysis with Undergraduates 5 6 7 8 Authors: Sarah C R Elgin*, Charles Hauser, Teresa M Holzen, Christopher Jones, Adam 9 10 Kleinschmit, Judith Leatherman, *Genomics Education Partnership. 11 12 13 14 *Correspondence: [email protected] (S.C.R. Elgin) 15 16 17 Abstract: The era of “big data” is also the era of abundant data, creating new 18 19 opportunities for student/scientist research partnerships. By coordinating 20 21 undergraduate efforts, the Genomics Education Partnership produces high quality 22 23 annotated datasets/analyses that could not be generated otherwise, leading to scientific 24 25 publications while providing many students with research experience. 26 27 28 29 30 31 Key words: undergraduate research, science education, bioinformatics, CURE (course- 32 33 based undergraduate research experience), crowd-sourcing science 34 35 36 37 38 39 Text 40 41 42 43 Current technology has allowed massive amounts of data to be collected in many fields, 44 45 including genomics, anatomy, ecology, astronomy, etc. Typically, after analysis to 46 47 answer the motivating question, the data are put into publicly accessible storage. Much 48 49 of this data still contains useful, unmined information, creating an opportunity for 50 51 expanded investigations. We have developed one such system for taking advantage of 52 53 public genomic datasets, by developing data analysis tools and providing them via the 54 55 internet to allow undergraduates to engage in research. This system of coordinating 56 57 “massively parallel” undergraduate efforts can be broadly applied to other fields, 58 59 60 61 62 63 64 65 1 2 2 3 4 providing benefits to the scientific community, the scientists directing the study, and 5 6 the students themselves. 7 8 9 10 Launched in 2006, the Genomics Education Partnership (GEP: http://gep.wustl.edu ) 11 12 brings undergraduates into genomics research. The consortium currently includes over 13 14 100 faculty members from diverse schools (see Contributing Authors). Students have 15 16 contributed to improving the underlying DNA sequence quality and manually 17 18 annotating selected regions of several Drosophila genomes. While helping students 19 20 learn the basics of eukaryotic gene structure and genome organization, the process also 21 22 introduces students to large genomics databases and bioinformatics tools, strengthens 23 24 their appreciation of evolution, immerses them in scientific inquiry, encourages critical 25 26 thinking, and leads some to pursue graduate work and/or bioinformatics careers. The 27 28 improved DNA sequence and careful annotations served as a foundation in an analysis 29 30 of the comparative evolution of megabase domains (a gene-rich heterochromatic 31 32 domain vs. a euchromatic domain), with high confidence in the findings [1]. 33 34 35 36 Such student “crowd-sourcing” efforts are scientifically valuable. In our recent study 37 38 comparing D. melanogaster with three other Drosophila species, GEP students working 39 40 between 2007 and 2012 improved 3.8 Mb of DNA from D. mojavensis and D. grimshawi, 41 42 closing 72 gaps and adding 44,468 bp of sequence. Students then annotated ~8 Mb of 43 44 DNA, modeling 1619 isoforms of 878 genes across three species. Whereas 58% of the 45 46 final gene models agreed with the GLEAN-R gene predictions, 42% did not. Careful 47 48 analysis of the findings indicates that human reconciliation of conflicting data is 49 50 currently superior for accuracy, albeit significantly slower. The resulting publication, 51 52 which examines the repeat characteristics (e.g., transposon density) and evolution of the 53 54 genes (e.g. gene size, codon bias, gene movement) in a heterochromatic domain, has 55 56 1,014 co-authors, including 940 undergraduates [1]. 57 58 59 60 61 62 63 64 65 1 3 2 3 4 The GEP project management process is presented in Figure 1. For projects like this to 5 6 be fruitful, it is necessary that the problem be one that can be sub-divided, with each 7 8 student (or small group) having specific responsibilities. It is also important to provide 9 10 students with a standard analysis protocol, as well as leading questions and/or tools 11 12 that enable students to check their work. In the GEP, students working on different 13 14 species of Drosophila aim to construct gene models that are best supported by the 15 16 available evidence. That evidence includes sequence similarity to the annotated proteins 17 18 of the well-annotated reference D. melanogaster; results from ab initio and extrinsic gene 19 20 finders; and all available modENCODE RNA-Seq data for the species. This information 21 22 and other custom data are provided to students through a local instance of the UCSC 23 24 Genome Browser (Figure 2). Students must evaluate and reconcile multiple lines of 25 26 potentially contradictory evidence to construct a gene model that they can defend and 27 28 use in subsequent explorations. Large numbers of participants enable the GEP to 29 30 replicate annotations, with experienced students (and occasionally staff) doing a final 31 32 reconciliation of any conflicting results [2]. In our recent analysis of ~2.1 Mb of the D. 33 34 biarmipes D element, GEP students produced 610 gene models, ~74% in congruence 35 36 with the final reconciled gene models (W. Leung, Washington University in St. Louis, 37 38 unpublished data). 39 40 41 42 GEP faculty embed this research challenge where appropriate in their curriculum, 43 44 generally in the laboratory portion of a genetics or molecular biology course, in a 45 46 dedicated genomics laboratory course, or through independent study. Such a course- 47 48 based undergraduate research experiences (CURE or CRE) are more accessible for 49 50 students who might not seek out a traditional apprentice-style research experience [3], 51 52 thus promoting inclusive excellence. Courses also enable us to provide research 53 54 experiences for many more students. Each GEP faculty member decides on the 55 56 preliminary training needed for their class, creating their own curriculum or selecting 57 58 from a collection of shared materials on the GEP website. Faculty members coach 59 60 61 62 63 64 65 1 4 2 3 4 students throughout the ongoing research, and direct their subsequent explorations, 5 6 which vary depending on the class learning objectives. 7 8 9 10 Assessment of pre- and post-course quiz performances show that participating students 11 12 increase their knowledge of eukaryotic genes and genomes and gain insight and 13 14 appreciation for the scientific process. In fact, GEP students and undergraduates who 15 16 have spent a summer in a research lab exhibit similar responses to a survey on science 17 18 learning/attitudes [4-5]. Survey comments indicate that most students appreciate the 19 20 hands-on approach to learning about genes/genomes, and ~85% are enthusiastic about 21 22 the opportunity to contribute to a genuine research project. Part of their motivation 23 24 stems from the fact that their work has meaning beyond the classroom. Most students 25 26 present and defend their work through a poster or oral presentation, often locally and 27 28 occasionally at regional/national conferences. 29 30 31 32 Many research projects have been successfully integrated into a CURE format [6-7]. For 33 34 example, the University of Texas at Austin has recently reported that engaging 35 36 freshmen in a 3-semester CURE (https://cns.utexas.edu/fri) results in significantly 37 38 higher retention in STEM, and higher graduation rates [8]. Most of the science being 39 40 done in the Texas program is based on projects led by and centered around the faculty’s 41 42 own research interests. Developing a CURE for 10 to 40 students around the research of 43 44 an individual local faculty member is a widespread approach, applicable across the 45 46 STEM disciplines [6]. Other CUREs take advantage of remote operation of 47 48 sophisticated instruments available through the national laboratories or other facilities, 49 50 or analyze a local problem (e.g.