Trends in

The GEP: Crowd-Sourcing Big Data Analysis with Undergraduates --Manuscript Draft--

Manuscript Number: Article Type: Scientific Life Keywords: undergraduate research, , bioinformatics, CURE (course-based undergraduate research experience), crowd-sourcing science Corresponding Author: Sarah Elgin Washington University St Louis, MO UNITED STATES First Author: Sarah Elgin Order of Authors: Sarah Elgin Charles Hauser Teresa M Holzen Christopher J Jones Adam Kleinschmit Judith Leatherman Abstract: The era of "big data" is also the era of abundant data, creating new opportunities for student/scientist research partnerships. By coordinating undergraduate efforts, the Education Partnership produces high quality annotated datasets/analyses that could not be generated otherwise, leading to scientific publications while providing many students with research experience.

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation Cover Letter

Department of Biology

Dear Ms. Navarro, Thanks you for the kind invitation to allow us to share news about the GEP with your readers. Please find attached the article which is ready for review. As per prior agreement, we have kept the author list short, listing only those individuals of the writing committee, and have added a complete author list of all those involved in the production of the published work of the GEP and the final reading and approval of the manuscript as “contributing authors” listed at the end of the MS. We trust that all authors will be listed in the final database curation of the article citation.

Thanks again,

Chris Shaffer Ph.D. , for Sarah C. R. Elgin Manuscript Click here to download Manuscript Elgin_MS_TIG_2016.doc

1 1 2 3 4 The GEP: Crowd-Sourcing Big Data Analysis with Undergraduates 5 6 7 8 Authors: Sarah C R Elgin*, Charles Hauser, Teresa M Holzen, Christopher Jones, Adam 9 10 Kleinschmit, Judith Leatherman, *Genomics Education Partnership. 11 12 13 14 *Correspondence: [email protected] (S.C.R. Elgin) 15 16 17 Abstract: The era of “big data” is also the era of abundant data, creating new 18 19 opportunities for student/scientist research partnerships. By coordinating 20 21 undergraduate efforts, the Genomics Education Partnership produces high quality 22 23 annotated datasets/analyses that could not be generated otherwise, leading to scientific 24 25 publications while providing many students with research experience. 26 27 28 29 30 31 Key words: undergraduate research, science education, bioinformatics, CURE (course- 32 33 based undergraduate research experience), crowd-sourcing science 34 35 36 37 38 39 Text 40 41 42 43 Current technology has allowed massive amounts of data to be collected in many fields, 44 45 including genomics, anatomy, ecology, astronomy, etc. Typically, after analysis to 46 47 answer the motivating question, the data are put into publicly accessible storage. Much 48 49 of this data still contains useful, unmined information, creating an opportunity for 50 51 expanded investigations. We have developed one such system for taking advantage of 52 53 public genomic datasets, by developing data analysis tools and providing them via the 54 55 internet to allow undergraduates to engage in research. This system of coordinating 56 57 “massively parallel” undergraduate efforts can be broadly applied to other fields, 58 59 60 61 62 63 64 65 1 2 2 3 4 providing benefits to the scientific community, the scientists directing the study, and 5 6 the students themselves. 7 8 9 10 Launched in 2006, the Genomics Education Partnership (GEP: http://gep.wustl.edu ) 11 12 brings undergraduates into genomics research. The consortium currently includes over 13 14 100 faculty members from diverse schools (see Contributing Authors). Students have 15 16 contributed to improving the underlying DNA sequence quality and manually 17 18 annotating selected regions of several genomes. While helping students 19 20 learn the basics of eukaryotic gene structure and genome organization, the process also 21 22 introduces students to large genomics databases and bioinformatics tools, strengthens 23 24 their appreciation of evolution, immerses them in scientific inquiry, encourages critical 25 26 thinking, and leads some to pursue graduate work and/or bioinformatics careers. The 27 28 improved DNA sequence and careful annotations served as a foundation in an analysis 29 30 of the comparative evolution of megabase domains (a gene-rich heterochromatic 31 32 domain vs. a euchromatic domain), with high confidence in the findings [1]. 33 34 35 36 Such student “crowd-sourcing” efforts are scientifically valuable. In our recent study 37 38 comparing D. melanogaster with three other Drosophila species, GEP students working 39 40 between 2007 and 2012 improved 3.8 Mb of DNA from D. mojavensis and D. grimshawi, 41 42 closing 72 gaps and adding 44,468 bp of sequence. Students then annotated ~8 Mb of 43 44 DNA, modeling 1619 isoforms of 878 genes across three species. Whereas 58% of the 45 46 final gene models agreed with the GLEAN-R gene predictions, 42% did not. Careful 47 48 analysis of the findings indicates that human reconciliation of conflicting data is 49 50 currently superior for accuracy, albeit significantly slower. The resulting publication, 51 52 which examines the repeat characteristics (e.g., transposon density) and evolution of the 53 54 genes (e.g. gene size, codon bias, gene movement) in a heterochromatic domain, has 55 56 1,014 co-authors, including 940 undergraduates [1]. 57 58 59 60 61 62 63 64 65 1 3 2 3 4 The GEP project management process is presented in Figure 1. For projects like this to 5 6 be fruitful, it is necessary that the problem be one that can be sub-divided, with each 7 8 student (or small group) having specific responsibilities. It is also important to provide 9 10 students with a standard analysis protocol, as well as leading questions and/or tools 11 12 that enable students to check their work. In the GEP, students working on different 13 14 species of Drosophila aim to construct gene models that are best supported by the 15 16 available evidence. That evidence includes sequence similarity to the annotated proteins 17 18 of the well-annotated reference D. melanogaster; results from ab initio and extrinsic gene 19 20 finders; and all available modENCODE RNA-Seq data for the species. This information 21 22 and other custom data are provided to students through a local instance of the UCSC 23 24 Genome Browser (Figure 2). Students must evaluate and reconcile multiple lines of 25 26 potentially contradictory evidence to construct a gene model that they can defend and 27 28 use in subsequent explorations. Large numbers of participants enable the GEP to 29 30 replicate annotations, with experienced students (and occasionally staff) doing a final 31 32 reconciliation of any conflicting results [2]. In our recent analysis of ~2.1 Mb of the D. 33 34 biarmipes D element, GEP students produced 610 gene models, ~74% in congruence 35 36 with the final reconciled gene models (W. Leung, Washington University in St. Louis, 37 38 unpublished data). 39 40 41 42 GEP faculty embed this research challenge where appropriate in their curriculum, 43 44 generally in the laboratory portion of a genetics or molecular biology course, in a 45 46 dedicated genomics laboratory course, or through independent study. Such a course- 47 48 based undergraduate research experiences (CURE or CRE) are more accessible for 49 50 students who might not seek out a traditional apprentice-style research experience [3], 51 52 thus promoting inclusive excellence. Courses also enable us to provide research 53 54 experiences for many more students. Each GEP faculty member decides on the 55 56 preliminary training needed for their class, creating their own curriculum or selecting 57 58 from a collection of shared materials on the GEP website. Faculty members coach 59 60 61 62 63 64 65 1 4 2 3 4 students throughout the ongoing research, and direct their subsequent explorations, 5 6 which vary depending on the class learning objectives. 7 8 9 10 Assessment of pre- and post-course quiz performances show that participating students 11 12 increase their knowledge of eukaryotic genes and genomes and gain insight and 13 14 appreciation for the scientific process. In fact, GEP students and undergraduates who 15 16 have spent a summer in a research lab exhibit similar responses to a survey on science 17 18 learning/attitudes [4-5]. Survey comments indicate that most students appreciate the 19 20 hands-on approach to learning about genes/genomes, and ~85% are enthusiastic about 21 22 the opportunity to contribute to a genuine research project. Part of their motivation 23 24 stems from the fact that their work has meaning beyond the classroom. Most students 25 26 present and defend their work through a poster or oral presentation, often locally and 27 28 occasionally at regional/national conferences. 29 30 31 32 Many research projects have been successfully integrated into a CURE format [6-7]. For 33 34 example, the University of Texas at Austin has recently reported that engaging 35 36 freshmen in a 3-semester CURE (https://cns.utexas.edu/fri) results in significantly 37 38 higher retention in STEM, and higher graduation rates [8]. Most of the science being 39 40 done in the Texas program is based on projects led by and centered around the faculty’s 41 42 own research interests. Developing a CURE for 10 to 40 students around the research of 43 44 an individual local faculty member is a widespread approach, applicable across the 45 46 STEM disciplines [6]. Other CUREs take advantage of remote operation of 47 48 sophisticated instruments available through the national laboratories or other facilities, 49 50 or analyze a local problem (e.g. the operation of a LEED-certified building or the waste 51 52 stream at the campus cafeteria). There are several national projects in addition to the 53 54 GEP. Perhaps the largest is SEA-PHAGES, which involves students in plaque- 55 56 purification and characterization of novel locally-isolated phage, followed by genome 57 58 sequencing and annotation (http://seaphages.org ). Investigations that benefit from 59 60 61 62 63 64 65 1 5 2 3 4 collection and coordinated analysis of an array of data are especially good topics for a 5 6 CURE. 7 8 9 10 Faculty participating in national research projects, such as the GEP, clearly benefit as 11 12 well. The central organization sets up and maintains a website so that projects, 13 14 curriculum and other resources can be shared among the whole group. Joint 15 16 assessment, drawing on the large pool of students, is also carried out. Faculty attend 17 18 webinars during the year and summer workshops that help them stay up-to-date in a 19 20 rapidly changing field, develop new curriculum, and work on publications in the 21 22 scientific and science education literatures. The project enables them to provide a 23 24 research experience for a greater proportion of their students, an objective for many 25 26 schools [9]. 27 28 29 30 The diverse GEP membership allows us to assess the impact of different institutional 31 32 characteristics (e.g., 2/4 year, public/private, large/small, selective/open, minority or 33 34 Hispanic serving) on student performance. We find no significant correlation between 35 36 institutional characteristics and student success (as judged by quiz scores and the 37 38 science knowledge/attitudes survey). We do find a positive correlation between the 39 40 amount of time spent on the GEP project and students achieving the full benefits of a 41 42 research experience [2]. Students need time to master the tools and gain familiarity 43 44 with the system; they can then begin to ask and address their own questions about the 45 46 genes and genome under study. 47 48 49 50 Having a centrally organized national experiment like the GEP collaborative has been a 51 52 win-win experience for us, the GEP faculty. In implementing this CURE, we have 53 54 provided our students with rich learning experiences, while also generating useful 55 56 scientific information that would be prohibitively expensive to generate by traditional 57 58 means (i.e. locally with full-time research scientists). Bioinformatics is particularly well 59 60 suited for a CURE, as infrastructure costs are low (computers with internet access being 61 62 63 64 65 1 6 2 3 4 the only requirement), and 24/7 access can be provided with no safety concerns, a 5 6 circumstance that lends itself to peer instruction. We believe our approach is applicable 7 8 to many other studies utilizing comparative genomics in other species. Toward this 9 10 end, we are working with members of the Galaxy Project (led by J Goecks, George 11 12 Washington University) to develop G-OnRamp, a system that facilitates creation of a 13 14 genome browser for any eukaryotic genome. 15 16 17 18 Genome annotation and analysis is just one of many studies that can benefit from 19 20 careful collection of many data points by undergraduates (see [6] for many different 21 22 examples). We suggest that STEM education reform efforts could be profoundly 23 24 enhanced by establishing a suite of national experiments in a variety of disciplines, 25 26 enabling many more faculty—especially those at PUIs with limited research resources— 27 28 to engage in such a project. We anticipate that the development of G-OnRamp, together 29 30 with our existing curriculum and tools, will facilitate the development of additional 31 32 CURE projects in genomics. But the strategy is clearly applicable beyond genomics. We 33 34 hope readers in many fields will think creatively about how their own research projects 35 36 might benefit from educational involvement such as we describe. The solution to many 37 38 data acquisition/data mining problems may be the students currently enrolled in 39 40 undergraduate laboratories and classrooms across the country. 41 42 43 44 Acknowledgements: The GEP was originally supported by the Howard Hughes 45 46 Medical Institute through a Professors grant to SCRE (#52007051) and is currently 47 48 funded by NSF IUSE grant #1431407, with continuing support from Washington 49 50 University in St Louis. The GEP-Galaxy project is funded by NIH BD2K grant 51 52 1R25GM119157. 53 54 55 56 Contributing authors 57 58 The full list of authors and affiliations is as follows: 59 60 61 62 63 64 65 1 7 2 3 4 Anna Allen, Howard University; Consuelo Alvarez, Longwood University; Sara 5 6 Anderson, Minnesota State University Moorhead; Gaurav Arora, Gallaudet University; 7 Cindy Arrigo, New Jersey City University; Andrew Arsham, Bemidji State University; 8 Cheryl Bailey, Mount Mary University; Daron Barnard, Worcester State University; Ana 9 10 Maria Barral, National University; Chris Bazinet, St. John's University; Dale Beach, 11 Longwood University; James E. J. Bedard, University of the Fraser Valley, BC; April 12 Bednarski, Washington University in St. Louis; John Braverman, Saint Joseph's 13 14 University; Jeremy Buhler, Washington University in St. Louis; Martin Burg, Grand 15 Valley State University; Hui-Min Chung, University of West Florida; Paula Croonquist, 16 Anoka-Ramsey Community College; Scott Danneman, Anoka-Ramsey Community 17 18 College; Randall DeJong, Calvin College; Justin R. DiAngelo, Penn State Berks; Robert 19 Drew, University of Massachusetts Dartmouth; Robert Drewell, Clark University; 20 Chunguang Du, Montclair State University; Sondra Dubowsky, McLennan Community 21 22 College; Todd Eckdahl, Missouri Western State University; Heather Eisler, University of 23 the Cumberlands; Julia Emerson, Amherst College; Amy Frary, Mount Holyoke 24 College; Donald Frohlich, University of St. Thomas (Houston); Thomas Giarla, Siena 25 College; Anya Goodman, California Polytechnic State University San Luis Obispo; 26 27 Shubha Govind, City College, CUNY; Elena Gracheva, Washington University in St. 28 Louis; Adam Haberman, University of San Diego; Amy Hark, Muhlenberg College; 29 Shan Hays, Western State Colorado University; Arlene Hoogewerf, Calvin College; 30 31 Laura Hoopes, Pomona College; Carina Howell, Lock Haven University of 32 Pennsylvania; Diana Johnson, George Washington University; M. Logan Johnson, Notre 33 Dame College; Lisa Kadlec, Wilkes University; Marian Kaehler, Luther College; Jacob 34 35 Kagey, University of Detroit Mercy; Jennifer Kennell, Vassar College; Cathy Silver Key, 36 North Carolina Central University; Melissa Kleinschmit, Trinidad State Junior College; 37 Nighat Kokan, Cardinal Stritch University; Olga Ruiz Kopp, Utah Valley University; 38 39 Meg Laakso, Eastern University; Wilson Leung, Washington University in St. Louis; 40 David Lopatto, Grinnell College; Christy MacKinnon, University of the Incarnate Word; 41 Mollie Manier, George Washington University; Elaine Mardis, Washington University 42 43 Genome Institute; Juan C. Martinez-Cruzado, University of Puerto Rico at Mayaguez; 44 Luis Matos, Eastern Washington University; Amie Jo McClellan, Bennington College; 45 Gerard McNeil, York College - City University of New York; Evan Merkhofer, Mount 46 47 Saint Mary College; Hemlata Mistry, Widener University; Elizabeth Mitchell, 48 McLennan Community College; Nathan T. Mortimer, Illinois State University; John 49 Mullican, Washburn University; Jennifer Leigh Myka, Gateway Community & 50 Technical College; Alexis Nagengast, Widener University; Paul Overvoorde, Macalester 51 52 College; Don Paetkau, Saint Mary's College - Indiana; Leocadia Paliulis, Bucknell 53 University; Susan Parrish, McDaniel College; Celeste Peterson, Suffolk University; Jeff 54 Poet, Missouri Western State University; Johanna M. Porter-Kelley, Winston-Salem 55 56 State University; Mary Lai Preuss, Webster University; James Price, Utah Valley 57 University; Nicholas Pullen, University of Northern Colorado; Laura Reed, University 58 of Alabama Tuscaloosa; Nick Reeves, Mt. San Jacinto College, Menifee Valley Campus; 59 60 Gloria Regisford, Prairie View A&M University; Catherine Reinke, Linfield College; 61 62 63 64 65 1 8 2 3 4 Dennis Revie, California Lutheran University; Srebrenka Robic, Agnes Scott College; 5 6 Jennifer A. Roecklein-Canfield, Simmons College; Ryan Rogers, Wentworth Institute of 7 Technology; Anne Rosenwald, Georgetown University; Michael R. Rubin, University of 8 Puerto Rico at Cayey; Takrima Sadikot, Washburn University; Jamie Sanford, Ohio 9 10 Northern University; Maria Santisteban, University of North Carolina at Pembroke; 11 Kenneth Saville, Albion College; Stephanie Schroeder, Webster University; Christopher 12 Shaffer, Washington University in St. Louis; Karim Sharif, Massasoit Community 13 14 College; Mary Shaw, New Mexico Highlands University; Matthew Skerritt, Corning 15 Community College; Diane Sklensky, Lane College; Chiyedza Small, Medgar Evers 16 College, CUNY; Sheryl Smith, Arcadia University; Mary Smith, North Carolina 17 18 Agricultural & Technical State University; Robert Snyder, State University of New York 19 at Potsdam; Eric Spana, Duke University; Rebecca Spokony, Baruch College; Aparna 20 Sreenivasan, California State University Monterey Bay; Joyce Stamm, University of 21 22 Evansville; Justin Thackeray, Clark University; Jeffrey S. Thompson, Denison 23 University; Chau-Ti Ting, National Taiwan University; Melanie Van Stry, Lane College; 24 Leticia Vega, Barry University; Matthew Wawersik, College of William and Mary; 25 Colette Witkowski, Missouri State University; Cindy Wolfe, Southwest Baptist 26 27 University; Michael Wolyniak, Hampden-Sydney College; James Youngblom, 28 California State University Stanislaus; Brian Yowler, Geneva College; Leming Zhou, 29 University of Pittsburgh 30 31 32 33 References 34 35 36 37 1. Leung, W. et al. (2015) Drosophila Muller F elements maintain a distinct set of 38 genomic properties over 40 million years of evolution. G3 (Bethesda) 5, 719-740. 39 40 41 2. Shaffer, C.D. et al. (2014) A course-based research experience: How benefits 42 change with increased investment in instructional time. CBE Life Sci. Educ. 13, 43 111-130. 44 45 46 3. Bangera, G. and Brownell, S. (2014) Course-based undergraduate research 47 experiences can make scientific research more inclusive. CBE Life Sci. Educ. 13, 48 602-606. 49 50 51 4. Lopatto, D. et al. (2008) Undergraduate research. Genomics Education 52 Partnership. Science 322, 684-5. 53 54 55 5. Shaffer, C.D. et al. (2010) The Genomics Education Partnership: Successful 56 integration of research into laboratory classes at a diverse group of 57 58 undergraduate institutions. CBE Life Sci. Educ. 9, 55-69. 59 60 61 62 63 64 65 1 9 2 3 4 5 6 6. National Academy of Sciences, Engineering, and Medicine (2015) Integrating 7 Discovery-Based Research into the Undergraduate Curriculum: Report of a 8 Convocation. Washington, D.C: National Academies Press. 9 10 11 7. Elgin et al. (2016) Insights from a Convocation: Integrating discovery-based 12 research into the undergraduate curriculum. CBE Life Sci. Educ. DOI: 13 14 10.1187/cbe.16-03-0118. 15 16 17 18 19 8. Rodenbusch, S.E. et al. (2016) Early engagement in course-based research 20 increases graduation rates and completion of science, engineering, and 21 22 Mathematics degrees. CBE Life Sci. Educ. DOI: 10.1187/cbe 16-03-0117. 23 24 25 9. Lopatto, D. et al. (2014) A central support system can facilitate implementation 26 27 and sustainability of a Classroom-based Undergraduate Research Experience 28 (CURE) in genomics. CBE Life Sci. Educ. 13, 711-23. 29 30 31 32 33 Legends: 34 35 36 37 Figure 1: Flowchart of the GEP research process. The draft Drosophila genome 38 39 assemblies and raw sequence data are obtained from NCBI. GEP staff at Washington 40 41 University in St. Louis (WUSTL) analyze these assemblies to identify regions of interests 42 43 (e.g., Muller F and D element scaffolds). These regions are partitioned into overlapping 44 45 projects at the appropriate size [currently ~100 kb for sequence improvement and ~40 46 47 kb (2-7 genes) for annotation]. GEP faculty members claim the number of projects 48 49 appropriate for their class. On completion, GEP students submit their projects (with a 50 51 detailed report) to WUSTL. For quality control purposes, each project is completed by at 52 53 least two groups working independently and then reconciled by experienced 54 55 undergraduate students. These reconciled projects are then reassembled to create a 56 57 large domain (~1-3 Mb) of high quality annotated sequence, which is then used in the 58 59 final analyses and subsequent publications in the scientific literature. 60 61 62 63 64 65 1 10 2 3 4 5 6 Figure 2: A GEP UCSC Genome Browser mirror view of the Mitf gene on the Drosophila 7 8 erecta F element. The Genome Browser provides student annotators with a workspace 9 10 where they can visualize all of the available computational and experimental evidence. 11 12 The available evidence tracks include sequence similarity to D. melanogaster protein 13 14 sequences, predictions from multiple gene finders, RNA-Seq read coverage and splice 15 16 junction predictions from TopHat, whole genome alignments against other Drosophila 17 18 species, and repeats identified by RepeatMasker and Tandem Repeats Finder (TRF). 19 20 Note the discrepancies among the four computational gene predictions, the lack of 21 22 RNA-Seq evidence for isoform RC first exon, and the small exon in isoforms RA and 23 24 RB, suggested by the RNA-Seq and TopHat tracks. In this case, the student annotators 25 26 were able to resolve these contradictory lines of evidence and produce gene annotations 27 28 for four different isoforms of the putative Mitf ortholog in D. erecta, as shown on the 29 30 “Reconciled Gene Models” custom track. 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 Figure 1 Sequence Improvement Annotation

Divide into overlapping projects Public “draft” genomes (~40kb, 2-7 genes)

Divide into overlapping projects Evidence-based coding regions (~100 kb) and TSS annotations

Collect projects, compare and Sequence and assembly verify student annotations improvement Reassemble into high quality annotated sequence Optional wet bench experiment (PCR/sequencing of gaps) Investigate research question of interest Collect projects, compare and verify final consensus sequence Analyze and publish results FigureD. erecta 2 F element: Contig Sequence contig1 Dere2 2 kb Scale 21,000 20,000 19,000 18,000 17,000 16,000 15,000 :cont Reconciled Gene Models Mitf-RA Reconciled Mitf-RC Gene Models Mitf-RD Mitf-RB BLASTX Alignment to D. melanogaster Proteins Sequence similarity Mitf-PC Mitf-PA to D. melanogaster Mitf-PB Proteins Mitf-PD Evidence TracksEvidence SGP Gene Predictions sgp_contig1_5 sgp_contig1_4 Geneid Gene Predictions gid_contig1_3 Gene Predictions Genscan Gene Predictions contig1.3 Twinscan Gene Predictions contig1.003.1 D. yakuba modENCODE RNA-Seq Alignment Summary D. yakuba modENCODE RN

RNA-Seq Junctions predicted by TopHat using D. yakuba modENCODE RNA-Seq

dm2 (dm2) Alignment Net Comparative D. m Genomics Repeating Elements by RepeatMasker Repe Repeats Simple Tandem Repeats by TRF Simp Original Figure File

Click here to access/download Original Figure File TIG_2016_figures.pptx