HHMI workshop for Student/Scientist Partnerships

Program & Abstracts June 14 - 16, 2012 ii Contents Agenda 1

Abstracts 5 Mark Adams 6 The Genomics Scholars Program: Encouraging a research path for under- represented minorities in community colleges Charles F. Aquadro 7 The Cornell Genetic Ancestry Project: Engaging Undergrads Through Participation Lois Banta 8 Metagenomic Analysis of Microbial Diversity in Winogradsky Columns Vincent Buonaccorsi 9 The Genome Consortium on Active Teaching using Next Generation Sequencing Steven G. Cresawn 10 Mycobacteriophages Genomics and Bioinformatics As a Positive Feedback Loop E.A. Dinsdale 11 Microbes, Metagenomes and Marine Mammals: Enabling the Next Generation of Students to Enter the Genomic Era Sam Donovan 12 Teaching and Learning Scientific Data Literacy Skills David J. Dooling 13 Making Sequence Analysis Accessible, or Even Invisible Todd T. Eckdahl 14 GCAT SynBio: Building a Community of Faculty Conducting Synthetic Research with Undergraduate Student Sean Eddy 15 Thoughts on bioinformatics education Sarah C. R. Elgin 16 GEP: The Genomics Education Partnership Alex Hartemink 17 Course: Introduction to computational genomics Ian Korf 18 Teaching new bioinformatics programmers Robert M. Kuhn 19 Using the UCSC Genome Browser to Teach Fundamental Genetic Concepts -- Student-Produced Video Library Carolyn Lawrence 20 Structural Annotation of the Maize Genome: Tools for Community Contributions

iii Contents

Wilson Leung 21 GEP: The Web Framework Suzanna E. Lewis 22 WebApollo: A Web-Based Sequence Annotation Editor For Distributed Community Annotation Jennifer Mansfield 23 Research in an undergraduate biology curriculum: investigating the functional genomics of chemosensation in the tobacco hornworm Manduca sexta Juan C. Martínez-Cruzado 24 Caribbean-Focused Genomics Research Projects with Tailored Courses for Developing Human and Computational Infrastructure in Puerto Rico David Micklos 25 DNA Subway: An Intuitive Interface to Introduce Genome Informatics William R. Pearson 26 Bioinformatics Theory and Practice - Striking a Balance Mihaela Pertea 27 Using Next-gen Sequencing Data to Explore the Human Genome Antonis Rokas 28 A Genomics Approach to Identifying the Factors Influencing Phylogenetic Accuracy Anne Rosenwald 29 Mining the Human Microbiome - An Opportunity for Student Research Daniel A. Russell 30 PhagesDB.org: A port in a storm of mycobacteriophage genomic information Susan R. Singer 31 Scaffolding Whole Transcriptome Analysis for Students James Taylor 32 Galaxy: an accessible, collaborative environment for reproducible research Matthew Vaughn 33 The iPlant Collaborative: Bringing Together High Performance Computing and Biology Spencer Wells 34 The Genographic Project as a model for student scientific engagement Susan R. Wessler 35 Transposing from the Research Laboratory to the Undergraduate Classroom

Participants 37

Logistics 41 iv Agenda

HHMI Bioinformatics Workshop for Student/Scientist Partnerships June 14-16, 2012 Howard Hughes Medical Institute Chevy Chase, MD

There is compelling evidence that the best approach to science education is to engage students in scientific research. One strategy for providing research experiences for significant numbers of students is to create student-scientist partnerships, and the general area of genomics/bioinformatics offers an important opportunity to promote such partnerships at the undergraduate level. There are already a number of research problems that have been addressed in this way, and we think the area has great potential.

The goals of the workshop are to identify research problems that are well suited for student-scientist partner- ships, and to identify the tools that exist and the tools that are needed for the computing infrastructure, includ- ing the integration of research and education. The workshop will bring together a distinguished group of life scientists who are using genomics to address interesting questions, computational scientists devising informatics tools for data mining and analysis, and scientist-educators who are experienced in teaching undergraduates using a research format. We anticipate that the group will enunciate a number of interesting problems that can be adapted for partnership, identifying any computer infrastructure needs and possible strategies for meeting those needs.

Thursday, June 14 3:00 – 5:00 pm Check in at HHMI Headquarters, Chevy Chase Conference Center 5:00 pm Reception Great Hall 6:00 pm Dinner Dining Room 7:15 pm Session I: Overview – Where are we? Where do we want to go? Small Auditorium Chair: David Asai Welcome: Sean Carroll, HHMI Sarah Elgin, Washington University in St. Louis David Micklos, Dolan DNA Learning Center, Cold Spring Harbor Laboratory Matt Vaughn, Texas Advanced Computing Center Charge to the workshop – Sarah Elgin 8:30 pm Social (Posters set up in the Atrium) The Pilot & Atrium Poster presenters include: Lois Banta, Williams College Wilson Leung, Washington University in St Louis Dan Russell, University of Pittsburgh Friday, June 15 7:30 am Breakfast Dining Room

1 HHMI bioinformatics workshop for student/scientist partnerships

8:30 – 10:10 am Session II: Visualizing Genomes Small Auditorium Chair: David Micklos Juan Carlos Martinez-Cruzado, University of Puerto Rico-Mayaguez Jennifer Mansfield, Barnard College Suzi Lewis, University of California at Berkeley Robert Kuhn, University of California at Santa Cruz James Taylor, Emory University 10:10 – 10:30 am Coffee break Outside Small Auditorium 10:30 – 11:50 am Session III: Extracting Meaning from Sequence Data Small Auditorium Chair: Suzi Lewis Steve Cresawn, James Madison University David Dooling, Washington University in St Louis William Pearson, University of Virginia Susan Singer, Carleton College 12:00 pm Lunch Dining Room 1:00 – 2:40 pm Session IV: Transcriptomes; Metagenomics Small Auditorium Chair: Susan Singer Vince Buonaccorsi, Juniata College Mihaela Pertea, The Johns Hopkins University Sean Eddy, Janelia Farm Research Campus, HHMI Anne Rosenwald, Georgetown University Liz Dinsdale, San Diego State University 2:40 – 3:00 pm Coffee break Outside Small Auditorium 3:00 – 4:30 pm Session V : Genomic Explorations Small Auditorium Chair: Sean Eddy Antonis Rokas, Vanderbilt University Sue Wessler, University of California – Riverside Todd Eckdahl, Missouri Western State University Lightning Talks Carolyn Lawrence, USDA-ARS, Iowa State University Sam Donovan, University of Pittsburgh Alex Hartemink, Duke University Ian Korf, University of California at Davis Mark Adams, J. Craig Venter Institute 4:30 – 6:00 pm Poster session (with open bar); suggest working groups, sign-up for working groups Atrium 6:00 pm Dinner Dining Room 7:15 pm Session VI: Bringing Genomics to the Public Small Auditorium Chair: Sean Carroll Spencer Wells, National Geographic Society Chip Aquadro, Cornell University 8:15 pm Social (posters available in the Atrium) The Pilot & Atrium (Posters removed at the end of the evening)

2 Agenda

Saturday, June 16 7:30 – 8:30 am Breakfast (and room checkout) Dining Room 8:30 – 9:30 am Working Group Session I* Conference rooms** 9:30 – 9:45 am Coffee Great Hall 9:45 – 10:45 am Working Group Session II* Conference rooms 11:00 am – 12:15 pm Reporting out from Working Groups Small Auditorium 12:15 pm Boxed Lunch and departure for airport Conference Center

*Most working group topics will be discussed in both Session I and Session II, so that everyone will have a chance to contribute to the small-group discussion on two topics.

**Conference rooms will be C122, C123, D115, D116, D124, and D125.

3 4 Abstracts HHMI bioinformatics workshop for student/scientist partnerships

Mark Adams, Ramana Madupu, and Lisa McDonald J. Craig Venter Institute

The Genomics Scholars Program: Encouraging a research path for under- represented minorities in community colleges The J. Craig Venter Institute (JCVI) conducts a broad range of educational activities Adams, M. targeting middle school through PhD students. The DiscoverGenomics! mobile lab program provides hands-on experience to middle-schoolers and is supported by a teacher professional development program. Over 20,000 students have participated in this program in the last six years. Summer and academic year internship programs provide research experience for motivated high-school and undergraduate students. In an effort to promote development of a career path in research for under-represented minorities, we have proposed to develop a transition program to facilitate the transition from community college to four-year college using a combination of activities including undergraduate research experience with mentoring and professional development. The program will incorporate multiple avenues of support for students through a multi-year research experience with mentors at JCVI and supplemental professional development. The research experience will be designed to encompass access to experimental designs that emphasize genomics, proteomics, and or metabolomics datasets, thereby introducing students to both “wet lab” and “dry lab” activities. Professional development will include introduction to the responsible conduct of research, record-keeping and lab notebooks, scientific reading and writing, and presentation skills. Additionally, selected students will have the opportunity to participate in undergraduate minority research conferences, which will expose them to various aspects of research and programs. Collaborative contacts are in place and a grant application to support the program has been submitted.

6 Name of presenter appears in bold if more than one author. Abstracts

Charles F. (“Chip”) Aquadro Director of the Center for Comparative and Population Genomics, and Professor, Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY

The Cornell Genetic Ancestry Project: Engaging Undergrads Through Participation. Aquadro, C. How do you engage a diverse university community from across the disciplines, to learn about and engage in meaningful discussion of the promise, challenges, risks, limitations and science behind genomics and genetic testing for ancestry and medicine? Virtually everybody is curious about their ancestry. Many have some idea of where their parents and perhaps grandparents lived or came from. But for many, knowledge of the family’s deeper biological ancestry quickly fades a few generations back. Genetics offers a way to bridge between their knowledge of their parents, and grandparents, to their deeper genetic ancestry and to see how their own ancestry fits into the grand human diaspora within and then out of Africa within the last 130,000 years. The Cornell University Genetic Ancestry Project was launched Spring 2011 to engage the Cornell community in such an educational exploration, focusing on engagement through participation. An important additional goal was to provide a fact- based discussion of the diverse social, legal and ethical implications raised by genetic testing, to foster respect for cultural diversity and viewpoints, all the while highlighting humanity’s underlying genetic similarity. I will describe how we pulled this off with 200 randomly chosen Cornell undergraduates and a partnership with Dr. Spencer Wells, Director of the National Geographic Society’s Genographic Project, and kept clear of IRB issues and NY States laws against genetic testing deemed “medically relevant” unless directed/prescribed through a physician. The results and student reactions were such that this initial effort has been expanded with the development of a new course on Personal Genomics and Medicine for freshman and sophomores from diverse disciplines across campus, being taught for the first time Spring 2012 to 50 students, and a new, more extensive ancestry event with 400 randomly chosen freshman entering Cornell fall 2012 is being planned.

7 HHMI bioinformatics workshop for student/scientist partnerships

Lois Banta1, David Esteban2, Doyle Ward3 and Bruce Birren3 1Department of Biology, Williams College, Williamstown, MA 2Vassar College, Poughkeepsie, NY 3Genome Sequencing Center, Broad Institute, Cambridge, MA

Metagenomic Analysis of Microbial Diversity in Winogradsky Columns Banta, L. One of the key features of metagenomic analyses is the ability to identify the vast majority of microbes in a community that cannot be cultivated. The Winogradsky column is one such complex community of metabolically interacting microorganisms, in which mud from a pond is incubated in the sunlight in a plexiglass cylinder with a cellulose source and additional sulfate to promote enrichment for microorganisms involved in the sulfur cycle. Over a period of months, microorganisms requiring a range of environmental conditions proliferate in stratified niches, with distinct populations participating in diverse metabolic activities. Bacterial growth is seen as changes in color from the original grey-brown mud to a rich pallet of reds and greens, due primarily to phototrophic microbes. We have initiated a metagenomic project in which undergraduates investigate the microbial diversity in distinct layers of Winogradsky columns using high throughput sequencing of 16s rDNA sequences. The sequences are then classified and analyzed using bioinformatics tools, and metabolic activities in the different regions of the columns are predicted from the species present. With the generous assistance of CoFactor Genomics, we have also obtained a transcriptome dataset for some columns, which will allow future students to explore not only “who is present” but “what they are doing” in each column layer. By comparing the microbial populations in columns prepared using varying sources of mud and cellulose, students learn how different environmental conditions influence the diversity and metabolic activities of bacteria and how DNA sequence data can be used to infer biological activities occurring in specific environments; they also develop molecular biology laboratory skills including DNA extraction, PCR and gel electrophoresis, and computational skills by analyzing the DNA sequence data using the Ribosomal Database Project (RDP), phylogenetics software and student-derived perl scripts. This project has the potential for expansion to a wide range of educational institutions, with students generating samples from ponds in a variety of ecosystems and climates. Compari- sons of 16S rDNA profiles among samples can be performed using the Qiime (“Quantita- tive insights into microbial ecology”) package of tools, but statistically meaningful analyses will require multiple replicates and sequencing capacity/cost are severely limiting factors at this point. Generating sufficient high-quality genomic DNA and RNA has proven to be challenging in a course lab setting. Turn-around time for the sequencing also makes it almost impossible for students to analyze the sequence data from the samples they generated. Im- proved free assembly tools and ways to assess the quality of the assembly for the transcrip- tome dataset will facilitate future functional analyses.

8 Name of presenter appears in bold if more than one author. Abstracts

Vincent Buonaccorsi Juniata College, Huntingdon PA

The Genome Consortium on Active Teaching using Next Generation Sequencing Genomics and bioinformatics are dynamic fields that provide opportunities to form student- Buonaccorsi, V. scientist partnerships at small liberal arts colleges. Empowering undergraduate faculty with access to state-of-the-art technology and with tools to implement curricular changes is a difficult and evolving challenge. This challenge has been successfully addressed in the last decade by the Genome Consortium on Active Teaching (GCAT), a grass-roots consortium of undergraduate educators. GCAT provided undergraduates access to microarray technology, and has impacted over 300 faculty and 24,000 undergraduates. A major driving factor that enticed a diverse group of faculty to adjust their teaching strategies was the academic freedom associated with integrating their own research questions into an active teaching approach. As Next-Gen sequencing approaches evolved and replaced microarray technology, a new network of educators (GCAT-SEEK) was formed in July, 2011 to enable undergraduate access to Next-Generation sequencing and functional genomics using the GCAT organizational model. The consortium now involves over 100 faculty, postdocs, and students from over 80 institutions throughout the country. Major interest areas include genomics, transcriptomics, and metagenomics. GCAT-SEEK aims to engage students in inquiry-based learning that is grounded in the key concepts and competencies of modern biology, and are connected to learning objectives and assessments. In our first year we have identified several bottlenecks that make it difficult to seamlessly integrate next-generation sequencing into undergraduate courses and research experiences. The first major challenge is one of experimental design for the faculty member who is a novice with respect to the technology (e.g. picking the optimal sequencing platform for the project). The second challenge is navigation through the myriad of bioinformatic, computational, and statistical options to efficiently extract meaningful scientificalue v from the sequence data obtained. Since the fundamental mission of GCAT-SEEK is on student training and learning, the third challenge that we identified was development of appropriate and effective pedagogical and assessment tools. While no single network member is an expert in all areas, we responded to these challenges by identifying key individuals with expertise in each area to provide support to other network members using the GCAT-SEEK listserv. Collectively, we are beginning to address the initial bottlenecks. The success and growth of GCAT-SEEK will require improved access to sequencing runs, more powerful computers with more storage space, cost-effective access to commercial software, time to develop pedagogical and assessment tools, and funds for faculty development workshops.

9 HHMI bioinformatics workshop for student/scientist partnerships

Steven G. Cresawn1, Daniel A. Russell2, and Graham F. Hatfull2 1Department of Biology, James Madison University, Harrisonburg, VA 2Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA

Mycobacteriophages Genomics and Bioinformatics As a Positive Feedback Loop Cresawn, S. Computers, like centrifuges and pipettes, are essential tools for the modern molecular biologist. The variety of ways in which computers can be applied to scientific problems are essentially limited only by the imagination of the scientist. For many undergraduates, however, the scope of what they can imagine is limited by their lack of exposure to the toolbox of the computational biologist. Most undergraduates in biology lack instruction in database design, programming languages, algorithms, networking, and graphical interface design. Rapid technical advances and falling prices for DNA sequencing will continue to exacerbate the disconnect between traditional biology curricula and the skills needed to manage and analyze large datasets. The first initiative from the HHMI Science Education Alliance, Phage Hunters Advancing Genome Evolutionary Science (SEA-PHAGES), is a large and disseminated bacteriophage genomics program. To date approximately 1,470 students at 71 universities have isolated 2,087 Mycobacterium smegmatis bacteriophages. The genomes of 203 of these phages were sequenced, with the sequence data then returned to the students for annotation and analysis. The large scale and disseminated nature of the SEA-PHAGES program have created challenges for data organization and sharing, participant communication, and collaborative analysis. We have addressed these issues with new software tools which have been designed and implemented with undergraduate students in mind. Our phagesdb.org site, where students can upload information about their phages, is a searchable data sharing platform. It also includes forum style communication tools and integrates with our comparative genome analysis program, Phamerator. The Phamerator software features a phage protein “phamily” assembly mechanism and intuitive data representations for whole genome comparisons and individual protein phamilies. While the SEA-PHAGES program introduces students to genomics, we are also using the Phamerator software itself to introduce students to and software design. In an upper-level undergraduate Bioinformatics course, students design, implement, and demonstrate new Phamerator features that are made available to the SEA-PHAGES students and other Phamerator users at large. In this system, the genomic and bioinformatic components each drive the other forward.

10 Name of presenter appears in bold if more than one author. Abstracts

E.A. Dinsdale1, R. A. Edwards2 and M. Houle Vaughn3 1Biology Department, San Diego State University, CA USA 2Computer Science, San Diego State University, CA USA 3College of Education, San Diego State University, CA USA

Microbes, Metagenomes and Marine Mammals: Enabling the Next Generation of Students to Enter the Genomic Era Dinsdale, E. The genomic era is upon us, but our students are lagging behind. The sequencing of the human genome in 2001 opened up new opportunities in biology, human health and environmental research. To meet the challenges of the genomic era, undergraduate students need to learn how to operate the new sequencing technologies and analyze the resultant data. In Spring 2010, SDSU student started a new lab based course that taught them to use a 454 Lifesciences Flx machine. The course has 24 students, which including Biology, Cell and Molecular, Ecology and Computer Sciences majors. The course, taught over 15 weeks, provides every student the opportunity to conduct each step of the sequencing process, from DNA extraction, library preparation, EmPCR, bead enrichment, loading the plate, to running the machine. The first machine run occurred on the second week of the course. The students met all of the manufacturers’ recommended key targets for the process and produced publication quality data. At the end of the course (15 weeks) the students had sequenced 14 Bacterial genomes, 14 metagenomes and approximately 5-8x coverage of the sea lion genome. The students started the annotation and comparative analysis of some of these sequences. The students not only learned how to sequence DNA, but were involved in generating data that was new to science. Our first paper using the data the students generated has been accept for publication. A typical comment from the students included “This was my favorite courses I’ve ever taken at SDSU, “I feel like I’ve learned so much and this class has spawned my interest in genomics”. The students reported increased ability in key objectives of student- scientist research projects, including conducting research where no one knew the outcome, having responsibility for the design and completion of a project. The students thoroughly enjoyed the course, were confident in running a sequencer and were inspired by the process to continue research in genomics; an area that they had previously disregarded.

11 HHMI bioinformatics workshop for student/scientist partnerships

Sam Donovan Department of Biological Sciences, University of Pittsburgh & BioQUEST Curriculum Consortium

Teaching and Learning Scientific Data Literacy Skills The success of student-scientist partnerships depends, in large part, on the adoption of shared scientific practices by research teams. Establishing community norms around the Donovan, S. collection, management, analysis, and ethics of scientific data is an essential step in the process of building a functional research partnership. The goals of this project include characterizing scientific data literacy skills and developing strategies for teaching and learning about working with scientific data. The products of this effort will support students’ participation in research and deepen their understanding of science by helping them develop a sophisticated working knowledge about scientific data. Traditionally, sharing information about how to work with data has been communicated implicitly, using an apprenticeship model, or explicitly, by providing detailed research protocols. Educational resources for learning about working with data have generally focused on the collection and analysis of small datasets in teaching laboratories. However, participating in modern biological research typically involves a much more diverse set of data manipulations including working with distributed datasets, curation, annotation, data-mining, and exploratory visualization. As more students become involved in student-scientist partnerships it is important to critically assess, and further develop, the strategies used to socialize them to the scientific norms for working with data. Furthermore, it will be valuable to explore the relationships between students’ understanding of science and their ability to effectively contribute to a shared research project.

12 Name of presenter appears in bold if more than one author. Abstracts

David J. Dooling, Scott M. Smith, George M. Weinstock, Elaine R. Mardis, and Richard K. Wilson The Genome Institute at Washington University in St. Louis

Making Sequence Analysis Accessible, or Even Invisible A significant impediment to making bioinformatics analysis accessible to students and Dooling, D. researchers is the sheer scale of data that must be managed and analyzed. Fortunately, many of the core analysis pipelines, those that occur closest to sequence data generation, have stabilized and can reduce the size of data sets by orders of magnitude. Creating systems that performed these analyses in automated and reproducible ways would make many emerging areas of genomics research more accessible to students and smaller research labs by allowing them to avoid working with such large data sets. We have developed an integrated analysis information management system, called the Genome Modeling System (GMS), for managing sample data, analysis execution, and results visualization for genomics research projects. The GMS allows for standard analysis pipelines to be configured and executed automatically as new data becomes available, allowing analysts to focus on interpretation of results rather than running scripts. When available, the system can be adopted by core facilities, enabling them to deliver more “usable” data to their clients. GMS will be distributed as a virtual machine image based on Ubuntu Linux, allowing an investigator to immediately begin working with the tools with minimal system administration and bioinformatics expertise. The virtual machine image includes popular genomics software, much of it never officially packaged for the Ubuntu platform, including BWA, VarScan, SAMtools, Picard, Bio::DB::Sam, BreakDancer, and MuSiC. All software is packaged using the native Ubuntu package management system and is served from The Genome Institute’s package repository, allowing facile, efficient upgrades as new versions of tools and the framework are released. Documentation and installation instructions for GMS are available at http://gmt.genome.wustl.edu/.

13 HHMI bioinformatics workshop for student/scientist partnerships

Todd T. Eckdahl1, A. Malcolm Campbell2, Laurie J. Heyer3, Jeffrey L. Poet4 1Department of Biology, Missouri Western State University, St. Joseph, MO 64507 2Department of Biology, Davidson College, Davidson, NC 28035 3Department of Mathematics, Davidson College, Davidson, NC 28035 4Department of Computer Science, Math and Physics, Missouri Western State University, St. Joseph, MO 64507

Eckdahl, T. GCAT SynBio: Building a Community of Faculty Conducting Synthetic Biology Research with Undergraduate Students The Genome Consortium for Active Teaching (GCAT) was founded with the purpose of making cutting-edge technology available to faculty members working with undergraduates in classroom and independent research settings at diverse institutions. GCAT provided access to DNA microarrays, a centralized microarray scanner, open-source software for data analysis, and faculty training workshops. It established an active community of faculty using microarrays for a variety of model organisms to address teaching goals and research questions of their own design. Building on this success, GCAT SynBio was established to support faculty interested in conducting synthetic biology research with undergraduate students. Synthetic biology can be defined as the use of engineering principles, molecular biology tools, and mathematical modeling to design and construct biological devices with applications in technology, medicine, energy, and the environment. GCAT SynBio has already conducted two faculty summer workshops and will offer another in each of the next three summers, supported by NSF and HHMI. The workshops will train 90 faculty members from 45 institutions, with priority given to those faculty teaching many underrepresented minorities. Training faculty has a multiplier effect; each faculty member will teach at least 100 students a year which will result in 18,000 students benefiting directly from these workshops by the fall of 2014, and 9,000 students will benefit per year thereafter. The workshops introduce synthetic biology to two-member multidisciplinary faculty teams and provide them with opportunities to develop an action plan for undergraduate research. GCAT SynBio has also established a collection of cloned biological parts, is funding the production of a starter collection of parts, and developed software called GCAT-alog for management and distribution of parts. As the community of faculty conducting synthetic biology research with undergraduates grows, consideration should be given to the relationship between the research agenda of scientists at research institutions and the research agenda of faculty at PUIs, to the development of bioinformatics tools that facilitate synthetic biology research, and to the potential of genomics research to generate synthetic biology research proposals.

14 Name of presenter appears in bold if more than one author. Abstracts

Sean Eddy Janelia Farm, HHMI

Thoughts on bioinformatics education From my perspective as a computational genome biologist whose primary training was in experimental biology, I’ll share some thoughts which I hope are relevant to the workshop’s Eddy, S. topic of undergraduate education. I often work with biologists who are intimidated by computational analysis of their large genomic datasets. Ironically, biologists are in fact very well trained for critical analysis of immensely complex datasets, because they have been trained to analyze immensely complex biological systems. Biologists learn a rigorous (even paranoid) discipline of designing creative positive and negative controls and independent lines of verification, because they must worry about being blindsided by unexpected effects and artifacts. In contrast, many classical statistical tests in data analysis were developed for testing single effects on small datasets, by rejecting a “null hypothesis”. In complex biological systems, simple null hypotheses are often sure to be false, though not necessarily for the reason we first think. However, biologists often lack the confidence to apply their discipline to computational data analysis, in part because they believe that large data analysis requires complex databases, mathematics, computer programming, and computer science beyond the skill set they can reasonably cope with learning. I’ll show examples of how my laboratory manages large datasets with rudimentary tools that anyone can learn, and how we analyze datasets with simple techniques directly analogous to experimental biology. I advocate for training undergraduates in just a few necessary tools -- the UNIX (MacOS/X or Linux) file system and command line, and script writing in either Perl or Python. I’ll also advocate for two fundamental techniques that my laboratory uses -- the use of subsampling to pull small random samples of data from a much larger dataset that we can actually look at by eye, and the use of simulations to create positive and negative control datasets. I will show examples from an area my lab works in (noncoding RNA analysis) where basic sampling and simulation experiments, suitable for undergraduate teaching examples, reveal interesting problems in key papers.

15 HHMI bioinformatics workshop for student/scientist partnerships

Sarah C R Elgin1, Wilson Leung1, Christopher D Shaffer1, Elaine Mardis1, David Lopatto2 1Washington University in St Louis 2Grinnell College

GEP: The Genomics Education Partnership An effective method for teaching science is to engage students in doing science. With Elgin, S. high through-put technologies, particularly DNA sequencing, getting cheaper, genomics is not only becoming an important part of our curriculum, it also provides unprecedented opportunities for students to participate in original research. The Genomics Education Partnership (GEP) includes faculty from over 80 schools, mostly primarily undergraduate institutions. Using a versatile curriculum that has been adapted to many different class settings, GEP undergraduates undertake projects to improve draft-quality genomic sequences and/or participate in the annotation of these sequences. GEP undergraduates have improved more than 4 million bases of draft genomic sequence from several species of Drosophila and have produced hundreds of gene models for four different Drosophila species using evidence-based manual annotation. Comparing the Muller F element of D. melanogaster with that of D. virilis, D. mojavensis, and D. grimshawi, students have documented the movement of genes between euchromatic and heterochromatic domains, showing that such wanderer genes often adopt the characteristics found in their surrounding genomic landscape (gene size, codon bias). Students appreciate the opportunity to make an original contribution as members of a research team. Student assessment using the CURE survey and custom knowledge quizzes show gains in student understanding of the research process and in understanding the organization of genes and genomes. While some students struggle with the burden of responsibility, the vast majority (~85%) of student comments (in an anonymous survey) are positive, and many are very enthusiastic. Participating faculty also report professional gains, increased access to genomics-related technology, and an overall positive experience. A genomics research project can be used not only as the core of a laboratory course, but also within a lab for a broader course or as an independent research project. We find this approach to teaching and learning to be rewarding for both faculty and students. (See our website at http://gep.wustl.edu.) Supported by HHMI grant # 52005780 and NIH grant R01 GM068388.

16 Name of presenter appears in bold if more than one author. Abstracts

Alex Hartemink Duke University, Departments of Computer Science, Statistical Science, and Biology

Course: Introduction to computational genomics Course overview: A computational perspective on the exploration and analysis of genomic and genome-scale information. Provides an integrated introduction to genome biology, algorithm Hartemink, A. design and analysis, and probabilistic and statistical modeling. Topics include genome sequencing, genome sequence assembly, local and global sequence alignment, sequence database search, gene and motif finding, phylogenetic tree building, and gene expression analysis. Methods include dynamic programming, indexing and hashing, hidden Markov models, and elementary machine learning. Helps develop practical experience with handling, analyzing, and visualizing genomic data using the scripting language Perl. Course prerequisites: The course will require students to program in Perl. Students coming in to the course should know how to program in some computer language, but it need not be Perl. Students should also have had some exposure to basic probability and molecular or cellular biology; however, the course has no formal course prerequisites, and significant background will be provided. Please speak to the instructor if you are unsure about your background. Course issues/challenges: • The course presumes the ability to program, which is a barrier for some, especially those coming from biology. On the other hand, it has no formal course prerequisites and does not presume a familiarity with algorithm design or genome biology because I cover those. • I cannot find a suitable textbook, so I do not require one, though I do provide suggestions regarding optional textbooks that have at least some of the material. • The course does not provide sufficient experience with visualizing large amounts of data. We currently focus on understanding and creating algorithms, but a larger visualization component would be “a good thing”. • For computational reasons, most problems involve working with small amounts of data (albeit real data); it would be nice if there were easier ways to have students scale up their code to handle larger sets of data, but this seems possibly a second-order concern. • I would love to change from Perl to Python, but have not yet had time to switch over the entire course infrastructure. The bigger question: How can we scale up good instruction? It’s not just the tools, algorithms, data, and other resources that need scaling: it’s also the ability to clearly explain concepts across multiple disciplines that needs to be scaled up. How do we scale up good instruction? As just one example, can we imagine a useful version of “Khan Academy” for introductory computational biology? I would be willing to participate in teaching as part of something like that, but it seems non-obvious to me how to start.

17 HHMI bioinformatics workshop for student/scientist partnerships

Ian Korf and Keith Bradnam UC Davis Genome Center

Teaching new bioinformatics programmers We have been teaching bioinformatics programming to students at the high school, undergraduate, graduate, postdoc, and PI levels. Our philosophy is that anyone can learn to Korf, I. program just as anyone can learn math or writing. Of course, some people learn much faster than others, and this is particularly true of programming. When we teach in a classroom environment, we therefore allow students to work at their own pace. Our curriculum teaches Unix and Perl because of their popularity in bioinformatics. We have written two texts, and one is freely available. See http://unixandperl.com. We will describe our course of study and provide some pointers on helping biologists become bioinformaticians.

18 Name of presenter appears in bold if more than one author. Abstracts

Robert M Kuhn1, David Haussler1,2 1Center for Biomolecular Science and Engineering, University of California, Santa Cruz; Santa Cruz, CA 2Howard Hughes Medical Institute, University of California, Santa Cruz; Santa Cruz, CA

Using the UCSC Genome Browser to Teach Fundamental Genetic Concepts -- Student-Produced Video Library Kuhn, R. The UCSC Genome Browser is a graphical display platform to provide access to the DNA sequence and annotations for a large number of organisms, including humans, important model organisms and many important vertebrates. As part of the display mechanism, many fundamental concepts of genetics and molecular biology are evident in an interactive, graphical format. For example, the basic intron-exon structure of most eukaryotic genes is evident at a glance in the alignment display of mRNAs and the gene prediction tracks of the main Browser graphic, as is the concept of untranslated regions of a transcript. Many other concepts in modern genetics are also easily viewed in the data displays: gene regulation predictions are supported by cross-species sequence homology and transcription factor binding data; evolutionary relationships among animals is shown in Conservation tracks; and disease can be associationed with copy-number variants. Insofar as variation at several scales is the basis for genetic research, understanding variation at multiple scales is important. The various data tracks of human annotations in the Browser illustrate these concepts, including single nucleotide polymorphisms, large copy-number variants, segmental duplications and data from the HapMap projects. We propose that a small group of motivated students could make short, five-minute video clips teaching many of these concepts to other students using the UCSC Genome Browser. Using their own perspective as undergraduate students, and being versed in social media and a video culture, they are well suited to creating production values that will speak to others of their generation. The video producers also have the opportunity to explore problems of their own choosing when fashioning their content. For example, the evolution of rhodopsin photoreceptors in primates is an interesting story that can be told using the comparative genomics features of the Genome Browser. Sequence conservation as a living footprint of evolutionary relationships is support for the “theory” of evolution that is completely independent of the fossil record, a compelling story that cannot be told to students often enough. That conserved non-coding regions actually lead to discovery of functionally significant regions in the genome is evidence of the testability of this controversial “theory.” In the process of producing videos such as these, the students themselves sharpen their own understanding, which is never challenged so much as when tasked to explain something, and they also learn to manipulate large datasets and familiarize themselves with the wide variety of data available.

19 HHMI bioinformatics workshop for student/scientist partnerships

Carolyn Lawrence1,2, Jon Duvick, and Volker Brendel3 1USDA-ARS Corn Insects and Crop Genetics Research Unit, Ames, IA 50011 USA 2Iowa State University Department of Genetics, Development and Cell Biology, Ames, IA 50011 USA 3Indiana University Department of Biology, Bloomington, IN 47405 USA

Structural Annotation of the Maize Genome: Tools for Community Contributions Lawrence, C. The maize genome has been sequenced and gene models are predicted for the pseudomolecules (i.e., sequence representing ten chromosomes plus “chromosome 0” which includes sequences that are not placed on the ten chromosomal scaffolds). Students and research scientists alike are able to evaluate and improve gene structures via yrGATE, a system developed for structural annotation of plant genomes via PlantGDB (for maize, see Community Annotation Central at ZmGDB; http://www.plantgdb.org/yrGATE/ ZmGDB/CommunityCentral.pl). The ZmGDB system is also accessible via the MaizeGDB Genome Browser, where researchers can visualize the quality of gene structures alongside other sequenced-indexed information including various assembly/genome features, diversity and expression data, and much more. To access ZmGDB from the MaizeGDB Genome Browser, visit http://gbrowse.maizegdb.org and click links from the “B73 RefGen_v2 Gene Models: Quality” track. Annotations contributed by the community are stored at ZmGDB and are made accessible via both the ZmGDB website and at the MaizeGDB Genome Browser via DAS, the Distributed Annotation System.

20 Name of presenter appears in bold if more than one author. Abstracts

Wilson Leung, Christopher D. Shaffer, Sarah C.R. Elgin Washington University in St. Louis

GEP: The Web Framework Supported by a grant from the Howard Hughes Medical Institute, the Genomics Education Partnership (GEP) aims to provide undergraduate students the opportunity to participate in Leung, W. genomics research during the academic year. With faculty members from over 80 primarily undergraduate institutions, undergraduates participating in the GEP improve the draft sequence and create curated gene sets for the Muller F element and portions of the Muller D element from several Drosophila species. To facilitate collaboration among GEP faculty and students, we developed a web framework that consists of four primary components: the GEP web site for sharing curriculum materials; the GEP wiki and bulletin board for group discussions and technical assistance; the Project Management System for managing project claims and submissions; and web- based tools for sequence analysis. Students participating in annotation learn how to use common bioinformatics tools (e.g. BLAST, Clustalw, UCSC Genome Browser). To generate high-quality annotated domains, sequences improved by GEP students are loaded onto a GEP mirror of the UCSC Genome Browser with evidence tracks from sequence alignments, gene predictors, and RNA-Seq data (with predictions from TopHat and Cufflinks). Students construct gene models using these resources and verify their models using the Gene Model Checker. Faculty members collect student annotations at the end of the semester and submit projects through the Project Management System. Two independent groups complete each project (for quality control purposes) and the GEP staff reconciles the submissions to create the final gene set. While the GEP web framework generally works well, it currently does not allow GEP members to investigate other regions or species of interests. We seek to develop web-based pipelines that would enable typical biology faculty members to construct genome browsers and annotation projects for additional regions of interests. The pipelines that are currently available either require considerable bioinformatics and UNIX expertise to setup (e.g. MAKER) or they are optimized for certain genomes with limited configuration options (e.g. DNA Subway). We are currently designing a system with Galaxy that would allow domain experts to create workflows that can then be used by other GEP faculty members. Given that large ChIP-seq and ChIP-chip datasets (e.g. from modENCODE) are now publicly available, we would also like to incorporate the study of epigenetics into the GEP curriculum. For example, students could use R to perform K-means clustering to examine the combinatorial patterns of histone modifications for a group of genes.

21 HHMI bioinformatics workshop for student/scientist partnerships

Ed Lee1, Gregg Helt1, Nomi Harris1, Mitch Skinner1, Christopher Childers2, Justin Reese2, Monica C. Munoz-Torres2, Christine G. Elsik2, Ian Holmes3, Suzanna E. Lewis1 1Berkeley Bioinformatics Open-source Projects, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA 2Georgetown University, Washington DC, 20057, USA 3Department of Bioengineering, University of California at Berkeley, Berkeley, CA, 94720, USA

Lewis, S. WebApollo: A Web-Based Sequence Annotation Editor For Distributed Community Annotation

Technical advances will continue to make sequencing faster and cheaper, but the generation of raw sequence data is analogous to converting raw ore into iron—something useful, like a car, still needs to be created from this basic commodity. In addition, not only is sequencing becoming less expensive, but it is also becoming more universal. It is no longer restricted to large sequencing centers, but is a now a laboratory technique that individual researchers can take advantage of. To turn this basic commodity into something biologically meaningful requires genomic annotation efforts adapt to keep pace. Thus the curation environment is shifting from a traditional centralized model, in which all curators for a given genome project share the same physical location, to a geographically dispersed community annotation model—which requires new tools to support community annotation efforts. WebApollo was designed to provide an easy to use, web-based environment that allows multiple distributed users to edit and share sequence annotations. WebApollo is comprised of three components: a web-based client, a server-side annotation editing engine, and a server-side service that provides the client with data from different sources, including databases at the University of California at Santa Cruz, Ensembl, and Chado. The web-based client is designed as an extension to JBrowse, a JavaScript-based genome browser that provides a fast, highly interactive interface for the visualization of genomic data. This JBrowse extension provides the gestures needed for editing annotations, such as dragging and dropping features to create new annotations of genes, transcripts and other genomic elements, dragging to change exon boundaries of existing annotations, and using context-specific menus to modify features. The extension also connects to the annotation-editing service and the data-providing services. The server-side annotation-editing engine is written in Java. It handles all the necessary logic to edit and deal with the complexities of modifications in a biological context, where a single change can have multiple cascading effects (e.g., when splitting or merging transcripts). Edits are stored persistently in the server, allowing users to quickly recover their data in the event of unexpected browser or server crashes. The server provides synchronized updates over multiple browser instances, so that every edit is immediately visible to all users who are viewing or editing the same region. It offers multiple levels of user accessibility, allowing project owners to decide with whom to share their work, and whether to allow read-only or both read and write access. The server-side service that provides data to the client is built on top of Trellis, a Distributed Annotation System (DAS) server framework. It sends JBrowse-supported JavaScript Object Notation (JSON) data, rather than the more verbose DAS XML. We also developed a Trellis plugin to access data from the UCSC MySQL genome database, which provides easy access to that popular data source. All three components are open source and provided under the BSD License. An extensive group of curators and investigators from the bee genome research community, spread over academic institutions worldwide, are currently beta-testing WebApollo. Their curation efforts, findings and interactions will dramatically upgrade the quality of the annotation data for the genomes of honey bee (Apis mellifera), and two bumble bees (Bombus impatiens and Bombus terrestris), which will lead to a better understanding of the biology of these social insects. WebApollo will be publicly released in fall 2012. Public demo: http://icebox.lbl.gov:8080/ApolloWebDemo Project web page: http://gmod.org/wiki/WebApollo

22 Name of presenter appears in bold if more than one author. Abstracts

Jennifer Mansfield, Brian Morton, John Glendinning, and Jessica Goldstein Barnard College, Columbia University, New York, NY

Research in an undergraduate biology curriculum: investigating the functional genomics of chemosensation in the tobacco hornworm Manduca sexta. Animals detect environmental chemicals through their chemosensory systems, which include Mansfield, J. taste and olfaction. Chemosensory genomics provides an excellent area for undergraduate research because relevant gene families are characterized in model organisms, yet there is substantial variation in the complements and functions of chemosensory genes across species, allowing exploration of original research questions in genomics, molecular genetics, physiology and ecology. The Manduca functional genomics project, supported by the HHMI, introduces original research across the Barnard biology curriculum, and draws on different areas of faculty expertise. Students are introduced to Manduca in Introductory Biology Lab, where they conduct behavioral tests of larval gustatory preferences. Upper level students participate in various aspects of the project through modules in five different courses and through individual research in faculty labs. Student led projects have included bioinformatic identification of chemosensory genes using ourManduca chemosensory transcriptome database, PCR-based gene cloning, development of an RNA interference protocol in sensory neurons, and use of quantitative RT-PCR to test hypotheses about chemosensory gene expression. Ongoing and future projects are aimed at investigating chemosensory gene function using behavioral and electrophysiological assays and RNAi. In the two years since the project was introduced, about 400 students have completed Introductory Biology Lab. Four students have conducted their senior thesis research on the project, and about 60 students have participated in upper level courses in the areas of genetics, genomics, and molecular biology. The project will soon be introduced into the Animal Physiology Lab. Within these courses, previously taught concepts and techniques are now presented in the context of original research. The focus on one research area has facilitated deeper engagement with primary literature as well. Tools and information generated as the project progresses, and the recently sequenced Manduca genome, increasingly allow students to formulate and test their own hypotheses. Further engagement with the research is promoted by collaboration across courses and between students conducting individual research and those in courses. Students working on the genomics and molecular biology aspects of this project manage data from many sources, including their own results. It could be useful to have a web-based resource where students can store DNA and protein sequences and annotate them using information gathered from different sites (Genbank, genome browsers, Manducabase, Flybase, etc.) and from their own experiments. This would be especially helpful as students are beginning to collaborate across courses and semesters.

23 HHMI bioinformatics workshop for student/scientist partnerships

Juan C. Martínez-Cruzado1, Steven Massey2, Juan L. Rodríguez-Flores3, and Taras K. Oleksyk1 1Department of Biology, University of Puerto Rico at Mayagüez, Mayagüez, PR 2Department of Biology, University of Puerto Rico at Rio Piedras, Rio Piedras, PR 3Department of Genetic Medicine, Weill Cornell Medical College, New York, NY

Caribbean-Focused Genomics Research Projects with Tailored Courses for Martínez-Cruzado, J. Developing Human and Computational Infrastructure in Puerto Rico Our long-term goal is to build a genome analysis center for the Caribbean. Our strategy for accomplishing this goal is centered on the creation of undergraduate research courses with strong components on the use of bioinformatics tools to solve problems in genomics, and on the development of genomics research projects highly relevant to Puerto Rico. Our initiative has attracted much attention and received substantial support from the local community. We are in the process of development of shared computing resources, including a computer cluster at UPR-Rio Piedras, where we have performed several analyses on the recently sequenced Puerto Rican Parrot (Amazona vittata) genome. Scaffold matches to chicken and zebra finch chromosomes suggest extensive divergence and that a de novo assembly would be necessary to avoid alignment bias toward the reference species genome. A. vittata is a critically endangered species, and the only surviving native parrot species in the United States or its territories. Among our most urgent needs is a cross-platform annotation software that would allow us to make genome data publicly available and enable students from other universities to contribute to the genome annotation. Under the Genomics Education Partnership (GEP) we developed a Genomic Annotation course that initially focused on the annotation of different Drosophila species. This semester, students begun with a Drosophila annotation training, and continued to map the 100 biggest scaffolds of the parrot genome to the reference chicken genome. Class participants created a database of the putative genes present in their scaffolds, as well as their ontology and evolutionary conservation. Currently, they are annotating parrot scaffolds using chicken genes from Ensembl as reference, and locating positions within the scaffolds using MEGA. The GEP Gene Model Checker is used to test gene models even in the absence of Drosophila orthologs. We are also developing a series of courses on Local Genome Diversity Studies in which undergraduates learn to interview, collect the samples and perform molecular lab procedures. In the first year, they also begin participating in journal clubs focused on human genetics. Advanced students learn to incorporate the genotype information and population frequencies for SNPs from the 1000 Genomes Project (1KGP) and HapMap databases. Next year they will use bioinformatics tools to identify a gene of interest related to a particular phenotype and identify those SNPs that are particularly frequent in the 1KGP Puerto Rico population sample set or that may have a strong phenotypic impact.

24 Name of presenter appears in bold if more than one author. Abstracts

David Micklos1, Anthony Biondo1, Cornel Ghiban1, Eun-sook Jeong1, Mohammed Khalfan1, Sheldon McKay1, Jason Williams1, and Uwe Hilgert2 1iPlant Collaborative, DNA Learning Center, Cold Spring Harbor Laboratory, Long Island, NY 2iPlant Collaborative, BIO5 Institute, University of Arizona, Tucson, AZ

DNA Subway: An Intuitive Interface to Introduce Genome Informatics Micklos, D. Genome analysis provides opportunities for undergraduate students to discover basic principles of molecular biology while embarking on research projects using available DNA and RNA sequence data. However, navigating the complex world of bioinformatics tools and sites can be daunting for beginners. DNA Subway provides an appealing way to introduce genome informatics to students. Using the visual metaphor of a subway map, this educational platform bundles research-grade bioinformatics tools and databases into intuitive workflows and presents them in an easy-to-use interface. Each of four DNA Subway lines focuses on different problems in genome analyses. The Red Line allows students to predict genes in up to 150 kb of DNA, add in RNA- or protein- based evidence, develop gene models, and make functional annotations. The Blue Line articulates with a complete set of lab materials for DNA barcoding, allowing students to merge biochemistry, bioinformatics, evolution, and ecology. DNA barcode sequences are automatically uploaded into the workflow, which includes both sequence analysis and tree- building tools. The Yellow Line allows students to search for transposons and gene families in sequenced genomes. The Green Line anticipates the widespread use of next generation sequence data in undergraduate institutions, allowing students to conduct transcriptome studies using RNA sequence (RNA-seq). Major obstacles that were overcome in developing DNA Subway included reformatting input/output data and insuring smooth operation across all major web browsers. Several new web applications were developed to improve upon proprietary desktop applications - including an electropherogram viewer and a DNA barcode viewer for aligned DNA sequences. A maize genome annotation project at Truman State University showed that undergraduate students transition easily from DNA Subway to genome browsers, providing evidence for the importance of educational interfaces in course-based research projects. The design and informatics expertise gained in developing the DNA Subway interface can be inform efforts to make educational gateways in other areas of biological research. DNA Subway is part of the iPlant Collaborative, a five-year project to develop a national computer infrastructure that applies computational thinking to solve biological problems.

25 HHMI bioinformatics workshop for student/scientist partnerships

William R. Pearson Dept. of Biochemistry and Molecular Genetics, U. of Virginia

Bioinformatics Theory and Practice - Striking a Balance The explosion of genome sequence data, the rapid expanding size and diversity of biological databases, and the broad spectrum of computer programs and web sites available to Pearson, W. analyze biological data, have dramatically reduced the cost and difficulty of addressing novel and challenging biological questions. New, unanalyzed data is cheap and easily accessed, and powerful tools like BLAST, make it possible to annotate a metagenome or explore complex evolutionary questions in weeks or months that earlier required a graduate career. However, the ease with which data can be obtained and analyzed encourages an analytical strategy of compute first - understand later. And the ready availability of computational tools and database resources based on sequence data can drive research questions towards the “light under the lamppost.” Computational biology courses must balance the students’ (and faculty’s) desire to “get some results” with the longer term goal of developing a basic understanding of the computational, statistical, and biological foundations on which the analyses depend. The problem is compounded by the interdisciplinary nature of Bioinformatics research; Computer Scientists may be unfamiliar with biological “improbabilities,” while Biologists may not understand the tradeoffs inherent in different computational approaches. Students at the lab bench learn quickly that unexpected experimental results usually result from technical errors. It is much more difficult to appreciate anomalies in large datasets, and high-profile mistakes are made by very sophisticated groups. Unfortunately, while there are excellent textbooks describing what Bioinformatics tools do, it is much less common to learn about what they cannot do, and how they may mislead. Increased emphasis at the interface between biology and statistical approaches, and a better understanding of the tradeoffs in computational tools and database resources, should provide students with a stronger foundation in Bioinformatics and a better understanding of the ambiguities inherent in every experimental science.

26 Name of presenter appears in bold if more than one author. Abstracts

Mihaela Pertea McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD

Using Next-gen Sequencing Data to Explore the Human Genome The latest generation of DNA sequencing technology has spurred a tremendous increase in the amount of genome data available, opening new doors to student-scientist partnerships Pertea, M. that can use sequencing to answer fundamental questions in biology and medicine. Today we can collect deep sequencing data from a human genome in less than a week, using just a single run of the latest high-throughput sequencer. The introduction of high-throughput sequencing technologies has also transformed the field of transcriptomics, through the use of RNA-seq to capture the complete set of mRNA transcripts in a cell. There are many new bioinformatics tools that students and faculty can use to analyze these new types of sequence data. For example, students can use Bowtie, a short-read aligner developed within our group, to map short sequence “reads” to the genome. They can then use the resulting alignments, in conjunction with other software tools, to identify genetic variations. Bowtie is also the engine behind Tophat, a program that aligns RNA-seq reads to a genome in order to identify exon-exon splice junctions. By using TopHat together with the transcriptome assembler Cufflinks, students can identify which transcripts are expressed, how many splice variants are present, and how much of each transcript is present in a cell. All these programs can be run on a standard desktop computer in a UNIX environment, but users without UNIX skills could also operate them through the web interface provided by the Galaxy project. Biology students in the 21st century must have the requisite computational skills to run these tools and many others, in order to sift through huge volumes of next-generation sequencing data and make meaningful biological discoveries.

27 HHMI bioinformatics workshop for student/scientist partnerships

Antonis Rokas Department of Biological Sciences, Vanderbilt University, Nashville, TN

A Genomics Approach to Identifying the Factors Influencing Phylogenetic Accuracy Phylogenies are the foundation of comparative biology and their inference is essential to Rokas, A. modern biology. However, phylogenetic reconstruction is fraught with challenges such that the relationships among several clades in life’s history as well as among elements in genomes remain poorly understood; identifying the factors compromising accuracy in molecular phylogenetic reconstruction is critical for understanding the evolution of genes, phenotypes and lineages. The Saccharomyces - Candida yeast lineage represents a superb model to study these factors, due to the abundance of genomic data from several species and the exquisite knowledge about the function of elements in the yeast genome. Aided by a NSF CAREER award, we are using two dozen yeast genomes to identify the factors that influence the performance of genes and rare genomic changes (RGCs). To promote the understanding of phylogenetics and its importance for understanding the evolution and function of genomes, we are also developing a series of modules on key concepts in phylogenomics research that integrate our research program into the undergraduate classroom. Specifically, we are creating lecture -practical pairs focused around a central theme designed to illustrate the detrimental or beneficial effects of a variety of factors on phylogenetic accuracy. The thousands of orthologous gene sets and RGCs identified from the yeast genome data provide us with a rich data source from which we can carefully handpick data sets that best highlight the usefulness of using novel molecular markers or which best showcase the effects of specific factors on phylogenetic accuracy. In the lecture component, the students lead classroom wide discussions of the conceptual issues and seminal publications related to the central theme. In the practical component, the students can either analyze a data set of known behavior with or without accounting for the factor in question and discuss which of the two experimental results is correct and why. Once students have gained familiarity with both theory and practice, our entire data set could be further employed to design studies, through a student-scientist partnership, which test the effect of a factor genome-wide. Example lecture-practical modules that can be created include factors such as base composition, marker informativeness, introgression, natural selection, and alignment. Biological data sets demonstrating the effects of particular factors on phylogenetic inference are highly sought-after by course organizers and are not only likely to be very useful to teachers of phylogenomics courses but also to mitigate bioinformatics illiteracy in the molecular biology research community.

28 Name of presenter appears in bold if more than one author. Abstracts

Anne Rosenwald1, Gaurav Arora1, Ramana Madupu2, Jennifer Roecklein-Canfiled3, and Janet Russell4 1Department of Biology, Georgetown University, Washington, DC 2J. Craig Venter Institute, Rockville, MD 3Department of Chemistry, Simmons College, Boston, MA 4Center for New Design in Learning and Scholarship, Georgetown University, Washington, DC

Rosenwald, A. Mining the Human Microbiome - An Opportunity for Student Research The explosion of microbial sequence data available for study lends itself to student- scientist partnerships. We are promoting the use of these data, particularly from the Human Microbiome Project (HMP) in our NSF-funded project, Genome Solver, in collaboration with the J. Craig Venter Institute, one of the 4 sequencing centers involved in the HMP. The project has two aims: First, to conduct workshops for undergraduate faculty so that they can learn to use these data in their classrooms to perform novel research with their students; and second, to create an online community (genomesolver.org), where students and faculty from multiple disciplines can share ideas, research, curriculum modules and pedagogical techniques for classroom use, etc. Questions that can be approached using these data include: proposing functions for novel genes based on sequence homology, looking for evidence of horizontal gene transfer, examining divergence of gene sequences within clades or phyla, comparing the same species found in two different environmental niches, examining the diversity of species in a given environmental niche, etc. Hypotheses generated by exploration of genomic data can in some instances be followed by wet lab experiments. Participation in such research addresses the need for students to experience authentic research, but also demonstrates to them the need for math and computer science skills in order to solve biological questions. While our efforts demonstrate that it is possible to function in a stand-alone mode using a variety of web-based sequence analysis tools, it would be advantageous to aggregate research efforts and data about individual genomes in a common framework. Such a tool would be useful for other bacterial genomics efforts, such as that promoted by the American Society for Microbiology and the Joint Genome Institute (ASM-JGI), which has focused on environmental microbes rather than the human microbiome. Such a site for “one-stop shopping” would allow faculty and their students to focus more on biological questions, rather than using classroom time learning a variety of different analysis tools. Another useful tool would be one that aggregates the metagenomic data acquired from 16S rDNA data sets. Having a single location for these data would be valuable not only for faculty and their students, but the research community as well.

29 HHMI bioinformatics workshop for student/scientist partnerships

Daniel A. Russell and Graham F. Hatfull Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA

PhagesDB.org: A port in a storm of mycobacteriophage genomic information In 2003, there were just 13 sequenced genomes of bacteriophages that infect Mycobacterial hosts. Through the Phage Hunters Integrating Research and Education (PHIRE) program Russell, D. at the University of Pittsburgh, this number climbed to 60 genomes by 2009. Then, as a consequence of broader dissemination of this platform through the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science (SEA- PHAGES) initiative - as well as substantial advances in sequencing technologies—the number of sequenced mycobacteriophage genomes has skyrocketed, and stands at more than 350 today. Such a large data set presents both challenges and opportunities, not only due to the quantity of genomes, but because of the abundant yet dispersed community of mycobacteriophage researchers that desires to contribute to, access, and interpret the information. The SEA- PHAGES community alone includes nearly 2,000 undergraduate students and 200 faculty at over 60 institutions located in more than 30 states. PhagesDB.org was thus conceived as a way to centralize data storage and access, and to provide analytical tools to eager young researchers. PhagesDB.org currently allows users to enter data, run local BLAST searches, check cluster and subcluster information, view information generated by the comparative genomics program Phamerator, download data, map phage isolation locations with GPS coordinates, access protocols, compare phage micrographs, and discuss questions via our phorum. These capabilities are widely employed, with over 4,000 unique monthly users for each of the past six months. Many undergraduate users are first exposed to PhagesDB.org when inputting basic data about phages they have discovered, so they immediately are given a sense of both individual accomplishment and the scope of the larger mycobacteriophage research community of which they are now a part. They can then use the tools on PhagesDB.org to unlock the genomic secrets of their phages, and a few students have even contributed directly to the capabilities of the site itself. As research questions are developed by SEA-PHAGES students and other phage researchers around the country, we will strive to add or create - with the assistance of undergraduates - new tools for the site. One thousand sequenced mycobacteriophage genomes are just around the corner, and only a large community of researchers, equipped with appropriate tools, will be able to make sense of this rich and exciting deluge of data.

30 Name of presenter appears in bold if more than one author. Abstracts

Susan R. Singer1, Jodi Schwarz2, Benjamin J. Taylor3, Cathy Manduca4, Sean Fox4, Ellen Iverson4, Jeff J. Doyle5, Dan Ilut5, Andrew Farmer6, and Gregory D. May6 1Department of Biology, Carleton College, Northfield, MN 2Department of Biology, Vassar College, Poughkeepsie, NY 3Department of Entomology, University of Wisconsin, Madison, WI 4Science Education Resource Center, Carleton College, Northfield, MN 5Department of Plant Biology, Cornell University, Ithaca, NY Singer, S. 6National Center for Genome Resources, Santa Fe, NM 7USDA ARS, Iowa State University, Ames, IA

Scaffolding Whole Transcriptome Analysis for Genetics Students Carleton and Vassar genetics students engage in research related to climate change and the biology of the prairie plant Chamaecrista and the anemone Aiptasia, respectively. Students begin at the whole organism level; move to transcriptome analysis scaffolded with a web- based ‘Genomics Explorer’ emphasizing strategy choice; design and conduct molecular experiments; and prepare final papers and presentations. Learning goals include: 1) Propose a testable hypothesis based on journal articles that can be tested using transcriptome data. 2) Frame meaningful questions that can be addressed using transcriptome data. 3) Modify the hypothesis based on data analysis. 4) Use multiple bioinformatics resources without losing focus on a biological question. 5) Distinguish information that can be extracted from transcriptome vs. genome sequences. 6) Leverage an investigation by using the whole transcriptome for analysis (i.e. pattern search). Our study unpacks factors maximizing student movement on the novice-to-expert continuum in using transcriptome data to ask and answer biological questions. Assessment tools include: ACT CAAP Science Reasoning test, ACT Science Reasoning test, Genetics Concepts Assessment, Motivated Strategies for Learning Questionnaire, Washington State Critical Thinking Rubric, Lopatto’s CURE and RISC instruments, ill-structured pre/post problems, electronic journal entries, web click- throughs, and classroom observers. Integrating a large data set visualization tool leveraged students’ ability to frame questions at the level of the whole transcriptome, rather than focusing on a single gene. Pathway analysis revealed that successful students used iterative approaches, integrated multiple strategies, thought aloud, engaged each other, made metacognitive use of the electronic journal, developed focused questions and incorporated presentation feedback to improve papers. Successful students were able to frame a tractable problem, understand context and assumptions, develop their own perspective, make a case with evidence, integrate other perspectives, form conclusions, and communicate effectively. Students fell into two groups along the novice-expert continuum. ‘Mimics’ make progress by finding an analogous situation in the literature and mimicking it. They ask relational questions including, “How does what we are interested in relate to something doable?” Mimics needed help with experimental design. ‘Innovators’ can both design experiments and extend conclusions, making independent progress. Innovators use a metacognitive approach where they identify what they don’t understand, find a specific way to probe, draw conclusions, and iterate. They frame specific, well-posed questions for the instructor based on their experience, asking for help appropriately. Supported by NSF (DUE-0837375 and DEB-0746571)

31 HHMI bioinformatics workshop for student/scientist partnerships

James Taylor (for the Galaxy Team) Emory University and http://galaxyproject.org

Galaxy: an accessible, collaborative environment for reproducible research Genomic research has advanced quickly, driven by rapidly changing data production technologies and experimental techniques. Sophisticated informatics tools are required Taylor, J. to make sense of large genomic datasets, however analysis best practices and underlying analysis tools are also changing very rapidly, and tools are rarely provided in a form that is immediately usable without informatics expertise. This creates a significant barrier to entry for anyone wanting to analyze genomic data. To address this, we have developed the Galaxy framework that makes it easy to integrate existing tools into an analysis workspace, giving them uniform user interfaces that can be used with nothing more than a web browser. Galaxy provides an analysis environment with automatic provenance, a workflow system for building complex analysis, a visualization framework, and pervasive sharing and publishing features to facilitate collaboration. Education and training has always been a key feature of the Galaxy project, and we have had great success with a variety of training mediums, including video tutorials. The Galaxy Pages system allows users to author documents that are directly integrated with the analysis environment. This provides both a novel way to publish computationally intensive analysis results, and has also been successfully used to develop self-guided educational exercises. Here we will discuss the goals and features of Galaxy in the context of education, how we have used it in undergraduate education, and plans for the future.

32 Name of presenter appears in bold if more than one author. Abstracts

Matthew Vaughn The iPlant Collaborative & Texas Advanced Computing Center, University of Texas at Austin

The iPlant Collaborative: Bringing Together High Performance Computing and Biology “Supercomputers, petabytes, virtualization, bioinformatics” are words that conjure both Vaughn, M. exciting possibilities and endless technical annoyances: command lines, special network protocols, mysterious access policies, and more. The iPlant Collaborative, a 5-year project funded by the National Science Foundation, abstracts out these complexities to deliver high performance computing to a varied community of users that includes bioinformaticians, computer scientists, laboratory investigators, and educators. iPlant offers multiple modes of access: a community-extensible rich web client (Discovery Environment), an on-demand virtualization cluster (Project Atmosphere), and rich REST-based programmer interfaces (Agave). Direct command-line login (XSEDE Direct Access Program) enables sophisticated bioinformatics analyses on some of the most powerful computing and storage systems in the world. The data and workflow sharing abilities offered by the iPlant Cyberinfrastructure make it ideal for fostering collaborative relationships between scientists and their students. New applications can be rapidly developed in the Discovery Environment and securely shared among a select group of collaborators. Experimental data can be retrieved from remote sources and housed in the iPlant Data Store where it, along with its derived data products, can be shared with that same group. Virtual machines available through Atmosphere offer an alternative way to share data and experiments “in the cloud.” Across the infrastructure, users at all levels of technical sophistication are presented with consistent data and interfaces. This has led to iPlant acting as the nexus for some very interesting educational experiments. In addition to our tutorial workshops featuring DNA Subway and the Discovery Environment, iPlant has 1) hosted synthesis workshops that focus on collaborative development of tools to address specific science questions, 2) enabled semester-long bioinformatics classes that lived natively in our cyberinfrastructure, 3) and identified examples of students who had learned enough about iPlant to share new computational solutions knowledge with their mentors.

33 HHMI bioinformatics workshop for student/scientist partnerships

Spencer Wells National Geographic Society

The Genographic Project as a model for student scientific engagement The Genographic Project, launched in 2005, is the largest research project ever undertaken by the National Geographic Society. Work is being carried out worldwide by a consortium Wells, S. of researchers, using the tools of molecular genetics to answer questions about human demographic history. Uniquely among large scientific projects, direct public engagement is central to the goals of the Genographic Project via our public participation kits. With more than 450,000 public participants to date, many thousands of whom are K-12 and university students, this component of the project provides a compelling way to engage nonscientists in the scientific process. I will discuss some examples of our successful educational programs in the broader context of ‘citizen science’ and public engagement.

34 Name of presenter appears in bold if more than one author. Abstracts

Susan R. Wessler, James Burnette The Genomics Institute and the Neil A Campbell Science Learning Laboratory, University of California, Riverside

Transposing from the Research Laboratory to the Undergraduate Classroom Transposable elements (TEs) comprise the largest fraction of virtually all characterized Wessler, S. eukaryotic genomes, where they often comprise over 50% of total sequence. During the past 20 years my lab has pioneered the development of computational tools and strategies to identify and analyze TEs in plant genomes. Of particular interest is the tiny fraction of the TE complement that is actively transposing and contributing to gene and genome evolution. Once these active TEs candidates are identified computationally, the project moves to the wet lab where transposition must then be validated in the plant host or in a heterologous assay system. The goal of my HHMI Professor award was to replicate my research laboratory as an undergraduate classroom including its three basic components: (1) a computer lab, (2) a wet lab and (3) a focus on plant transposable elements. Now in the 4th year, The Dynamic Genome (DG) courses are taught to freshmen in the Neil A. Campbell Science Learning Laboratory at UC Riverside. Transposable elements have proven to be an ideal focus for an early laboratory experience. First, although they are incredibly abundant, they are largely ignored by practicing scientists. This means that there is a tremendous opportunity for discovery and ownership - especially with the increasing pace of genome sequencing. Second, TE biology is easily grasped by students as TEs are simple (one gene) and have a single purpose - to increase their copy number in host genomes. Finally, whether in the genome or excising from a GFP reporter gene, they provide wonderfully dynamic examples of evolution in action - as they accumulate mutations in their own sequence or insert into host genes. Over the past four years the DG courses has been extremely successful and curricula and research projects have been devised that provide unexpected synergy between my teaching and research programs. In addition, venues to broaden the impact of the activities beyond the classroom have been established. These include the implementation of a 2-semester class with a local high school and the dissemination of our software pipeline (TARGeT) as part of the DNA Subway by the NSF- funded iPlant consortium.

35 36 Participants HHMI bioinformatics workshop for student/scientist partnerships

Mark Adams, PhD David Dooling, PhD Scientific Director and Professor Assistant Director, The Genome Institute J. Craig Venter Institute Washington University in St. Louis San Diego, CA St. Louis, MO [email protected] [email protected]

Charles “Chip” Aquadro, PhD Todd T. Eckdahl, PhD Professor Professor and Chair Cornell University Missouri Western State University Ithaca, NY St. Joseph, MO [email protected] [email protected]

Lois Banta, PhD Sean Eddy, PhD Associate Professor Group Leader, Janelia Farm Research Campus Williams College [email protected] Williamstown, MA [email protected] Sarah C. R. Elgin, PhD Professor Vivien Bonazzi, PhD Washington University in St. Louis Program Director, Computational Biology and Bioinformatics St. Louis, MO National Human Genome Research Institute/National [email protected] Institutes of Health (NHGRI/NIH) Bethesda, MD Alexander J. Hartemink, PhD [email protected] Associate Professor Duke University Vincent P. Buonaccorsi, PhD Durham, NC Associate Professor [email protected] Juniata College Huntingdon, PA Jose Herrara [email protected] Program Director National Science Foundation Steven G. Cresawn, PhD Arlington, VA Assistant Professor [email protected] James Madison University Harrisonburg, VA Ian Korf, PhD [email protected] Associate Professor University of California at Davis Elizabeth A. Dinsdale, PhD [email protected] Assistant Professor San Diego State University Robert Kuhn, PhD San Diego, CA Associate Director, UCSC Genome Browser [email protected] University of California at Santa Cruz [email protected] Sam Donovan, PhD Research Associate Professor Carolyn J. Lawrence, PhD University of Pittsburgh Research Geneticist Pittsburgh, PA USDA-ARS, University of Iowa [email protected] Ames, IA [email protected]

38 Participants

Mary Lee S. Ledbetter Antonis Rokas, PhD Program Director Assistant Professor National Science Foundation Vanderbilt University Arlington, VA Nashville, TN [email protected] [email protected]

Wilson Leung Anne G. Rosenwald, PhD Chief Technical/Teaching Assistant Assistant Professor Washington University in St. Louis Georgetown University St. Louis, MO [email protected] [email protected] Daniel A. Russell Suzanna E. Lewis, PhD University of Pittsburgh University of California at Berkeley Pittsburgh, PA [email protected] [email protected]

Jennifer Mansfield, PhD Susan R. Singer, PhD Assistant Professor Professor Barnard College Carleton College New York, NY Northfield, MN [email protected] [email protected]

Juan Carlos Martinez-Cruzado, PhD James Taylor, PhD Professor Assistant Professor University of Puerto Rico–Mayaguez Emory University [email protected] Atlanta, GA [email protected] Melissa McCartney, PhD Editorial Fellow Matthew W. Vaughn, PhD Science Magazine Manager, Life Sciences Computing Group Washington, DC Research Associate, Computational Biology [email protected] iPLANT, Texas Advanced Computing Center [email protected] David A. Micklos, PhD Executive Director of the Dolan DNA Learning Center Spencer Wells, PhD Cold Spring Harbor Laboratory Geneticist Cold Spring Harbor, NY National Geographic [email protected] Washington, DC [email protected] William R. Pearson, PhD Professor Susan R. Wessler, PhD University of Virginia Professor Charlottesville, VA University of California–Riverside [email protected] [email protected]

Mihaela Pertea, PhD Assistant Professor Johns Hopkins University Baltimore, MD [email protected]

39 HHMI bioinformatics workshop for student/scientist partnerships

HHMI Staff

Jack E. Dixon, PhD Vice President and Chief Scientific Officer [email protected]

Sean Carroll, PhD Vice President, Department of Science Education [email protected]

David J. Asai, PhD Director, Precollege and Undergraduate Science Education Program [email protected]

Cheryl Bailey, PhD Senior Program Officer [email protected]

Cynthia Bauerle, PhD Senior Program Officer [email protected]

Andrew Quon Program Officer [email protected]

Patricia Soochan Program Officer [email protected]

Melvina Lewis Program Assistant [email protected]

40 Logistics NORTH HHMI bioinformatics workshop for student/scientist partnerships

Campus & Conf Services / IT Sector A Sector B

ConferenceB-101 Center and First A-116Floor

Front Entrance

Finance Pilot A-111 The Pilot Atrium Atrium

C-156

Investments C-150

Sector C C-105 Finance Sector N Waterfall Corridor to Central Space and N-144 2nd Floor (to the Small Auditorium) C-123 C-122 PondConference Rooms C122 and C123

N-140 Water Fall Corridor Communications Dining Room

Purnell Choppin Dining Conference Rooms D115 and D116 Library Room D-115

D-113 Sector K Central Space D-116 Sector D Great Hall

D-124 Conference Rooms D124 and D125 S-123 Sector S D-125

G House

DN S-129 UP Conference Center Science Education

UP

Campus & Conf Services H House

Maxwell Cowan Conference Center UP Auditorium

DN

UP

R LIFT LCHAI WHEE

DN UP E SERVIC

SELFNG CLOSI GATE ESS EGR

DN

UP Conference Rooms DN

E GAT F House EGRESS SING CLO SELF

DN

UP

UP

HHMI 42 Howard Hughes Medical Institute Grounds & Walking Path 1st Floor Logistics NORTH

Sector B Sector A Second Floor B-224 A-248

Human Resources B-228 Information East Garage Entrance Technology Old Section

A-259 A-262

Sector C

Sector N

N-241 General Counsel

N-238 Science Fitness Center

Sector K Small Auditorium Small Auditorium Stairway in Central Space (access the Central Space via Waterfall Corridor) S-221 Sector D Archives Rathskeller

S-224

Sector S

S-228 Science Education

Campus & Conf Services Conference Center Conference Rooms

South Garage Entrance New Section

43

HHMI HHMI 2nd Floor Lower Level & Parking Garage NORTH

Sector B Sector A

HHMI bioinformatics workshop for student/scientist partnerships B-224 A-248

Human Resources B-228 Information East Garage Entrance Technology Old Section

A-259 Lower Level A-262

Sector C

Sector N

N-241 General Counsel

N-238 Science Fitness Center Fitness Center (take the elevator or Sector K stairs located in the Conference Center Small to the lower level) Auditorium

S-221 Sector D Archives

S-224

Sector S

Elevator and stairwell S-228 Science Education

Campus & Conf Services Conference Center Conference Rooms

South Garage Entrance New Section

44 HHMI HHMI 2nd Floor Lower Level & Parking Garage