Main application 1. Face page (title, name, affiliation) 2. BC Application for promotion form 3. CV 4. Research statement 5. Teaching statement 6. Service statement

Appendix A. Research materials 1. Copies of publications 2. Copies of abstracts 3. Copy of cover image 4. Invited talk posters (Pers. Gen and CHI) 5. Book references 6. Article references 7. News items Appendix B. Professional activities / materials 1. SAB memberships 2. Interviews 3. Keynote speaker flier 4. News items

Appendix C. Teaching materials 1. Research statement 2. BC Course syllabi 3. BC Course evaluations 4. Outside teaching descriptions 5. Outside teaching images (e.g. CSHL courses poster)

Appendix D. Service materials 1. Service statement 2. Service materials

Tenure application

Gabor T. Marth, D.Sc. Department of Biology Boston College

October 5, 2008 Boston College “Application for Promotion” Form Curriculum Vitae Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

Gabor T. Marth, D.Sc.

I. PERSONAL DATA

Academic title Assistant Professor of Biology

Office address Department of Biology, Boston College Room 415 Higgins Hall, 140 Commonwealth Avenue, Chestnut Hill, MA 02467 Tel: 617.552-3571 (office) 617.552-0397 (lab) Fax: 617.552-2011 Email: [email protected] Web site: http://bioinformatics.bc.edu/marthlab

Home address 218 South St. Unit 8. Waltham, MA 02453 Tel: 781.894-3594

Birth November 12, 1964. Budapest, Hungary.

Education history 1994. D.Sc. (Doctor of Science) degree in Systems Science and Mathematics. Department of Systems Science and Mathematics, School of Engineering, Washington University, St. Louis, Missouri. 1987. B.S. and M.S. degrees in Electrical Engineering. Department of Control Engineering. Technical University of Budapest, Budapest, Hungary.

Professional appointments 2006 – present. Director, Bioinformatics Program. Department of Biology, Boston College, Chestnut Hill, MA. 2003 – present. Assistant professor. Department of Biology, Boston College, Chestnut Hill, MA.

1 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

2000 – 2003. Staff scientist. Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD. 1994 – 2000. Post-doctoral research associate, Genome Sequencing Center, Department of Genetics, Washington University School of Medicine, St. Louis, MO. 1992 – 1994. Co-manager. Center for Robotics and Automation, Department of Systems Science and Mathematics, Washington University, St. Louis, MO. 1987 – 1988. Doctoral fellowship. Hungarian Academy of Science, Budapest, Hungary (1987-1988). 1987. Internship. Institute for Heavy Ions Research, Darmstadt, Germany.

II. RESEARCH

Publications (as Assistant Professor at Boston College) 1. (***) Douglas R. Smith, Aaron R. Quinlan, Heather E. Peckham, Kathryn Makowsky , Wei Tao, Betty Woolf, Lei Shen, William F. Donahue, and Nadeem Tusneem , Michael P. Stromberg, Donald A. Stewart, Lu Zheng, Swati S. Ranade, Jason B. Warner, Clarence C. Lee, Brittney E. Coleman, Zheng Zhang, Stephen F. McLaughlin , Joel A. Malek, Jon M. Sorenson, Alan P. Blanchard, Jarrod Chapman, David Hillman , Feng Chen, Daniel S. Rokhsar, Kevin J. McKernan, Thomas W. Jeffries, Gabor T. Marth, and Paul M. Richardson. RAPID WHOLE-GENOME 1 MUTATIONAL PROFILING USING NEXT-GENERATION SEQUENCING TECHNOLOGIES. Genome Research. 2008;10: 1638-42. Epub 2008 Sep 4. 2. (***) Huang W, Marth G. EagleView: A GENOME ASSEMBLY VIEWER FOR NEXT-GENERATION SEQUENCING TECHNOLOGIES. Genome Research. 2008;18:1538-43. Epub 2008 Jun 11 3. (***) Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, Barnett D, Fox P, Glasscock JI, Hickenbotham M, Huang W, Magrini VJ, Richt RJ, Sander SN, Stewart DA, Stromberg M, Tsung EF, Wylie T, Schedl T, Wilson RK, Mardis ER. WHOLE-GENOME SEQUENCING AND VARIANT DISCOVERY IN C. ELEGANS. Nature Methods. 2008;5:183-8. 4. (***) Quinlan AR, Stewart DA, Strömberg MP, Marth GT. PYROBAYES: AN IMPROVED BASE CALLER FOR SNP DISCOVERY IN PYROSEQUENCES. Nature Methods. 2008;5:179-81. 5. (***) Quinlan AR, Marth GT. Primer-site SNPs mask mutations. Nature Methods. 2007; Mar;3:192. 6. (***) Indap AR, Marth GT, Struble CA, Tonellato P, Olivier M. ANALYSIS OF CONCORDANCE OF DIFFERENT HAPLOTYPE BLOCK PARTITIONING ALGORITHMS. BMC Bioinformatics. 2005;15;6:303. 7. (***) Marth GT, Czabarka E, Murvai J, Sherry ST. THE ALLELE FREQUENCY SPECTRUM IN GENOME-WIDE HUMAN VARIATION DATA REVEALS SIGNALS OF DIFFERENTIAL DEMOGRAPHIC HISTORY IN LARGE WORLD POPULATIONS. Genetics 2004;166:351-372

2 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

8. Marth GT, Schuler G, Yeh R, Davenport R, Agarwala R, Church D, et al. SEQUENCE VARIATIONS IN THE PUBLIC HUMAN GENOME DATA REFLECT A BOTTLENECKED POPULATION HISTORY. Proceedings of the National Academy of Sciences of the USA (PNAS) 2003;100:376-381 9. Ghebranious N, Vaske D, Yu A, Zhao C, Marth GT, Weber JL. STRP SCREENING SETS FOR THE HUMAN GENOME AT 5 CM DENSITY. BMC Genomics 2003;4:1-10. 10. Weber JL, David D, Heil J, Fan Y, Zaho C, Marth GT. HUMAN DIALLELIC INSERTION/DELETION POLYMORPHISMS. American Journal of Human Genetics 2002;71:854-862 11. Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth GT, et al. A MAP OF HUMAN GENOME SEQUENCE VARIATION CONTAINING 1.42 MILLION SINGLE NUCLEOTIDE POLYMORPHISMS. Nature 2001;409:928-933 12. Marth GT, Yeh R, Minton M, Donaldson R, Li Q, Duan S. et al. SINGLE- NUCLEOTIDE POLYMORPHISMS IN THE PUBLIC DOMAIN: HOW USEFUL ARE THEY? Nature Genetics 2001;27:371-372 13. Marth GT, Korf I, Yandell MD, Yeh RT, Gu Z, Zakeri H, et al. A GENERAL APPROACH TO SINGLE-NUCLEOTIDE POLYMORPHISM DISCOVERY. Nature Genetics 1999;23:452-456 14. Dear S, Durbin R, Hillier L, Marth GT, Thierry-Mieg J, Mott R. SEQUENCE ASSEMBLY WITH CAFTOOLS. Genome Research 1998;8:260-267 (***) indicates publication as Assistant Professor at Boston College

Book chapters 1. Marth GT. COMPUTATIONAL SNP DISCOVERY IN DNA SEQUENCE DATA. Methods Mol Biol (Humana Press, ed. P.Y. Kwok) 2003;212:85-110 2. Vieux E, Marth GT, Kwok PY. SNP DISCOVERY AND PCR-BASED ASSAY DESIGN: FROM IN SILICO DATA TO THE LABORATORY EXPERIMENT. In Bioinformatics for Geneticists. Wiley & Sons, ed. Barnes M & Gray I.C. 2003; 203-215

Talks at national and international conferences 1. (***) INFORMATICS CHALLENGES FOR SEQUENCING 1000S OF INDIVIDUAL GENOMES. Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the conference on Intelligent Systems for Molecular Biology, Toronto, ON, Canada. July 2008. 2. (***) DATA ANALYSIS METHODS FOR NEXT-GENERATION SEQUENCING TECHNOLOGIES. Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Epigenomics & Sequencing Meeting, Harvard University, Boston, MA. July 2008. Invited speaker. 3. (***) BASE QUALITY AND READ QUALITY IN 1000G DATA. Gabor T. Marth and Derek Barnett, Department of Biology, Boston College, Chestnut Hill, MA.

3 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

Presented at the Cold Spring Harbor Laboratory 1000 Genomes Project Meeting, Cold Spring Harbor, NY. May 2008. 4. (***) NEXT-GENERATION SEQUENCING – THE INFORMATICS ANGLE. Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Advances in Genome Biology and Technology Meeting, Marco Island, FL. February 2008. Invited speaker. 5. (***) INFORMATICS FOR NEXT-GENERATION SEQUENCE ANALYSIS – SNP CALLING. Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Pacific Symposium on Biocomputing (special session on Computational Challenges of Next-Generation Sequencing Applications). January 2007. 6. (***) SOFTWARE TOOLS FOR POLYMORPHISM DISCOVERY IN NEXT- GENERATION SEQUENCER DATA. Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Cambridge Healthtech Institute Conference on “Exploring Next-generation sequencing”, Providence, RI. October 2007. Invited speaker. 7. (***) POLYMORPHISM DISCOVERY IN NEXT-GENERATION SEQUENCER DATA. Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Cold Spring Harbor Biology of Genomes Meeting, Cold Spring Harbor, NY. May 2007. 8. (***) SNP DISCOVERY IN WHOLE-GENOME LIGHT-SHOTGUN 454 PYROSEQUENCES. Gabor T. Marth and Aaron Quinlan, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Advances in Genome Biology and Technology Meeting, Marco Island, FL. February 2007. 9. (***) A COALESCENT BASED MARKER SELECTION TOOL FOR ASSOCIATION STUDIES. Eric Tsung and Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA (Presentation delivered by co-author, lab postdoc Dr. Eric Tsung). Presented at the American Society of Human Genetics meeting, Salt Lake City, UT, October 2005 10. (***) A COALESCENT COMPUTATIONAL PLATFORM FOR ASSOCIATION DATA ANALYSIS. Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Genome studies and the HapMap Conference, Oxford, UK. March 2005. 11. FROM SNPS TO A HUMAN HAPLOTYPE MAP – DESIGN CONSIDERATIONS FOR A GENERAL RESOURCE. Gabor T. Marth, National Center for Biotechnology Information, NIH, Bethesda, MD. Presented at the Cold Spring Harbor Genome Sequencing Meeting, Cold Spring Harbor, NY. May 2003. 12. OPTIMAL MARKER SAMPLING STRATEGIES FOR A GENERAL HUMAN HAPLOTYPE MAP. Gabor T. Marth, National Center for Biotechnology Information, NIH, Bethesda, MD. Human Variation Meeting, Reykjavik, Iceland. October 2002. 13. MODELS OF GENOME-WIDE OBSERVED DISTRIBUTIONS OF SNPS PERMIT SYSTEMATIC DESCRIPTION OF HUMAN HAPLOTYPE STRUCTURE. Gabor T. Marth, National Center for Biotechnology Information, NIH, Bethesda, MD. Presented at the Cold Spring Harbor Genome Sequencing Meeting, Cold Spring Harbor, NY. May 2002.

4 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

14. UNIFIED ANALYSIS OF SNP DENSITY AND ALLELE FREQUENCY SPECTRA IN GENOME-WIDE VARIATION DATA. Gabor T. Marth, National Center for Biotechnology Information, NIH, Bethesda, MD. Presented at the Cold Spring Harbor Genetic Variation Meeting, Cold Spring Harbor, NY. September 2001. 15. INFERENCES ON POPULATION HISTORY AND LINKAGE DISEQUILIBRIUM FROM GENOME-WIDE DNA VARIATION DATA. Gabor T. Marth, National Center for Biotechnology Information, NIH, Bethesda, MD. Human Variation Meeting, Stockholm, Sweden. October 2001. Invited speaker. 16. SNPS IN OVERLAPPING GENOME SEQUENCE: INFERENCES ON THE GENOMIC STRUCTURE OF VARIATION AND POPULATION HISTORY. Gabor T. Marth, National Center for Biotechnology Information, NIH, Bethesda, MD. Presented at the Cold Spring Harbor Genome Sequencing Meeting, Cold Spring Harbor, NY. May 2001. 17. TOWARDS A POLYMORPHIC MARKER MAP OF THE HUMAN GENOME: SNP DISCOVERY IN OVERLAPS OF LARGE-INSERT GENOMIC CLONES. Gabor T. Marth, National Center for Biotechnology Information, NIH, Bethesda, MD. Presented at the Cold Spring Harbor Genome Sequencing Meeting, Cold Spring Harbor, NY. May 2000. 18. POLYBAYES: A GENERAL APPROACH TO SNP DISCOVERY. Gabor T. Marth, Genome Sequencing Center, Washington University School of Medicine, St. Louis, MO. Presented at the Cold Spring Harbor Genome Sequencing Meeting, Cold Spring Harbor, NY. May 1999. (***) indicates talk as Assistant Professor at Boston College

Invitations to speak at future conferences 1. Invited speaker at the Cold Spring Harbor Laboratory Conference on Personal Genomes, Cold Spring Harbor, NY. To be held in October 2008. 2. Invited speaker at the ICREA Conference on Next-Generation Sequencing, Barcelona, Spain. To be held in October 2009.

Invitations to deliver conference keynote presentations 1. Invited conference opener keynote speaker at the Cambridge Healthtech Institute Conference on “Exploring Next-generation sequencing”, Providence, RI. To be held in September 2008.

Conference poster presentations 1. (***) INFORMATICS TOOLS FOR HUMAN GENOME RESEQUENCING. Gabor T. Marth, Michael Stromberg, Chip Stewart, Weichun Huang, Aaron Quinlan. Department of Biology, Boston College, Chestnut Hill, MA 02467. Presented at the Cold Spring Harbor Biology of Genomes Meeting. Cold Spring Harbor, NY. May 2008. 2. (***) MOSAIK: A REFERENCE-GUIDED ASSEMBLER FOR NEXT- GENERATION SEQUENCING PLATFORMS. Michael P. Stromberg and Gabor

5 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

T. Marth. Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Cold Spring Harbor Biology of Genomes Meeting. Cold Spring Harbor, NY. May 2008. 3. (***) RARE ALLELE DISCOVERY AND FREQUENCY ESTIMATION USING CURRENT SEQUENCING TECHNOLOGIES. Aaron Quinlan and Gabor Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Cold Spring Harbor Biology of Genomes Meeting. Cold Spring Harbor, NY. May 2008. 4. (***) SPANNER: A TOOL FOR STRUCTURAL VARIATION DISCOVERY FROM PAIRED-END READS. Donald Stewart and Gabor T. Marth. Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Cold Spring Harbor Biology of Genomes Meeting. Cold Spring Harbor, NY. May 2008. 5. (***) A BENCHMARKING PLATFORM FOR NEXT-GENERATION SEQUENCE READ ASSEMBLY AND POLYMORPHISM DISCOVERY TOOLS. Weichun Huang and Gabor T. Marth. Department of Biology, Boston College, Chestnut Hill, MA, USA. Presente4d at the Cold Spring Harbor Biology of Genomes Meeting. Cold Spring Harbor, NY. May 2008. 6. (***) AN INFORMATICS PIPELINE FOR MIRNA PROFILING USING NEXT- GENERATION SEQUENCING PLATFORMS. Michele Busby1, Michael Stromberg1, Martin Kurtev2, Jesse Gray2, Michael Greenberg2, Gabor Marth1. (1) Department of Biology, Boston College, Chestnut Hill, MA, USA; (2) Neurobiology Program, Children's Hospital, and Department of Neurobiology, Harvard Medical School, Boston, MA 02115, USA. Presented at the Cold Spring Harbor Biology of Genomes Meeting. Cold Spring Harbor, NY. May 2008. 7. (***) WHOLE-GENOME, COMPLETE MUTATIONAL PROFILING USING NEXT- GENERATION SHORT-READ SEQUENCING. Derek Barnett1, Chip Stewart1, Marc-Jan Gubbels1, Paul Fox2, Tim Schedl2. (1) Biology Department, Boston College, Chestnut Hill, MA 02467. (2) Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110. Presented at the Cold Spring Harbor Biology of Genomes Meeting. Cold Spring Harbor, NY. May 2008. 8. (***) SNP-SNIFFER: A SNP DISCOVERY PROGRAM FOR 454 SEQUENCES. Weichun Huang and Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Cold Spring Harbor Biology of Genomes Meeting. Cold Spring Harbor, NY. May 2007. 9. (***) A TOOL FOR INTEGRATED GENOME STRUCTURAL ANALYSIS. Donald Stewart and Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Cold Spring Harbor Biology of Genomes Meeting. Cold Spring Harbor, NY. May 2007. 10. (***) QUANTIFICATION AND MODELING OF ASCERTAINMENT BIASES IN NON-PAIR-WISE SNP DISCOVERY SCHEMES. Eric Tsung and Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Cold Spring Harbor Biology of Genomes Meeting. Cold Spring Harbor, NY. May 2007. 11. (***) MOSAIK: A REFERENCE GENOME GUIDED ASSEMBLY PROGRAM FOR NEXT-GENERATION SEQUENCER DATA. Michael Stromberg and Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented

6 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

at the Cold Spring Harbor Biology of Genomes Meeting. Cold Spring Harbor, NY. May 2007. 12. (***) WHOLE-GENOME POLYMORPHISM DISCOVERY WITH LIGHT- SHOTGUN 454 PYROSEQUENCES. Aaron Quinlan and Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Cold Spring Harbor Biology of Genomes Meeting. Cold Spring Harbor, NY. May 2007. 13. (***) SOFTWARE TOOLS FOR ASSESSING GENETIC SEQUENCE VARIATIONS IN NEW SUPER-HIGH THROUGHPUT SEQUENCING MACHINE DATA. Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Pacific Symposium on Biocomputing. Hawaii, HI. January 2007. 14. (***) POPULATION GENETIC METHODS TO ASSESS NATURAL SAMPLE TO SAMPLE VARIANCE OF MARKER ALLELE FREQUENCIES FOR GENETIC ASSOCIATION TESTING. Eric Tsung and Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Cold Spring Harbor Biology of Genomes Meeting. Cold Spring Harbor, NY. May 2006. 15. (***) AUTOMATIC SOMATIC MUTATION DETECTION. Michael Stromberg and Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Cold Spring Harbor Biology of Genomes Meeting. Cold Spring Harbor, NY. May 2006. 16. (***) BAYESIAN POLYMORPHISM DETECTION IN DIPLOID DNA SEQUENCES. Aaron Quinlan and Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Cold Spring Harbor Biology of Genomes Meeting. Cold Spring Harbor, NY. May 2006. 17. (***) POLYMORPHISM DETECTION IN DIPLOID RE-SEQUENCING DATA. Aaron Quinlan and Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Pacific Symposium on Biocomputing. Wailea, HI. January 2006. 18. (***) ACCURATE POLYMORPHISM DETECTION IN DNA RESEQUENCING DATA. Aaron Quinlan and Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the American Society of Human Genetics meeting, Salt Lake City, UT, October 2005 19. (***) A COMPUTATIONAL TOOL FOR HAPMAP-BASED MARKER PRIORITIZATION FOR CLINICAL ASSOCIATION STUDIES. Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Cold Spring Harbor Biology of Genomes Meeting, Cold Spring Harbor, NY. May 2005. 20. (***) HAPLOTYPE MAP SNP SELECTION STRATEGY AND ITS CONSEQUENCES FOR VARIATION PROPERTIES. Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the Pacific Symposium on Biocomputing. Kohala, HI. January 2004. 21. (***) DOUBLE-HIT SNP SELECTION STRATEGY – EFFECT FOR ALLELE FREQUENCY AND VARIATION PROPERTIES. Gabor T. Marth, Department of Biology, Boston College, Chestnut Hill, MA. Presented at the International Human Variation Meeting, Chantilly, VA, November 2003. 22. INFERENCES ON HUMAN HAPLOTYPE STRUCTURE BASED ON NEW POPULATION-GENETIC MODELS OF SNP DENSITY AND ALLELE

7 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

FREQUENCY. Gabor T. Marth and Stephen Sherry, National Center for Biotechnology Information, NIH, Bethesda, MD. Presented at the Pacific Symposium on Biocomputing. Kauai, HI. January 2002. 23. THE STRUCTURE OF SNPS IN OVERLAPPING REGIONS OF HUMAN GENOME SEQUENCE. Gabor T. Marth and Stephen Sherry, National Center for Biotechnology Information, NIH, Bethesda, MD. Presented at the Pacific Symposium on Biocomputing. Honolulu, HI. January 2001. 24. CONSTRUCTION OF A DENSE POLYMORPHIC MARKER MAP OF THE HUMAN GENOME: SINGLE-NUCLEOTIDE POLYMORPHISM DISCOVERY IN OVERLAPPING GENOMIC CLONE SEQUENCES. Gabor Marth1, Ruth Davenport2, Raymond Yeh1, and Pui-Yan Kwok2 (1) Genome Sequencing Center, Dept. of Genetics, and (2) Dept. of Dermatology, Washington University, St. Louis, MO. Presented at the Pacific Symposium on Biocomputing. Honolulu, HI. January 2000. (***) indicates presentation as Assistant Professor at Boston College

Seminar talks (not at conferences) 1. (***) THE INFORMATICS OF POLYMORPHISM DISCOVERY IN NEXT- GENERATION RE-SEQUENCING DATA. Presented at the McKusick - Nathans Institute of Genetic Medicine, School of Medicine, Baltimore, MD. November 2007. 2. (***) POLYMORPHISM DISCOVERY IN NEXT-GENERATION RE- SEQUENCING DATA. Illumina Expert Panel on Next-Generation Sequencing. Washington, DC. November 2007. 3. (***) SOFTWARE FOR NEXT-GENERATION SEQUENCER DATA. Presented at Helicos BioSciences. Cambridge, MA. August 2007. 4. (***) SOFTWARE FOR NEXT-GENERATION SEQUENCER DATA. Presented at the Broad Institute of MIT and Harvard University, Cambridge, MA. June 2007. 5. (***) THE INFORMATICS OF GENETIC SEQUENCE VARIATIONS. Presented at , Medford, MA. October 2006. 6. (***) SOFTWARE TOOLS FOR GENETIC SEQUENCE VARIATIONS. Presented at the Rennselaer Polytechnic Institute, Troy, NY. October 2006. 7. (***) SOFTWARE TOOLS FOR NEW, SUPER-HIGH THROUGHPUT DNA SEQUENCERS. Presented at the Washington University School of Medicine, St. Louis, MO. August 2006. 8. (***) SNPS AND HAPLOTYPES – THE INFORMATICS OF GENETIC VARIATIONS. Presented at the Debrecen University School of Medicine, Hungary. May 2006. 9. (***) SOFTWARE TOOLS FOR MEDICALLY IMPORTANT SEQUENCE VARIATIONS. Presented at Pfizer Global Research, Groton, CT. April 2006. 10. (***) INFORMATICS FOR GENETIC SEQUENCE VARIATIONS. Presented at UC Davis, Davis, CA. June 2006.

8 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

11. (***) THE INFORMATICS OF SNPS AND HAPLOTYPES. Presented at the Boston College Physics Department Seminar Series. Newton, MA. November 2005. 12. (***) A COALESCENT COMPUTATIONAL PLATFORM TO PREDICT STRENGTH OF ASSOCIATION FOR CLINICAL SAMPLES. Presented at the MIT Bioinformatics seminar series, Cambridge, MA. April 2005. 13. (***) THE UTILITY OF THE HAPMAP REFERENCE SAMPLES FOR CLINICAL POPULATIONS. Presented at the Vancouver Bioinformatics Seminar Series, University of British Columbia, Vancouver, BC, Canada, Sept. 2004. 14. (***) GENOME VARIATION INFORMATICS: SNP DISCOVERY, DEMOGRAPHIC INFERENCE AND HUMAN HAPLOTYPE STRUCTURE. Presented at the MIT Bioinformatics Seminar Series, Cambridge, MA, November 2003. 15. (***) GENOME VARIATION INFORMATICS: SNP DISCOVERY, DEMOGRAPHIC INFERENCE AND HUMAN HAPLOTYPE STRUCTURE. Presented at the Childrens Hospital Informatics Group, Boston, MA, October 2003. 16. The Structure of Human Variations in Overlapping Genome Sequence. Presented at the Uniformed Services University for the Health Sciences, Bethesda, MD. February 2001. 17. DNA POLYMORPHISM DATA IN OVERLAPPING GENOME SEQUENCE. Presented at the Marshfield Medical and Research Foundation, Marshfield, WI. February 2001.

Invitations for future seminar talks 1. Illumina Expert Panel Meeting on Next-Generation Sequencing. Boston, MA. To be held in October 2008. 2. "Next-Generation Sequencing Technologies" seminar series, University of Michigan, Ann Arbor, MI. To be held in October 2008.

Computer software packages created 1. EAGLEVIEW – a graphical viewer program for the visualization of assembled next-generation sequences (the software was developed for the studies in Huang and Marth, Genome Research 2008). Web: http://bioinformatics.bc.edu/marthlab/EagleView 2. SPANNER – a computer program for the discovery of structural polymorphisms (larger chromosomal deletion and insertions, copy number variations, and translocations) from next-generation sequencing read data (software in beta stage). 3. GIGABAYES – a computer program for the discovery of single-nucleotide polymorphisms (SNPs) and short insertion-deletion polymorphisms (INDELs) in next-generation sequencer data (the software was developed for the studies in Hillier et al. Nature Methods 2008).

9 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

Web: http://bioinformatics.bc.edu/marthlab/PbShort 4. MOSAIK – a reference sequence guided DNA read mapper / aligner / assembler program for next-generation sequencer reads (the software was developed for the studies in Hillier et al. Nature Methods 2008). Web: http://bioinformatics.bc.edu/marthlab/Mosaik 5. PYROBAYES – a base caller program for Roche / 454 pyrosequencing DNA reads (the software developed for the studies in Quinlan et al. Nature Methods 2008). Web: http://bioinformatics.bc.edu/marthlab/PyroBayes 6. SPECTRA – a suite of computer programs for calculating the Allele Frequency Spectrum and Marker Density Distribution in large single-nucleotide polymorphism (SNP) collections (the software developed for the studies in Marth et al. PNAS 2004). 7. POLYBAYES – a computer program for single-nucleotide polymorphism (SNP) discovery, implementing a suite of algorithms for multiple sequence alignment, paralogous sequence fragment filtering, and a Bayesian-statistical SNP detection algorithm. (the software developed for the studies in Marth et al. Nature Genetics 1999). Web: http://genome.wustl.edu/tools/software/polybayes.cgi Distribution: http://www.ibridgenetwork.org/wustl/polybayes_software 8. FINISH – a computer program to aid the manual completion (finishing) of large- insert genomic DNA clones in shotgun genome sequencing projects

Grant funding (including extra- and intra-mural) 1. Title: Software tools for next-generation sequencer data Agency: National Institutes of Health / National Human Genome Research Institute Research grant type: R01 Role: Principal Investigator Funding dates: 09/2008 - 08/2012 Direct cost: $1,567,200 / 4 yrs Total cost: $2,270,736 / 4 yrs

2. Title: A general Bayesian polymorphism discovery tool Agency: National Institutes of Health / National Human Genome Research Institute Research grant type: R01 Role: Principal Investigator Dates of funding: 09/2005 - 07/2010 Direct cost: $1,175,000 / 5 yrs

10 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

Total cost: $1,671,775 / 5 yrs

3. Title: Mutational profiling by high-throughput sequencing Agency: Boston College Research grant type: Research Incentive Grant (RIG) Role: Principal Investigator (with co-PI Marc-Jan Gubbels) Funding dates: 06/2008 - 05/2009 Direct cost: $15,000 / 1 yr Total cost: $15,000 / 1 yr

Laboratory staffing (past and present)

Name Dates Role Current post Damien Chonka 09/2004 - 06/2006 undergraduate res. U. Chicago Aykut Unsal 09/2004 - 06/2006 undergraduate res. UMDNJ Anthony Nguyen 06/2006 - 06/2007 undergraduate res. N.A. Dr. Eric Tsung 02/2004 - 03/2008 postdoc Appl. BioSys. Aaron Quinlan 03/2005 - 08/2008 Ph.D. student U. Virginia Michael Stromberg 03/2006 - present Ph.D. student N.A. Dr. Donald Stewart 09/2006 - present postdoc N.A. Dr. Weichun Huang 10/2006 - present postdoc N.A. Michele Busby 01/2007 - present Ph.D. student N.A. Derek Barnett 03/2007 - present Ph.D. student N.A.

Scientific project contributions 2007 – present. International 1000 Genomes Project. Member, Project Data Analysis Group. Member, Data Release Group. Chair, Data simulation Subgroup. My laboratory performs fundamental data analysis: data quality assessment, read mapping, polymorphism discovery, data release coordination, special analyses for guiding study design for the Project. 2007 – present. Next-Generation Sequencing Data Assembly Format International Working Group (http://assembly.bc.edu). Organizer and Host. My laboratory organizes, runs the working group, host teleconferences and web site: http://assembly.bc.edu 2002 – 2005. International HapMap Project. Data analysis and project advisory role. I was participating in advisory meetings, and contributed specific analyses to help guide overall study design for the Project.

11 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

2001. The SNP Consortium (TSC). Discovery of 150,000 single-nucleotide polymorphisms contributed to the Consortium. As a representative of the National Center for Biotechnology Information (NCBI) a the NIH, I contributed ~150 thousand single-nucleotide polymorphisms (SNPs) discovered in our studies to the construction of the first dense polymorphic marker map of the human genome 1995 – 2000. Human Genome Project. Informatics tool development and genome data analysis at the Genome Sequencing Center, Department of Genetics, Washington University School of Medicine. I developed a number of computer tools primarily aimed at automating sequence completion (finishing) of large-insert DNA clone sequences in shotgun sequencing projects.

III. PROFESSIONAL ACTIVITIES

Membership on grant review panels 2007. Ad hoc. National Human Genome Research Institute, National Institutes of Health Development and Application of New Technologies to Targeted Genome- wide Resequencing in Well-Phenotyped Populations special emphasis panel. Washington, DC. 2007. Ad hoc. National Institute of Mental Health, National Institutes of Health. Methods of Statistical Analysis of DNA Sequence Data special emphasis panel. Bethesda, MD. 2003. Ad hoc. National Cancer Institute and the National Human Genome Research Institute, National Institutes of Health. Mammalian Gene Collection (MGC) Project special emphasis panel. Frederic, MD. 2004. Ad hoc. Center for Scientific Review, National Institutes of Health Genome Study Section. Bethesda, MD.

Membership on scientific advisory boards 2008 – present. Member, Scientific Advisory Committee, Ontario Institute for Cancer Research (International Cancer Genome Consortium), Toronto, ON, Canada. 2008 – present. Member, Scientific Advisory Board, EdgeServe Biosciences, Gaithersburg, MD. 2001 – present. Member, Scientific Advisory Board, Marshfield Medical and Research Foundation Personalized Medicine Program, Marshfield, WI. 2003 – present. Member, Scientific and Technical Advisory Board, Omicia, Inc. Oakland, CA

Participation in advisory meetings 2007. National Human Genome Research Institute, National Institutes of Health & Wellcome Trust, UK – 1000 Genomes Project planning meeting, Cambridge, UK.

12 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

Role: to advise agencies funding the 1000 Genome Project on informatics aspects of its master study design. 2007. National Center for Biotechnology Information (NCBI). National Library of Medicine, National Institutes of Health. Advisory meeting on Next-Generation DNA Sequencer Data Formats, Rockville, MD. Role: to advise NCBI on DNA sequencing data formats for storing next-generation sequencer data. 2005. National Cancer Institute – National Human Genome Research Institute, National Institutes of Health. Human Cancer Genome Project – planning advisory meeting, Washington, DC. Role: to advise the two NIH funding agencies on the informatics tools and infrastructure needed for the Human Cancer Genome Project. 2002. National Human Genome Research Institute (NHGRI), National Institutes of Health. The International HapMap Project advisory meeting, Washington, DC. Role: to advise the funding agency (NHGRI) on theoretical population genetic aspects of the HapMap Project study design.

Consultancy 2002 – present. Omicia Inc., Oakland, CA. Role: help with the design of genomic data mining systems; advise management on informatics systems product development; general scientific counseling on medical genomic applications. 2008 – present. Monsanto, Inc. St. Louis, MO. Role: advise informatics research / development team on the use of next-generation sequencing technologies; help with informatics tool selection and genomic data analysis pipeline development.

Journal reviewer assignments / associate editor duty 2005. Associate editor (ad hoc): PLoS Computational Biology. 2000 – present. Journal reviewer: Nature, Nature Genetics, Nature Methods, Bioinformatics, Biotechniques, Genetics, Genome Research, Genomics, Human Molecular Genetics, Human Mutation, Nucleic Acids Research, Proceedings of the National Academy of Sciences (PNAS), Trends in Genetics.

Conference organization 2009. Conference session organizer: “Computational Challenges in Next- Generation Sequence Analysis”. (Co-organizer with Francisco de la Vega). Pacific Symposium on Biocomputing, Kohala, HI. January 2009. 2008. Conference session organizer: “Sequencing thousands of human genomes - perspectives, challenges, and analysis”. (Co-organizer with Francisco de la Vega). Conference on Intelligent Systems for Molecular Biology, Toronto, ON, Canada. July 2008.

Conference session chair duty 2008. Informatics session. Epigenomics & Sequencing Meeting, Harvard University, Boston, MA. July 2008

13 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

2007. Bionformatics tools session. Cambridge Healthtech Institute Conference on “Exploring Next-generation sequencing”, Providence, RI. October 2007.

IV. TEACHING ACTIVITIES

Courses taught at Boston College 2007 Fall. BI616 Graduate Bioinformatics. Graduate core course. Enrollment: 13. Credit hours: 2. Course requirements: 5 homeworks, in-class activity, term project, final presentation. 2007 Fall. BI420 Introduction to Bioinformatics. Undergraduate course. Enrollment: 24. Credit hours: 3. Course requirements: 6 homeworks, 2 midterm examinations, in-class activity, final presentation and presentation materials. 2006 Fall. BI616 Graduate Bioinformatics. Graduate core course. Enrollment: 8. Credit hours: 2. Course requirements: 5 homeworks, in-class activity, term project, final presentation. 2006 Fall. BI420 Introduction to Bioinformatics. Undergraduate course. Enrollment: 26. Credit hours: 3. Course requirements: 6 homeworks, 2 midterm examinations, in-class activity, final presentation and presentation materials. 2005 Fall. BI616 Graduate Bioinformatics. Graduate core course. Enrollment: 8. Credit hours: 2. Course requirements: 5 homeworks, in-class activity, term project, final presentation. 2005 Fall. BI420 Introduction to Bioinformatics (co-taught with Prof. Peter Clote). Undergraduate course. Enrollment: 26 Credit hours: 3. Course requirements: 6 homeworks, 2 midterm examinations, in-class activity, final presentation and presentation materials. 2005 Spring. BI820 Quantitative and Computational Problems in Genomics. Graduate seminar course. Enrollment: 5 Credit hours: 2. Course requirements: 1 major and 2 minor student presentations, in-class activity. 2004 Fall. BI420 Introduction to Bioinformatics (co-taught with Prof. Peter Clote). Undergraduate course. Enrollment: 19 Credit hours: 3. Course requirements: 6 homeworks, 2 midterm examinations, in-class activity, final presentation and presentation materials. 2004 Spring. BI820 Quantitative and Computational Problems in Genomics. Graduate seminar course. Enrollment: 18 Credit hours: 2. Course requirements: 1 major and 2 minor student presentations, in-class activity.

Curricular / teaching infrastructure development at Boston College 2003 – present. Undergraduate Bioinformatics minor curriculum development. Together with Prof. Peter Clote, I led the development of the curriculum for the new Bioinformatics minor (replacing the earlier Bioinformatics concentration). 2003 – present. Graduate Bioinformatics curriculum. Together with Prof. Jeffrey Chuang, I led the development of the Graduate Bioinformatics curriculum.

14 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

2004. Design of computer laptop teaching laboratory (Higgins 425) I initiated the conversion of Higgins 425 into a computer laptop teaching laboratory. Hardware includes 26 IBM/PC laptops and a dedicated UNIX computer server connected to the laptops via the BC network. This lab now servers both our undergraduate and graduate Bioinformatics introductory courses, as well as other Bioinformatics electives.

Guest teaching at Boston College 2004 Guest lecturer in the Undergraduate Genetics Course (two sessions, instructor Prof. Clare O’Connor) 2006 Guest lecturer in the Graduate Evolution Course (instructor Prof. Jeffrey Chuang)

Undergraduate students mentored 2004 – 2006. Damien Croteau-Chonka 2004 – 2006. Aykut Unsal 2006 – 2007. Anthony Nguyen

Graduate students mentored 2005 – 2008. Aaron Quinlan (Ph.D. received, 2008) 2006 – present. Michael Stromberg (Ph.D. in progress) 2007 – present. Michele Busby (Ph.D. in progress) 2007 – present. Derek Barnett (Ph.D. in progress)

Postdoctoral researchers mentored 2008. Dr. Aaron Quinlan 2006 – present. Dr. Donald Stewart 2006 – present. Dr. Weichun Huang

Membership on graduate comprehensive exam committees 2008. Tristan Lubinski 2008. Deb Ritter 2008. Ozge Ceyhan 2007. Laura Shelton 2007. Keith Eidell 2006. Maria Gumina 2005. Karie Heinecke

15 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

Membership on graduate thesis committees 2007 – present. Keith Eidell (Ph.D.) 2007 – 2008. Aliz Axmann (MS) 2007 – 2008. Howard Chen (Ph.D.) 2006 – present. Matthew Hoffman (Ph.D.) 2006 – present. Michael Stromberg (Ph.D.) 2006. Mark Audeh (MS) 2005 – present. Nevine Shalaby (Ph.D.) 2005 – 2008. Aaron Quinlan (Ph.D.) 2004. Melanie Grandy (MS)

Extramural teaching activities 2008. Instructor, Harvard University Department of Genetics nano-course on next- generation DNA sequencing. Role: Instructor. This is a two times half-day course to introduce new sequencing technologies to the Genetics students at the Harvard Medical School. Co-taught with Drs. Chad Nusbaum and Mark Daly at the Broad Institute of MIT and Harvard. 2007 – present. Instructor, Cold Spring Harbor Laboratory Course on Revolutionary Sequencing Technologies. Role: Instructor / Bioinformatics. This is a two-week long, intensive sequencing course at the pre-eminent Genetics teaching center in the US. I deliver a number of lectures and, together with members of my laboratory acting as teaching assistants, lead a number of data analysis sessions. My laboratory is also responsible for the Bioinformatics software setup for the course. 2006. Instructor, Medical Genomics Course, Debrecen University of Medicine, Debrecen, Hungary. Role: Instructor. This is a semester-long course teaching Medical Genetics to medical students. I delivered six lectures over a week period. 1999 – present. Instructor, Cold Spring Harbor Laboratory Genome Informatics course. Role: Instructor. This is a two-week, intensive Bioinformatics programming course for medical and biological researchers, postdocs, and graduate students. Each year I deliver a two-hour lecture and a two-hour practical data analysis laboratory session. 2001. Lecturer, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health. Graduate course on Bioinformatics. Role: Guest lecturer. I delivered one two-hour lecture to NIH researchers on Bioinformatics. 2000 – present. Instructor, Canadian Genetic Diseases Network Genome Informatics course. Role: Instructor. This is a 3-7 day intensive course in which we teach informatics and computer analysis techniques to academic and commercial researchers. I deliver a two-hour lecture and a two-hour practical data analysis lab session.

16 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

1992 – 1995. Role: Instructor and Teaching Assistant. Engineering and Mathematics courses at the Washington University School of Engineering. Role: Instructor and Teaching Assistant. As a graduate student, I served as course instructor for two classes (Numerical Methods and Process Control Laboratory); and TA for a number of course in the Department of Systems Science and Mathematics (e.g. Signal Processing, Digital Signal Processing).

V. DEPARTMENTAL AND UNIVERSITY SERVICE Departmental committee / faculty search committee memberships 2007. Bioinformatics Faculty Search Committee 2005. Bioinformatics Faculty Search Committee 2004. Co-chair, Biology Research Retreat Organizing Committee, Inaugural Biology Research Retreat on October 2, 2004. 2003 – 2007. Graduate Program Committee 2003 – 2007. Graduate Admissions Subcommittee of the Graduate Program Committee 2003 – present. Bioinformatics Ph.D. Curriculum Development Committee 2003 – present. BC Biology Computing Committee

Other departmental service 2008. Organizer and faculty host, 2008 Landmark Seminar in Biology, Boston College (3-day seminar series, speaker Prof. Aravinda Chakravarti, Director, Center for Complex Disease Genomics, McKusick - Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD) 2006 – present. Director, Bioinformatics Program, Boston College Biology Department 2006 – present. Manager of Bioinformatics UNIX System Administrator 2006 – 2008. Manager of Bioinformatics Programmer

University committee memberships 2008 – Search Committee, UNIX System Administrator for Natural Sciences. 2008 – present. Chair, Data Resources Policy Committee, Academic Technology Advisory Board 2006 – present. Beckman Scholar mentor 2006. Member, Biology department chair selection faculty polling committee (at behest of the deans of A&S). 2005 – 2007. Member, Boston College High-Powered Computing Advisory Group 2004. Member, McCarthy Prize Selection Committee

17 / 18 Gabor T. Marth D.Sc. – Curriculum vitae 10/04/2008

University-wide advisory meetings 2008. Biology Department representative at the Boston College Alumni Investor Meeting on BC intellectual property and technology commercialization 2005. Biology Department representative at the External Review Committee Meeting on Boston College Galvanizing Initiatives in the Natural Sciences

18 / 18 Research statement Gabor T. Marth – Research statement 08/30/2008

Research statement

I am an engineer by training (hold an undergraduate degree in Electrical Engineering from the Technical University of Budapest, and the Doctor of Science degree in Systems Engineering from Washington University in St. Louis). As a Bioinformatician, I have been at my most productive developing software tools that automate data processing and data analysis, and ultimately enable and enrich biological research. At my first position as a bioinformatician, at the Washington University genome center, I developed computer software that aided the sequencing of the C. elegans and the human genomes. Building on my knowledge of genetic data analysis and sequencing software skills I then developed a statistically rigorous, Bayesian framework, POLYBAYES, for the detection of single-nucleotide polymorphisms (SNPs), the most common form of human genetic variation. This framework was used in a number of important SNP discovery projects, and has been quoted as a textbook example of successful application of Bayesian mathematics in the field of Bioinformatics. The POLYBAYES work is my single most important early career achievement. After assuming my post as a staff scientist at the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH), I led a number of genomic data discovery projects. Most importantly, we used the overlapping parts of the DNA sequences of the large-insert bacterial artificial chromosome (BAC) clones, sequenced by the Human Genome Project, for the discovery of hundreds of thousands of human genetic variations. These genetic variations played a major part in the construction of the first high-density polymorphism map of the human genome assembled by The SNP Consortium, an international group of laboratories organized for this purpose. My colleagues and I used this large, genome-wide human SNP collection for population genetic analyses, specifically to quantify the demographic history of effective population size in large world populations. These studies were among the first to show in nuclear SNP variation data that both European and Asian populations went through a large genetic bottleneck in the recent evolutionary past, an event not seen in African polymorphism data. Various mathematical population-genetic modeling studies I conducted at the NCBI and in the first year of my appointment as a junior faculty member at Boston College has contributed to the design of the International HapMap Project, an ambitious undertaking aimed at describing human polymorphism structure at a fine resolution. Over the past few years several new, high throughput DNA sequencing technologies have matured to the point where they produce orders of magnitudes more bases per run than capillary sequencers (the technology that was used to decode the first human genome). The vast amount of data produced by the new machines will enable, in principle, cheap sequencing of an arbitrary number of individual human genomes, as well as the full-genome mutational analysis of biologically important model organisms and genetically inaccessible pathogens. My experience and skill in genomics, sequencing software development, and data mining I acquired while working in the Human Genome Project made it a natural choice to focus my laboratory’s resources on the essential informatics problems associated with the emerging DNA sequencing technologies. We developed a complete suite of computer tools for the analysis of next-generation sequencing machine data, including software for (1) base calling (i.e. converting the raw signals produced by the sequencing machines into actual DNA base sequence); (2) read mapping (i.e. finding the true location of the random DNA fragments sequenced by the machines within the human genome, much like placing pieces of a jigsaw puzzle relative to the picture on the box); (3) genetic variation discovery (i.e. detecting sites in the human genome where the genetic code varies across different individuals); and (4) sequence data visualization (i.e. building graphical viewer programs where scientists can see the primary sequence data, the locations of genetic variations, and the genome annotations that provide context in which to evaluate the biological significance of those genetic variations). I was able to secure two sizeable R01 research grants from the National Institutes of Health (NIH) to support my laboratory’s work. My group published a number of our tools in respectable journals and several more manuscripts are in preparation. We also published the results of groundbreaking data discovery

1 / 6 Gabor T. Marth – Research statement 08/30/2008 projects that we carried out in collaboration with experimental researchers at major genome centers. We are currently participating in large collaborative data mining projects, most notably in the analysis of individual human resequencing data from the 1000 Genomes Project (1000G) funded by the NIH and the Wellcome Trust (the major funding source for genomic research in the United Kingdom).

1. Next-generation sequencing As of today, four different next-generation sequencing platforms are commercially available: the 454 platform from Roche producing medium-length reads; the Solexa sequencer from Illumina, the SOLiD system from Applied Biosystems, and the Heliscope system from Helicos each producing short-read sequences. Other technologies are expected soon (e.g. from Pacific BioSciences). Because of the vast throughput and lower cost next-generation sequencing has become the de facto method for whole-genome organismal and human resequencing aimed at finding SNP variation. In addition, these new technologies have been applied for structural variation discovery; studying protein-DNA interactions and DNA methylation; small RNA sequencing; and gene expression profiling. It is safe to say that these sequencing technologies will permeate many other areas of high- throughput biology in the near future. Although the vast throughput of next-generation sequencers is a boon to the genomics community and enables exciting new applications, informatics analysis poses serious analytical challenges not met by existing software; some apply to all technologies, while others are instrument-specific. 1.1. Major informatics challenges. The tremendous data volumes produced by the new sequencers places high demands on data storage, data manipulation, and computational processing. Processing billions of reads required by individual human resequencing necessitates efficient new data formats and high-performance read mapping and variant calling algorithms. Accurate mapping of short sequencing reads is confounded by the repetitive sequences that are abundantly present in complex genomes. Distinguishing rare alleles from sequencing errors in deep read alignments from thousands of individuals requires new SNP calling algorithms. Because the number and size distribution of structural variations (SVs) in the human genome is largely unknown, and because candidate SV predictions are often difficult to independently validate, it is challenging to train discovery algorithms for this important type of genetic variations. Visualization of the primary sequence data requires efficient, highly indexed data formats for speed, and intelligent, application specific viewer options. 1.2. Technology-specific problems. Sequencing error characteristics are highly technology- dependent, e.g. 454 read errors are dominated by base under- or over-calls in homopolymer-rich sequences, whereas the vast majority of Solexa base errors are substitutions. The base quality values produced by the native, machine-manufacturer supplied software are notoriously inaccurate. Because one must rely on the base quality values to distinguish between true allelic difference and sequencing error, this makes calling rare SNPs especially challenging. Read length and sequencing fragment length also vary across technologies, requiring considerable flexibility from read mapping and structural variation detection algorithms.

2. Current software pipeline for next-generation genome resequencing My laboratory seeks to develop efficient and accurate software for next-generation sequence analysis. Our initial focus was on developing a complete software pipeline for the most immediate application of the new sequencing technologies: genome resequencing for SNP and short-INDEL polymorphism discovery. Our development was driven by the following three data analysis projects: (1) SNP discovery in light-shotgun 454 pyrosequencing reads from 10 Drosophila melanogaster isolates aligned to the reference fruit-fly genome; (2) SNP and short-INDEL discovery between a wild Caenorhabditis elegans isolate and the reference worm strain; and (3) mutational discovery between

2 / 6 Gabor T. Marth – Research statement 08/30/2008 a phenotypically important mutant strain of the yeast Pichia stipitis (this strain converts xylose to ethanol at a substantially elevated efficiency) and its parent strain. These projects gave rise to our base calling, read mapping, SNP and short-INDEL allele calling software, as well as a sequence assembly viewer program. 2.1. PYROBAYES – a base caller program for 454 pyrosequences. When attempting to use our single-read coverage, light-shotgun 454 sequences for allele calling against the reference genome sequence we observed that the base quality values assigned by the native 454 software were much lower than the actual accuracy of the bases, resulting in a large number of missed SNPs in the data. To mitigate this effect we wrote our own base caller, PYROBAYES, which produces much more accurate base quality values, and enables accurate SNP discovery even in single-read coverage. 2.2. MOSAIK – a general reference sequence guided read mapping program. This program aligns traditional 500-1,000 bp ABI / capillary reads, 100-250 bp 454 reads, 25-50 bp Solexa or SOLiD short-reads (single-end or paired end), and 20-50 bp Heliscope short-reads. Most importantly, MOSAIK (i) produces gapped read alignments to allow short-INDEL discovery; (ii) deals with reads that map to multiple genome locations; and (iii) produces an assembly (base-wise multiple alignment of all reads) including reads from different technologies. MOSAIK can achieve alignment speeds up to 40,000 reads per second in a single process. 2.3. GIGABAYES – a Bayesian SNP caller for next-generation sequences. We have completely re-engineered and extended our original Bayesian SNP caller, POLYBAYES. The program was re- written in C++ for speed. A number of new features were added to enhance heterozygote allele calling in diploid genomes, to use inheritance information in both-parents-and-a-child trio sequencing data, and maximize the sensitivity of rare allele discovery in deep human resequencing datasets. We used earlier versions of MOSAIK and GIGABAYES for our C. elegans allele calling study in single-end Solexa sequences. In this study we demonstrated that our polymorphism discovery pipeline was able to detect SNPs at a very low false positive rate. We also demonstrated that short (one or two bp) INDEL polymorphisms can also be detected at similarly low false positive rates. 2.4. EAGLEVIEW – a next-generation sequence assembly viewer. This viewer is compatible with ABI/capillary sequences (so experimental SNP validation data can be viewed together with the next- generation primary reads in which the SNP was discovered), Illumina short-read sequences, and 454 pyrosequences. EAGLEVIEW has the ability to display annotation tracks in registration with the sequencing reads. This allows one to e.g. assess SNP locations in the context of gene annotations. EAGLEVIEW has a “packed” read view, and extensive zooming capability up to a “bird eye’s view” to effectively use display real estate. 2.5. Software distribution. We are distributing our software free of charge to academic and not-for- profit use (URL: http://bioinformatics.bc.edu/marthlab/software). We have sent our software to over 300 laboratories worldwide.

3. Present research focus: 1000 Genomes Project data analysis The main driver behind our current algorithmic development is the individual human genome resequencing data generated by the 1000 Genomes (1000G) Project: my laboratory actively participates in the work of its Analysis Group (URL: http://www.1000genomes.org) 3.1. Read accuracy and base quality values. We are developing a pipeline (termed ERRORICITY) to evaluate base error rates and base quality value accuracy for sequencing reads. We applied this to 454 and Solexa datasets from the 1000G Project. Our analyses indicate that the Solexa base calling pipeline generally over-estimates base quality values (i.e. the actual base accuracy is much lower than implied by the assigned quality value). Such calibration analyses are crucial because well- calibrated quality values are required for accurate allele calling (especially for rare SNPs).

3 / 6 Gabor T. Marth – Research statement 08/30/2008

3.2. Read simulations. Although the gold standard for assessing the performance of Bioinformatics software is appropriate experimental validation, there are many instances where simulated datasets are the only way to definitively evaluate accuracy. One example is assessing read mapping accuracy: only in simulated datasets does one know the “true” genome position of origin of short- read fragments. As part of the 1000G Analysis Group effort, my lab is responsible for writing a read simulator program (ART) which produces simulated reads according to realistic, sequencing technology-specific read length, fragment-size distribution, and error profile, for all technologies contributing sequence data to the Project. Both our read accuracy analysis and read simulation efforts enjoy strong support from the four main sequencing machine manufacturers (Roche, Illumina, AB, and Helicos), including providing testing datasets and technology consultation. 3.3. Analyses guiding 1000G sequence collection strategy. Using our tools we perform tests directed at practical questions about 1000G study design. For example, given a fixed amount of total available sequence coverage, what is the best compromise between the number of individuals sequenced and the per-individual read coverage? We used our read simulation program ART, read mapper MOSAIK, and SNP caller GIGABAYES, to compare alternatives for trading off sample size vs. per-individual read coverage. When tabulating false positive calls and missed rare SNP rates we found that a minimum of 4X read coverage per individual will be required for effective rare SNP detection in a large cohort (e.g. 400 individuals). 3.4. Read mapping and variation discovery of the 1000G project data. My laboratory participates in the construction of the analysis pipeline for the official 1000G public data releases. Specifically, we will map the sequencing reads, construct read assemblies, and perform SNP and short-INDEL calling. Because of unique features of our software we believe that our pipeline will add substantial value to the analysis performed simultaneously by other software pipelines. For example, our MOSAIK program is the only fast aligner that produces gapped alignments necessary for INDEL calling. Our SNP caller GIGABAYES is currently the only software fitted for next-generation reads, applies a statistically rigorous (i.e. non-heuristic) allele detection algorithm, and is capable of allele calling in deep resequencing data from thousands of individuals.

4. Research plans As our individual genome resequencing informatics pipeline matures, we are taking our tool development into other exciting application areas. We are developing a structural variation discovery program, SPANNER; we are building an informatics pipeline for whole-genome mutational profiling of pathogens and model organisms; we are adapting our tools for the analysis of coding and non- coding transcribed sequences. In order to bring the results of next-generation sequencing projects to the biologist we are planning to build an interactive visual data analysis tool. We anticipate further multi-fold speedup of sequence production from existing technologies and others that will be coming on-line in the near future. We will be seeking out algorithmic and hardware solutions that enable storage and processing power for the orders of magnitude more data than available today. 4.1. Structural variation discovery. Structural variations (SVs) are important genetic features that are abundantly present in every human genome. Next-generation sequencing enables structural discovery at a much higher resolution than possible with tiling arrays or SNP-chips, and can find events (e.g. translocations) that are not detectable with probe-based approaches. We are currently developing a SV-detection program, SPANNER. This effort is driven partly by the multi-individual paired-end Solexa data from the 1000G Project and partly by a large collaboration with researchers at Johns Hopkins University for SV detection in SOLiD reads collected from autism as well as colorectal cancer samples. SPANNER aims to detect all basic SV event types: deletions, insertions, amplifications, inversions, and translocations. Event boundaries are detected on the basis of discordant paired-end map

4 / 6 Gabor T. Marth – Research statement 08/30/2008 locations and orientations. Copy number is estimated from the observed depth of read coverage averaged in an appropriate window. Complementing the SPANNER algorithm, we will build a viewer application displaying read-pair mapping distances and highlighting discordantly mapped pairs; and depth of read coverage together with estimated copy number. Although there are many groups worldwide working on SV discovery, we are not aware of any next- generation sequence based, publicly available SV detection computer tools or viewer applications. Paired with our MOSAIK read aligner, and with extensive input and experimental validation from our collaborators, we anticipate SPANNER to become a powerful computer program for SV discovery. 4.2. Whole-genome mutational profiling. Forward genetics is an essential tool for understanding gene function in model organisms. However, molecular genetic mapping methods are time- consuming and laborious, and cannot be used for organisms that are genetically inaccessible (e.g. eukaryotic parasites). High-throughput, next generation sequence based whole-genome mutational profiling will greatly accelerate genetic mapping. We participate in collaborative mutational profiling projects in the yeast Pichia stipitis and in the eukaryotic parasite Toxoplasma gondii. These studies indicate that exhaustive detection of point mutations induced by chemical mutagenesis will likely require 10-15X sequence coverage for haploid genomes. On the other hand, our power calculations suggest that finding insertional mutants will likely be possible at much lower (0.2X) read coverage, but will require fragment-end read-pairs. Our methods for detecting chemically induced point mutations will be based on SNP calling algorithms. Insertional mutations will be detected with an SV detection-like approach, from fragment- end read pairs where one read maps to the known insertion sequence, and the other read maps to the genome location of the insertion. The main emphasis will be on accuracy i.e. to ensure that essentially all mutations are found (close to zero false negative rate) and very few false positive candidates are called. We anticipate that this project will answer fundamental questions about chemical mutagenesis e.g. the number of actual point mutations introduced as a function of dosage, and what fraction of the mutations disrupt genes. We will also learn how to use the positions of the mutations to find the gene or genes responsible for the mutant phenotype. 4.3. Transcriptome analysis: novel transcript discovery and quantification of expression analysis. Next-generation short-read technologies can produce over 100 million reads per run. When applied to transcriptome sequencing, this throughput can provide accurate measurement of gene expression over a large dynamic range. We have started building informatics pipelines for the quantification of known RNA transcripts, both coding and non-coding; and for novel transcript discovery. The main informatics challenges for small non-coding RNA sequencing is linker removal, often in the last few, low quality bases of the reads; and the precise genome alignments of short RNA reads in the presence of repetitive sequence motives. Our pipeline will include additional informatics programs that add value to the analysis (e.g. for miRNAs, detecting hairpins and identifying potential miRNA* sequences). We will also develop an informatics pipeline for mRNA analysis. Expression profiling will include both SAGE/CAGE-like counting methods, and shotgun sequencing of expressed gene sequences. Estimation of expression levels based on shotgun read coverage on transcripts has to account for the portion of the transcript where reads of the given read length can be uniquely mapped. The estimation is also subject to representational biases introduced by the sequence fragmentation protocol that produces the shotgun fragments. We are developing informatics methods to account for biases from these and other sources. Novel transcript discovery will take advantage of either paired- end read data or longer (50+ bp) sequencing reads in order to confidently link pairs of exons and align sequencing reads across splice junctions. Our informatics methods will be developed in collaboration with researchers generating data and performing validation experiments at the Boston Children’s Hospital and at the Dana Farber Cancer Institute.

5 / 6 Gabor T. Marth – Research statement 08/30/2008

4.4. Developing an interactive workbench for data visualization, data validation, and hypothesis generation for next-generation sequencing data. One of the greatest challenges of next-generation sequencing projects is to reduce the tremendous primary data volume (often hundreds of GB) to a manageable size, and to present study results with the right amount of detail. The second challenge is to present the primary sequence data in the context in which they can be best interpreted. We experience the first challenge in our own software development and data analysis projects (e.g. when inspecting candidate SNP calls in 1,600X read coverage, or looking at 600,000 copies of an abundantly expressed miRNA transcript). We plan to develop a family of interactive data visualization tools tailored to the logic and needs of specific sequencing applications. For example, for SNP allele calling, one might choose a hierarchical viewing strategy in which the primary view shows only the most likely genotype of every sequenced individual but hides the sequencing reads. Additional details will then be available at consecutive, deeper view levels. For miRNA alignments, one would only show a single representative of each unique transcript, and indicate with other visual clues the number of reads from each transcript. For displaying reads relevant for SV discovery, one would e.g. only show read pairs that are discrepantly mapped and can therefore indicate an SV event. To address the second challenge, we will display the primary sequence data in the context of other relevant Biological information, including gene annotation and gene models, gene expression data, or predicted transcription factor binding sites. Annotation tracks are already present in our prototype assembly viewer program EAGLEVIEW. A compact and focused sequence view together with intuitive presentation of the sequence context should be valuable for scientific hypothesis generation based on the sequencing results. 4.5. Scale up to the needs of routine personal resequencing. Despite the fast pace with which informatics methods adapt to next-generation sequence analysis data storage, read mapping, and variation detection is still a formidable task even for a single individual genome. The 1000G Project is on track to sequence a thousand or so individuals. With sequencing costs falling, and throughput rapidly increasing, individual resequencing for personal genetic analyses is soon to come. Tools that can keep up with an anticipated several orders of magnitude of data volume will require a new round of informatics development. We will pursue three avenues for this development. First, we will help develop efficient, highly indexed data file formats that support fast, parallelized data processing. Second, we will test highly advanced data structures (e.g. cuckoo hash tables, disjoint hashing, approximate string matching) to further improve the efficiency and speed for our read mapping algorithms. Third, we will collaborate with companies specializing in hardware implementations of bioinformatics algorithms that can result in orders of magnitude speed up for our software.

6 / 6 Gabor T. Marth – Teaching statement 09/14/2008

Teaching statement

1. Teaching philosophy in Bioinformatics I was hired to the Boston College Biology Department as the first junior faculty member in Bioinformatics. In addition to establishing a nationally and internationally recognized research program in this field, my charge was to help establish and provide direction for a strong Bioinformatics teaching curriculum. Prior to coming to Boston College, while at Washington University in St. Louis, and then at the National Institutes of Health, I have participated each year in a Bioinformatics course for Geneticists at the Cold Spring Harbor Laboratory, a pre-eminent research and conference center on Long Island. I also participated in shorter Bioinformatics and Genomics courses at various locations in Canada, organized by the Canadian Genetic Diseases Network. In these strictly hands-on courses I witnessed the transformation of many Biologists with absolutely no prior computer experience into capable Bioinformatics programmers. These individuals were hugely enabled by their newly-found computer skills, and their ability to process biological information in an intelligent, automated fashion greatly enriched their research. Based on this experience, I have become convinced that both our undergraduate and graduate students are best served by a teaching curriculum with heavy focus on hands-on computer work. I was able to infuse UNIX and PERL computer programming into our current introductory Bioinformatics courses for both student bodies, as discussed below.

2. Course development and teaching Over the past five years I developed from scratch and taught three Bioinformatics courses, a graduate seminar, an undergraduate elective, and a graduate core class. BI820 Quantitative and computational problems in genomics. I developed this class as my first teaching assignment in the Fall of 2003, and taught it in the Spring semesters of 2004 and 2005. The web site with course materials and computer presentations are available at the following URL: http://bioinformatics.bc.edu/~marth/BI820/. This seminar class covered three essential topics in computational medical genetics: I. Genetic single-nucleotide polymorphism discovery and characterization in DNA sequence data. This section of the course informs the students who to discover in genetic sequence data genetic variants that make individuals’ genetic material unique. II. Polymorphism structure, function, and ancestral inference. This section discussed the sources of genetic variations and the molecular / ancestral processes (e.g. recombination, genealogy) that have propagated them to the present. Methods of ancestral inference were discussed including studies by the instructor and others to uncover important features of pre-historic human evolution. III. Advanced topics in population, statistical, and medical genetics. This section introduces statistical and medical genetics. It describes what is known about human genetic diseases and the allelic structure of functional polymorphisms. It also describes methods to find specific genetic variants that case human diseases: genetic linkage analysis, and case-control association mapping. Finally, forensic applications such as DNA identification were discussed. This seminar class was based on a number of instructor lectures, introducing each major topic. This was followed by a number of student presentations based on key publications in the computational and medical genetics field. Each section also contained computer sessions where the instructor demonstrated the relevant computer programs used by the research community for data discovery and analysis. This course was based on my own research work during my first two years at Boston College. As a new assistant professor with little experience with curriculum development I spent an extortionate amount of time preparing for this course. But the experience with preparing lectures, selecting

1 / 4 Gabor T. Marth – Teaching statement 09/14/2008 relevant publications for student reading material and presentations made future course preparations much easier. I have received much positive feedback from the students. Although this seminar course has last been taught in 2005, the course often comes up in conversation and students who took the class fondly demonstrate bits and pieces of computational genetics knowledge they learned years ago.

BI420 Introduction to Bioinformatics. This is a foundational upper-level undergraduate elective. I developed half of the class materials in 2004 (in 2004 I co-taught this course with Professor Peter Clote). I developed the other half afterwards, and taught the class from 2005 to 2008 completely based on my own materials. The 2006 course materials are available at: http://bioinformatics.bc.edu/~marth/BI420/ and the 2007 course materials are available at: http://bioinformatics.bc.edu/marthlab/BI420/. In 2008 we moved the course materials to the Boston College Blackboard Vista system. This course is organized into six separate sections, as follows. I. Introduction to Bioinformatics. Aspects of cellular and genome organization that lend themselves to computational scientific research. II. Genome sequencing and data mining. Genome sequence generation, functional and structural sequence annotations, genetic variation discovery, gene expression analysis, proteomics, storage and retrieval of Biological data. III. Classical Bioinformatics methods. Sequence alignments, phylogenetic analysis, and data classification. IV. Computational Genomics. Evolutionary genomics, population genomics, medical genomics. V. Practical Bioinformatics. Basic UNIX computer skills, programming in the PERL computer language, using and building Biological databases. VI. Final Presentations. Short, individual student presentations on a relevant bioinformatic topic. In this class we spend a sizeable amount of time on classical bioinformatic methods. However, I feel that the main focus should be on the current growth areas of computational biology research. Specifically, we cover in greatest detail genetic variation research, genomic data mining, computational genomics, and medical genomics. Many of these topics greatly overlap with either my current or past research areas. For example, I am able to teach from personal experience about genetic sequencing, the Human Genome Project, large international genetic variation discovery projects, disease gene hunting, and most recently, the exciting informatics research on next- generation DNA sequencing technologies. The Teaching Assistants and other members of my research laboratory often insist that I include slides from their newest research results into the appropriate presentations. I enjoy teaching this class immensely, for a number of reasons. First, the vast majority of the students in this class are extremely interested, engaged, and very smart. Second, a large fraction of the topics are very close to my heart and to my own research interests. Third, I have been fortunate to receive exemplary help from Teaching Assistants, which I believe makes this class a rewarding and extremely efficient learning experience for the students. Fourth, I enjoy teaching the practical computer programming section of the class immensely, as I get to watch the majority of the students develop real computer skills in just 3-4 weeks. I believe these skills will serve them very well in their future studies and at their eventual workplace, whether at a research institution, or in a professional job.

BI820 Graduate Bioinformatics. This is a class that I first taught in the Fall semester of 2005 as a graduate “seminar”. In reality, it was a trial run for our new Graduate Bioinformatics Core course, which is required for all Biology graduate students, and is usually taken in their second year. I taught this course in the Fall semesters of years 2005 – 2008. The course materials are posted at the

2 / 4 Gabor T. Marth – Teaching statement 09/14/2008 following URL: http://bioinformatics.bc.edu/marthlab/BI616/. As I described earlier, I believe that the most important skill that experimental Biology students can learn from us Bioinformaticians is the ability to process biological data, even at a basic level. In most areas of Biology, computers have become an indispensable research tool. A researcher who can take full advantage of electronic data by manipulating data from public databases and integrating these with his/her own data points will gain a strong competitive edge over others. For these reasons, I convinced the Biology faculty to (a) add Bioinformatics as a mandatory component to the graduate core curriculum; and (b) that this course should be a practical Bioinformatics programming course. The course covers the following topics: I. UNIX programming basics. In this section of the course students familiarize themselves with the UNUX environment, learn how to execute commands, how to accomplish simple file manipulation tasks, and how to use productivity tools in the UNIX environment. II. PERL programming. In this section students learn how to write computer programs in a simple yet powerful programming language, one that has become the mainstay of Bioinformatics research. We start at the basics, and work our way up to complex data and program control structures. III. Bioinformatics-specific programming. In this section the students apply the basic programming skills they learned to typical, real-life Bioinformatics problems: how to parse biological data files; how to run bioinformatic programs from within a program, capture and parse their output. We learn how to build and manipulate databases from PERL programs, and how to write web server applications that allow users to run bioinformatic processing programs through the web. IV. Term projects. The students complete a complex term project, typically involving building a biological database, accessing and manipulating its contents through a web server application they also write. This is also a very rewarding class to teach. The majority of students taking this class have no prior computer experience. In each class, I introduce commands, data structures, controls structures, or solve programming tasks of increasing complexity. Each student works on his/her own assigned laptop computer. The Teaching Assistant and I help with problems. After every student masters the material, we move on to the next step. After a few classes of struggle and concentrated effort, EVERY student (so far) became conversant with UNIX and PERL. The final presentations usually require a large time commitment both from the students and the teaching staff. In the class with the final presentations each student demonstrates his/her web server application; this is a festive occasion. This is an intellectually very demanding class to teach and involves a tremendous amount of interaction with the students. But the rewards are equally great: there have been many instances where students who have taken this class go on to build their own databases and web servers to aid their laboratory’s research. In other instances, the programming and data analysis skills obtained in this class lea to data analysis publications that would not have been otherwise possible. These examples make me feel that this class gives the students tangible skills that can clearly advance their scientific or professional careers.

3. Infrastructure development for Bioinformatics classes As both current undergraduate and the graduate Bioinformatics classes include substantial hands-on computer programming, it was required that I build up a hardware infrastructure and recruit computer support for these classes. Following a master plan that I put together the Biology department, with the help of a grant from the Boston College Academic Technology Fund, purchased 26 laptop computers (25 for students, and one instructor machine). These computers are located in the Biology Department Computer Teaching Laboratory (Higgins Room 425). These machines have been installed with a UNIX environment and appropriate productivity and programming tools. Additionally, the Biology Department purchased a computer server (named BIOCLASS) housed in the Biology Department server room, and maintained by the Biology Department UNIX system administrator.

3 / 4 Gabor T. Marth – Teaching statement 09/14/2008

The laptop computers are used in both the undergraduate and the graduate Bioinformatics classes. In off semesters and in the Summer, a smaller number (5-6) of laptops are left out, fully installed, to allow graduate students to use them for programming if necessary.

4. Curriculum development and the future of Bioinformatics teaching With an undergraduate Bioinformatics foundation course and a graduate Bioinformatics core course in place I feel that the Bioinformatics teaching mission has a solid foundation. There are several additional Bioinformatics elective courses, taught by other Bioinformatics faculty members, which provide more specialized coursework. Based on the foundation courses, as well as additional courses offered by the Mathematics and the Computer Science Departments, we are currently establishing Bioinformatics minor. This minor should be attractive to both Biology and Computer Science undergraduates. Given that Bioinformatics is a strong growth area, this minor should provide useful utility for graduate level research or for finding professional opportunities. The number of Bioinformatics graduate students is steadily increasing. This requires that we provide a comprehensive, state-of-the-art teaching curriculum for our graduate students. Currently, the low number of Bioinformatics faculty limits the number of graduate level classes we can offer. We are currently working on extending our options by considering courses not only from other departments at Boston College (e.g. Math and Computer Science) but also at other area universities. The hiring of additional Bioinformatics faculty will also help serve our Bioinformatics graduate students.

4 / 4 Service statement Appendix A. Research materials Copies of publications Downloaded from genome.cshlp.org on September 11, 2008 - Published by Cold Spring Harbor Laboratory Press

Methods Rapid whole-genome mutational profiling using next-generation sequencing technologies

Douglas R. Smith,1,7,9 Aaron R. Quinlan,2,7 Heather E. Peckham,3,7 Kathryn Makowsky,1 Wei Tao,1 Betty Woolf,1 Lei Shen,1 William F. Donahue,1 Nadeem Tusneem,1 Michael P. Stromberg,2 Donald A. Stewart,2 Lu Zhang,2 Swati S. Ranade,3 Jason B. Warner,3 Clarence C. Lee,3 Brittney E. Coleman,3 Zheng Zhang,3,4 Stephen F. McLaughlin,3 Joel A. Malek,3 Jon M. Sorenson,3,4 Alan P. Blanchard,3 Jarrod Chapman,5 David Hillman,5 Feng Chen,5 Daniel S. Rokhsar,5 Kevin J. McKernan,3 Thomas W. Jeffries,6 Gabor T. Marth,2,9 and Paul M. Richardson5,8,9 1Agencourt Bioscience Corporation, Beverly, Massachusetts 01915, USA; 2Boston College Biology Department, Higgins Hall, Chestnut Hill, Massachusetts 02467, USA; 3Applied Biosystems, Beverly, Massachusetts 01915, USA; 4Appplied Biosystems, Foster City, California 94404, USA; 5US Department of Energy Joint Genome Institute, Walnut Creek, California 94598, USA; 6Institute for Microbial and Biochemical Technology, US Forest Products Laboratory, Madison, Wisconsin 53726, USA

Forward genetic mutational studies, adaptive evolution, and phenotypic screening are powerful tools for creating new variant organisms with desirable traits. However, mutations generated in the process cannot be easily identified with traditional genetic tools. We show that new high-throughput, massively parallel sequencing technologies can completely and accurately characterize a mutant genome relative to a previously sequenced parental (reference) strain. We studied a mutant strain of Pichia stipitis, a yeast capable of converting xylose to ethanol. This unusually efficient mutant strain was developed through repeated rounds of chemical mutagenesis, strain selection, transformation, and genetic manipulation over a period of seven years. We resequenced this strain on three different sequencing platforms. Surprisingly, we found fewer than a dozen mutations in open reading frames. All three sequencing technologies were able to identify each single nucleotide mutation given at least 10–15-fold nominal sequence coverage. Our results show that detecting mutations in evolved and engineered organisms is rapid and cost-effective at the whole-genome level using new sequencing technologies. Identification of specific mutations in strains with altered phenotypes will add insight into specific gene functions and guide further metabolic engineering efforts. [Supplemental material is available online at www.genome.org. Complete data sets are available at the NCBI Short Read Archive under accession no. SRA 001158 (ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead).]

Pichia stipitis (Pignal) is a haploid yeast related to endosymbionts mental Fig. 1; Methods). Disruption of CYC1 (cyctochrome c, of beetles that degrade rotting wood (Suh et al. 2003). It is an isoform 1) to create strain Shi21 increased the specific ethanol important organism for bioenergy production from lignocellu- production rate by 50% and the ethanol yield by 10%; however, losic materials because of its high capacity to ferment xylose and the nature of additional mutational events leading to this phe- cellobiose to ethanol (Parekh et al. 1988). We previously se- notype was uncharacterized. quenced the reference strain, Pichia stipitis CBS-6054, resulting in Traditional methods for identifying mutations are labor- a completely characterized genome of eight chromosomes total- and time-intensive, so we tested the ability of next-generation ing 15.4 Mb of sequence (Jeffries et al. 2007). This strain has been sequencing technologies to determine the differences in this im- subjected to chemical mutagenesis, phenotypic selection, genetic proved strain’s entire genome relative to the reference strain. We engineering, and adaptive evolution in order to develop strains generated high-coverage, whole-genome data sets using single improved for ethanol production. Chemical mutagenesis and se- fragment end reads from three next-generation sequencing plat- lection resulted in small improvements in ethanol production forms: 454 Life Sciences (Roche) (∼225-bp reads), Illumina (for- attributable in part to carbon catabolite derepression (Supple- merly Solexa sequencing) (32-bp reads), and Applied Biosystems SOLiD (35-bp reads) (Schuster 2008). We assessed these data to 7These authors contributed equally to this work. determine the effect of sequence coverage (i.e., data set size) on 8 Present address: Progentech Limited, 5885 Hollis St., Suite 155, the accuracy of mutation detection, and to compare the effi- Emeryville, CA 94608, USA. 9Corresponding authors. ciency of the three platforms for this application. E-mail [email protected]; fax (510) 655-5840. E-mail [email protected]; fax (978) 867-2601. E-mail [email protected]; fax (617) 552-2011. Results Article published online before print. Article and publication date are at http:// www.genome.org/cgi/doi/10.1101/gr.077776.108. Freely available online Genomic DNA from P. stipitis (Shi21) was sequenced using the through the Genome Research Open Access option. three advanced sequencing platforms according to specifications

18:000–000 ©2008 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/08; www.genome.org Genome Research 1 www.genome.org Downloaded from genome.cshlp.org on September 11, 2008 - Published by Cold Spring Harbor Laboratory Press

Smith et al.

Table 1. Sequencing and mutation discovery statistics

Total Average sequence False-positive False-negative Sequencing Total no. sequence coverage from (spurious) (missed) technology of readsa (bp, in millions) aligned readsb mutations mutations

10 ןFLX (2 runs) 887,123 199.35 10.78 454 61 ןFLX (1.5 runs) 669,783 150.64 8.15 454 1 17 ןFLX (1 run) 459,563 103.38 5.62 454 00 ןIllumina (7 lanes) 25,818,266 826.18 44.24 00 ןIllumina (3 lanes) 11,281,705 361.01 19.40 20 ןIllumina (2 lanes) 7,548,407 241.55 13.00 22 ןIllumina (1 lane) 3,674,253 117.58 6.32 00 ןAB (2 flow cells) 228,191,758 7,986.71 175.09 00 ן39,111,512c 1,368.90 30.01 (ןAB (30 00 ן26,065,653c 912.30 20.00 (ןAB (20 00 ן13,045,859c 456.61 10.01 (ןAB (10 04 ן10,426,261c 364.92 8.00 (ןAB (8 05 ן7,819,696c 273.69 6.00 (ןAB (6

The overall sequence throughput and aligned coverage is shown for each sequencing technology used in the study. We also report the number of spurious and missed mutations observed from each experiment. aFor the 454 and Illumina technologies, the total number of reads reflects the number of reads that remained after manufacturer quality controls. The Applied Bisoystems (AB) read totals reflect all reads produced by the sequencing run. bThe coverage produced by those reads in the second column that passed the mapping filters we used for each technology (Methods). cEstimated number of reads based on in silico subsampling of coverage. of the manufacturers (Methods). Low-quality sequence reads tems SOLiD reads to the Pichia genome with the Applied Biosys- from the 454 Life Sciences and Illumina technologies were ex- tems SOLiD Alignment Tool. Despite the algorithmic differences cluded by manufacturer quality control filters prior to analysis. owing to color-space alignments, MOSAIK and the SOLiD Align- Since the Applied Biosystems SOLiD sequencing technology does ment Tool use a similar hash-based method to find potential not exclude low-quality reads prior to data analysis, we instead genomic alignment locations for each sequence read. discarded all SOLiD reads that had too many mismatches when The distribution of sequence coverage across the Pichia ge- they were mapped (Methods) to the Pichia reference genome. We nome was similar for each of the sequencing technologies (Fig. 1). processed the sequence reads from each technology with the The observed coverage distributions are substantially dispersed as manufacturer-supplied base-calling software. We additionally re- compared to the expected Poisson distributions (Fig. 1, dotted called the 454 pyrosequences with the Pyrobayes (Quinlan et al. lines), indicating that there are regions of the Pichia genome that 2008) program because it produces a lower number of substitu- are more facile to sequence than others. The causes and dynamics tion errors and more accurate base quality values than the native base-calling program (Methods). We first identified and masked (i.e., excluded from the genome sequence) all repetitive elements within the P. stipitis genome (Jeffries et al. 2007) that would in- terfere with unique read alignments, including short genomic repeats as well as nuclear mitochondrial DNAs (NUMTs), which are sequences of mitochondrial origin that were inserted into the nuclear genome (Methods; Supplemental Table 1) (Richly and Leister 2004). Due to the nature of the unpaired short reads pro- duced by these methods, this repeat masking prevented shorter SOLiD and Illumina reads from mapping to 6.8% of the genome and prevented the medium-length 454 FLX reads from mapping to 5.3% of the genome (Supplemental Methods). The total num- ber of aligned reads passing alignment quality filters and the corresponding aligned read coverage are shown in Table 1. Align- ment of reads from each technology to the repeat-masked refer- -coverage of the genome de ןence sequence resulted in 11–175 pending on the type of platform and number of runs (Table 1; Supplemental Table 2). When mapping the Illumina, 454, and Applied Biosystems reads to the masked reference sequence, we allowed one, two, Figure 1. Distribution of genome sequence coverage. The distribution and three mismatches, respectively (Methods). The Illumina and of sequence coverage across the unmasked portion of the genome is 454 reads were mapped to the reference sequence with shown for each technology. Here we represent comparable mean cover- mean genome coverage), 454 ןthe MOSAIK program (Methods). At the time of this anal- age levels for Illumina (red line, 13.00 ן ysis, MOSAIK was unable to align reads from the Applied Biosys- FLX (blue line, 10.78 mean genome coverage), and Applied Biosystems mean genome coverage) technologies. For ןSOLiD (black line, 10.00 tems SOLiD technology because of the dinucleotide encoding each, we compare the observed coverage distribution to the expected (also termed “color-space” alignments) that this technology uses Poisson coverage distribution (dotted lines of the same color for each (Valouev et al. 2008). Therefore, we mapped the Applied Biosys- technology).

2 Genome Research www.genome.org Downloaded from genome.cshlp.org on September 11, 2008 - Published by Cold Spring Harbor Laboratory Press

of these biases are beyond the scope of this study but are an we are currently investigating the use of paired-end sequence important consideration for genome resequencing studies. Mul- data to identify and resolve structural variations as well as larger tiple read alignments from the 454 and Illumina platforms were insertions and deletions. screened for mutations using GIGABAYES, a new version of the A primary focus of this study was to evaluate the utility of POLYBAYES (Marth et al. 1999) SNP discovery program (Meth- next-generation sequencing technologies for mutational profil- ods). Color-space alignments of the SOLiD data were similarly ing. We therefore compared the capabilities of the three plat- screened using software supplied by Applied Biosystems. The 17 forms for the identification of the 14 confirmed point mutations candidate mutations discovered among the three sequencing in the Pichia mutant. Each of the three sequencing technologies technologies were resequenced in CBS-6054 and in each of the correctly identified all 14 variations with essentially no false posi- four derived strains with a capillary sequencing machine and tives when all available reads generated on the platform were were all confirmed (Table 2). Three of the changes were found to used (Table 1; Fig. 2). The complete Illumina and Applied Bio- be errors in the reference sequence, as the alternate base is pres- systems alignments afforded perfect accuracy: All 14 mutations ent in the validation traces not only from all sequenced mutants were found and no false-positive predictions were made. A single but also from the parent strain. This implies an error rate of 3 nt false-positive prediction was found in the complete 454 FLX data in the 15-Mb Pichia reference genome, far exceeding the estab- (which produced lower overall coverage than the other plat- lished standards for genome finishing (1 error/10 kb). Given that forms) and was most likely the result of a PCR error during se- the mutations were discovered in very deep data sets and inde- quence library construction (data not shown). The accuracy we pendently confirmed by four different sequencing methods, it is observed is encouraging given that low false discovery (i.e., that unlikely that we missed any additional mutations in the un- is, the fraction of erroneously identified mutations) and false masked fraction (∼95%) of the Shi21 mutant genome. We there- negative (i.e., the fraction of true mutations that were missed) fore believe that the remaining 14 mutations comprise the com- rates are critical considerations for the application of these tech- plete set of single nucleotide variants between the mutant and nologies to rapid forward genetic mutational profiling. These re- the parent (i.e., reference) Pichia strains. sults show that all three technologies are suitable for highly ac- Since the Pichia genome is haploid during vegetative curate mutation screening (Supplemental Fig. 2). growth, all mutations are expected to be homozygous. An appar- An important consideration for the cost of such experi- ent heterozygous change at position 358,358 on chromosome 8 ments is the depth of sequence coverage required to achieve a is a result of the intentional gene disruption of CYC1 with a URA3 desired sensitivity and specificity. To determine how the error selection cassette, which resulted in a URA3 duplication. This rate changes as fewer reads are used, we selected subsets of reads apparent variation represents a paralogous difference between of varying size (corresponding to likely use cases for each plat- the two copies of a duplicated gene and thus cannot be consid- form) from each of the three full data sets and subjected the ered a true point mutation. We screened for small (1–2 bp) INDEL resulting lower-coverage assemblies to our mutation discovery polymorphisms with GIGABAYES, but none were found, which analysis. As shown in Table 1, a combined missed mutation (false is not surprising considering that the alkylating agents (Methods) negative [FN]) and erroneously called mutation (false positive used in mutagenesis principally induce base substitutions. How- [FP]) error rate of 50% is achieved with 1.5 454 FLX machine runs ever, because we strictly limited the number of mismatches al- (8.15-fold aligned read coverage; six FP and one FN errors), a lowed during read mapping (Methods), it is theoretically possible single lane of Illumina reads (6.32-fold aligned read coverage; that longer (>2 bp) INDEL mutations were missed. Additionally, two FP and two FN errors), and sixfold coverage of Applied Bio- systems SOLiD reads (zero FP and five FN errors). The increased number of false Table 2. Summary of discovered point mutations relative to the Pichia reference genome positives observed with the lower 454 FLX coverage is the result of local homo- polymer misalignments that arise when a nucleotide overcall (that is, calling too many nucleotides) is followed by a nucleotide undercall (that is, calling too few bases), or vice versa. Deeper cover- age mitigates such alignment artifacts (Quinlan et al. 2008). The fact that the Applied Biosystems SOLiD technology produced zero false positives is a result of the “di-base encoding” which facilitates the segregation of sequencing error from true mutations (Valouev et al. 2008). It is important to note that we may have missed additional mutations in the Shi21 strain because we masked between 5.3% and 6.8% of the genome. Given the constraints of plate configurations and run conditions on the different plat- Color coding indicates in which strain each mutation first appeared relative to the parent, CBS-6054. forms, we find that a minimum of 10– Orange, FPL-061 (rapid growth on L-xylose in the presence of the respiration inhibitors); yellow, FPL-DX26 (2-deoxyglucose resistance); green, FPL-UC7 (FOA resistance); blue, Shi21 (CYC1:ura3 tar- 15-fold genome coverage is required for geted gene disruption). the desired error rate.

Genome Research 3 www.genome.org Downloaded from genome.cshlp.org on September 11, 2008 - Published by Cold Spring Harbor Laboratory Press

Smith et al.

Sequencing Chromosomal DNA from P. stipitis Shi21 was prepared by stan- dard methods (Burke et al. 2000). For 454 sequencing, a library was prepared and sequenced using manufacturer-supplied proto- cols and reagents, as follows. Five micrograms of DNA was sheared to an average size of 480 bp. Adaptors were ligated, and the correct products were selected using 454 library immobiliza- tion beads. The single-stranded DNA library was quantified using the Invitrogen Ribogreen assay, and 32 emulsion PCR reactions were prepared with a ratio of two molecules per DNA capture bead. After amplification, the emulsions were broken and en- riched, resulting in a total of 3.92 million beads containing am- plified library fragments. The beads were sequenced in two full 454 FLX sequencing runs, each loaded with 1.8 million beads, yielding a total of ∼199 Mb of sequence data. For Illumina sequencing, 3 µg of genomic DNA was frag- mented below 800 bp using a nebulizer. Fragments were end- repaired with T4 DNA polymerase. A single dA was added to the ends using Klenow fragment and dATP. Fragments were then ligated with adaptors provided by the manufacturer. Adaptor- Figure 2. The effect of sequence coverage on mutation discovery ac- ligated fragments were separated from unligated adaptors by curacy. The total number of mutation discovery errors is shown for the running and agarose gel and cutting a band corresponding to three sequencing technologies at various levels of aligned sequence cov- ∼150–300 bp and purified using a spin column. The fragment erage. (Blue circles) 454 FLX; (red circles Illumina; (black circles) Applied Biosystems SOLID. library containing adaptors was subjected to 18 rounds of PCR using primers supplied by Illumina. This amplified library was then loaded onto the cluster generation station for single mol- Discussion ecule bridge amplification on slides containing attached primers. The slide with amplified clusters was then subjected to step-wise All three next-generation sequencing platforms correctly identi- sequencing using four-color labeled nucleotides on the Illumina fied nucleotide variations between the reference and mutant 1G sequencing system for 32 cycles. A total of 25,818,266 reads strains given sufficient coverage. The fraction of mutations in were obtained after quality filtering, yielding ∼826 Mb of se- open reading frames (78%) was slightly higher than the average quence data. gene density (56%) (Jeffries et al. 2007). In the absence of selec- For SOLiD sequencing, five micrograms of DNA was sheared tion, about two-thirds of the base changes should have resulted and size-selected to an average size of 100 bp. P1 and P2 adaptors in silent mutations at the amino acid level, due to redundancy in were ligated and amplified for 15 cycles; 0.2 pg/µL of double- the genetic code. Surprisingly, all mutations retained in open stranded library was added to the emulsion with 950 million reading frames resulted in amino acid changes, indicating high beads according to manufacturers’ instructions. Twenty-nine selective pressure and little or no neutral drift (Table 2). Further percent of the beads were P2 positive (contained amplified li- characterization of the identified mutational events through brary fragments) before enrichment and 91% of the beads were physiological and genetic studies will be necessary to determine P2 positive after enrichment, yielding 277 million beads depos- how they affect cell phenotype. ited on two slides; 228 million of these beads fell within the Overall, our results demonstrate that the new sequencing imaged area and were detected in sequencing, yielding 2.7 Gb of technologies tested are well suited for mutational analysis of aligned 35-mer sequence. For confirmation sequencing, PCR products were generated novel yeast strains derived from multistep mutagenesis proce- from genomic DNA of each strain using M13-tailed primer pairs, dures. For most applications, 10–15-fold redundant genome cov- the products were sequenced on ABI3730xl instruments, variants erage will allow for accurate and cost-effective mutational profil- were identified using PolyPhred, and confirmed using consed (Ste- ing. Deeper coverage is likely necessary for similar experiments in phens et al. 2006). Complete data sets are available at the NCBI diploid organisms (e.g., ENU mutagenesis in mouse), as the dis- Short Read Archive under accession no. SRA 001158 (ftp:// covery of heterozygous loci requires that both alleles be sampled ftp.ncbi.nih.gov/pub/TraceDB/ShortRead). from high-quality reads. The approach is expected to be equally suitable for the analysis of bacterial, fungal, and other organisms Illumina and 454 sequence alignment derived by directed evolution and natural variation, especially as sequencing costs and throughput continue to improve for all of We used our general reference sequence-guided alignment and these technologies. Thus, this approach could help accelerate the assembly tool, MOSAIK, to process the Illumina and 454 data development of novel organisms for bioenergy and biotechnol- sets. MOSAIK (Michael Stromberg, Boston University) uses a ogy applications as well as facilitate traditional forward and re- hashing scheme to seed full Smith-Waterman gapped alignments against the concatenated P. stipitis genome. The resulting pair- verse genetic screens. wise alignments are then consolidated into a multiple sequence alignment (assembly) and saved as an ACE assembly file. These Methods assemblies can be viewed by programs such as consed (Gordon et al. 1998). To correct for 454 indel alignment errors, the Smith- Derivation of the mutagenized Shi21 strain Waterman scoring algorithm has been augmented to use an al- The Shi21 derivation of the Shi21 strain of P. stipitis is thoroughly ternate gap open penalty when a homopolymer region is de- described by Shi et al. (1999). tected. For both the Illumina and the 454 reads, we required that

4 Genome Research www.genome.org Downloaded from genome.cshlp.org on September 11, 2008 - Published by Cold Spring Harbor Laboratory Press

at least 95% of each read align to the reference sequence. In order mapping. D.A.S., structural variation discovery. L.Z., read map- to ensure that we only aligned high-quality reads from each tech- ping. G.T.M., mutation detection and manuscript preparation. nology, we also required that the reads from each technology had few sequence differences (i.e., mismatches, insertions, or dele- References tions) relative to the reference genome sequence. We allowed at most one sequence difference in the Illumina reads and two se- Burke, D., Dawson, D., and Stearns, T., eds. 2000. Methods in yeast quence differences in the longer 454 reads. genetics. Cold Spring Hargor Laboratory course manual. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. SOLiD sequence alignment Gordon, D., Abajian, C., and Green, P. 1998. Consed: A graphical tool for sequence finishing. Genome Res. 8: 195–202. The Applied Biosystems SOLiD alignment tool translates the ref- Jeffries, T.W., Grigoriev, I.V., Grimwood, J., Laplaza, J.M., Aerts, A., erence sequence to di-base encoding (“color-space”) and aligns Salamov, A., Schmutz, J., Lindquist, E., Dehal, P., Shapiro, H., et al. the reads in color space. The program guarantees finding all 2007. Genome sequence of the lignocellulose-bioconverting and xylose-fermenting yeast Pichia stipitis. Nat. Biotechnol. 25: 319–326. alignments between a read and the reference sequence with up to Marth, G.T., Korf, I., Yandell, M.D., Yeh, R.T., Gu, Z., Zakeri, H., Stitziel, M mismatches (a user-specified parameter). Applied Biosystems N.O., Hillier, L., Kwok, P.Y., and Gish, W.R. 1999. A general approach SOLiD reads were mapped to the Pichia genome allowing up to to single-nucleotide polymorphism discovery. Nat. Genet. 23: 452–456. three mismatches for each read. The alignment tool uses multiple Parekh, S.R., Parekh, R.S., and Wayman, M. 1988. Fermentation of xylose and cellobiose by Pichia stipitis and Brettanomyces clausenii. spaced seeds (discontinuous word patterns) to achieve a rapid Appl. Biochem. Biotechnol. 18: 325–338. running time. Quinlan, A.R., Stewart, D.A., Stromberg, M.P., and Marth, G.T. 2008. Pyrobayes: An improved base caller for SNP discovery in pyrosequences. Nat. Methods 5: 179–181. Acknowledgments Richly, E. and Leister, D. 2004. NUMTs in sequenced eukaryotic genomes. Mol. Biol. Evol. 21: 1081–1084. Author contributions to this work are as follows: D.R.S., project Schuster, S.C. 2008. Next-generation sequencing transforms today’s initiation, design and coordination, mutation analysis. K.M., 454 biology. Nat. Methods 5: 16–18. FLX sequencing. W.T. and L.S., initial 454 data analysis and com- Shi, N.Q., Davis, B., Sherman, F., Cruz, J., and Jeffries, T.W. 1999. Disruption of the cytochrome c gene in xylose-utilizing yeast Pichia parison with SOLiD data. B.W., Sanger sequencing confirmation stipitis leads to higher ethanol production. Yeast 15: 1021–1030. and analysis. W.F.D., SOLiD library construction. N.T., R&D Stephens, M., Sloan, J.S., Robertson, P.D., Scheet, P., and Nickerson, manager. H.E.P., development of SOLiD consensus algorithm, D.A. 2006. Automating sequence-based detection and genotyping of mutation detection, manuscript preparation. S.S.R., library devel- SNPs from diploid samples. Nat. Genet. 38: 375–381. Suh, S.O., Marshall, C.J., McHugh, J.V., and Blackwell, M. 2003. Wood opment. J.B.W., C.C.L. and B.E.C., SOLiD emulsion and sequencing. ingestion by passalid beetles in the presence of xylose-fermenting Z.Z., SOLiD alignment algorithm. S.F.M., J.A.M., and J.M.S., develop- gut yeasts. Mol. Ecol. 12: 3137–3145. ment of SOLiD consensus algorithm. A.P.B. development of 2-base Valouev, A., Ichikawa, J., Tonthat, T., Stuart, J., Ranade, S., Peckham, H., encoding. K.J.M., analysis and manuscript preparation. D.H. and F.C., Zeng, K., Malek, J.A., Costa, G., McKernan, K., et al. 2008. A high-resolution, nucleosome position map of C. elegans reveals a lack of 454 and Illumina sample prep and data generation. J.C. and D.S.R., universal sequence-dictated positioning. Genome Res. 1051–1063. Initial Illumina data analysis. P.M.R., Experimental design, coordina- tion, manuscript preparation. A.R.Q., mutation detection, sequence mapping, data analysis, and manuscript preparation. M.P.S., read Received February 22, 2008; accepted in revised form July 10, 2008.

Genome Research 5 www.genome.org Downloaded from genome.cshlp.org on August 19, 2008 - Published by Cold Spring Harbor Laboratory Press

Resource EagleView: A genome assembly viewer for next-generation sequencing technologies

Weichun Huang and Gabor Marth1 Department of Biology, Boston College, Chestnut Hill, Massachusetts 02467, USA

The emergence of high-throughput next-generation sequencing technologies (e.g., 454 Life Sciences [Roche], Illumina sequencing [formerly Solexa sequencing]) has dramatically sped up whole-genome de novo sequencing and resequencing. While the low cost of these sequencing technologies provides an unparalleled opportunity for genome-wide polymorphism discovery, the analysis of the new data types and huge data volume poses formidable informatics challenges for base calling, read alignment and genome assembly, polymorphism detection, as well as data visualization. We introduce a new data integration and visualization tool EagleView to facilitate data analyses, visual validation, and hypothesis generation. EagleView can handle a large genome assembly of millions of reads. It supports a compact assembly view, multiple navigation modes, and a pinpoint view of technology-specific trace information. Moreover, EagleView supports viewing coassembly of mixed-type reads from different technologies and supports integrating genome feature annotations into genome assemblies. EagleView has been used in our own lab and by over 100 research labs worldwide for next-generation sequence analyses. The EagleView software is freely available for not-for-profit use at http://bioinformatics.bc.edu/marthlab/EagleView. [Supplemental material is available online at www.genome.org.]

In the past three years, the emergence of massively parallel se- development and testing for downstream analysis. The develop- quencing technologies has dramatically reduced time and costs ment of assembly algorithms and polymorphism discovery tools for whole-genome sequencing. For example, the current 454 Life requires rigorous software testing which is greatly facilitated by Sciences (Roche) GS FLX system, which can produce 100 million the display of base discrepancies, machine signals, and base qual- bases per run in less than eight hours, is hundreds of times faster ity values. (3) Data validation. Experimental data validation of- and over 10 times cheaper than the conventional Sanger capil- ten requires that we view additional sequences collected for veri- lary sequencing. The Illumina sequencing (formerly Solexa se- fication together with the primary assembly data. (4) Data inter- quencing) technology is able to generate over one billion bases of pretation and hypothesis generation. The interpretation of high-quality DNA sequence per run at less than 1% of the cost of candidate polymorphism sites (e.g., SNP) in a genomic context capillary sequencing. Such technological advances will soon requires integration of genome annotation data (e.g., gene struc- make it possible to sequence individual human genomes within ture) into the assembly view. This integration in turn facilitates a short timeframe and at an affordable price. The emergence of hypothesis generation for follow-up experimentation. new, even faster technologies (e.g., Pacific Biosciences’ technol- To fulfill these functions the visualization tool must be able ogy) has the potential to make the 1000-dollar human genome a to handle large genome assemblies of millions of reads, display reality. The new sequencing technologies make possible compre- mixed-type sequence reads with trace signals simultaneously, hensive genetic and epigenetic variation analysis (Barski et al. and display complex genome annotations. Existing assembly 2007; Mikkelsen et al. 2007), regulatory element identification viewers such as consed (Gordon et al. 1998) and Hawkeye (Schatz (Robertson et al. 2007), structural variation discovery (Swami- et al. 2007) were designed for genome assemblies of Sanger cap- nathan et al. 2007), and transcriptome quantification (Ng et al. illary sequence reads and do not yet have effective support for 2006). The huge volume of new sequencing data, the relatively next-generation sequence reads. For example, consed does not shorter read lengths, and the different error models of new se- offer a compact assembly view and has very limited support for quencing technologies, however, present us with difficult infor- annotations (only displays colored read or consensus tags). Load- matics challenges. ing large assemblies into consed requires a large amount of One of the main challenges is data visualization. Visualiza- memory not typically available to most users. Hawkeye has simi- tion is an essential requirement for many data analyses including lar memory limitations for large genome assemblies and has no but not limited to the following tasks. (1) Uncovering errors in support for viewing genome feature annotations. Neither tool sequence read mapping, alignment, and assembly. Erroneous supports viewing technology-specific trace signals for assemblies read mapping to paralogous regions, as well as local alignment of mixed-type reads from different sequencing technologies. In- and assembly errors lead to false single nucleotide polymorphism clusion of the above features was the main design consideration (SNP) calls. Visual inspection can reveal these errors. (2) Software for our new assembly viewer, EagleView, for supporting genome assemblies of next-generation sequencing technologies.

1Corresponding author. Results E-mail [email protected]; fax (617) 552-2011. Article published online before print. Article and publication date are at http:// EagleView is a user-friendly viewer with a single-window GUI. Its www.genome.org/cgi/doi/10.1101/gr.076067.108. feature set was specifically designed for visualization of large ge-

18:000–000 ©2008 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/08; www.genome.org Genome Research 1 www.genome.org Downloaded from genome.cshlp.org on August 19, 2008 - Published by Cold Spring Harbor Laboratory Press

Huang and Marth

Figure 1. Illustration of EagleView features. The EagleView shown in the figure is the version for Microsoft Windows. All features except the mouse-tip window shown in the figure are also available for both Linux and Mac versions. The upper part shows a genome assembly of 454 sequence reads; the lower part displays an assembly of Illumina reads.

nome assemblies of next-generation sequence reads (see Fig. 1; Computational efficiency Table 1). In order to utilize screen space effectively, EagleView offers a compact assembly view (i.e., reads are optimally A typical genome assembly of next-generation sequencing tech- placed in multiple lines, each having multiple reads) and dis- nologies may contain hundreds of millions of reads, reaching plays technology-specific trace signals using a pinpoint view. assembly file size of many gigabytes. Regions in the assembly It can display assemblies of mixed-type reads with the appropri- may have hundreds or thousands of folds coverage. Computa- ate trace information. Importantly, EagleView has extensive sup- tional efficiency, especially memory usage, is therefore a critical port for displaying genome annotation tracks as well as user- issue. We compared EagleView’s computational requirements to defined sequence features. It allows navigation by genome lo- consed (version 16.0) and Hawkeye (version 2.0.4) on a data set cation (padded or unpadded), read id, annotation feature, or that was possible to load with all three programs (see Methods). any user-defined coordinate map. It also supports zooming This data set consists of nearly seven million 32-base reads from ,and customizable fonts and colors. EagleView comes with de- genome resequencing of the K-12 strain of Escherichia coli by the tailed documentation and is distributed as binary installation Illumina sequencing technology. We found that the CPU time packages for the three major operating systems (Windows, Linux, used by EagleView during loading the assembly was not signifi- and Mac). The software is available at the authors’ websites. cantly different from the two other programs. However, Eagle-

2 Genome Research www.genome.org Downloaded from genome.cshlp.org on August 19, 2008 - Published by Cold Spring Harbor Laboratory Press

EagleView: An efficient genome assembly viewer

Table 1. EagleView feature list features supporting the inspection of base quality values and un- derlying machine signals allow users to distinguish between true Feature categories Features discrepancies and base calling errors.

View Compact view of assembly with zooming Validation of candidate polymorphisms capability Pinpoint view of base quality Manual checking of candidate polymorphisms in resequencing Pinpoint view of technology-specific sequence data is important because current computational polymorphism trace discovery tools for the new sequencing technologies are still in Pinpoint view of read id and strand Navigation Navigation by both unpadded and padded an early developmental stage. Mismatches between erroneously positions aligned reads and the reference genome representing paralogous Navigation by genomic features or user-defined differences between duplicated genome regions give rise to false locations candidate polymorphisms. Similarly, if a read is locally mis- Navigation by read id and contig id aligned, the misaligned base is often called by the polymorphism Efficiency Fast and memory efficient Supporting large genome assemblies of millions discovery software as a candidate polymorphism. EagleView al- of reads lows users to manually check and identify such falsely called Data integration Genome features (e.g., gene, exon, intron) candidates. Polymorphism data (e.g., SNP) In addition to the manual inspection of the primary data, 454 flowgram trace Illumina four color raw signals validation of, e.g., candidate polymorphisms often involves the Operating Supporting both 32-bit and 64-bit versions of collection of additional sequence data by a different sequencing systems operating systems including Windows, Linux, technology. The inspection of such experimental validation data, and Mac OSX together with the primary sequence reads used in the discovery Others Distinct mark for discrepancy sites Customizable font and color for viewer process, requires that we can combine reads from multiple dif- Printing capability ferent sequencing platforms in a single assembly view. EagleView Data preparation tools included supports the assembly view of mixed-type reads of next- generation sequencing technologies. The capability allows one to view coassemblies of, e.g., 454 and Illumina reads, and inspect View required less than one fourth of the memory used by consed trace signals and compare sequence reads between technologies or Hawkeye (see Table 2). We also tested the three tools on two at candidate polymorphisms. larger genome assemblies of C. elegans chromosomes consisting of over 14 and 19 million 32-base Illumina reads, respectively. Data interpretation and hypothesis generation Consed and Hawkeye were unable to load either one of these two Often what users want to know after candidate polymorphisms assemblies on our 24 GB RAM Linux server, whereas EagleView are extracted is whether these candidates fall within genes, ex- successfully opened both. ons, splice sites, or regulatory regions. This information is essen- tial to assess the potential significance of a given variant, to point Genome assembly inspection, software development, to genes that may be phenotypically important, and thus guide and debugging further experimentation. An essential feature of EagleView in this An important application of the viewer application is sequence regard is the extensive support for integrating genome feature assembly inspection. In de novo assemblies one looks, e.g., for annotations together with the primary assembly data (Supple- erroneous joins between contigs based on spurious read overlaps, mental Fig. 3). It supports the importation of annotations of areas of low sequence coverage, or regions covered by only low- various classes, the display of specific feature id (e.g., gene name quality reads or reads from only one strand. In reference se- and exon ID), as well as the definition of user-defined features quence guided assemblies, one looks for erroneously mapped (e.g., candidate SNP sites). Additionally, EagleView supports reads representing duplicated, paralogous or repetitive genome navigation by feature map positions. This is useful, e.g., to rap- regions, and local misalignments due to sequencing errors, typi- idly scroll through every candidate polymorphism site, or to in- cally because of consecutive insertion/deletion errors. Identifica- spect every exon in a given genome region. tion of such mapping, alignment, and assembly errors helps soft- ware development because it can pinpoint algorithmic weak- Application examples nesses. For example, we used EagleView to identify local 1. We have used EagleView for studying the sequencing error misalignments of 454 reads where different base insertion errors profile of 454 pyrosequencing technology and its implications within three reads were aligned as a base substitution (e.g., Supplemental Fig. 1). We used these examples to develop a 454- specific scoring scheme in our alignment program MOSAIK. We Table 2. Efficiency comparison also used examples of erroneously mapped reads to improve the Tool Version CPU time (min:sec) Memory usage mapping accuracy of MOSAIK. EagleView has several key features that help assembly inspection. First, the zoomed-out view allows consed 16.0 4:16 15.06 GB users to scroll through the entire assembly and scan for regions of Hawkeye 2.0.4 6:09 14.23 GB assembly errors. The ability to zoom in allows users to closely EagleView 1.6 4:08 3.36 GB inspect such regions. EagleView marks bases in red within reads The genome assembly for the assessment is of length 4,661,217 bases that are discrepant relative to the genome reference or contig and consists of 6,872,388 Illumina 32-base reads. The assessment was consensus sequence. Regions with a high number of such dis- based on 64-bit Linux versions of all three tools. Testing took place on a crepancies provide a visual cue for possible assembly errors. The 64-bit Linux server with 24-GB memory.

Genome Research 3 www.genome.org Downloaded from genome.cshlp.org on August 19, 2008 - Published by Cold Spring Harbor Laboratory Press

Huang and Marth

Figure 2. The genome assembly of human chromosome Y with real SNP position map. (A) A single SNP site with ID rs1053790, heterozygote frequency (HF) 0.26, and dbSNP validation status (VS) 2 (at least one sub-SNP in cluster has frequency data submitted). (B) A region with high density of SNPs. At the SNP site at position 57,440,427, the sequence error and alignment error potentially contribute wrong genotype A/C/G called at the position where the true genotype is C/G. (C) A deletion under the SNP site is due to an alignment error.

for improving sequence read alignment/assembly tools. The annotation files in the format required by EagleView) from data for the study contain two runs of Helicobacter pylori ge- SNPs contained in dbSNP (build no. 128) and in the HapMap nome resequencing reads generated from the 454 GS20 se- project (release no. 22). In addition, we constructed MAP files quencing machine. We used the Smith-Waterman-based from the known human transcripts including mRNA and EST ACANA tool (Huang et al. 2006) to estimate the error profile from the NCBI genome annotation (build no. 36) to enable by aligning a random sample of 10,000 reads from the entire visual inspection of genetic polymorphisms within the ge- data set (610,000 reads) to the H. pylori reference genome nome context. All these MAP files are available at the Eagle- (1,700,000 bp). We found that the average error rate is mark- View Web site. To demonstrate analyses that will be typical for edly increased along the length of homo polymers with the whole-genome, multi-individual, human resequencing data, overcall error rate increasing more rapidly (Supplemental Fig. we performed the following two experiments. In the first ex- 4). Such 454 sequencing error pattern potentially causes many periment, we generated 20-fold simulated Illumina read cov- alignment/assembly errors if the alignment algorithm does erage of the human Y chromosome, and aligned the reads not take the error profile into consideration. Using EagleView, with our reference guided alignment program MOSAIK. We we examined and identified different types of alignment er- then used EagleView to examine assembly errors and sequenc- rors resulting from consecutive insertions and deletions in the ing errors that in read data would lead to false positive SNP 454 sequences (Supplemental Fig. 1), and used this informa- candidates or wrong genotypes calls (Fig. 2). In the second tion to improve our alignment algorithm for 454 sequences. experiment, we used EagleView to inspect real human poly- 2. We used EagleView to manually inspect SNP candidates that morphism sites identified by new Illumina sequencing data we identified computationally between the Bristol and Pasa- from the dena strains of Caenorhabditis elegans in a large-scale genome 1000 Genomes Project (http://www.1000genomes.org). resequencing study using the Illumina sequencers (Hillier et We used EagleView feature navigation function to find and al. 2008). In SNP discovery, false SNP calls can result from compare polymorphism map differences near gene regions alignment errors, sequencing errors, gene paralogs, or from among four subpopulations: Yoruba (YRI), Japanese (JPT), defects in the SNP detection algorithm. Manual inspection Chinese (CHB), and European (CEU) (Supplemental Fig. 5). using EagleView allowed us to identify the exact problems and We also examined discrepancies of SNP genotypes between helped us improve our assembly and SNP detection algo- the HapMap project and the new assembly data from the 1000 rithms. We also used EagleView’s bird’s-eye-view feature to Genomes Project (Fig. 3). quickly scroll through and spot-check the C. elegans genome assembly (Supplemental Fig. 2; Hillier et al. 2008). 3. We used EagleView to examine human polymorphism data in Discussion the context of gene annotations. This type of analysis is gain- ing importance as large, comprehensive human genome rese- We have been using the early version of EagleView successfully in quencing projects (e.g., the international 1000 Genomes our data mining projects and for the development of our se- Project) are gearing up. To facilitate the comparison, e.g., be- quence analysis tools. We realize that additional features will be tween SNPs discovered in the 1000 Genomes data and known necessary. Efforts are underway to standardize next-generation genetic variants, we have constructed MAP files (i.e., feature read (http://sourceforge.net/projects/srf/) and assembly (http://

4 Genome Research www.genome.org Downloaded from genome.cshlp.org on August 19, 2008 - Published by Cold Spring Harbor Laboratory Press

EagleView: An efficient genome assembly viewer

because the assembly could be loaded on our 24-GB memory Linux server by all three programs. In the test, both consed and EagleView loaded the assem- bly file in the ACE format, while Hawk- eye loaded the assembly file in its native bank format converted from the ACE assembly file. The CPU time and memory usage for each tool were mea- sured after it loaded and displayed Con- Figure 3. Two examples of SNP genotype discrepancies between the HapMap project and the new tig view. Two larger testing assemblies assembly data from the 1000 Genomes Project. In the figure, the HapMap SNP ID and genotypes are shown in the white and bold font. In the left panel, the assembly shows a rare allele C in the position were subsets of the whole-genome rese- not reported in the HapMap. In the right panel, the assembly shows a deletion SNP but HapMap quencing study of C. elegans of which reports that it is A/T SNP. The deletion SNP likely to be true as Illumina sequencing technology has a the primary sequencing data were also very low rate of insertion/deletion sequencing errors. from WUGSC (Hillier et al. 2008). The two larger assemblies contain 14,562,818, and 19,566,095 Illumina 32- assembly.bc.edu) formats. The new binary formats, combined base reads, respectively. All assembly files are available at the with effective indexing of the assembled reads, will enable sub- EagleView Web site. stantial reduction in memory usage and loading time. Future versions of EagleView will support these formats. We will also Data file formats support the GFF3 annotation format in addition to our propri- EagleView reads a genome assembly file in the standard ACE etary MAP file format. We will expand our data integration ca- format, a tag-based format commonly used by genome assembly pabilities by including the visualization of, e.g., microarray-based programs (a detailed description of the ACE format is available at gene expression levels. We will include trace views for the newest http://bozeman.mbt.washington.edu/consed/consed.html). sequencing technologies (e.g., Applied Biosystem’s SOLiD and EagleView uses three optional, auxiliary data files: READS, EGL, Helicos’ tSMS). We will support the visualization of paired-end and MAP files (see Table 3). The READS and EGL files are paired reads, and customizable coloring schemes for identifying reads together for storing base qualities and technology-specific trace from a given technology, and/or reads representing the same signals of sequence reads. The READS file contains all read data DNA template. Finally, we will integrate analysis tools, e.g., our while the EGL file is just the indexes of the contig start locations polymorphism discovery tool into the viewer application. in the corresponding READS file. EagleView automatically loads In summary, EagleView is the first visualization tool specifi- base quality and trace information, if both the READS and the cally designed for next-generation sequencing technologies. EGL files are present in the same directory as the ACE assembly EagleView has already proved to be an essential tool in our de- file. The MAP file is for storing location mapping information of velopment of informatics software for genome assembly and genome features, such as genes, exons, or SNPs. If present, the polymorphism discovery. We expect that it will also be useful in MAP file is also loaded automatically. All three optional files are in tab-delimited text formats (detailed format descriptions are the many other applications of next-generation sequencing tech- provided in the EagleView documentation). nologies. The EagleView software is available at no charge for not-for-profit use. Utility tools EagleView comes with three data conversion tools to prepare the Methods optional data files. EagleIndexFasta converts FASTA files contain- ing base quality and read trace information to the corresponding Efficiency test READS and EGL files. EagleIndexSff and EagleIndexSffM, both spe- We tested the efficiency of three tools, consed (ver. 16.0), Hawk- cific to 454 reads, extract base quality and flow signal informa- eye (ver. 2.0.4), and EagleView (ver. 1.6), on a 64-bit Linux server tion from the 454 binary SFF files and convert into the READ and with 24-GB memory. The 64-bit version of each tool was used for EGL formats. EagleIndexSff converts from a single SFF file; EagleIn- this test. The genome assembly file used for this test is a refer- dexSffM can convert from multiple SFF files. Detailed usage is ence-based genome assembly of E. coli K-12 genome by Illumina described at EagleView’s documentation. sequencing technology from our collaborators at the Washing- ton University Genome Sequencing Center (WUGSC). The as- sembly contains a reference genome of length 4,661,217 bases Acknowledgments and 6,872,388 Illumina 32-base reads. This data set was selected We thank Dr. Elaine R. Mardis at WUGSC for providing 454 and Illumina sequence data for our software testing. We also thank all EagleView beta testers for their helpful feedback. This research Table 3. EagleView data files was supported by a grant to G.M. (no. R01 HG003698) from the File type File extension National Human Genome Research Institute, National Institutes of Health. Genome assembly file ACE Base quality/flow trace data file READS Contig address index file for READS file EGL References Genome feature map file MAP Barski, A., Cuddapah, S., Cui, K., Roh, T.Y., Schones, D.E., Wang, Z., Files are identified by the file extension. Wei, G., Chepelev, I., and Zhao, K. 2007. High-resolution profiling

Genome Research 5 www.genome.org Downloaded from genome.cshlp.org on August 19, 2008 - Published by Cold Spring Harbor Laboratory Press

Huang and Marth

of histone methylations in the human genome. Cell 129: 823–837. Nucleic Acids Res. 34: e84. doi: 10.1093/nar/gkl444. Gordon, D., Abajian, C., and Green, P. 1998. Consed: A graphical tool Robertson, G., Hirst, M., Bainbridge, M., Bilenky, M., Zhao, Y., Zeng, T., for sequence finishing. Genome Res. 8: 195–202. Euskirchen, G., Bernier, B., Varhol, R., Delaney, A., et al. 2007. Hillier, L.W., Marth, G.T., Quinlan, A.R., Dooling, D., Fewell, G., Genome-wide profiles of STAT1 DNA association using chromatin Barnett, D., Fox, P., Glasscock, J.I., Hickenbotham, M., Huang, W., et immunoprecipitation and massively parallel sequencing. Nat. al. 2008. Whole-genome sequencing and variant discovery in C. Methods 4: 651–657. elegans. Nat. Methods 5: 183–188. Schatz, M.C., Phillippy, A.M., Shneiderman, B., and Salzberg, S.L. 2007. Huang, W., Umbach, D.M., and Li, L. 2006. Accurate anchoring Hawkeye: An interactive visual analytics tool for genome assemblies. alignment of divergent sequences. Bioinformatics 22: 29–34. Genome Biol. 8: R34. doi: 10.1186/gb-2007-8-3-r34. Mikkelsen, T.S., Ku, M., Jaffe, D.B., Issac, B., Lieberman, E., Giannoukos, Swaminathan, K., Varala, K., and Hudson, M.E. 2007. Global repeat G., Alvarez, P., Brockman, W., Kim, T.K., Koche, R.P., et al. 2007. discovery and estimation of genomic copy number in a large, Genome-wide maps of chromatin state in pluripotent and complex genome using a high-throughput 454 sequence survey. lineage-committed cells. Nature 448: 553–560. BMC Genomics 8: 132. doi: 10.1186/1471-2164-8-132. Ng, P., Tan, J.J., Ooi, H.S., Lee, Y.L., Chiu, K.P., Fullwood, M.J., Srinivasan, K.G., Perbost, C., Du, L., Sung, W.K., et al. 2006. Multiplex sequencing of paired-end ditags (MS-PET): A strategy for the ultra-high-throughput analysis of transcriptomes and genomes. Received January 6, 2008; accepted in revised form June 5, 2008.

6 Genome Research www.genome.org ARTICLES

Whole-genome sequencing and variant discovery in C. elegans

LaDeana W Hillier1,3, Gabor T Marth2,3, Aaron R Quinlan2, David Dooling1, Ginger Fewell1, Derek Barnett2, Paul Fox1, Jarret I Glasscock1, Matthew Hickenbotham1, Weichun Huang2, Vincent J Magrini1, Ryan J Richt1, Sacha N Sander1, Donald A Stewart2, Michael Stromberg2, Eric F Tsung2, Todd Wylie1, Tim Schedl1, Richard K Wilson1 & Elaine R Mardis1

Massively parallel sequencing instruments enable rapid and quality) values. Furthermore, the general utility of short read inexpensive DNA sequence data production. Because these sequences, coverage models for resequencing and approaches for instruments are new, their data require characterization with read mapping to reference genomes requires investigation. To respect to accuracy and utility. To address this, we sequenced a address these, we sequenced an isolate of the C. elegans N2 Bristol http://ww w .natur e .com/nature methods Caernohabditis elegans N2 Bristol strain isolate using the Solexa strain using the Solexa Sequence Analyzer (Illumina Inc.). Our Sequence Analyzer, and compared the reads to the reference analyses of these sequences included (i) an elucidation of the Solexa genome to characterize the data and to evaluate coverage and read error model, (ii) an evaluation of sequence differences between representation. Massively parallel sequencing facilitates strain- the two isolates and (iii) identification and investigation of repre- to-reference comparison for genome-wide sequence variant sentational biases in Solexa data. We revealed possible sequencing discovery. Owing to the short-read-length sequences produced, errors in the C. elegans reference genome, and putative variants that we developed a revised approach to determine the regions had occurred in our passaged N2 Bristol strain. of the genome to which short reads could be uniquely mapped. Massively parallel sequencing can be applied to strain-to- We then aligned Solexa reads from C. elegans strain CB4858 to reference comparisons that reveal genome-wide sequence differ- the reference, and screened for single-nucleotide polymorphisms ences, either for evolutionary studies or for discovering genetic (SNPs) and small indels. This study demonstrates the utility variation that may explain phenotypic variation. Implementing this

200 8 Nature Pu b lishing G r oup of massively parallel short read sequencing for whole application requires a new approach that assesses the fraction of a

© genome resequencing and for accurate discovery of genome to which short read sequences can be uniquely mapped genome-wide polymorphisms. because they are more susceptible to multiple placements than are longer capillary instrument–derived sequences. Computational In 1998 the decoding of the first animal genome sequence, that identification and markup of these ‘microrepeats’ is therefore an of C. elegans, was published1. C. elegans was first suggested as a important precursor to accurate short-read analysis, and must model organism in the 1960s by Sydney Brenner, and subsequent allow for mismatches resulting from sequencing errors or poly- work produced a physical map of its genome2.Asaresult,the morphisms. We aligned Solexa sequence reads from the C. elegans C. elegans genome sequencing project formed the cornerstone strain CB4858 (originally isolated in Pasadena, California, USA)7 to of efforts ultimately aimed at decoding the human genome3,4. the microrepeat masked N2 Bristol reference sequence, and identi- The entire C. elegans biology community has benefited enor- fied SNPs and small indels with a modified PolyBayes8 version. mously from the availability of the genome sequence and the Orthologous validation yielded a high validation rate. ever-improving genome annotation5, and from the comparative power of the availability of sequenced genomes for C. elegans’ RESULTS relatives such as C. briggsae6. Experimental design The emerging availability of massively parallel sequencing instru- In this study we explored two applications of Solexa sequencing: mentation provides the capability to resequence genomes in a (i) genome resequencing and (ii) genome-wide polymorphism fraction of the time, effort and expense than ever before. Compared discovery. For the first application, we sequenced an isolate of the to capillary sequencing, these instruments produce relatively short- C. elegans N2 Bristol strain at a high coverage depth with single-end read-length sequences that require characterization, including read reads and at a much lower coverage depth with paired-end reads. error profiles and base call accuracy (which we refer to as base Using the reference genome as our alignment target, we determined

1Washington University School of Medicine, Department of Genetics and Genome Sequencing Center, 4444 Forest Park Blvd., St. Louis, Missouri 63108, USA. 2Boston College, Department of Biology, 140 Commonwealth Ave., Chestnut Hill, Massachusetts 02467, USA. 3These authors contributed equally to this work. Correspondence should be addressed to E.R.M. ([email protected]). RECEIVED 19 SEPTEMBER 2007; ACCEPTED 21 DECEMBER 2007; PUBLISHED ONLINE 20 JANUARY 2008; DOI:10.1038/NMETH.1179

NATURE METHODS | VOL.5 NO.2 | FEBRUARY 2008 | 183 ARTICLES

85,498,844 Solexa N2 Bristol reads given base quality value depends upon that BLAT to Solexa primers nce base’s position in the sequence read. refere Todetermine Solexa N2 Bristol single-end E. coli No match to C. elegans, BLAT to Exact match read coverage, we first devised an iterative E. coli or to C. elegans Solexa primers reference genome, read-alignment strategy for these reads 1,525,093 (1.78%) reads catalog positions 894,698 (1.05%) reads (Fig. 1 and Supplementary Methods online). We then determined the average 18,442,727 reads coverage of the genome to be 19.2 (s.d. ¼

Quality trim 64,636,331 (75.55%) reads 9.0; Supplementary Fig. 1 online). reads (keeping only We examined the genome for over- reads ≥ 20 bp at ≥ Q25) represented regions and found B1.7% of the genome had 440-fold coverage in 6,699,337 reads unique 32-mers. We expected ribosomal No match to Exact match to C. elegans C. elegans DNA genes to have higher than average 1,843,353 reads 4,855,984 reads coverage because the reference represents these as single copies but they actually exist in multiple copies. By examining unique 32-mers within rDNA segments, we found phrap-based BLAT-based alignments to identify non-exact assembly matches to C. elegans genome evidence of excess coverage (4100 for the chromosome 1 rDNA unique 32-mers). Figure 1 | N2 Bristol Solexa read analysis. The diagram shows the processing steps used to evaluate This finding lends credibility to the use of Solexa single-end reads from the N2 Bristol isolate. The majority of reads mapped exactly to the read coverage as a quantitative metric of http://ww w .natur e .com/nature methods reference genome. region-specific copy number. Based on our analysis of regions with higher than average coverage, combined with the assembly and an accuracy estimate and an error model for Solexa reads. We next analysis of unmapped reads (see Supplementary Methods), we aligned all reads possible using a tiered approach (Fig. 1), identified estimate a maximum of B0.5 Mb of repetitive sequence is missing sequence differences between the two isolates and evaluated both from the C. elegans reference genome. representational bias and copy-number detection. To determine the genome coverage of the sequence reads we We developed a genome-wide polymorphism discovery approach proceeded as follows. After aligning exactly matching 32-bp reads, by first sequencing C. elegans strain CB4858, using Solexa single- exactly matching quality-trimmed reads (that is, reads with at least end reads of about ninefold coverage. To decrease the possibility 20 consecutive base pairs of quality score Z20) and quality- of erroneous variant detection because of paralogous read place- trimmed reads with 2 or more mismatches to the reference genome, ments, we identified and masked ‘microrepeat’ regions in the we found that 99.9% of the unique C. elegans genome was covered, 200 8 Nature Pu b lishing G r oup

© genome based on a 32-bp read length. We then aligned CB4858 mostly in large spans; the longest was 194 kb on chromosome V. Of reads to the reference genome using Mosaik and applied a modified the regions left uncovered by Solexa reads, there were 9,492 gaps PolyBayes version to detect variants. Our predicted polymorphic comprising 95,913 bp. These coverage gaps ranged in size from sites were validated by PCR amplification and Sanger sequencing at 1 base to over 1,000 bases; 77.9% were 1–9 bp and another 36.8% ahighrate. were 10–50 bp. If we only consider 32-mer exact mapping reads, the largest coverage gap was a 4,601-bp region on chromosome X Resequencing a C. elegans N2 Bristol strain isolate (5907815–5912415). Notably, the entire 4,600-bp region is We used the single-end C. elegans N2 Bristol reads to evaluate bounded by a transposon (TC5A#DNA/Tc4) in the reference the overall accuracy and quality of Solexa pipeline passed reads. sequence and is completely contained within a single fosmid Table 1 provides several metrics of our Solexa single-end read clone (H05L03) that extends into an overlapping yeast artificial dataset, including Eland alignment results to the ws170 release of chromosome (YAC; Y23B4A). the C. elegans reference genome9 (http://www.wormbase.org). There were a total of 1,728 zero-base-pair gaps discovered—areas Based on the Eland metrics, we estimated 20-fold coverage of N2 of the genome where two adjacent reference genome 32-mers were Bristol for the quality passed and aligned single-end reads. covered by reads that aligned exactly to each 32-mer but across We performed read alignment with EagleDiscoverer, and subsequent error ana- lysis revealed that 57.2% of the uniquely Table 1 | Solexa run metrics for N2 Bristol and CB 4858 single-end reads mapping single-end reads contained zero Number Total bases Total passed Percentage Percentage Percentage error mismatches and 79.9% had 0 or 1 mis- Genome of runs generated bases passed bases aligning (ws170) (alignment-based) match. We determined the full distribution of mismatches for the Solexa N2 Bristol N2 Bristol 3.5 4.06 Gb 2.67 Gb 66% 79% 0.6% reads (Fig. 2), and the position-specific CB4858 1.5 2.52 Gb 1.35 Gb 54% 67% 0.52% dependency of Solexa base calls at phred Solexa run metrics obtained for the combined 30 and 32 bp single-end reads from both the N2 Bristol and CB 4858 isolates. Results, including the total number of bases generated, the total number of passed (for example, high quality) bases, and the qualities of 25 and 30 (Fig. 3), which percentage of aligning reads were obtained from the output of the Solexa-provided data analysis pipeline. The Eland-generated illustrates that the base accuracy for a error rate is reported, based on the reference genome alignments of Solexa passed reads.

184 | VOL.5 NO.2 | FEBRUARY 2008 | NATURE METHODS ARTICLES

70% reads) for which a Solexa read had extra bases relative to the 60% 57.20% reference. These also could be deletions in the reference. 50% We identified 1,396 nonrepetitive, uncovered regions with at 40% least one read having an unaligned or mismatched base, suggesting 30% a Solexa base-calling error, a polymorphism in the Solexa- 22.73% 20% sequenced N2 Bristol isolate or a substitution error in the reference

Percentage of reads 11.24% 10% 5.87% genome. Of these, 1,011 were covered by more than one read, and 2.96% 0% 544 were covered by more than two reads. These suggest a 01234 maximum substitution error rate in the C. elegans reference Number of mismatches sequence of 1 in 99 kb. We included a limited number of these Figure 2 | Accuracy distribution of N2 Bristol Solexa single-end reads. putative errors in our validation efforts, described below. As described in the text, after alignment of N2 Bristol Solexa reads to We were able to produce and analyze limited numbers of paired- the reference genome sequence using EagleDiscoverer and tabulating end reads for C. elegans N2 Bristol, providing an average coverage B any differences between the two sequences, we determined that 80% of 0.84 with a mean physical coverage (measured by the span of of the reads exhibited 0 or 1 mismatch when uniquely aligned to the matching paired ends) of 3.08. Because paired-end reads are used reference genome. to evaluate structural variation based on deviations in end read distance from expectation10, we determined that 37,352 read pairs which no exactly matching spanning read exists. Of these, 1,564 had a mapped distance 43 s.d. from the 218-bp average (Supple- (90%) had single read representation of the flanking 32-mer, mentary Data;36,209wereo104 bp and 1,143 were 4332 bp). If consistent with the notion that these regions are under-represented we required more than one read pair placement to confirm an by Solexa reads. Further investigation of these coverage gaps event, only 5,908 pairs remained (5,670 were o104 bp). Notably, revealed that (i) they are located primarily in noncoding sequence these 5,670 read pairs spaced B100 bp closer than expected http://ww w .natur e .com/nature methods (2% are in exons), (ii) only a few regions could be explained by support our estimate that B0.5 Mb of the C. elegans repetitive hairpin formation (see Supplementary Data and Supplementary genome is missing from the reference genome. The vast majority of Table 1 online) and (iii) the A+T content in these regions is multiple read pairs that confirmed a structural variation event were substantially higher than the genome average (85% versus 65%, within introns and/or were annotated as repetitive. Many such respectively; Supplementary Fig. 2 online). Furthermore, this A+T read pairs confirmed regions already annotated as ‘‘difficult to bias more likely occurs during amplification than during sequen- sequence’’; about 1.5 times as many fell in genomic regions cing (see Supplementary Data). We identified 125 zero-base-pair sequenced from YACs or plasmids (both clone types were used to gaps with a non-identical spanning Solexa read, suggesting an sequence regions unclonable in cosmids for the reference genome insertion in the Solexa-sequenced strain. We resequenced 22 of sequencing). For example, on chromosome III, a complex tandem these and validated them as true differences between the two N2 repeat annotated as ‘‘restriction digest data indicate 3 kb is missing Bristol isolates. from the assembly of this region’’ was identified by 238 Solexa Genomic alignment of nonexact match reads (that is, N2 Bristol paired ends placed at 43 s.d. apart (50% were 332–400 base pairs 200 8 Nature Pu b lishing G r oup

© single-end reads without an exact match to C. elegans; Supple- apart), further substantiating the initial suspicion of misassembly. mentary Methods) allowed us determine differences between the Hence, paired-end data enhance the utility of Solexa reads, provid- two N2 Bristol isolates and to identify possible errors in the ing an important tool for identifying putative structural variation. reference sequence as these reads are highly similar but contain inserted or deleted bases that preclude an exact match. Here three Polymorphism discovery in C. elegans strain CB4858 or more Solexa reads were required to predict a reference error, to We sequenced an isolate of the CB4858 strain using the reduce contributions from Solexa base calling errors. Such align- Solexa technology to produce Bninefold coverage in single-end ments putatively identified 2,981 insertions, deletions and indels of reads. This strain was selected because previous work had suggested 1–20 bp. Of these, 2,082 occurred at posi- tions also having exactly matching Solexa reads, thus confirming the reference 40 sequence and indicating an allelic poly- 35

morphism between the two N2 Bristol iso- 30 25 lates. By contrast, 235 of the indels occurred 30 in regions with no perfectly aligning Solexa 25 read, suggesting a possible error in the 20

C. elegans reference genome, and indicating 15 a potential indel error rate of 1 in 373 kb.

Calculated base quality 10 We detected 56 different putative deletion events, for which a Solexa read spanned one 5

or more bases in the reference genome, 0 aligning immediately on either side. Alter- 12345678910111213141516 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Base position natively, these could be insertions in the reference genome. Lastly, 53 different puta- Figure 3 | Position dependency of base calling accuracy for N2 Bristol Solexa single-end reads. The tive insertions were suggested (by 502 calculated based quality is shown as a distribution at phred base quality 25 or at base quality 30.

NATURE METHODS | VOL.5 NO.2 | FEBRUARY 2008 | 185 ARTICLES

Once we aligned CB4858 Solexa reads to the conservatively masked C. elegans genome, we applied our combined repeat masking to filter the alignments, identified high-quality sequence Genome: 100,281,244 bp differences with PolyBayes, and finalized a set of 45,539 SNPs and 7,353 single-base-pair indels. This yields a rate of one SNP per 1,629.81 bp and one indel per 9,894.99 bp. Hence the pair-wise Microrepeats: nucleotide diversity (theta) between the CB4858 and the N2 Bristol 19.8% strains is 6.136 104, in good agreement with the B1:1,500 rate posited in a previous description of CB4858 (ref. 11). As 37,856,444 RepeatMasker: CB4858 Solexa reads yielded a total number of 45,539 SNPs, the 14.5% ‘read-per-SNP’ yield was 831. All confirmed CB4858 sequence RM only: variants are available in Wormbase. 3.4% We orthologously validated roughly 1,000 candidate SNPs and indels by PCR-directed capillary sequencing to gauge the perfor- mance of our Mosaik-PolyBayes approach. After sequencing and evaluation, we determined a SNP validation rate of 96.3% (438/ Figure 4 | Repetitive content in C. elegans. Venn diagram depicting the 455) and an 89.0% conversion rate (438/492) for candidates fraction of bases in the genome covered by microrepeats and by identified by PolyBayes (Table 2). We sequenced 239 of our RepeatMasker, and the overlapping set. putative single-base indels, finding they validated (93.8%) and converted (87.7%) at practically the same rates as SNPs. Both a polymorphism rate of 1:1,600 (ref. 11). The Solexa analysis insertions and deletions predicted in the reference genome pipeline produced metrics of our Solexa single-end read data sequence were represented (insertions: 2,948 or 47.1%, and dele- http://ww w .natur e .com/nature methods for CB4858 (Table 1). tions: 3,316 or 52.9%). Many of the indels were variable numbers of As a precursor to variant discovery in CB4858, we identified bases in mono-nucleotide repeats, for example, 5 versus 4 adeno- regions of the reference genome with a high potential for ambig- sines. Although mononucleotide runs are typically very difficult uous read alignment, based on the Solexa 32-bp read length. First, areas for indel detection, our high validation rate indicates that we identified all unique 32-mers in the reference sequence, but as Solexa reads resolve base numbers in these runs very well. Micro- our error rate analysis (Fig. 2) indicated a drop-off in the error rate repeat masking has a marked impact on accurate SNP discovery by beyond 2 errors per read, we defined a repetitive 32-mer as one that eliminating putative SNPs and indels resulting from paralogous appears in the genome more than once, allowing 0–2 mismatched read mapping (Table 2). bases (substitutions, insertions or deletions). We called these We estimated false negative rates for PolyBayes by running ‘microrepeats’ to distinguish them from repeats marked by the PolyPhred13–15 (version 5.0) on the validation trace data. This RepeatMasker program12, which masks 14.5% of the bases in the algorithm indicated PolyBayes had missed 26 SNPs, for a false genome. The fraction of the genome comprising perfect and near- negative rate of 3.75%. 200 8 Nature Pu b lishing G r oup

© perfect microrepeats totaled 19.8%. We illustrate the relationship To determine the chromosomal distribution of CB4858 poly- between RepeatMasker-masked bases and microrepeat bases iden- morphisms, we placed CB4858 SNPs and indels along the six tified by our methods as a Venn diagram (Fig. 4). Although there is C. elegans chromosomes, and identified both chromosome-wide a substantial overlap (11.11%) between the regions masked by both and chromosomal position–specific differences (Supplementary methods, 8.7% of the genome that we identified as microrepeats Data and Supplementary Fig. 3 online). Our data confirmed an was not masked by RepeatMasker. Conversely, 3.4% of the genome earlier study in C. elegans16 suggesting that nonsynonymous sub- was masked by RepeatMasker only, indicating that some fraction of stitution rates are higher in the first and second codon positions C. elegans repeat elements can be uniquely sequenced with 32-bp than in the third (Supplementary Fig. 4 online). Furthermore, over reads. Taken together, RepeatMasker repeats and microrepeats half of CB4858 SNPs positioned in exons putatively introduce an cover 23.2% the genome. amino acid change.

Table 2 | PolyBayes SNP and indel validation data

Assay Submitted to Assay Sequencing SNP candidate Validation Conversion Mask type applied type validation successful successful confirmed rate (%) rate (%) Known repeats SNP 598 582 557 482 86.5 80.6 Exact microrepeats SNP 579 559 518 475 91.7 82.0 Near-exact microrepeats SNP 492 482 458 438 96.3 89.0 (2 or fewer mismatches) Known repeats Indel 239 228 222 202 91.0 84.5 Exact microrepeats Indel 232 223 217 201 92.6 86.6 Near-exact microrepeats Indel 220 213 208 193 93.8 87.7 (2 or fewer mismatches) Validation and conversion rates for PolyBayes-selected SNPs and single base indel candidates. Successive application of masking filters, as described in the text, reduced the number of paralogous placements and identified high confidence putative variant sites.

186 | VOL.5 NO.2 | FEBRUARY 2008 | NATURE METHODS ARTICLES

DISCUSSION sequence variants, to analyze coverage and to evaluate representa- Massively parallel sequencing approaches hold great promise for tional bias. These alignments consisted of a combination of exact genome-wide discovery of sequence variation, when comparing hash-match based comparisons, followed by BLAST-like alignment different isolates or strains to reference genomes. It is apparent that tool (BLAT)-based comparisons. Our methods are detailed short-read technologies must initially be characterized with respect below, and are presented in a flowchart format (Fig. 1 and to their quality and accuracy, providing a baseline for devising Supplementary Data). analytical methods. Dramatically shorter read lengths also increase the coverage level needed for adequate depth and breadth of reads Paired-end read evaluation. We mapped paired-end reads (for to predict variation with high confidence, when compared to example, a 25–35 bp read from each end of a B200-bp genomic capillary sequencing reads. Although these short reads presently fragment) from the N2 Bristol isolate to the C. elegans genome are too short for de novo assembly, producing regional assemblies of using the exact hash-match based method described above. After resequencing reads, followed by reference genome alignment, read mapping of individual paired ends, we determined final apparently has merit for detecting insertions and deletions, and placements by asking that the ‘forward’ and ‘reverse’ read of the should be pursued in future resequencing efforts. pair match on the same chromosome and within 1,000 bases of Paired-end reads clearly increase the power to properly interpret each other. problematic areas of the genome, including collapsed or misas- sembled repeats, and to detect structural variations. As genomes Mosaik alignment of CB4858 Solexa reads. We identified both increase in size and complexity, paired ends will also be more perfect microrepeats and microrepeats with up to two mismatches efficiently placed than single-end reads, as only one end of each (substitutions, deletions or insertions) to encompass the possibi- read pair needs a unique genome placement to properly place most lity of sequencing errors (nucleotide misincorporation or base reads, given that a precise paired-end read distance has been calling) in the reads or of polymorphism in the genomes being achieved in library construction. compared. Custom scripts then produced a microrepeat-masked http://ww w .natur e .com/nature methods Solexa reads provide a rapid vehicle for genome-wide SNP and reference genome. small indel discovery, once additional masking of ‘microrepeat’ We next aligned the Solexa CB4858 single-end reads to the sequences is achieved. Aside from SNP or indel discovery, whole- microrepeat-masked C. elegans reference genome with our Mosaik genome resequencing also can be used after random mutagenesis program. Mosaik consists of two parts: the aligner (aligns each to identify and characterize each mutagenized base. Our results read to the reference genome separately in a pair-wise fashion) and establish the utility of short-read-length massively parallel the assembler (pads the individual reads and the reference ge- sequencing for the accurate discovery of both single-nucleotide nome sequence so that every aligned base within each read and small insertion-deletion polymorphisms, and establishes a remains in register in the resulting multiple read alignment). framework for human genome resequencing toward similar The details of Mosaik processing are described in Supplementary discovery aims. Methods. The resulting multiple read alignments were then reported either in ACE17 or in binary formats used by downstream METHODS analysis software. 200 8 Nature Pu b lishing G r oup

© Determining Solexa single-end read accuracy. To isolate sequen- cing errors from simple alignment errors, we used a version of SNP and indel discovery in strain CB4858. Starting with the the Smith-Waterman–based global alignment algorithm that multiple read alignments produced by the Mosaik aligner and reports all optimal and suboptimal alignments above a prespeci- assembler, we analyzed the resulting alignments using a version of fied alignment score (EagleDiscover; W.H., unpublished data). PolyBayes8 that was completely reengineered to enable efficient Although time-intensive, this algorithm identifies all alignable analyses of millions of aligned short-read sequences. The program positions in the C. elegans genome for a 32-bp read. Here we evaluated each aligned base and its base quality value at each generated three random sample sets of 20,000 Solexa N2 Bristol position, to indicate putative SNPs and small (1–3 bp) putative single-end reads and aligned each read set to the unmasked indels, and their corresponding SNP probability value (PSNP). Base reference genome, allowing up to 4 mismatches (substitution, quality values were converted to base probabilities corresponding insertion or deletion). For further consideration of accuracy, we to every one of the four possible nucleotides (and to the prob- kept only reads that aligned at a single locus in the genome. For ability that the nucleotide in question was an actual insertion 8 each of the three read sets we tabulated the number of sequence error in the sequence). Using a Bayesian formulation ,aPSNP differences between each read and the reference sequence, and (or indel probability value, as appropriate) was calculated as the combined the results to make a histogram of reads (Fig. 2). Then likelihood that multiple different alleles are present between the we evaluated unique alignments to calculate the observed error reference genome sequence and the reads aligned at that position. rate at each base position for a given Solexa base quality score. We If the probability value exceeds a prespecified threshold, the converted these rates to phred scores (Solexa base qualities are SNP or indel candidate is reported in the output. For the expressed as a probability of each of the 4 bases being the correct collection of bases contributed by such reads, a single ‘con- call rather than as a single phred-like probability of correctness) sensus’ base call and its base quality value are computed. The and graphed the dependence of observed base quality on base corresponding base probabilities are then used in the Bayesian position (Fig. 3). PSNP calculation. In this study, we used a PSNP cutoff value of 0.7 to define a high-certainty SNP or small indel site. Validated Alignment and analysis of Solexa single-end reads. We compared CB4858 SNPs and indels were assigned Wormbase accession Solexa N2 Bristol reads to the reference genome to identify numbers pas1– pas50906.

NATURE METHODS | VOL.5 NO.2 | FEBRUARY 2008 | 187 ARTICLES

Software availability. The combined microrepeat plus Repeat- Published online at http://www.nature.com/naturemethods/ Masker masked genome sequence annotations and FASTA files are Reprints and permissions information is available online at available at http://bioinformatics.bc.edu/microrepeats/elegans/. http://npg.nature.com/reprintsandpermissions Mosaik and the updated version of PolyBayes are now in beta 1. C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: release and available for users to wish to participate in software a platform for investigating biology. Science 282, 2012–2018 (1998). testing (http://bioinformatics.bc.edu/marthlab/Beta_Release). After 2. Waterston, R. et al. The genome of the nematode Caenorhabditis elegans. Cold Spring Harb. Symp. Quant. Biol. 58, 367–376 (1993). the testing period, both programs will be released for public use, 3. Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature free of charge for academic users. 409, 860–921 (2001). 4. International Human Genome Sequencing Consortium. Finishing the euchromatic Additional methods. Details of Solexa library construction and sequence of the human genome. Nature 431, 931–945 (2004). 5. Harris, T.W. et al. WormBase: a multi-species resource for nematode biology and sequencing, data analysis of primary sequence data and its align- genomics. Nucleic Acids Res. 32, D411–D417 (2004). ment to the C. elegans reference genome (both single and paired- 6. Stein, L.D. et al. The genome sequence of Caenorhabditis briggsae: a platform for end reads) as well as detailed descriptions of Mosaik and PolyBayes comparative genomics. PLoS Biol. 1, e45 (2003). 7. Hodgkin, J. & Doniach, T. Natural variation and copulatory plug formation in analysis of CB4858 read data and its validation are available in Caenorhabditis elegans. Genetics 146, 149–164 (1997). Supplementary Methods. 8. Marth, G.T. et al. A general approach to single-nucleotide polymorphism discovery. Nat. Genet. 23, 452–456 (1999). Note: Supplementary information is available on the Nature Methods website. 9. Bieri, T. et al. WormBase: new content and better access. Nucleic Acids Res. 35, D506–D510 (2007). ACKNOWLEDGMENTS 10. Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat. Genet. We acknowledge National Human Genome Research Institute funding 37, 727–732 (2005). (HG003079-04 to R.K.W. and HG003698 to G.T.M.). We thank K. Hall and 11. Denver, D.R., Morris, K. & Thomas, W.K. Phylogenetics in Caenorhabditis elegans: D. Bentley of Illumina, Inc. for generously producing the paired-end read an analysis of divergence and outcrossing. Mol. Biol. Evol. 20, 393–400 (2003). data described in the manuscript, M. Wendl for careful reading of 12. Smit, A.F. The origin of interspersed repeats in the human genome. Curr. Opin. the manuscript and T. Bieri for submitting the CB4858 variants Genet. Dev. 6, 743–748 (1996). http://ww w .natur e .com/nature methods to Wormbase. 13. Bhangale, T.R., Stephens, M. & Nickerson, D.A. Automating resequencing-based detection of insertion-deletion polymorphisms. Nat. Genet. 38,1457–1462 AUTHOR CONTRIBUTIONS (2006). L.W.H., N2 Bristol read, coverage, variant and gap analyses; G.T.M., CB4858 SNP 14. Stephens, M., Sloan, J.S., Robertson, P.D., Scheet, P. & Nickerson, D.A. discovery and N2 Bristol error profile analysis; A.R.Q., CB4858 SNP discovery and Automating sequence-based detection and genotyping of SNPs from diploid validation analysis; D.D., Solexa analysis pipeline; G.F., validation assay design and samples. Nat. Genet. 38, 375–381 (2006). analysis, D.B., Solexa base quality value analysis, P.F., preparation of N2 Bristol 15. Nickerson, D.A., Kolker, N., Taylor, S.L. & Rieder, M.J. Sequence-based detection and CB4858 DNA, J.I.G., N2 Bristol read analysis; M.H., Solexa libraries and of single nucleotide polymorphisms. Methods Mol. Biol. 175, 29–35 (2001). sequencing, W.H., microrepeat analysis, V.J.M., Solexa libraries and sequencing, 16. Koch, R., van Luenen, H.G., van der Horst, M., Thijssen, K.L. & Plasterk, R.H. R.J.R., N2 Bristol analysis; S.N.S., validation assays; D.A.S., microrepeat masking of Single nucleotide polymorphisms in wild isolates of Caenorhabditis elegans. C. elegans; M.S., Mosaik adaptation; E.F.T., microrepeat finding; T.W., N2 Bristol Genome Res. 10, 1690–1696 (2000). analysis, T.S., C. elegans strain selection; R.K.W., project origination; E.R.M., 17. Gordon, D., Abajian, C. & Green, P. Consed: a graphical tool for sequence project coordination and manuscript preparation. finishing. Genome Res. 8, 195–202 (1998). 200 8 Nature Pu b lishing G r oup ©

188 | VOL.5 NO.2 | FEBRUARY 2008 | NATURE METHODS BRIEF COMMUNICATIONS

Accurate base qualities are crucial for resequencing applications in Pyrobayes: an improved which true allelic variation must be distinguished from sequencing error. Reliable SNP calls can only be made if the base error rate base caller for SNP for the called allele is substantially lower than the expected poly- morphism rate. For example, in human studies for which the average discovery in pairwise polymorphism rate is on the order of 1 in 1,000 bp, no SNP call should be made from a single allele with a base quality lower pyrosequences than 30 (1 in 1,000 bp error rate). However, if most base calls in resequencing reads are well above such a threshold, SNPs can be detected with high confidence even in single-read coverage. Unfor- Aaron R Quinlan, Donald A Stewart, tunately, we found that the majority of the base qualities assigned by Michael P Stro¨mberg & Ga´bor T Marth the native 454 base caller (version 1.0.52) were not sufficiently high for SNP calling in low-coverage conditions, as only 24% of the native Previously reported applications of the 454 Life Sciences 454 base calls were above 30 (Fig. 2a). However, we found that 454 pyrosequencing technology have relied on deep sequence reads can be called accurately, but the base qualities assigned by the

http://ww w .natur e .com/nature methods coverage for accurate polymorphism discovery because of native base caller underestimate the actual base accuracy (Fig. 2b). frequent insertion and deletion sequence errors. Here we report We developed a new base calling program, Pyrobayes, to produce a new base calling program, Pyrobayes, for pyrosequencing reads. more accurate (higher) base qualities and hence make more high- Pyrobayes permits accurate single-nucleotide polymorphism quality base calls in 454 pyrosequences. (SNP) calling in resequencing applications, even in shallow read Our base caller first determines the most likely number of coverage, primarily because it produces more confident base incorporated bases from the measured incorporation signal for calls than the native base calling program. each nucleotide test. Our Bayesian strategy (Supplementary Meth- ods and Supplementary Fig. 1 online) requires ‘data likelihoods’, The sequencing reads produced by the 454 Life Sciences pyrose- that is, the distribution of observed nucleotide incorporation quencers are the result of cyclical nucleotide tests in which ideally all signals for every possible homopolymer length. We estimated nucleotides within a homopolymer (for example, AAA) are incor- porated in a single test, and the light intensity signal observed in

200 8 Nature Pu b lishing G r oup each cycle is proportional to the actual number of incorporated a © nucleotides1. In reality, the signal for a fixed number of incorpo- rated bases varies substantially, and there is usually a nonzero signal even when no base is incorporated (Supplementary Fig. 1a online). This makes accurate base calling difficult and leads to nucleotide over-calls and under-calls that manifest as insertion and deletion errors2–4. Such errors often lead to misalignments that artificially inflate sequencing error estimates and cause the assignment of lower estimates of the base calls’ accuracy (which we refer to as base 0.0040 quality) than warranted by their true accuracy (Fig. 1). b Deletions 24% 0.0035 Insertions 72% 0.0030 Figure 1 | Comparison of the error profiles of Pyrobayes and the native 454 Substitutions 4% base caller. (a) Illustration of the effects of calling too few or too many bases 0.0025 on the alignment of a read (gray) to the reference sequence (black). Top, too few thymines were called, resulting in two spurious mismatches (arrows) by 0.0020 Pyrobayes misaligning the correctly called cytosine and the inserted guanine in the 454 Error rate 0.0015 454 base caller read. Middle, the correct number of thymines was called, resulting in the correct read alignment of the single insertion error (red) in the 454 read. 0.0010 Bottom, too many thymines were called, resulting in the correct read alignment of the two base insertion errors (red) in the 454 read. (b) Base error 0.0005

rates for Pyrobayes and the native 454 base caller. The relative contribution of 0.0000 each error type based on Pyrobayes calls is shown in the pie chart. Total error rate Insertion rate Deletion rate Substitution rate

Department of Biology, Boston College, 140 Commonwealth Avenue, Chestnut Hill, Massachusetts 02467, USA. Correspondence should be addressed to G.T.M. ([email protected]). RECEIVED 3 OCTOBER 2007; ACCEPTED 4 DECEMBER 2007; PUBLISHED ONLINE 13 JANUARY 2008; DOI:10.1038/NMETH.1172

NATURE METHODS | VOL.5 NO.2 | FEBRUARY 2008 | 179 BRIEF COMMUNICATIONS

abPyrobayes 60 c 1.0 454 base caller 55 0.30 50 Pyrobayes 0.25 0.8 45 454 base caller quality

≥ 40 0.20 0.6 35 30 0.15 0.4 25

20 of calls Fraction 0.10 0.2 Actual base quality 15 10 0.05 Fraction of base calls Fraction 0.0 5 0 0.00 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 0 51015 20 25 30 35 40 45 50 Called base quality Called base quality Called base quality

Figure 2 | Comparison of the base qualities assigned by Pyrobayes and the native 454 base caller. (a) The cumulative distribution of base qualities assigned by each program. (b) Comparison between assigned base quality and the base quality calculated from measured base accuracy. A value of 50 was assigned when no errors were found. (c) The distribution of base calls according to base quality.

these by collecting shotgun resequencing data with the 454 Life We investigated the effect of our higher overall base qualities on Sciences GS20 instrument from a finished mouse bacterial artificial SNP detection. First, we searched for single-base-pair differences chromosome (BAC) clone and extrapolating to higher homopoly- between the 454-sequenced iso-1 reads and the iso-1 reference mer lengths for which few or no examples could be found sequence. We expected few true polymorphisms as these sequences (Supplementary Fig. 1a and Supplementary Fig. 2 online). For were from the same inbred D. melanogaster strain, and the overall http://ww w .natur e .com/nature methods ‘prior probabilities’,we used the relative frequency of homopolymer accuracy of the D. melanogaster genome reference sequence is very lengths tabulated from several different reference genome high. Therefore, SNPs discovered in this comparison estimate the sequences. We found that these frequencies were consistently false positive SNP rate. This rate was 1.22/10,000 bp using the different from the theoretical expectation that they are proportional native base calls, but only 0.97/10,000 bp using the Pyrobayes base to 1/4n,wheren is the homopolymer length (Supplementary calls. It is important to consider that the false SNP discovery rate Fig. 1b). In the software we used a single distribution because depends on the polymorphism rate in the resequenced organism. the frequencies are very similar across all eukaryotic genomes we For example, in D. melanogaster, where the pairwise polymorphism considered. Using data likelihoods and prior distributions, we rate is B1/200 bp (ref. 5), our results corresponded to a false SNP determined the ‘Bayesian posterior probability’ of the correct discovery rate of 1.9%. number of bases given the measured incorporation signal (Supple- To estimate SNP calling error rates directly, we also sequenced an mentary Fig. 1c). The called base sequence was produced by inbred D. melanogaster isolate from Malawi with a single 454 run. concatenating the most likely number of bases for every consecutive In the alignments of the 454 reads base called with Pyrobayes we 200 8 Nature Pu b lishing G r oup

© incorporation test. The base quality assigned to each base is the found 1,118 SNP candidates at or above the Polybayes SNP probability that the base in question is not an over-call. We found it probability6 cutoff value of 0.7. The validation rate for these also useful to call one extra base, as long as the presence of that base candidates was 93% (1,036 of 1,118). The corresponding 7% false is above a minimum probability (see below). positive SNP rate observed in this experiment is a composite effect We compared the Pyrobayes and native base calling accuracy in of false SNP calls, emulsion PCR errors before 454 sequencing and 299,654 reads from the inbred reference (iso-1)strainofDrosophila the usual artifacts associated with capillary sequence validation melanogaster (Supplementary Methods). The overall base accuracy experiments7. We also estimated that we missed 14.8% of the SNPs (Fig. 1b) was quite high for both Pyrobayes and the native base (Supplementary Methods). We repeated the SNP discovery experi- caller (99.60% versus 99.61%). Notably, 96% of all sequencing ment in the alignments processed with the native 454 base caller: errors were insertions or deletions. The Pyrobayes insertion error the false positive rates were identical, but twice as many (30.0%) rate was higher (0.29% versus 0.24%), but its deletion rate was SNPs were missed. lower (0.09% versus 0.10%). Most importantly for SNP discovery, The primary cause of spurious substitution errors in 454 reads is the Pyrobayes substitution error rate was 60% lower (0.017% versus the erroneous alignment of a base under-call followed by an over- 0.042%) than that of the native base caller. A large fraction (74%) of call (or vice versa) as a base substitution (Fig. 1a and Supplemen- the base calling errors was shared between the two methods. tary Fig. 3d). Our alignment algorithm, Mosaik (Supplementary Characteristically, 86% of the errors solely made by Pyrobayes Methods), uses gap penalties that properly align reads in such were insertions whereas 82% of the unique 454 base caller errors situations. Additionally, we found that calling more bases in were deletions or substitutions (Supplementary Fig. 3). The homopolymer runs often also improves the alignment (Fig. 1a). Pyrobayes base qualities corresponded substantially better to the Eliminating spurious base errors resulting from alignment artifacts actual base accuracy than the native base qualities (Fig. 2b), and leads to assignment of higher base qualities. Higher base qualities therefore our base qualities were typically higher (Fig. 2c). For increase SNP calling sensitivity. example, 56% of the Pyrobayes base calls were assigned base The cost of tending toward calling more bases in homopolymer qualities of 30 or higher, as compared to 24% of the native base runs is a slightly increased insertion rate (Fig. 1b) even though the calls (Fig. 2a,c). Additionally, Pyrobayes produced base qualities up extra called bases are typically assigned very low base qualities. This to 50, whereas the highest native base quality was 38. is a logical choice for SNP discovery applications. However, it is not

180 | VOL.5 NO.2 | FEBRUARY 2008 | NATURE METHODS BRIEF COMMUNICATIONS

yet clear what effect such extra called bases will have for de novo to call SNPs in such regions without a substantial loss of accuracy sequence assembly of 454 reads. will permit more complete analyses of whole-genome alignments. A natural, although undesirable consequence of having to Pyrobayes can process a single sequencing run in under 2 min. determine homopolymer length from a single incorporation signal Pyrobayes and Mosaik are freely available for nonprofit use at is that the likelihood of over-calling error increases with every http://bioinformatics.bc.edu/marthlab/Software. consecutive nucleotide. Accordingly, the first called base in a homopolymer run is assigned the highest base quality, and the Note: Supplementary information is available on the Nature Methods website. last called base, the lowest (Supplementary Fig. 4a online). This ACKNOWLEDGMENTS introduces an unintended directionality for the base qualities in the This work was supported by a grant from the US National Human Genome Research sequence alignment (Supplementary Fig. 4b). Clearly, it is not Institute (R01 HG003698) to G.T.M. We thank E. Mardis and the 454 production group at the Washington University Genome Sequencing Center for generating the possible for the base calling program to resolve this ambiguity sequence data used in this work, and A. Clark at Cornell University for providing within the standard base quality framework defined by the Phred8,9 access to the D. melanogaster reads. base calling program. Consequently, one must rely on alignment and SNP calling software to account for this phenomenon. AUTHOR CONTRIBUTIONS A.R.Q., software and algorithm development and data analysis; D.A.S., data fitting We also evaluated base calling accuracy on the new 454 Life and parameter estimation for Bayesian data likelihoods; M.P.S., alignment Sciences FLX sequencing machine model using two sequencing algorithm development. A.R.Q. and G.T.M. designed the experiment and wrote runs from the K12 strain of Escherichia coli and found that both the manuscript. base callers underestimate the FLX base accuracy (Supplementary Published online at http://www.nature.com/naturemethods/ Fig. 5 online). The primary reason for this is that the overall error Reprints and permissions information is available online at rate of the FLX machine (0.12%) was much lower than that of the http://npg.nature.com/reprintsandpermissions GS20 (0.40%). Although the fact that the Pyrobayes base qualities 1. Margulies, M. et al. Nature 437, 376–380 (2005). were much closer to the actual accuracy suggests that our calibra- 2. Girard, A., Sachidanandam, R., Hannon, G.J. & Carmell, M.A. Nature 442, http://ww w .natur e .com/nature methods tion procedure is robust, there is clearly a need to recalibrate our 199–202 (2006). 3. Thomas, R.K. et al. Nat. Med. 12, 852–855 (2006). method for the FLX and future models. 4. Velicer, G.J. et al. Proc. Natl. Acad. Sci. USA 103, 8107–8112 (2006). The increased accuracy of our base qualities will likely permit 5. Hoskins, R.A. et al. Genome Res. 11, 1100–1113 (2001). more sensitive biological studies using the 454 machines. Although 6. Marth, G.T. et al. Nat. Genet. 23, 452–456 (1999). our data only illustrate this directly for low-coverage, survey-type 7. Quinlan, A.R. & Marth, G.T. Nat. Methods 4, 192 (2007). 10 8. Ewing, B. & Green, P. Genome Res. 8, 186–194 (1998). applications, statistical fluctuations will result in regions of 9. Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Genome Res. 8, 175–185 (1998). shallow read depth even in deeper nominal coverage. The ability 10. Lander, E.S. & Waterman, M.S. Genomics 2, 231–239 (1988). 200 8 Nature Pu b lishing G r oup ©

NATURE METHODS | VOL.5 NO.2 | FEBRUARY 2008 | 181 CORRESPONDENCE

Primer-site SNPs mask mutations This is more than twice the 9.1% overall MHR in the entire data set and nearly four times the 6.1% rate in amplicons without a primer- To the editor: Sanger-sequencing of diploid DNA from PCR ampli- SNP. Moreover, amplicons with primerSNPs account for half of all cons is the standard resequencing method for targeted mutation detec- missed heterozygotes. One plausible explanation is that the missed tion1. Heterozygosity at polymorphisms within the PCR primer sites heterozygous individuals are also heterozygous at a primerSNP. If so, causes preferential amplification of the matched chromosome2 and the PCR primer would preferentially anneal to and amplify the chro- suppresses the signal from an alternate allele on the mismatched chro- mosome with the perfectly matched allele. In the resulting sequence mosome. Analysis of ten resequenced Encyclopedia of DNA Elements trace, the signal for the allele on the preferentially amplified chromo- (ENCODE) regions3,4 shows that this phenomenon is responsible for some will dominate the alternate allele (Fig. 1a). Indeed, the MHR a large fraction of missed mutations, most of which can be recovered was 58.4% in individuals heterozygous at a primerSNP, nearly ten by doubling PCR amplicon coverage. times the 6.1% rate in individuals homozygous at the same primer- The resequencing traces produced by the HapMap project from ten SNP. Furthermore, if our explanation is correct, the single SNP allele ENCODE regions, together with the chip-based genotypes for many of present in the trace will be on the same chromosome as the perfectly the discovered single-nucleotide polymorphisms (SNPs) therein3, are matched primerSNP allele. Using phased haplotype data from the a useful reference for assessing the missed heterozygote rate (MHR) HapMap project, together with the primer sequence, we were able of trace-based mutation detection software. While testing our SNP- to correctly predict the identity of the missed SNP allele 93.1% of the discovery program POLYBAYES5 on this data set, we uncovered many time. These findings suggest that primerSNP heterozygosity is a major SNPs with gross discrepancies between the chip-based genotypes and cause of missed heterozygotes. the traces of the same individuals. Specifically, these individuals were A typical focus of medical resequencing projects is the detection heterozygous according to the chip-based genotypes yet only one of rare SNPs or individual mutations, normally present as a single of the two SNP alleles was detectable in the traces (Supplementary heterozygote within the resequenced cohort. In amplicons with Fig. 1 online). Often, the same SNP was also sequenced from a second, primerSNPs, such variants are missed 25% of the time (as compared http://ww w .natur e .com/nature methods overlapping amplicon, where the traces confirmed the heterozygous to the 14.5% rate in all amplicons). For such rare SNPs, when the genotype. We noticed that a disproportionate fraction of the discre- single heterozygote is missed, the SNP is not discovered. For common pant traces were from amplicons whose primers overlapped other SNPs, missing a fraction of the heterozygotes results in the reduction SNPs (primerSNPs). To ensure that these discrepancies were not arti- of heterozygous genotype frequencies from Hardy-Weinberg expec- facts of our new software, we confirmed the results with another SNP tations and can force quality-control procedures3 to discard the SNP discovery program, POLYPHRED6. (Supplementary Fig. 2 online). Of all amplicons studied, 15.4% had primerSNPs, and in these Many resequencing protocols already have provisions for avoid- amplicons, the missed heterozygote rate (the uncalled heterozygotes ing primerSNPs, especially toward the 3′ end of the primer sequence in the traces as a fraction of all heterozygous genotypes) was 22.2% where imperfect annealing owing to allelic mismatch is assumed to (Supplementary Data online and Supplementary Methods online). inhibit polymerase binding. We found that SNPs anywhere within the primer can cause missed heterozygotes and therefore must be avoided

200 7 Nature Pu b lishing G r oup a TTCGAAACACGCAA Primer sequence (Supplementary Fig. 3 online). Unfortunately, previously unknown

© SNPs cannot be avoided. We find, however, that their impact can be SNP within primer site SNP within amplicon dramatically reduced by sequencing from more than a single ampli- Differentially TTCGAAACACGCAA AGAAA amplified con. In amplicons with a primerSNP, sequencing from one additional chromosomes TTCGAAACGCGCAA AGCAA amplicon reduces the MHR fivefold, from 20.9% to 3.7% (Fig. 1b), Heterozygous Heterozygous and from a third amplicon, to below 1%. Sequencing from increased Alleles are not equally amplified. amplicon coverage also reduces the overall MHR, and mitigates other sequencing errors and software limitations. SNP within primer site SNP within amplicon All ENCODE data used in this analysis including the resequenc- Equally TTCGAAACACGCAA AGAAA amplified ing traces, anchored assemblies and POLYPHRED candidate SNP chromosomes TTCGAAACACGCAA AGCAA Alleles are amplified equally. mark-ups are available online (http://bioinformatics.bc.edu/mathlab/ Homozygous Homozygous pcrbias).

b 0.22 MHR in amplicons Note: Supplementary information is available on the Nature Methods website. Figure 1 | Heterozygosity at a 0.21 with primerSNP 0.20 MHR in all amplicons primerSNP. (a) For an individual 0.19 0.18 heterozygous at a primerSNP (top), 0.17 Aaron R Quinlan & Gabor T Marth 0.16 the chromosome matching the 0.15 0.14 Department of Biology, Boston College, Chestnut Hill, Massachusetts 02467, USA. primer sequence is amplified more 0.13 0.12 e-mail: [email protected] efficiently than the chromosome 0.11 0.10 containing the mismatched allele. 0.09 0.08 COMPETING INTERESTS STATEMENT Sequencing results in reduced 0.07 0.06 The authors declare that they have no competing financial interests.

Missed heterozygote rate 0.05 color intensity for the allele on 0.04 0.03 the mismatched chromosome. For 0.02 1. Sjoblom, T. et al. Science 314, 268–274 (2006). 0.01 an individual homozygous at a 0.00 2. Ikegawa, S., et al. Hum. Genet. 110, 606–608 (2002). primerSNP (bottom), amplification 1 2 3 3. The International HapMap Consortium. Nature 437, 1299–1320 (2005). Depth of amplicon coverage and allele-specific color intensities 4. The ENCODE Project Consortium. Science 306, 636–640 (2004). are balanced. (b) The MHR decreases with increasing amplicon coverage 5. Marth, G.T. et al. Nat. Genet. 23, 452–456. (1999). both in amplicons with primerSNPs and in all amplicons. 6. Stephens, M., et al. Nat. Genet. 38, 375–381 (2006).

192 | VOL.4 NO.3 | MARCH 2007 | NATURE METHODS BMC Bioinformatics BioMed Central

Research article Open Access Analysis of concordance of different haplotype block partitioning algorithms Amit R Indap*1, Gabor T Marth2, Craig A Struble3, Peter Tonellato1 and Michael Olivier1

Address: 1Human and Molecular Genetics Center, Medical College of Wisconsin, Milwaukee, USA, 2Department of Biology, Boston College, Chestnut Hill, USA and 3Department of Mathematics, Statistics, and Computer Science, Marquette University, Milwaukee, USA Email: Amit R Indap* - [email protected]; Gabor T Marth - [email protected]; Craig A Struble - [email protected]; Peter Tonellato - [email protected]; Michael Olivier - [email protected] * Corresponding author

Published: 15 December 2005 Received: 24 June 2005 Accepted: 15 December 2005 BMC Bioinformatics 2005, 6:303 doi:10.1186/1471-2105-6-303 This article is available from: http://www.biomedcentral.com/1471-2105/6/303 © 2005 Indap et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Background: Different classes of haplotype block algorithms exist and the ideal dataset to assess their performance would be to comprehensively re-sequence a large genomic region in a large population. Such data sets are expensive to collect. Alternatively, we performed coalescent simulations to generate haplotypes with a high marker density and compared block partitioning results from diversity based, LD based, and information theoretic algorithms under different values of SNP density and allele frequency. Results: We simulated 1000 haplotypes using the standard coalescent for three world populations – European, African American, and East Asian – and applied three classes of block partitioning algorithms – diversity based, LD based, and information theoretic. We assessed algorithm differences in number, size, and coverage of blocks inferred under different conditions of SNP density, allele frequency, and sample size. Each algorithm inferred blocks differing in number, size, and coverage under different density and allele frequency conditions. Different partitions had few if any matching block boundaries. However they still overlapped and a high percentage of total chromosomal region was common to all methods. This percentage was generally higher with a higher density of SNPs and when rarer markers were included. Conclusion: A gold standard definition of a haplotype block is difficult to achieve, but collecting haplotypes covered with a high density of SNPs, partitioning them with a variety of block algorithms, and identifying regions common to all methods may be the best way to identify genomic regions that harbor SNP variants that cause disease.

Background helped facilitate the discovery of millions of SNPs and Single Nucleotide Polymorphisms (SNPs) are single base their use in genetic association studies for human disease pair differences between individuals in a population. The [1]. Association studies work on the premise that SNP recent completion of the Human Genome Project has genotypes are correlated with a disease phenotype. Indi-

Page 1 of 13 (page number not for citation purposes) BMC Bioinformatics 2005, 6:303 http://www.biomedcentral.com/1471-2105/6/303

vidual SNPs are genotyped and the frequency of alleles are haplotype diversity within blocks [9]. The MDL principle compared between groups of affected and un-affected is an application of information theory to statistical mod- individuals. SNPs that are tested for association either eling which searches for patterns in data [13]. The descrip- must be the causative allele or be in linkage disequilibrium tion length of a data set is a function of the length with (LD) with the causative allele. LD is the non-random asso- which data can be encoded in binary digits, or bits [9]. The ciation of alleles between adjacent loci [2]. SNPs that are best set of block boundaries defined by Anderson and in LD with causative allele serve as a proxy and the associ- Novembre's method is the set of block boundaries that ation with the disease phenotype is maintained. has the shortest description length for a set of SNP geno- types that span a genomic region. The authors use a Numerous studies have shown that the human genome dynamic programming algorithm they call the iterative contains regions of high LD with low haplotype diversity dynamic programming algorithm (IDP) and a faster, but [3-6]. These regions are called haplotype blocks. The exist- approximate, dynamic programming algorithm called ence of haplotype blocks reduces the number of SNPs iterative approximate dynamic programming algorithm required in association studies by identifying and typing (IADP) to find the minimum description length for a set only the subset of tag SNPs which uniquely identify com- of haplotypes. Their method is implemented in the pro- mon haplotypes present in a block. The frequencies of gram MDBlocks [9]. these haplotypes can be compared in groups of affected and unaffected individuals [7]. Previous studies on the empirical performance of block partitioning methods have focused on data sets with dif- Haplotype blocks are defined computationally by various fering minor allele frequency cutoffs. The studies of Daly algorithms and can be classified into three categories: et al. [5], Patil et al. [3], and Gabriel et al. [6] used minor diversity based [3,8], LD-based [6], and information-the- allele frequency cutoffs of 5%, 10%, and 20%, respect- oretic [9]. Patil et al. [3] used a diversity based greedy algo- fully. Schulze et al. [14] assessed the effects of varying the rithm to partition Chromosome 21 into haplotype blocks minor allele frequency cutoff on the number of blocks in a sample of 20 re-sequenced chromosomes. Their algo- and tag SNPs inferred by the LD based method of Gabriel rithm considers all blocks of consecutive SNPs of one SNP et al. [6] and diversity based method of Zhang et al. [8]. As or larger, and defines a haplotype block boundary where rarer SNPs were removed and the allele frequency cutoff at least 80% of observed haplotypes within a block are raised, the number of blocks inferred decreased for both represented at least one or more times in their sample of methods, showing that the block structure is highly influ- chromosomes. Overlapping block boundaries were elim- enced by the allele frequency of SNPs used in their analy- inated by choosing the block with the maximum ratio of sis. SNPs in the block to the number of SNPs required to dis- criminate all haplotypes represented in the block. The Ke et al. [15] studied the impact of SNP density on block process was repeated until the entire length of the chro- boundaries from three different partitioning algorithms: mosome was partitioned into haplotype blocks. Zhang et the previously discussed LD approach of Gabriel et al., the al. [8] subsequently provided a dynamic programming four-gamete test [16], and a D' threshold approach of implementation for this approach in their software Hap- Phillips et al [17]. The author's study genotyped over 5000 Block [10]. SNPs in a 10 Mb region of chromosome 20 in four differ- ent populations: CEPH families, U.K. Caucasians, African Gabriel et al. [6] used a LD-based algorithm to define hap- Americans, and East Asians. Block boundaries of the algo- lotype blocks in a worldwide sample of chromosomes rithms were assessed with differing marker densities start- from Africa, Asia, and Europe. The authors computed con- ing at 2 kb and going to 10 kb. Their results show that fidence bounds of the value of D', a standard measure- longer blocks at sparser densities are broken into smaller ment of LD [11], and defined pairs of SNPs to be in strong blocks as more SNPs are added in. Other studies describ- LD (little evidence of recombination) if the one-sided ing the LD block structure of the human genome also used 95% D' confidence bound is between 0.7 and 0.98. The varying marker densities. The study by Phillips et al. [17] authors defined a haplotype block if least 95% of pairwise on chromosome 19 used an average marker density of one SNP comparisons in a region show little evidence of SNP per 17.65 kb with a median value of 5.5 kb. Gabriel recombination based upon their D' confidence bounds. et al. [6] used an average density of one SNP every 2 kb. The program Haploview [12] implements this method of Daly et al. [5] used a density of one marker approximately Gabriel et al. every 5 kb. Patil et al. [3] used a higher density of SNPs with one SNP every 1.3 kb. This study was also the only Anderson and Novembre [9] use the Minimum Descrip- one that completely re-sequenced the entire chromosome tion Length (MDL) principle for defining haplotype for all 20 samples. blocks which incorporates LD decay between blocks and

Page 2 of 13 (page number not for citation purposes) BMC Bioinformatics 2005, 6:303 http://www.biomedcentral.com/1471-2105/6/303

HB All SNPs GA HB MAF >10% GA MD

Shared Blocks

SNP location

EuropeanFigure 1 population partitions European population partitions. European population partitions for all 1000 haplotypes are shown. Block algorithms are abbreviated as HB (HapBlock), GA (Gabriel's method), MD (MDBlocks). The first two tracks show resulting block partitions from HapBlock and Gabriel's method using all SNPs. Next set of three tracks display resulting block partitions using all SNPs with a 10% MAF. The shared blocks track shows chromosomal regions common to all three block partitions using SNPs with a 10% MAF. The last two tracks show SNP positions of all SNPs and SNPs with at least a 10% MAF.

The ideal data set to fully assess the performance of block in number, size, and coverage of blocks under different partitioning algorithms would be a comprehensively re- values of marker density, allele frequency, and sample size sequenced large genomic region in a large number of on the performance of block partitioning algorithms. Our independent chromosomes. Unfortunately, such data are results show a great divergence in haplotype blocks pre- not available at this time. Only a limited number of sam- dicted by each method, and supports the notion that it ples have been re-sequenced extensively. In addition to may be advisable to use multiple algorithms in parallel to the study by Patil et al., as of June 2005 the SeattleSNPs comprehensively account for all haplotype blocks in the [18] data set has re-sequenced 234 human genes in 24 human genome. African-American and 23 European CEPH samples span- ning a total of 4868 kb of sequence. The ENCODE project Results [19] intends to re-sequence five 500 kb genomic regions Data simulation and block partitioning in the 48 individuals of the HapMap Consortium data set One thousand haplotypes representing a 200 kb region [20]. were generated via the standard coalescent with popula- tion specific demographic profiles for three world popula- Therefore, to fully assess the performance of block parti- tions: European, African American, and East Asian. All tioning algorithms we generated three populations con- datasets were analyzed with the three block partitioning sisting of 1000 haplotypes using the coalescent, a algorithms described in the Methods section. stochastic technique that simulates the genetic history of a sample of chromosomes [11]. Haplotypes representing In addition to the complete dataset of 1000 haplotypes, a 200 kb chromosomal region for three world popula- 1000 bootstrap sub-sample replicates of 24 or 96 haplo- tions – European, African American, and East Asian – were types were sampled and filtered for different SNP density simulated using an implementation of the coalescent that (all markers, one marker approximately every 1 kb, one uses a population-specific demographic history. The pop- marker approximately every 5 kb) and minor allele fre- ulation specific profiles we used were previously pub- quency (MAF) cutoff values (0.1%, 5%, and 10%). Each lished in Marth et al. [21], where the authors derive a bootstrap replicate was partitioned using three methods closed mathematical formula for computing the allele fre- (HapBlock, Gabriel's method, and MDBlocks). Computer quency spectrum for a specified demographic profile. The memory constraints prevented MDBlocks from partition- demographic profiles for each of the populations were ing all 1000 chromosomes using all SNPs for each coales- derived by computing allele frequency spectra predicted cent-derived population. For the same reason we were by Marth's equation for numerous demographic scenarios only able to analyze 200 bootstrap subsamples of 24 or 96 and testing the fit between it and the observed spectra chromosomes with MDBlocks. More details on coalescent from the SNP Consortium data set [1] for each respective simulations and bootstrap sampling is given in the Meth- population. ods section of the paper.

In the study presented here, we partitioned our coales- European population partitions using all chromosomes cent-derived haplotypes into blocks using the three algo- All 1000 European chromosomes were analyzed with rithms described above (diversity based, LD based, HapBlock and Gabriel's method. There were 1349 poly- information theoretic). We assessed algorithm differences morphic sites with an average SNP density of one SNP per

Page 3 of 13 (page number not for citation purposes) BMC Bioinformatics 2005, 6:303 http://www.biomedcentral.com/1471-2105/6/303

Table 2: Descriptive block statistics for all 1000 European, African American, and East Asian haplotypes using all SNPs and all SNPs with a MAF of at least 10%

European haplotypes

Method MAF Number of Blocks Mean bp/block Mean SNPs/block % coverage

HapBlock 0.1% 180 951.63 7.49 85.6% HapBlock 10% 48 3363.60 7.64 80.7% Gabriel 0.1% 39 4414.28 31.41 86.1% Gabriel 10% 33 4837.52 10.84 79.8% MDBlocks0.1%---- MDBlocks 10% 18 9348.78 20.39 84.4%

African American haplotypes

Method MAF Number of Blocks Mean bp/block Mean SNPs/block % coverage

HapBlock 0.1% 232 733.52 7.13 85% HapBlock 10% 61 2499.93 6.26 76.2% Gabriel 0.1% 40 4433.62 37.85 88.6% Gabriel 10% 36 4416.75 10.25 80% MDBlocks0.1%---- MDBlocks 10% 18 10004.44 21.22 90%

East Asian haplotypes

Method MAF Number of Blocks Mean bp/block Mean SNPs/block % coverage

HapBlock 0.1% 208 831.83 7.93 86.5% HapBlock 10% 57 2596.73 5.84 74% Gabriel 0.1% 41 3966.02 24.12 81.3% Gabriel 10% 38 3437.13 8.34 65.3% MDBlocks0.1%---- MDBlocks 10% 18 10006.11 18.50 90%

147 bp. Figure 1 displays the resulting block partitions with a MAF of 10% or higher. The number of inferred using all SNPs from the two methods, with the HapBlock blocks for HapBlock dropped dramatically from 180 to partition denoted as HB and Gabriel's method denoted as 48. For Gabriel's method the change was not as large, with GA. Table 2 displays descriptive statistics for the HapBlock 33 blocks inferred. MDBlocks inferred 18 blocks which and Gabriel's method population partitions. No match- had the largest physical size. HapBlock, Gabriel's method, ing block boundaries existed between HapBlock and and MDBlocks covered 80.7%, 79.8%, and 84.4% of the Gabriel's method. HapBlock inferred a larger number of 200 kb region in blocks. HapBlock again inferred a greater blocks of smaller physical length than Gabriel's method, number of blocks of smaller size when compared to the but 74% of the sequence was common to blocks inferred other two methods. Of each possible pair of partitions, by both methods. Both algorithms gave similiar values of only Gabriel's method and MDBlocks contained one set coverage, which is defined as sum of the physical haplo- of matching boundaries. Still, a large fraction of sequence, type block lengths in base pairs divided by total length of 57%, was common to all three partitions. Table 1 shows region [22], with values of 85.6% for HapBlock and percentage of total sequence common to all population 86.1% for Gabriel's method, respectively. block partitions with this population and condition, as well as other populations examined in this study. Figure 1 When analyzing all chromosomes using only SNPs with a shows the population partitions for all three methods MAF of 10% or greater, the total number of markers was using only SNPs with at least a MAF 10%, and block reduced to 367 with an average of one SNP every 540 bp. regions common for all three algorithms. Table 2 also shows descriptive statistics using only SNPs

Page 4 of 13 (page number not for citation purposes) BMC Bioinformatics 2005, 6:303 http://www.biomedcentral.com/1471-2105/6/303

Table 1: Population partition block overlaps. The table shows the percentage of total sequence common to all three partitions inferred from each algorithm (HapBlock, Gabriel's method, and MDBlocks) for each population studied.

Density MAF European African American East Asian

all 5% 61% 53% 44% all 10% 57% 60% 46% 1 kb 0.1% 43% 21% 30% 1 kb 5% 60% 14% 22% 1 kb10%53%17%26% 5 kb 0.1% 13% 3% 10% 5 kb 5% 29% 0 13% 5 kb 10% 33% 3% 12%

Next, we compared the population partitions of HapBlock partitions with no exact matching boundaries between and Gabriel's method using all markers vs. all markers them. with a MAF of at least 10%. For HapBlock, 46% of the blocks inferred with SNPs with the higher MAF were bro- Using only SNPs with a frequency of at least 10% resulted ken up with the addition of rarer markers however, 70.8% in a total of 382 markers with an average spacing of one of the chromosome is common to both partitions. For SNP every 521 bp. Table 2 also displays descriptive statis- Gabriel's method 73% of the sequence is common to par- tics for these block partitions. The number of blocks titions resulting from the two differing allele frequency inferred by HapBlock dropped sharply to 61. For Gabriel's conditions. Only 3% of Gabriel's method blocks were method the difference was smaller with a total of 36 broken into smaller markers with the additon of rarer blocks inferred. MDBlocks inferred the smallest number SNPs. of blocks with 18, but had the largest average size. Percent coverage dropped for HapBlock and Gabriel's method to African American population partitions using all 76.2% and 80%, respectively. MDBlocks still included chromosomes 90% of the region in blocks. When comparing all the par- A total of of 1653 polymorphic sites with an average den- titions, 60% of the 200 kb region was common to blocks sity of one SNP every 119 bp defined the 1000 haplotypes inferred by all three methods (see Table 1). HapBlock and in our sample. Table 2 contains descriptive statistics for Gabriel's method shared two matching boundaries, and HapBlock and Gabriel's method partitions using all SNPs. HapBlock and MDBlocks shared one matching boundary. HapBlock identified 232 blocks while Gabriel's method Figure 2 displays all three block partitions and shared identified 40. Figure 2 displays the HapBlock and block regions between each partition. Comparing the Gabriel's method population partitions using all SNPs. HapBlock and Gabriel partitions with the full marker set Gabriel's method resulted in a slightly larger sequence to the corresponding partition of the same method with coverage of 88.6% compared with 85% for HapBlock. rarer SNPs filtered out shows that there were common HapBlock identified a larger number of blocks of smaller regions identified in both. For HapBlock 67% of the 200 size, however 76%, of the sequence was common to both kb region was common to blocks for both conditions. For

HB All SNPs GA HB MAF >10% GA MD

Shared Blocks

SNP location

AfricanFigure American2 population partitions African American population partitions. African American population partitions for all 1000 haplotypes are shown. Block algorithms are abbreviated as HB (HapBlock), GA (Gabriel's method), MD (MDBlocks). The first two tracks show resulting block partitions from HapBlock and Gabriel's method using all SNPs. Next set of three tracks display resulting block partitions using all SNPs with a 10% MAF. The shared blocks track shows chromosomal regions common to all three block partitions using SNPs with a 10% MAF. The last two tracks show SNP positions of all SNPs and SNPs with at least a 10% MAF.

Page 5 of 13 (page number not for citation purposes) BMC Bioinformatics 2005, 6:303 http://www.biomedcentral.com/1471-2105/6/303

HB All SNPs GA HB MAF >10% GA MD

Shared Blocks

SNP location

EastFigure Asian 3 population partitions East Asian population partitions. East Asian population partitions for all 1000 haplotypes are shown. Block algorithms are abbreviated as HB (HapBlock), GA (Gabriel's method), MD (MDBlocks). First two tracks show resulting block partitions from HapBlock and Gabriel's method using all SNPs. Next set of three tracks display resulting block partitions using all SNPs with a 10% MAF. The shared blocks track shows chromosomal regions common to all three block partitions using SNPs with a 10% MAF. The last two tracks show SNP positions of all SNPs and SNPs with at least a 10% MAF.

Gabriel's method 74% of the sequence is included in both Bootstrap partitions using all markers with frequency of partitions. ≥10% To assess variation in block structure on more realistic East Asian population partitions using all chromosomes sample sizes (i.e. sample sizes that are being obtained by A total of 1649 SNPs with an average spacing of one SNP re-sequencing) we bootstrap subsampled 96 or 24 chro- every 120 bp defined the 1000 Asian haplotypes. Table 2 mosomes from our original set 1000 times. Figure 4 shows descriptive statistics for HapBlock and Gabriel par- shows the block partitions resulting from HapBlock for titions. HapBlock identified 208 blocks, Gabriel's method the first 50 individual bootstrap subsamples of size 96 inferred 41, and 70% of the chromosome was common to using all SNPs with a MAF of at least 10%. It was clearly block regions inferred by both methods. No matching evident that the block structure varied between the boot- boundaries existed between the two partitions. Figure 3 strap subsamples and the population partition. To find shows HapBlock and Gabriel's method block partitions. SNPs that were consistently inferred together in blocks The HapBlock partition inferred a larger number of blocks above a threshold frequency across all bootstrap subsam- of smaller size. Coverage values for HapBlock and ples we defined consensus block partitions for HapBlock Gabriel's method were 86.5% and 81.3%, respectively. for threshold values from 100 to 50 percent. (For more details on consensus blocks see Methods.) As the thresh- Removing rarer SNPs and using only markers with a MAF old for defining a consensus block is lowered, the physical of 10% or higher left 333 markers. There was a sharp drop length of a block increases monotonically and blocks in the number of blocks inferred by HapBlock with 57 defined at higher thresholds are combined. Table 3 shows blocks compared to 208 when using the full marker set. the percentage of chromosomal region common to both Gabriel's method inferred 38 blocks. MDBlocks inferred the population partition and consensus block partitions the fewest with 18. None of the partitions shared the same of HapBlock using only SNPs with a MAF of at least 10%. set of SNPs for a block boundary. Table 2 shows descrip- tive statistics for the resulting block partitions. The Gabriel's method and MDBlocks partitions also showed amount of sequence coverage drops for two of the meth- within population variation in block structure. (See addi- ods: 74% for HapBlock and 65.3% for Gabriel's method. tional files 1 and 2.) Table 3 also contains the percentage Coverage for MDBlocks remains at 90%. The shared block of total sequence common between the population parti- regions between all three methods shown in Figure 3 tions of Gabriel's method and MDBlocks, and each con- account for 46% of the chromosomal region. sensus block definition. Similar to the HapBlock results as the threshold for defining a consensus block is lowered, Population partitions at other conditions the amount block regions common to both partitions Descriptive statistics for population partitions of each increased. Of the three methods, MDBlocks consensus method at other density and MAF conditions are shown in blocks had the greatest amount of total sequence in com- additional file 7. mon with the population partition. Table 4 shows the per- centage total sequence common to all three consensus block definitions at each threshold value. Figure 5 dis- plays consensus blocks from each algorithm defined at a 80% threshold, and block regions common to all three

Page 6 of 13 (page number not for citation purposes) BMC Bioinformatics 2005, 6:303 http://www.biomedcentral.com/1471-2105/6/303

Population Partition 100 90 Consensus 80 blocks 70 60 50

Bootstrap subsamples of 96 individuals

FigureEuropean 4 HapBlock consensus and bootstrap block partitions European HapBlock consensus and bootstrap block partitions. European HapBlock consensus and bootstrap parti- tions using all SNPs with at least a 10% MAF are shown. The first track shows the population partition using all 1000 chromo- somes followed by consensus blocks defined at thresholds of 100-50% from bootstrap samples of size 96. The next set of tracks are the first 50 individual bootstrap HapBlock partitions.

consensus blocks. While these common block regions number of blocks inferred and their average size in base cover only 40% of the 200 kb region in blocks, it was pairs per blocks as SNP density increased. Coverage gener- encouraging to find that our consenus block partitions ally increased with an increased density of SNPs (see sup- overlapped. plementary Figure 3).

Figures 6 and 7 show the average number of blocks and Similar patterns for African American and East Asian boot- base pairs per blocks of each partitioning algorithm tested straps were found. Variation in block structure between for European haplotype bootstrap subsample sizes of 24 bootstrap samples existed. The same pattern of an inverse and 96 chromosomes for other density and MAF condi- relationship between the number of blocks and their aver- tions. There is an inverse relationship between the age size as SNP density increased, remained. As the thresh- old for a consensus block is lowered, the percentage of Table 3: European population and consensus block overlap. sequence common between the population block parti- Percentage of total sequence common to each method's tion increased monotonically. Also as the threshold is consensus blocks defined from bootstrap subsamples of 96 lowered, there is a greater percentage of total sequence chromosomes and population partition using all SNPs with at common to all consensus blocks defined from each least a 10% MAF. method. (Data not shown). consensus threshold HapBlock Gabriel's Method MDBlocks Discussion 100 20% 17% 31% We generated three populations of haplotypes via coales- 90 55% 45% 66% cent simulations to assess the performance of three block 80 72% 53% 76% partitioning algorithms under different marker density 70 77% 57% 78% and allele frequency conditions. Each of the block algo- 60 78% 61% 81% 50 79% 65% 81% rithms employed in this study partitions a genomic region into haplotype blocks using vastly different approaches.

Page 7 of 13 (page number not for citation purposes) BMC Bioinformatics 2005, 6:303 http://www.biomedcentral.com/1471-2105/6/303

Table 4: Consensus block overlap. Percentage of total sequence higher density of markers, hence there is an inverse rela- common to all three European consensus blocks defined from tionship between the number of blocks inferred and their bootstrap subsamples of 96 European haplotypes using all SNPs with at least 10% MAF. physical size. Removing rarer SNPs does not necessarily decrease the number of blocks inferred for each of the consensus threshold % common sequence methods when conditioning on a density value. This result is in contrast to the results of Schulze et al. [14], 100 9% who found that removing rarer SNPs decreased the 90 30% number of blocks inferred by the HapBlock and Gabriel's 80 40% 70 45% method. This maybe a result of the stochastic nature of the 60 51% coalescent. To get a clearer picture of the effect of allele fre- 50 57% quency on the performance of block partitioning algo- rithms, it may require the simulation of many more genealogies.

In addition to the three algorithms described here, there Our consensus block definitions attempt to identify SNPs are other definitions for haplotype blocks not examined consistently inferred together in blocks across all boot- [16,17]. Despite all these algorithms, there is no widely strap replicates. The amount of common block regions accepted definition of how to best define haplotype between the population partition and consensus block blocks [23]. definitions from bootstrap samples depends heavily on the threshold to define a consensus block, as well as the The descriptive statistics of each population block parti- percent coverage of the bootstrap and population parti- tion using all 1000 chromosomes clearly show that results tions. If a significant proportion of the chromosome is are different in number, size, and coverage of inferred inferred in blocks in both the population and consensus blocks, particularly with a higher density of markers. Hap- definitions, there is a greater chance of finding common Block generally inferred the largest number of blocks of block regions. However as discussed earlier, this attribute smallest size and MDBlocks inferred the fewest number of is influenced by SNP density and allele frequency of mark- blocks of largest size. While there are few exact matching ers. block boundaries between different partitions, there is a large amount of common block regions between them. For Gabriel's method, the number, size, and coverage of Increasing the density of markers had a more dramatic inferred blocks varied dramatically between the bootstrap effect on the percent coverage for Gabriel's method than samples and population partitions. The average number the other two methods due to fact that LD patterns are of blocks inferred from Gabriel's method for bootstrap sensitive to marker density and can change with the addi- samples of size 24 and 96 of European haplotypes, using tion of more markers [15]. The amount of coverage, in all markers with a MAF of at least 10% was 18.06 and turn, influences the percentage of total sequence common 30.76, respectively. On average 20.1% and 68.1% of the to all partitions since there is a greater chance of overlap 200 kb region were inferred in blocks. These numbers dif- between them. fer from the population partition numbers of 33 blocks and 79.8%. These disparate numbers illustrate the effect To assess within population variation in block structure sample size has on estimating confidence bounds of D'. we bootstrap subsampled haplotypes of sizes 24 or 96 This also explains the fact that in certain bootstrap sam- chromosomes. The descriptive statistics of the bootstrap ples, Gabriel's method failed to infer any blocks. The per- partitions indicate that the number of inferred blocks cent overlap between the consensus blocks defined from increases as a higher density of markers is used. Also, the bootstrap samples of size 24 and the population partition average number of base pairs per blocks decreases with a never exceed 16%, even at the most liberal consensus

Shared Blocks MD Consensus blocks HB threshold 80 GA SNP location

CommonFigure 5 European consensus block regions Common European consensus block regions. Overlapping consensus block regions from each consensus block defined from MDBlocks (MD), HapBlock (HB), and Gabriel's method (GA). Consensus blocks shown from each method are defined at a threshold of 80% using all SNPs with at least a 10% MAF. The SNP positions are shown in the last track.

Page 8 of 13 (page number not for citation purposes) BMC Bioinformatics 2005, 6:303 http://www.biomedcentral.com/1471-2105/6/303

threshold of 50%. For consensus blocks defined from distinguish common haplotypes within a block. Recently, bootstrap sample sizes of 96, the percent intersection with a method formulated by Halldorsson et al. [25] selects tag the population partition is 53% even at the fairly high SNPs which does not require a haplotype block defini- threshold of 80%. For HapBlock and MDBlocks, differ- tion. Also, the tag SNP algorithm LDselect [26] chooses ences between average coverage values from the bootstrap tag SNPs independent of chosen haplotype block bound- experiments to the population partition are not as large, aries. and they showed a larger percentage of sequence intersec- tion between the consensus blocks. Another point to address in our study design is that the simulated haplotypes used were derived from a single Comparing the consensus blocks for one method to the realization of a coalescent simulation, hence our study population partition of the same method addresses the does not address genetic sampling [27]. Since we boot- block structure variation within a particular algorithm. To strap subsampled 96 or 24 individuals from a population find common block regions in bootstrap subsamples of 1000, we fix the genetic history of our data set and focus from differing algorithms, we found the overlapping on the statistical sampling on the performance of block boundaries between consensus block regions from each partitioning algorithms used in this study. We also chose algorithm. For certain density, MAF conditions, and con- not to vary recombination rate or incorporate recombina- sensus block thresholds, there was very low or non-exist- tion hotspots in our simulations since we only analyzed a ent overlap. These numbers can be severely reduced if a 200 kb region. Due to these limitations, we did not com- particular method fails to infer a large number of blocks pare populations to each other. Rather, we examined the covering a significant portion of sequence, as was the case trends seen in each population and used coalescent simu- for Gabriel's method at the sparsest marker density of 5 lations with three different population histories to ensure kb. When using all SNPs with a 10% MAF for European that the results from the three block partitioning algo- haplotypes, the percentage overlap between all consensus rithms were not due to the coalescent parameters chosen. blocks ranges from 9–57% depending on the consensus The recent study of Ding et al. [28] address the affects of threshold. Finding block regions common to all three population genetic parameters, such as the mutation and methods is an encouraging sign because each algorithm recombination rate, on the diversity and LD based algo- takes a different approach to the block partitioning prob- rithms discussed here for multiple realizations of coales- lem. If the haplotype block paradigm is an accurate cent genealogies. description of underlying LD patterns of the human genome, different algorithms should find common block Conclusion regions since the three methods base their algorithms on In summary, our results show that for the population par- various attributes of the paradigm. titions using all 1000 chromosomes, there is a varied range of number, size, and coverage of blocks between the Rather than searching for exact matching boundaries different methods. The percentage total sequence com- using Schwartz concordance test statistic as a measure of mon to all three partitioning algorithms ranges from 3– block concordance [24], we chose to compute the percent- 61% depending on the population and is generally higher age of common block regions between two different block using a high density of SNPs with a wide range of MAF. partitions as our metric of concordance. While using the Bootstrap sampling of haplotypes from the population block concordance test statistic is a valid approach, the shows there is within population variation in block struc- method cannot assess the significance of block bounda- ture for all three methods. Our consensus block definition ries which may differ by few SNPs, but still have a signifi- attempts to define blocks based on sets of SNPs consist- cant degree of overlap between block regions. There was ently found together in blocks across all bootstrap repli- only one matching boundary between each possible of cates. Using a higher density of markers there is an pair of partitions using all 1000 European haplotypes increased percentage of total sequence in common with using all SNPs with a MAF of 10%. However, 57% of the consensus blocks and population partitions. The percent- 200 kb region was common to blocks defined from all age of common block regions between consensus blocks three methods. defined from all three methods is influenced by the per- cent coverage of individual partitions, which itself is influ- In our analysis we focused on the number, size, and cov- enced by the density and allele frequency of markers that erage of haplotype blocks inferred by three different algo- comprise the haplotypes to be partitioned. It is evident rithms. We do not discuss tag SNPs identification because that each algorithm gave a different picture of haplotype we view it as a separate problem. However, it should be block structure at differing density and MAF values and pointed out that the dynamic programming approach of few, if any exactly matching block boundaries existed. An HapBlock is closely tied to tag SNPs because it defines open question that remains is how best to merge or inte- blocks which minimize the number of SNPs needed to grate block definitions from different algorithms. For

Page 9 of 13 (page number not for citation purposes) BMC Bioinformatics 2005, 6:303 http://www.biomedcentral.com/1471-2105/6/303

AveragesamplesFigure 6 number of blocks inferred European bootstrap sub- AverageFigure 7 bp/block European bootstrap subsamples Average number of blocks inferred European boot- Average bp/block European bootstrap subsamples. strap subsamples. Figure 6 shows 3-d bar plots of the Figure 7 shows 3-d bar plots of the average number of base average number of blocks inferred for HapBlock, Gabriel's pairs per block inferred for HapBlock, Gabriel's method, and method, and MDBlocks partitions on European bootstrap MDBlocks partitions on European bootstrap replicates of replicates of sizes 96 and 24 at each SNP density and MAF sizes 96 and 24 at each SNP density and MAF condition. condition.

population size are, for the European population: 9.85 × empirical studies, it is advisable to subject collected data 10-4; for African American and East Asian populations: to a variety of block algorithms and identify common 1.29 × 10-3 and 1.03 × 10-3, respectively. For each popu- block regions. If distinct partitioning algorithms show a lation, the haplotypes we used to examine the partition- large portion of overlap in inferred block regions, then ing algorithms were drawn from a single realization of the these genomic regions can be investigated further to iden- coalescent. tify genetic variants causing disease. The coalescent simulation software was implemented in Methods Perl and run on a Sun Blade 1000 with dual 750 MHz Coalescent simulations Ultra Sparc III processors and 4.5 GB of RAM. To validate Haplotypes representing a 200 kb chromosomal region the correctness of the program 200 genealogies of 41 indi- for three world populations, European, African American, viduals for a 200 kb region were simulated. The average and East Asian were generated using an implementation frequency spectrum was tabulated from these 200 simula- of the standard coalescent with uniform recombination, tions and plotted against the predicted spectra from which uses a population-specific demographic history. Marth's mathematical formula. The results of the The demographic profiles for each of the three popula- observed and predicted spectra for each population is tions considered in this study were as determined by shown in supplementary figures 4, 5, and 6. Marth et al. from The SNP Consortium genotype data [21]. These profiles are characterized by 3 effective popu- Haplotype block partitioning algorithms lation size epochs. For example, for the European popula- Three categories of block partitioning algorithms were tion, we used an ancestral effective population size of used in the study: diversity based, LD based, and informa- 10,000 individuals, followed by a bottleneck phase of an tion theoretic. The software programs that implement effective population size of 2,000 lasting 500 generations, each method are described below. All three programs then an expansion to an effective population size of were run on a Sun Blade 1000 with dual 750 MHz Ultra 20,000 starting 3,000 generations ago. The average Sparc III processors and 4.5 GB of RAM. number of mutations occurring along a branch of the genealogy (lineage) is Poisson distributed and propor- HapBlock tional to the branch length. The value of the (constant) HapBlock v2.1 is a diversity based algorithm that mini- mutation rate µ in the simulations was 2.5 × 10-8. Since mizes the number SNPs that distinguish at least α percent the effective population size is different within each of the of common haplotypes [8]. A haplotype block comprised three epochs of the demographic profiles, there is not a of at least one SNP is defined if the number of common single value of θ, the scaled mutation rate. The equivalent haplotypes represents at least α percent of all the observed values of θ that take into account the fluctuating effective haplotypes. A haplotype can be designated common

Page 10 of 13 (page number not for citation purposes) BMC Bioinformatics 2005, 6:303 http://www.biomedcentral.com/1471-2105/6/303

either by its frequency or the number of times represented Hence, the same set of individuals that make up a partic- in the set of observed haplotypes. We chose to designate a ular bootstrap subsample can be compared at differing haplotype as common if it had a frequency of at least density and MAF conditions. If a subsample contained a 10%. Hence, for our study we set α and β to 0.80 and 0.10, monomorphic site, it was removed prior to the initial fil- respectively. The program is available for download here: tering of density and allele frequency conditions. Since http://www.cmb.usc.edu/msms/HapBlock/. monomorphic sites do not contain any information, in an information theoretic sense, and information theory Gabriel's Method forms the basis for MDBlocks, the program would crash. Gabriel's method [6] is implemented in the software Hap- Removing monomorphic SNPs in our bootstrapping rou- loview v3.11 [12]. Gabriel's method defines pairs of SNPs tine solved this problem (Eric C. Anderson, personal com- to be in strong LD if the one-sided 95% D' confidence munication). bound is between 0.7 and 0.98. The method defines a block if 95% of pairwise SNP comparisons are in strong Consensus block definition LD. For our study Haploview was executed in command To identify SNPs that are consistently inferred together in line mode to obtain partitions from Gabriel's method. the same block across all bootstrap subsamples, we intro- Executing Gabriel's method on certain bootstrap samples duce the idea of a consensus block. Let the collection P = p1, generated a software error. Corresponding with the author ..., p1000 be the collection of bootstrap partitions resulting for Haploview, we were not able to identify the cause of from a particular method. Let S be the set of SNPs that the error (Jeffery Barrett personal communication). But comprise the haplotypes. For each SNPi and SNPi+1, we for all bootstrap samples of haplotypes analyzed, this calculate how often they are assigned to the same block error was encountered on less than 1% of the time. Hap- across all bootstrap samples. We call this the neighbor loview is available for download here: http:// probability. We define a consensus block as collection of www.broad.mit.edu/mpg/haploview/index.php. consecutive SNPs whose neighbor probability is greater than or equal to some threshold percentage t, for t = 100 MDBlocks 90 80 70 60 50. Consensus blocks were defined for each MDBlocks vl.0 uses the Minimum Description Length of the density and MAF conditions for bootstrap subsam- (MDL) principle for defining blocks [9]. It considers the ples of sizes 24 and 96. As described in the previous sec- set of all possible block boundaries and finds the one with tion, if a bootstrap subsample initially contained a the minimum description length using two versions of a monomorphic site, it was removed. However, this leads to dynamic programming algorithm. The first is called the the situation that not all bootstrap replicates may contain iterative dynamic programming algorithm (IDP) The second the same SNPs. To calculate consensus blocks, then we is a faster, but approximate method called the iterative take the union of all markers used across all bootstrap rep- approximate dynamic programming algorithm (IADP). Due licates and then proceed to calculate neighbor probability. to the number and size of haplotypes analyzed, we used If a particular SNP was not used in a particular bootstrap, the IADP option. MDBlocks ran out of computer memory and is a member of a the union set of SNPs, its block when attempting to partition all 1000 haplotypes using assignment was treated as missing data and imputed in all SNPs for each population when using the IADP algo- the following way. If the adjacent markers to the left and rithm. MDBlocks is available for download here: http:// to the right of the missing marker were assigned to the ib.berkeley.edu/labs/slatkin/eriq/software/mdb_web/. same block number, then the missing SNP in question was assigned to the same block. Bootstrap subsampling To assess variation in the number and size of blocks Data storage inferred by the three partitioning algorithms used in the All data regarding coalescent-derived haplotypes (SNP study under differing values of sample size, SNP density, positions, allele frequencies, etc), block partitions and MAF cutoffs, 1000 bootstrap subsamples of sizes 24 (number of blocks inferred, block boundaries, etc), and or 96 were drawn with replacement from the population. consensus block definitions were stored in tables in a A true bootstrap sample is one that is the same size as the MySQL v3.23 database. original sample (1000). Since we are making smaller sam- ples of 96 or 24 chromosomes, it is more properly called Visualization of block partitions a bootstrap subsample. Initially each bootstrap subsam- Block partitions were visualized in the UCSC Genome ple contained the full set of SNPs, and was progressively Browser [29]. filtered for each possible pair of SNP density and MAF cut- off values. Block partition intersection Finding the common regions between different block par- titions was achieved by executing the appropriate MySQL

Page 11 of 13 (page number not for citation purposes) BMC Bioinformatics 2005, 6:303 http://www.biomedcentral.com/1471-2105/6/303

query on tables holding information for block partition boundaries. For a subset of density and MAF conditions Additional File 4 (all SNPs, all SNPs with a 10% MAF) correctness of the Supplementary Figure 4 shows the validated European allele frequency database query was verified by using sequence intersec- spectrum (AFS). The average folded AFS from 200 coalescent genealogies of 41 individuals is plotted in green. The predicted AFS from Marth's tion feature of the UCSC Table Browser [30]. mathematical formula is shown in red. Click here for file Authors' contributions [http://www.biomedcentral.com/content/supplementary/1471- MO conceived the original experimental question. GTM 2105-6-303-S4.ppt] provided the source code for coalescent simulations. ARI, PT, and MO formulated the idea of consensus blocks. CAS Additional File 5 offered helpful advice on data analysis and implementa- Supplementary figure 5 shows the validated African American allele fre- tion. ARI executed the block partitioning programs, col- quency specturm (AFS). The average folded AFS from 200 coalescent genealogies of 41 individuals is plotted in green. The predicted AFS from lected and analyzed the data, and wrote the paper with Marth's mathematical formula is shown in red. editorial comments and modifications from MO, GTM, Click here for file CAS, and PT. [http://www.biomedcentral.com/content/supplementary/1471- 2105-6-303-S5.ppt] Additional material Additional File 6 Supplementary figure 6 shows the validated East Asian allele frequency Additional File 7 specturm (AFS). The average folded AFS from 200 coalescent genealogies Supplementary file 7 is an Excel sheet containing descriptive statistics for of 41 individuals is plotted in green. The predicted AFS from Marth's population partitions using all 1000 haplotypes for European, African mathematical formula is shown in red. American, East Asian populations for all SNP density and MAF condi- Click here for file tions. [http://www.biomedcentral.com/content/supplementary/1471- Click here for file 2105-6-303-S6.ppt] [http://www.biomedcentral.com/content/supplementary/1471- 2105-6-303-S7.xls]

Additional File 1 Acknowledgements Supplementary Figure 1 shows Gabriel's method consensus and bootstrap We thank Vasant Marar and Meeta Oberoi for excellent computer support. partitions using all SNPs with at least a 10% MAF for European haplo- types. The first track shows the population partition using all 1000 chro- References mosomes followed by consensus blocks defined at thresholds of 100-50% 1. Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, from bootstrap samples of size 96. The next set of tracks are the first 50 Marth G, Sherry S: A map of genome sequence information individual bootstrap Gabriel's method partitions. containing 1.42 million single nucleotide polymorphisms. Click here for file Nature 2001, 409:187-196. [http://www.biomedcentral.com/content/supplementary/1471- 2. Weiss KM, Clark AG: Linkage disequilibrium and mapping of complex human traits. Trends in Genetics 2002, 18:19-24. 2105-6-303-S1.ppt] 3. Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, Hacker CR, Kautzer CR, Lee DH, Marjoribanks C, McDonough DP, Nguyen BTN, Additional File 2 Norris MC, Sheehan JB, Shen N, Stern D, Stokowski RP, Thomas DJ, Supplementary Figure 2 shows MDBlocks consensus and bootstrap parti- Trulson MO, Vyas KR, Frazer KA, Fodor SPA, Cox DR: Blocks of Limited Haplotype Diversity Revealed by High-Resolution tions using all SNPs with at least a 10% MAF for European haplotypes. Scanning of Human Chromosome 21. Science 2003, The first track shows the population partition using all 1000 chromosomes 294:1719-1723. followed by consensus blocks defined at thresholds of 100-50% from boot- 4. Olivier M, Bustos VI, Levy MR, Smick GA, Moreno I, Bushard JM, strap samples of size 96. The next set of tracks are the first 50 individual Almendras AA, Sheppard K, Zierten DL, Aggarwal A, Carlson CS, bootstrap MDBlocks partitions. Foster BD, Vo N, Kelly L, Liu X, Cox DR: Complex High-Resolu- Click here for file tion Linkage Disequilibrium and Haplotype Patterns of Sin- gle-Nucleotide Polymorphisms in 2.5 Mb of Sequence on [http://www.biomedcentral.com/content/supplementary/1471- Human Chromosome 21. Genomics 2001, 78:64-72. 2105-6-303-S2.ppt] 5. Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES: High-resolu- tion haplotype structure in the human genome. Nature Genet- Additional File 3 ics 2001, 29:229-232. 6. Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Supplementary Figure 3 shows 3-d bar plots of the average coverage of Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero SN, Rotimi HapBlock, Gabriel's method, and MDBlocks partitions on European boot- C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D: strap replicates of sizes 96 and 24 at each SNP density and MAF condi- The Structure of Haplotype Blocks in the Human Genome. tion. Science 2003, 296:2225-2229. Click here for file 7. Hirschorn JN, Daly MJ: Genome-wide Association Studies for Common Diseases and Complex Traits. Nature Reviews Genet- [http://www.biomedcentral.com/content/supplementary/1471- ics 2005, 6:95-108. 2105-6-303-S3.ppt] 8. Zhang K, Deng M, Chen T, Waterman MS, Sun F: A dynamic pro- gramming algorithm for haplotype block partitioning. PNAS 2002, 99:7335-7339.

Page 12 of 13 (page number not for citation purposes) BMC Bioinformatics 2005, 6:303 http://www.biomedcentral.com/1471-2105/6/303

9. Anderson EC, Novembre J: Finding Haplotype Block Bounda- 29. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, ries by Using the Minimum-Description-Length Principle. Haussler D: The Human Genome Browser at UCSC. Genome American Journal of Human Genetics 2003, 73:336-354. Research 2002, 12:996-1006. 10. Zhang K, Qin Z, Chen T, Liu JS, Waterman MS, Sun F: HapBlock: 30. Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler haplotype block partitioning and tag SNP selection software D, Kent WJ: The UCSC Table Browser data retrieval tool. using a set of dynamic programming algorithms. Bioinformatics Nucleic Acids Research 2004, 32(Suppl 1):D493-D496. 2005, 21:131-134. 11. Nordborg M, Tavare S: Linkage disequilibrium: What history has to tell us. Trends in Genetics 2002, 18:83-90. 12. Barrett JC, Fry B, Mailer J, Daly MJ: Haploview: analysis and visu- alization of LD and haplotype maps. Bioinformatics 2005, 21:263-265. 13. Hansen MH, Yu B: Model Selection and the Principle of Mini- mum Description Length. Journal of the American Statistical Associ- ation 2001, 96:746-774. 14. Schulze TG, Zhang K, Chen YS, Akula N, Sun F, McMahon FJ: Defin- ing haplotype blocks and tag single-nucleotide polymor- phisms in the human genome. Human Molecular Genetics 2004, 13:335-342. 15. Ke X, Hunt S, Tapper W, Lawrence R, Stavrides G, Ghori J, Whit- taker P, Collins A, Morris AP, Bentley D, Cardon LR, Deloukas P: The impact of SNP density on fine-scale patterns of linkage disequilibrium. Human Molecular Genetics 2004, 13:577-588. 16. Wang N, Akey JM, Zhang K, Chakraborty R, Jin L: Distribution of Recombination Crossovers and the Origin of Haplotype Blocks: The Interplay of Population History, Recombination, and Mutation. American Journal of Human Genetics 2002, 71:1227-1234. 17. Phillips M, Lawrence R, Sachidanandam R, Morris A, Balding D, Don- aldson M, Studebaker J, Ankener W, Alfisi S, Kuo FS, Camisa A, Pazo- rov V, Scott K, Carey B, Faith J, Katari G, Bhatti H, Cyr J, Derohannessian V, Elosua C, Forman A, Grecco N, Hock C, Kuebler J, Lathrop J, Mockler M, Nachtman E, Restine S, Varde S, Hozza M, Gelfand C, Broxholme J, Abecasis G, Boyce-Jacino M, Cardon L: Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots. Nature Genetics 2003, 33:382-387. 18. Carlson CS, Eberle MA, Rieder MJ, Smith JD, Kruglyak L, Nickerson DA: Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nature Genetics 2003, 33:518-521. 19. Consortium TEP: The ENCODE (ENCyclopedia Of DNA Ele- ments) Project. Science 2003, 306:636-640. 20. International HapMap Consortium T: The International HapMap Project. Nature 2003, 426:789-796. 21. Marth G, Schuler G, Yeh R, Davenport R, Agarwala R, Church D, Wheelan S, Baker J, Ward M, Kholodov M, Phan L, Czabarka E, Mur- vai J, Cutler D, Wooding S, Rogers A, Chakravarti A, Harpending HC, Kwok PY, Sherry ST: Sequence variations in the public genome data reflect a bottlenecked population history. PNAS 2003, 100:376-381. 22. Pritchard JK, Wall J: Assessing the Performance of the Haplo- type Block Model of Linkage Disequilibrium. American Journal of Human Genetics 2003, 73:502-515. 23. Bafna V, Halldorsson BV, Schwartz R, Clark AG, Istrail S: Haplo- types and Informative SNP Selection Algorithms: Don't Block Out Information. RECOMB 2003:19-27. 24. Schwartz R, Halldorsson BV, Bafna V, Clark AG, Istrail S: Robustness of Inference of Haplotype Block Structure. Journal of Computa- tional Biology 2003, 10:13-19. 25. Halldorsson BV, Bafna V, Lippert R, Schwartz R, Vega FMDL, Clark Publish with BioMed Central and every AG, Istrail S: Optimal Haplotype Block-Free Selection of Tag- ging SNPs for Genome-Wide Association Studies. Genome scientist can read your work free of charge Research 2004, 14:1633-1640. "BioMed Central will be the most significant development for 26. Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA: disseminating the results of biomedical research in our lifetime." Selecting a Maximally Informative Set of Single-Nucleotide Polymorphisms for Association Analyses Using Linkage Dis- Sir Paul Nurse, Cancer Research UK equilibrium. The American Journal of Human Genetics 2004, Your research papers will be: 74:106-120. 27. Weir BS: Genetic Data Analysis II Sinauer Associates; 1996. available free of charge to the entire biomedical community 28. Ding K, Zhou K, Zhang J, Knight J, Zhang X, Shen Y: The Effect of peer reviewed and published immediately upon acceptance Haplotype-Block Definitions on Inference of Haplotype- Block Structure and htSNPs Selection. Molecular Biology and cited in PubMed and archived on PubMed Central Evolution 2005, 22:148-159. yours — you keep the copyright

Submit your manuscript here: BioMedcentral http://www.biomedcentral.com/info/publishing_adv.asp

Page 13 of 13 (page number not for citation purposes) Copyright  2004 by the Genetics Society of America

The Allele Frequency Spectrum in Genome-Wide Human Variation Data Reveals Signals of Differential Demographic History in Three Large World Populations

Gabor T. Marth,1 Eva Czabarka, Janos Murvai and Stephen T. Sherry National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894 Manuscript received April 15, 2003 Accepted for publication September 4, 2003

ABSTRACT We have studied a genome-wide set of single-nucleotide polymorphism (SNP) allele frequency measures for African-American, East Asian, and European-American samples. For this analysis we derived a simple, closed mathematical formulation for the spectrum of expected allele frequencies when the sampled populations have experienced nonstationary demographic histories. The direct calculation generates the spectrum orders of magnitude faster than coalescent simulations do and allows us to generate spectra for a large number of alternative histories on a multidimensional parameter grid. Model-fitting experiments using this grid reveal significant population-specific differences among the demographic histories that best describe the observed allele frequency spectra. European and Asian spectra show a bottleneck-shaped history: a reduction of effective population size in the past followed by a recent phase of size recovery. In contrast, the African-American spectrum shows a history of moderate but uninterrupted population expansion. These differences are expected to have profound consequences for the design of medical association studies. The analytical methods developed for this study, i.e., a closed mathematical formulation for the allele frequency spectrum, correcting the ascertainment bias introduced by shallow SNP sampling, and dealing with variable sample sizes provide a general framework for the analysis of public variation data.

HE analysis of statistical distributions of genetic the effects of recombination or mutation rate heterogene- Tvariations has a rich history in classical population ity as we show below. genetic studies (Crow and Kimura 1970), and recent Modeling the distribution of allele frequency: Prior genome-scale data collection projects have positioned study of the AFS has been restricted to properties of the field to apply, challenge, and improve traditional summary statistics such as Tajima’s D (Tajima 1989), or theory by examining data from thousands of loci simul- the proportion of rare- to medium-frequency alleles (Fu taneously. The two most frequently studied distributions and Li 1993). There has been very little analysis of the of nucleotide sequence variation are the marker density general shape of observed spectral distributions. The (MD), or mismatch distribution (Li 1977; Rogers and analytical shape of the AFS, under a stationary history Harpending 1992; i.e., the distribution of the number of constant effective population size, was derived by Fu of polymorphic sites observed when a collection of se- (1995) who showed that, within n samples, the expected quences of a given length are compared), and the allele number of mutations of size i is inversely proportional frequency spectrum (AFS; Ewens 1972; i.e., the distribu- to i. Important properties of the coalescent process un- tion of diallelic polymorphic sites according to the num- der deterministically changing population size have ber of chromosomes that carry a given allele within a been derived in publications of Griffiths and Tavare sample). The latter distribution is immediately applicable (1994a,b) and Tavare et al. (1997). These results show to the genotype data produced by projects that are char- that, for the purposes of genealogy, varying population acterizing a large subset of currently available single-nucle- size can be treated by appropriate scaling of the coales- otide polymorphisms (SNPs) with measures of individual cent time. Applying these results to obtain a formula allele counts (genotypes) for three ethnic populations for the allele frequency spectrum is not trivial, however, (http://snp.cshl.org/allele_frequency_project/). In addi- because mutations occur in nonscaled time. More re- tion to data availability, the AFS has other, analytical advan- cently, Wooding and Rogers (2002) derived a method tages over MD data, most notably its independence from called the matrix coalescent that overcomes these diffi- culties and calculates the AFS under arbitrarily changing population size histories. Their approach solves the problem for the general case, but leads to an involved 1Corresponding author: Department of Biology, Boston College, 140 Commonwealth Ave., Chestnut Hill, MA 02467. computational procedure requiring numerical matrix E-mail: [email protected] inversion. In this study, we have taken a different ap-

Genetics 166: 351–372 ( January 2004) 352 G. T. Marth et al.

proach. By extending Fu’s result from a stationary popu- lations, practically all possible simple shapes of popula- lation history to a more general shape, a profile of demo- tion history have been proposed: constant effective size graphic history characterized by an arbitrary number of (stationary history), growth relative to an ancestral effec- epochs such that the effective population size is constant tive size (population expansion), size reduction (col- within each epoch, we have arrived at a very simple, lapse), and bottleneck (a phase of size reduction fol- easily computable formula for the AFS. The price we lowed by a phase of growth or recovery); see Figure 1. pay is the lack of generality of arbitrary shapes. In many These claims as well as the underlying data have been practical situations, however, these shapes can be ap- reviewed by various authors (Harpending and Rogers proximated by a piecewise constant effective size profile. 2000; Wall and Przeworski 2000; Jorde et al. 2001; The advantage is a formulation that permits very rapid Rogers 2001; Ptak and Przeworski 2002; Tishkoff generation of AFS under a large number of competing and Williams 2002). It is generally agreed that variation histories for accurate data fitting and hypothesis testing. patterns in mitochondrial DNA show rapid expansion This result is applicable when the sites under consider- of effective size in all human populations. Results in ation are selected randomly and the number of success- microsatellite data are less unanimous about which pop- fully genotyped samples is identical at each site. For the ulations experienced expansion or what the magnitude data set we are considering both of these assumptions and starting time of such demographic events were. are violated. First, the sites in question were selected Recent studies of SNP data sets in nuclear DNA propose for the population allele frequency characterization of the possibility of a population collapse to explain re- a large subset of SNPs from a genome-wide map (Sachi- duced haplotype diversity (Clark et al. 1998; Reich et danandam et al. 2001) of SNPs discovered by computa- al. 2001, 2002; Gabriel et al. 2002), especially in samples tional means, in large mining efforts in the public (Alt- of European ancestry, a hypothesis consistent with our shuler et al. 2000; Mullikin et al. 2000; Lander et al. observations in the current data set. 2001; Marth et al. 2003) and private (Venter et al. 2001) domains, numbering millions of sites. Common in these efforts is that SNP discovery was carried out in METHODS samples of a small number of chromosomes (two or Allele frequency spectrum under stepwise constant three). The samples used in the discovery phase were effective population size: We show that, for a population different from the samples used in the consequent geno- evolving under the Wright-Fisher model, and under se- type characterization experiments, and they repre- lective neutrality, the expectation for the number of ⌿ sented an unknown mixture of ethnicities. Second, be- mutations i of size i, within a sample of n chromosomes cause of genotyping failures, the number of successful under a demographic history of multi-epoch, piecewise genotypes varies from site to site, raising the question constant effective population size is of how to compare allele counts across these sites. In this ␮ ⌿ ϭ 4 N1 work, we propose methods to deal with these practical E( i) i problems. The resulting suite of tools enables us to analyze  the shape of the AFS observed in the data directly and to MϪ1 Ϫ Ϫ Ϫ1 ϩ  ␮Nmϩ1 Nm n 1 ͚ 4 ΂ ΃ evaluate competing scenarios of demographic history on ϭ i i m 1  the basis of how well they fit the observations.   n n   Demographic history: The reconstruction of human  n Ϫ k Ϫ j ␶ l(l Ϫ 1)  ϫ ͚ ͚  ΂ ΃ m* ͟  ΂ Ϫ ΃ e 2 Ϫ Ϫ Ϫ , demographic history is of direct biological and anthro- kϭ2 i 1 jϭk  l:l϶j; l(l 1) j(j 1)  Յ Յ  pological interest. Additionally, the history of effective k l n population size has a profound effect on important (1) quantities such as the extent of linkage disequilibrium ␮ where is the (constant) per-locus mutation rate, Nm and is therefore important for medical association stud- is the effective population size in epoch m, Tm is the ies. There have been many attempts for demographic ␶ ϭ m corresponding epoch duration, and *m ͚lϭ1Tl/2Nl , inference from contemporary molecular data represent- the normalized epoch boundary time. A detailed deriva- ing different molecular mutation systems such as mito- tion of this result is given in the appendix. The normal- chondrial DNA polymorphisms (Di Rienzo and Wilson ized distribution of these expectations according to the 1991; Rogers and Harpending 1992; Sherry et al. frequency is the allele frequency spectrum: 1994; Ingman et al. 2000), microsatellites (Di Rienzo et ϭ al. 1998; Kimmel et al. 1998; Reich and Goldstein Pn(i) Pr(a given segregating site is size i in n samples) 1998; Relethford and Jorde 1999; Gonser et al. 2000; ⌿ ϭ E( i) ϭ Ϫ Zhivotovsky et al. 2000), and, more recently, SNPs in nϪ1 , i 1,...,n 1. (2) ͚ ϭ E(⌿ ) nuclear DNA (Harding et al. 1997; Clark et al. 1998; j 1 j Cargill et al. 1999; Zhao et al. 2000; Reich et al. 2001; It is sometimes useful to consider the “full” allele full Sachidanandam et al. 2001; Yu et al. 2001). For both frequency spectrum, P n (i), considering sizes 0 and n, global samples of human diversity, or specific subpopu- i.e., when all samples carry the ancestral or the derived Demographic Inference From SNP Data 353 allele, respectively. We have verified the accuracy of the the individual terms are close in value. Instability can complete allele frequency spectrum derived from this be avoided by accurate calculation of each term. The formulation by coalescent simulations (supplemental higher the sample size, the more accurately each term Figure S1 at http://www.genetics.org/supplemental/). has to be evaluated. We do not have a systematic way Three important properties of the allele frequency spec- to predict the accuracy requirement as a function of trum are clear from Equation 1. First, the expectation sample size, hence we determined the accuracy require- for a given frequency is linear under simultaneous scal- ment for a given sample size by trial and error. In our ing of all effective population sizes and epoch durations implementation, we have used high-accuracy numeric

(i.e., as long as Tm and Nm are multiplied by the same libraries with settable numeric precision. Our experi- constant for each m), hence the relative frequency spec- ence has been that, up to a sample size n ϭ 100, a trum remains unchanged. This fact can be exploited to numeric precision of 100 decimal places was sufficient reduce the number of parameters that characterizes a for our calculations. Evaluation of the allele frequency given demographic model under consideration. Sec- spectrum for a sample size of 1000 required a numerical .decimal places 500ف ond, the expected number of mutations of a given size precision of for more than one nucleotide site is simply the sum Correcting ascertainment bias: To describe the situa- of the individual expectations, without regard to any tion where polymorphic sites discovered in a set of sam- possible correlation among the site genealogy of proxi- ples are genotyped in a second, independently drawn mal sites. Therefore, our results for the expected num- set of samples for frequency characterization we divide ber of segregating sites as well as the allele frequency the two independent groups of samples into a “discov- spectrum are also valid for polymorphisms at a single ery” group consisting of k samples and a “genotyping” locus of arbitrary sequence length, without regard to group consisting of n samples. The discovery process is possible recombination within the locus, or for polymor- modeled by considering only those sites within the n ϩ phisms collected from throughout the genome. This k samples that are polymorphic (i.e., are of size between latter consideration allows us to apply the theoretical 1 and k Ϫ 1) within the discovery group of depth k and expectations derived here for the data set examined, discarding those sites that are monomorphic in this without regard to the amount and structure of linkage group, as these sites would not be considered for subse- between the sites represented within the set. Third, the quent genotyping. The conditional probability, Pn|k(i), allele frequency spectrum is independent of the actual that a site is of size i within the n genotyping samples value of the per-nucleotide, per-generation mutation given that it is polymorphic in the k discovery samples rate, as long as this rate is uniform for every site consid- is: ered. ϭ | Ϫ Minor allele frequency spectrum (folded spectrum): Pn|k(i) Pr(size i in n samples size between 1 and k 1ink samples)

In situations where allele frequency is determined ex- Ϫ ϭ Pr(size i in n samples AND size between 1 and k 1ink samples) perimentally by counting the two alternative alleles Pr(size between 1 and k Ϫ 1ink samples) within a sample of n chromosomes, it is uncertain which kϪ1 ϩ ϩ ϭ ͚lϭ1 Pr(size i l in n k samples AND size l in k samples) of the two alleles is the mutant allele. In such situations, Pr(size between 1 and k Ϫ 1ink samples) instead of the true frequency, we work with the fre- kϪ1 | ϩ ϩ ϩ ϩ ϭ ͚lϭ1 Pr(size l in k samples size l i in n k samples) · Pr(size l i in n k samples) quency of the less frequent (or minor) allele (Fu 1995). Pr(size between 1 and k Ϫ 1ink samples) The distribution of minor allele frequency is described k n k n kϪ1 ΂ ΃΂ ΃ nϩkϪ1 full kϪ1΂ ΃΂ ΃ l i ͚ ϭ ϩ l i ϭ 1 full ϩ ϭ l 1 P n k(l ) ϩ by the folded spectrum defined as Ϫ ͚ nϩk P nϩk(i l ) Ϫ ͚ nϩk Pnϩk(i l ) k 1 full ΂ ϩ ΃ k 1 full ΂ ϩ ΃ ͚lϭ1 P k (l )lϭ1 l i ͚lϭ1 P k (l ) lϭ1 l i n ˜ ϭ ϩ Ϫ Յ k n Pn(i) Pn(i) Pn(n i), i: i . (3) kϪ1΂ ΃΂ ΃ ϭ l i ϩ C ͚ nϩk Pnϩk(i l ). 2 ΂ ϩ ΃ lϭ1 l i (4) ϭ By this definition, if n is even, P˜ n(n/2) 2Pn(n/2), i.e., twice the value we would expect to measure, leading It is possible that a site that appears polymorphic within to a “doubling effect.” This fact needs to be taken into the k discovery samples is monomorphic within the n geno- account during the interpretation of measured data. typing samples. As a result, the conditional probabilities

Because in many data sets available for analysis the an- Pn|k(0) and Pn|k(n) are typically nonzero, and one has to cestral allelic state is currently unknown, the folded renormalize after the transformation to get the AFS. It spectrum is important in practice. is easy to verify that Equation 4 is also valid for calculat-

Numerical calculation of the allele frequency spec- ing the folded conditional spectrum P˜ n|k(i), as defined trum: Frequency spectrum calculations were imple- in Equation 3, provided that both folded spectra P˜ k(i) mented in the C programming language. Some care and P˜ nϩk(i) are available. This property makes it possible must be taken when calculating the expected spectrum, to account for the ascertainment bias when only the because computing Equation 1 requires the evaluation folded allele frequency distributions are available. For of alternating sums, a source of numeric instability when the sake of completeness, we include the conditional 354 G. T. Marth et al. spectrum for the important special case, k ϭ 2, i.e., number of relative counts as compared to the original ascertainment within a pair of chromosomes: observations. To obtain the AFS, one omits sizes 0 and nϩ1 full ϩ ϩ Ϫ m in Equation 7 and renormalizes. It is easy to verify 2͚kϭ1P nϩ2(k) (i 1)(n 1 i) P | (i) ϭ · P ϩ (i ϩ 1) that the equivalence reduction also works for the folded n 2 full ϩ ϩ n 2 P 2 (1) (n 1)(n 2) allele frequency distribution. ϭ ϩ ϩ Ϫ ϩ C(i 1)(n 1 i)Pnϩ2 (i 1). (5) We point out that our reduction procedure is not equivalent to frequency binning, a procedure some- It is easy to show that under a stationary history the times employed to compare allele counts available at spectrum is a linear function of i, and the folded spec- different samples sizes. Aggregating discrete allele fre- trum is constant (Figure 2a). quency data on the basis of a nominal allele frequency We point out that our method of ascertainment bias c/n, the ratio of allele counts and the sample size, results correction improves on an earlier method based on in data distortion stemming from two sources. First, for using the measured discrete allele frequency as an esti- ϭ Ϫ1 a given sample, the inherent base frequency is fn n . mator for the overall allele frequency within the popula- In general, only window sizes that are integer multiples tion (Sherry et al. 1997; see supplemental Figure S2 at of fn will preserve the uniform appropriation of allele http://www.genetics.org/supplemental/). sizes into frequency bins. This may be impossible if Reduction of allele frequency counts to equivalent multiple sample sizes are present in the data. Second, counts at a lower sample size: Often allele frequency sites with identical nominal allele frequencies but differ- data are the result of genotyping a target number, nt, ent sample sizes are not equivalent; e.g., a site with a minor of individuals at a collection of polymorphic sites. Because allele count of 1 in 3 samples is clearly not equivalent of genotyping failures, however, the actual number of to a site with a minor allele count of 10 in 30 samples. genotypes available at different locations is smaller and Distortions from both sources are most pronounced at often varies from site to site. At sites where an identical lower sample sizes. Our equivalence reduction proce- number, n, of successfully determined chromosomal dure is a technique of data aggregation that is free allelic states are available we denote the distribution of of such distortions. This point is further illustrated in allele counts by Cn(i) and the corresponding probability supplemental Figure S3 at http://www.genetics.org/ distribution obtained by normalizing these counts by supplemental/, where we compared the AFS resulting Pn(i). Sites with different numbers of successful geno- from simple binning of all available data for the Euro- types are not directly comparable. To enable joint analy- pean samples to the AFS we obtain by the equivalence sis of allele counts observed at all sites genotyped in the data reduction procedure presented here. experiment, we have devised a procedure that, given Coalescent simulations and tabulation of linkage dis- an observed distribution of allele frequencies among equilibrium: We used coalescent simulations to verify samples, produces an equivalent distribution at a lower the accuracy of our allele frequency spectrum calcula- sample size, m. This is achieved by, first, considering all tions (supplemental Figure S1), to tabulate measures possible choices of m subsamples selected from the total of linkage disequilibrium, and to tabulate distributions n available samples, in such a way that each choice is of mutation age. To perform these simulations, we have equally likely and, second, requiring that the total num- implemented a widely used, direct coalescent algorithm ber of observations remains the same. Under these as- (Hudson 1991). The simulation software was first imple- sumptions, the “equivalent” allele counts, Cm(i), for m mented in Perl for rapid coding and error checking subsamples are and then reimplemented in Cϩϩ for increased compu- m nϪm tational speed. To verify the direct formula, we have nϪmϩi ΂ ΃΂ jϪi ΃ ϭ ϭ ͚ i ϭ run coalescent simulations under a variety of population Cm(i) E(Cm(i)) ΂n΃ Cn(j), i 0,...,m, (6) jϭi j history scenarios, tabulated the allele frequency spectra,

m nϪm and compared them to the computed predictions. To nϪmϩi ΂ ΃΂ Ϫ ΃ i j i verify the conditional spectrum calculations, we have simu- ϭ ͚ full ϭ Pm(i) ΂n΃ P n (j), i 0,...,m. (7) ϩ jϭi j lated n k chromosomes within a common genealogy, designated k samples as the discovery group, and n sam- Note that this procedure does not allow one to gener- ples as the genotyping, or frequency measurement, ate a higher sample size distribution on the basis of a group. Of all the sites that were polymorphic within lower sample size distribution. Also note that, even if the n ϩ k samples, we discarded those sites that were the higher sample size distribution was a relative allele monomorphic within the k discovery samples and kept frequency spectrum, the resulting lower sample size dis- the remaining sites. We then tabulated the allele fre- tribution will contain nonzero terms for size 0 and for quency counts at these sites among the n genotyping size m. Clearly, the first case is the result of the possibility samples. that the omission of n Ϫ m chromosomes left us with 0 Expectations for the extent of linkage disequilibrium mutant alleles, and the second is that only mutant alleles were generated according to a previously published remained. This results in a slight reduction of the total method (Kruglyak 1999). For each population, we Demographic Inference From SNP Data 355 used the best-fitting three-epoch model for the coales- in the past) parameter at 10,000, for each model class. cent simulations, with samples size n ϭ 100. Marker We have generated the unbiased allele frequency spec- allele frequencies were restricted to the range between tra by direct calculation using Equation 1, for a sample 0.25n and 0.75n. For each value of recombination frac- size of m ϩ 2, where m ϭ 41 is the (common) sample tion, we tabulated r2, a commonly used measure of link- size after data reduction, and k ϭ 2 is the discovery age disequilibrium defined as size. We then computed the conditional spectrum using Equation 4. Finally, we folded the spectrum using the (p Ϫ p · p )2 r 2 ϭ AB A B , (8) definition given in Equation 3. To quantify the degree pA · pa · pB · pb of fit between a given model and the observations we have used the likelihood of the observed data condi- where A and a denote the mutant and the ancestral tioned on the model: alleles at the first marker location, and B and b are the Ϫ alternative alleles at the second marker location. The c m 1 ϭ ci P(data|model) ΂ ΃ ͟ pi . (9) quantities pA, pa, pB, and pb are the corresponding allele c1,...,cmϪ1 iϭ1 frequency measurements, and pAB is the measured fre- quency of the haplotype defined by the combination of For generating the likelihood surface for the Euro- ␹2 allele A at the first marker position and B at the second pean bottleneck size vs. duration we used the metric marker position. Finally, marker age was tabulated by defined as registering the time of occurrence for each of the muta- mϪ1 Ϫ 2 (ci c · pi) tions during the simulations. ␹2 ϭ ͚ . (10) ϭ c · p Model fitting to observed allele frequency spectra: The i 1 i primary objective of the fitting experiments is to deter- In the above notations, ci is the observed number of mine the distribution of the posterior probability of the sites of size i, c is the number of total sites, pi is the predicted model parameters given the observed data: P(model| (relative) probability of size i, and m is the common sample data). With the help of our closed formula for the direct size to which all observations were reduced using the equiv- calculation of the AFS we were able to generate the alence data reduction procedure outlined earlier. expected AFS for a complete, high-resolution, multidi- Comparison between models with different epoch num- mensional grid overlaid on the parameter space that bers: Models within the same structure (same epoch num- we intended to explore. This direct approach yielded ber) could be directly compared on the basis of any of the likelihood distribution, P(data|model), computed the three goodness-of-fit metrics discussed above. Models at each grid point. Given that there is no sensible way with different numbers of epochs were compared using to assign an “informed” prior distribution to the model methods of normal hypothesis testing for nested models parameters, the distribution of the likelihood function (Ott 1991), on the basis of the likelihood of the data is equivalent to the posterior distribution and can be given each of the two models compared. The quantity ␭ ϭ used in ranking competing parameters. We point out 2 ln( ) 2 ln(P(data|model1)/P(data|model2)) is as- that an alternative method of achieving the same goal ymptotically ␹2 distributed, with degrees of freedom is to use a Markov-chain Monte Carlo (MCMC) tech- equal to the difference in the number of parameters nique to obtain the posterior distribution (Griffiths characterizing the models (i.e., adding one extra epoch and Tavare 1994a; Kuhner et al. 1995). We opted for increases the number of parameters by two). The larger the direct method because it was simple but computa- this quantity, the more significant the improvement that tionally feasible, by its nature avoided the convergence was achieved by the introduction of the extra epoch. If issues usually associated with MCMC, and allowed us to the quantity is small, the improvement in data fit does evaluate the likelihood function at every grid point, for not warrant the introduction of the extra parameters. each of the three population-specific AFS analyzed. Stepwise constant models of one, two, and three ep- ochs were considered. For each model class defined by RESULTS the number of epochs, a vector of parameters describing Modeling allele frequency: We considered a diploid the model was considered, including the effective popu- population whose demographic history was described lation size and the duration of the epoch (expressed in by a series of epochs such that the effective population terms of generations). We have sampled each effective size was stepwise constant within each epoch (e.g., Figure size parameter, Ni, between 1000 and 150,000 in steps 1) and showed that the expected number of samples of 1000 up to 30,000 and in steps of 5000 beyond 30,000, carrying a mutant allele can be described by a closed, and each epoch duration parameter, Ti, between 100 easily computable mathematical formulation (see and 50,000 in steps of 100 up to 10,000 and in steps of methods). We derived a method for incorporating the 500 beyond 10,000. Because of the scaling equivalence same frequency ascertainment bias into AFS models that of the relative distribution discussed earlier, we fixed was introduced into real data by the sampling strategies the ancestral size (the effective size of the epoch farthest used during SNP discovery and for revealing the strate- 356 G. T. Marth et al.

the attempted sample sizes are different. In such cases one selects a target sample size and applies the reduc- tion procedure to transform allele counts observed at higher sample sizes to the equivalent counts at this lower target sample size. It is then possible to fit the resulting single AFS containing the contribution of all available data instead of fitting multiple, often sparse spectra, one for each sample size present in the data. Minor allele frequency spectra observed in samples representing different world populations show differen- tial demographic histories: The SNP Consortium (http:// snp.cshl.org), an organization formed primarily for the discovery of a large set of human SNPs, has made well Figure 1.—Example of a three-epoch, piecewise constant, over 1 million polymorphic sites available in the public bottleneck-shaped population history profile. The ancestral domain (Sachidanandam et al. 2001). Most of these effective population size (N ) is followed by an instant reduc- 3 SNPs were discovered by comparing sequencing read frag- tion of effective size (N2). The duration of this epoch is T2 generations. This is followed by a stepwise increase of effective ments from multi-ethnic, anonymous, whole-genome population size to N1, T1 generations before the present. shotgun subclone libraries to the public genome refer- ence sequence (Sachidanandam et al. 2001); i.e., the vast majority of the SNPs were found in a discovery size gies’s consequent effect on SNP population frequency of two chromosomes (k ϭ 2). Quasi-random subsets of (methods). We illustrate the effect of this bias under these candidate sites were then selected for frequency different values of ascertainment sample size (Figure characterization in samples representing European- 2a). As expected, the bias toward sample enrichment American, African-American, and East Asian populations for common polymorphisms is strongest when SNPs are (for sample identifiers see http://snp.cshl.org/allele_ discovered in a pair of chromosomes, and it gradually frequency_project/panels.shtml). In this study, we disappears as discovery sample size increases. Under a chose the largest data set of allele frequency counts stationary population history, the folded spectrum un- resulting from genotypes provided by Orchid Biosci- der ascertainment in two chromosomes is a constant ences, of 42 individuals (84 chromosomes) drawn from function of frequency (methods), and deviations from each of the three populations (http://snp.cshl.org/ a horizontal line signal a nonstationary history that is allele_frequency_project/). Experimental results were easy to detect and interpret. In Figure 2b, we contrast reported for 33,538 sites. For a significant fraction of the ascertainment bias-corrected, minor allele fre- the sites genotyping was unsuccessful for one or more quency spectra for notable, competing scenarios of de- of the populations attempted. In some other cases, al- mographic history. When a population expands, an in- though genotyping was successful, all samples carried creasing number of chromosomes simultaneously incur the same allele and hence the site could not be con- new mutations, which results in an overabundance of firmed as polymorphic. For the purpose of our study, rare alleles in the spectrum. Conversely, a population we restricted our attention to those sites where (1) geno- collapse is a rapid loss of chromosomes, and the alleles typing from each of the three sample groups was success- present at high frequency are more likely to be carried ful (genotyping for a given population was considered by surviving chromosomes than are their rare counter- successful if genotype data were obtained for at least parts. For that reason a collapse generates an overrepre- half the population samples, i.e., 21 individuals, even sentation of common alleles. Finally, AFS under a bottle- if only one of the alternative alleles was seen in that neck history (a reduction of effective size followed by population) and (2) the site was polymorphic within at a phase of recovery) carries the signature of both the least one of the three population samples. Of the total phase of collapse (a valley at intermediate frequencies) 21,407 sites that were successfully genotyped in all three and that of growth (elevated signal at low frequencies). populations the European samples were polymorphic We report a procedure to transform allele counts at 18,660 sites, the African samples at 20,587 sites, and at a given sample size to a lower, target sample size the Asian samples at 17,369 sites. At a given site, the (methods). Using this equivalence sample size reduction total number of alleles counted varied between 42 (the procedure, allele count observations at all sites can be minimum number possible, in case only 21 diploid indi- reduced to the equivalent counts at a lower, “common viduals were successfully genotyped within a popula- denominator” sample size, as illustrated in Figure 3. tion) and 84, the maximum possible if all 42 individuals This procedure is useful for analyzing allele counts at within a population sample were successfully genotyped. sites where the number of available genotypes is variable To use all the data available, we have applied our equiva- either because a fraction of attempted genotyping ex- lence sample size reduction procedure (methods)to periments failed or when merging data sets in which convert the allele count data to a common denominator Demographic Inference From SNP Data 357

Figure 2.—Ascertainment bias. (a) Folded spectra under stationary history, at various values of “discovery sample” size k (methods). (b) Allele frequency spectra predicted under competing scenarios of population history (conditioned on pairwise ascertainment k ϭ 2). Equilibrium his- ϭ ϭ ϭ tory, N1 10,000; expansion, N1 20,000, T1 ϭ ϭ ϭ 3000, N2 10,000; collapse, N1 2000, T1 500, ϭ ϭ ϭ N2 10,000; bottleneck history, N1 20,000, T1 ϭ ϭ ϭ 3000, N2 2000, T2 500, N3 10,000. (a and b) Sample size n ϭ 41.

sample size. Because the identity of the ancestral and our web site: www.ncbi.nlm.nih.gov/IEB/Research/ the mutant allele was not known, we used the allele GVWG/AFS-2003/. counts of the less frequent (or minor) allele, giving rise To assess the signals of population history within these to a folded spectrum (methods). To avoid the “dou- observed distributions, we generated allele frequency bling” effect associated with folding the allele frequency spectra as predicted under competing scenarios of pop- spectrum when the sample size is an even number, as ulation history of varying complexity: stationary history described in methods and in particular by Equation 3, (one epoch), expansion or collapse (two epoch), and we chose the common denominator sample size as m ϭ all possible shapes of three-epoch histories (methods). 41, i.e., the first odd number below the (even) sample For a given set of model parameters, we generated the size 42. The unfolded spectrum hence lies between 1 corresponding theoretically predicted, ascertainment and 40 (sizes 0 and 41 indicate monomorphisms). Ac- bias-corrected minor allele frequency spectrum and cordingly, the folded spectrum lies between minor allele evaluated the degree of fit between the prediction and sizes 1 and 20, for each of the three population-specific the observations (methods). For each population-spe- sample groups (Figure 4, first column). The allele fre- cific data set and for each model structure (number of quency data used in our analysis are available through epochs), we determined the best-fitting model parame- 358 G. T. Marth et al.

Figure 3.—Sample size reduction. Folded, normalized allele frequency dis- tribution for each sample size (n ϭ 42, ...,84)present in the European allele count data (gray) is shown. The allele frequency spectra obtained using the equivalence sample size reduction tech- nique (methods) are also shown for var- ious equivalence sample sizes (m ϭ 21, 31, and 41; green).

ters and the corresponding measures of goodness of fit. (N, effective number of individuals) and duration (T, By definition of the likelihood function used for data generations) of the recovery phase was within a narrow ϭ ϭ fitting, the best-fitting model parameters are the maxi- range (N1 19,000–21,000, T1 2700–3000). Parame- ϭ mum-likelihood parameter estimates for that model ters of the bottleneck phase were in a wider range (N2 ϭ class (Table 1). 1000–4000 and T2 200–1300), with several alternative The normalized observed allele frequency distribu- pairs available: longer but less severe bottlenecks or tions for each population group and the corresponding shorter, more severe bottlenecks. Given the potential best-performing distributions within each model class interest in a possible bottleneck in the history of Euro- are shown in Figure 4. In all three population-specific pean populations, we further investigated the strength spectra, stationary history is a poor descriptor of the of the bottleneck signal by fixing the recovery size and ϭ ϭ data, both by visual inspection and by examination of duration parameters (N1 20,000, T1 3000) and vary- the fit values in Table 1. The best-fitting two-epoch ing the bottleneck size N2 and duration T2 in fine incre- model for all three spectra is that of expansion (Table ments (20). For each parameter combination, we evalu- 1). In the European (Figure 4a) and in the Asian (Figure ated the goodness of fit to the European spectrum as 4b) samples the best-fitting three-epoch model is one measured by the ␹2 statistics and reported the resulting of a bottleneck-shaped history. In the European data, probability surface in Figure 5. The best-fitting parame- the curve fit produced by the bottleneck profile is a very ter combinations (ones not rejected by the ␹2 test even significant improvement over that produced by histories at the 99.8% level) lie on a slightly curved line between of expansion. In the Asian data, the improvement is still the following pairs: effective size of 1040 during the significant but to a lesser degree. The best-fitting three- bottleneck for 240 generations and effective size 2320 epoch models in African-American data (Figure 4c) rep- for 560 generations. The most likely model, at this reso- resent a two-step population increase of moderate size. lution, is a bottleneck effective size of 1560 for 360 In addition to the best-fitting models, a range of pa- generations. These values and the ratio of effective pop- rameter values produced comparably good fit to the ulation size and bottleneck duration being nearly con- observations. We have examined parameter sets that stant in a large region are in good agreement with previ- produced likelihood values that were at least 90% of ous reports (Reich et al. 2001). In the Asian data (Figure the value obtained for the best-fitting three-epoch pa- 4b), all parameters including those characterizing the ϭ rameter set. Analysis of these “close to optimal” parame- bottleneck phase were within a tight range: N2 3000– ϭ ϭ ϭ ter values in the European data shows that both the size 5000, T2 600–1000, N1 24,000–26,000, and T1 3000– Demographic Inference From SNP Data 359

Figure 4.—Model fitting to folded AFS observed in population-specific genotype data reduced to common sample size, m ϭ 41. (a) European spectrum. (b) Asian spectrum. (c) African-American spectrum. First column, observed allele frequency spectrum (black), best-fitting three-epoch theoretical model prediction (green), and prediction under stationary effective size (red); second column, breakdown of mutations according to age within each frequency class of the best-fitting model spectra [color bands correspond to a range of 1000 generations (e.g., black band, 1–1000 generations; red band, 1001–2000 generations)]; third column, distribution of mutation times (generations in the past) at each frequency, based on 1 million simulation replicates. Notched box: 25%, median, 75%. Whiskers: min/max values. Open square: mean value. Open circle: 5%, 95% values.

3200. Similarly narrow ranges were observed for the ple and rapid way to generate expected distributions ϭ ϭ African-American data (Figure 4c): N2 16,000, T2 of allele frequency under stepwise constant models of ϭ ϭ 13,000–15,000, N1 26,000–30,000, and T1 2000– effective population size history. This procedure is or- 2600. ders of magnitude faster than tabulating simulation rep- licates, especially for large sample sizes, permitting fast generation of model spectra to explore large parameter DISCUSSION spaces at high resolution. The method of ascertainment Significance of the allele frequency analysis methods bias calculation we have presented permits the interpre- presented here: Equation 1 (methods) provides a sim- tation of allele frequency spectra measured at polymor- 360 G. T. Marth et al.

TABLE 1 Results of fitting multi-epoch models of allele frequency spectrum to population-specific observed allele frequency data

Model Model Resulting pairwise ␪ Improvement over structure parameters (units of 10Ϫ4)lnP(data|model) lower-epoch model a. European data ϭ Ϫ One epoch N1 10,000 8.00 55.98 — ϭ Ϫ ␭ϭ Two epoch N2 10,000 8.74 38.11 2 ln 35.74 ϭ Ͻ Ϫ4 N1 140,000 P 10 ϭ (T1 2,000) Highly significant ϭ Ϫ ␭ϭ Three epoch N3 10,000 7.88 23.72 2 ln 28.78 ϭ Ͻ Ϫ4 N2 2,000 P 10 ϭ (T2 500) Highly significant ϭ N1 20,000 ϭ (T1 3,000)

b. Asian data ϭ Ϫ One epoch N1 10,000 8.00 74.26 — ϭ Ϫ ␭ϭ Two epoch N2 10,000 8.63 31.95 2 ln 84.62 ϭ Ͻ Ϫ4 N1 50,000 P 10 ϭ (T1 2,000) Highly significant ϭ Ϫ ␭ϭ Three epoch N3 10,000 8.24 26.39 2 ln 11.12 ϭ ϭ N2 3,000 P 0.0039 ϭ (T2 600) Significant ϭ N1 25,000 ϭ (T1 3,200)

c. African-American data ϭ Ϫ One epoch N1 10,000 8.00 197.86 — ϭ Ϫ ␭ϭ Two epoch N2 10,000 9.20 28.69 2 ln 338.34 ϭ Ͻ Ϫ4 N1 18,000 P 10 ϭ (T1 7,500) Highly significant ϭ Ϫ ␭ϭ Three epoch N3 10,000 10.29 26.72 2 ln 3.94 ϭ ϭ N2 16,000 P 0.1395 ϭ (T2 15,000) Not significant ϭ N1 26,000 ϭ (T1 2,400)

phic sites selected from existing variation resources. Our Table 1). Clearly, the shapes of the European and the procedure of equivalence sample size reduction enables Asian spectra are closer to each other than either is to the analysis of realistic data sets with genotyping failures. the shapes of the African spectra. On the basis of the All three of the above procedures are firmly rooted three-epoch models, both the European and the Asian within the coalescent framework. Model calculations data are best explained by bottleneck-shaped histories, directly correspond to experimentally observable quan- whereas the best-fitting third-order model for the Afri- tities, without referencing directly unobservable quanti- can-American data is a continued expansion. The results ties such as the overall population frequency of alleles. of hierarchical model testing (methods) in Table 1 The data-fitting methodology is conceptually simple and show that the inclusion of the third epoch did not sig- allows direct comparison of the degree of fit between nificantly improve the fit to the African-American data. each of the three population samples examined, at each However, the bottleneck history is a dramatic improve- grid point (parameter combination). ment over the best-fitting two-epoch growth models in Differential population histories in the three sample both the European and Asian data. Considering the sets: On the basis of the goodness of fit between models range of models that produced close to optimal fit val- and observations (Table 1), a history of stationary popu- ues, but using a fixed, 20-year generation time, the Euro- lation size can be confidently rejected for all three sets pean bottleneck represented a 2.5- to 10-fold decline of samples. Introduction of even very simple dynamics in population size, lasting 200–1300 generations [4–26 into the history has dramatically improved data fit. thousand years (KY)]. This was followed by a phase of There were large differences among the allele frequency 5- to 20-fold population expansion, starting 2700–4300 spectra observed in the three populations (Figure 4 and generations (54–86 KY) ago. The Asian bottleneck rep- Demographic Inference From SNP Data 361

Figure 5.—Bottleneck size and duration in the European samples. The probability surface of the effective size and the duration ϭ of a bottleneck are shown. Size of the ancestral epoch is fixed at N3 10,000, size of the present epoch is fixed at 20,000, and ϭ the duration of the present epoch is fixed at T1 3000. Parameter regions indicated by shading fall into the same bin of significance. Note that the P values indicated are the direct ␹2 probabilities (i.e., 1 minus the tail probability).

resented a 2- to 3-fold decline for 600–1000 generations neck severity index (in our notation T2/N2) and consider (12–20 KY), followed by 5- to 8-fold growth starting moderate bottlenecks where the expansion ratio is 20 3000–4200 generations (60–84 KY) ago. The best-fitting and the severity index is in the range of 0.25 and 4.0. Our models for the African-American data represent unin- own estimates (expansion ratio 5–20 for Europeans, 5–8 -for both popula 0.2ف terrupted growth of effective population size, with the for Asians, and severity index of expansion clearly starting earlier than is evident in our tions) are in general agreement with these values and European or the Asian data. signify bottlenecks on the less severe end of the spec- Earlier mitochondrial and microsatellite studies re- trum. Our estimates for the start of the recovery phase port data that are predominantly consistent with expan- (54–86 KYA for Europeans, 60–84 KYA for Asians) are sion-type histories of effective population size. The main well within the range of the mitochondrial and microsa- evidence that points to expansion is negative values of tellite estimates. The fact that our best-fitting two-epoch Tajima’s D and an excess of low-frequency alleles. The models indicate expansion-type histories for all three start of such expansion is estimated between 30 and 130 populations we examined is also consistent with conclu- KYA (Harpending and Rogers 2000). Nuclear data, sions from mitochondrial and microsatellite data. A val- especially in samples of non-African origin, seem to uable reality check of an inferred demographic model show a different pattern, an excess of common variants is its implied pairwise nucleotide diversity value, ␪. Al- (Hey 1997; Clark et al. 1998; Reich et al. 2001, 2002). though our data-fitting analysis of the relative spectrum Simulation results have suggested that a bottleneck- does not provide absolute estimates for ␪, these values shaped history of effective population size consisting of can be obtained on the basis of the best-fitting models ␮ a phase of collapse followed by a recent phase of size by fixing the ancestral size N3 and mutation rate . recovery can reconcile this seeming contradiction be- For each of the three populations, we use a common tween observations from different mutation systems ancestral effective size of 10,000 and common mutation (Fay and Wu 1999; Hey and Harris 1999). These stud- rate of 2 ϫ 10Ϫ8 [a value that lies between recent, promi- ies characterize bottleneck-shaped histories by a size nent estimates for average per-nucleotide, per-genera-

expansion ratio (in our notation N1/N2) and a bottle- tion human mutation rate (Nachman and Crowell 2000; 362 G. T. Marth et al.

Kondrashov 2003 )]. This leads to an estimate of ␪ϭ pean and Asian SNPs have originated Ͻ10,000 genera- 7.88 ϫ 10Ϫ4 for the European model, in good agreement tions ago and have drifted to high population frequency. with previously reported values for other genome-wide Finally, the third column of Figure 4 shows the average data sets (Sachidanandam et al. 2001; Venter et al. 2001; age of SNPs at given frequencies, confirming that SNPs Marth et al. 2003). The prediction from the Asian data at a higher frequency are expected to be older than is slightly higher, 8.24 ϫ 10Ϫ4. The pairwise ␪ predicted SNPs at lower frequencies. Also, in each frequency class, by the best-fitting model for the African-American data the expected age of African SNPs is substantially higher is 10.29 ϫ 10Ϫ4, significantly higher than that observed than that of European or Asian SNPs, corroborating within the European and Asian samples, and in agree- earlier observations noting the more ancient origins of ment with the general consensus that nucleotide diver- African SNPs. sity is higher in sub-Saharan samples than in non-African The differential demographic histories of the three data (Relethford and Jorde 1999; Przeworski et al. populations examined also have important conse- 2000; Jorde et al. 2001; Tishkoff and Williams 2002). quences for the extent of allelic association in the hu- All three estimates are well within realistic values, lend- man genome, when the different populations are con- ing further credence to the validity of our model param- sidered. To illustrate this point, we have carried out eters. coalescent simulations, taking into account the individ- A bottleneck-shaped history was also our best-fitting ual best-fitting histories, and tabulated the average ex- three-epoch model structure for MD distributions ob- tent of linkage disequilibrium (LD) between markers served in overlap fragments of public genome clone separated by different values of recombination fraction data (Marth et al. 2003). However, the parameter esti- (for a fixed value of per-nucleotide, per generation re- mates are significantly different between these two stud- combination rate, the recombination fraction translates ies. Our estimates from MD data indicated a less severe into physical distance), as shown in Figure 6. Similar bottleneck of nearly identical duration and a shorter demographic histories distilled from the Asian and Eu- phase of recovery of more modest size as compared to ropean samples result in similar values of LD at a given the AFS in the European samples. Multiple factors may marker distance. LD is predicted to decay more rapidly contribute to these differences. First, the DNA samples (roughly twice as fast) for the best-fitting demographic for the two studies came from different donors. Second, history for the African-American samples, in agreement some fraction of the large-insert clones sequenced for with previous reports (Reich et al. 2001). Differences the construction of the public genome reference se- in the extent of allelic association within the genome are quence originate from libraries that are not of European expected to have profound consequences for medical origin [although there appears to be an overrepresenta- association studies. tion of European sequences (Weber et al. 2002), pre- Caveats and open problems: Clearly, our multi-epoch, sumably due to the origin of a single bacterial artificial stepwise models of demographic history represent sim- chromosome library with the largest contribution]. If plified versions of the “true” demographic past. Never- indeed an appreciable fraction of the data represents theless, our three-epoch models go beyond the majority sub-Saharan DNA, the resultant MD in these mixed data of previous studies that explore even simpler models of could indicate a less severe bottleneck than would have past population dynamics such as expansion vs. collapse been evident in a distribution containing only European or are restricted to the rejection of stationary effective data. size on the basis of summary statistics. Consideration of To understand the consequences of the differential the third-order dynamics in this study allowed us to histories that best describe the three population-specific reveal a phase of bottleneck in the history characterizing data sets, we have partitioned the corresponding fre- the European and the Asian samples, permitting recon- quency spectra according to the age of the mutations ciliation of the signals of recent population growth ap- (methods) that gave rise to the polymorphisms (Figure parent in mitochondrial and microsatellite data with 4, second column). According to these tabulations, 35.9% realistic, observed values of nucleotide diversity. of the European polymorphisms originated in Ͻ10,000 Although the signal of differential history is undeni- generations, as did a similar fraction, 34.9%, in the Asian able in the data, the effect is confounded by the fact model. In contrast, only 29.6% of the African mutation that the discovery and genotyping data sets were not are younger than 10,000 generations. This indicates that drawn from a single population. SNP discovery was per- the bottleneck events that explain the European and formed in shotgun sequences from ethnically diverse Asian data have eliminated a large fraction of the poly- libraries (with ethnic association of individual reads un- morphisms that predated these events, and a larger frac- known) aligned to the public genome reference se- tion of current polymorphisms are of a more recent quence (Sachidanandam et al. 2001), presumably rep- origin as compared to the African data. This effect is resenting a mixture of ethnicities, with a bias toward most visible at the common end of the spectrum: only clones from European donors (Weber et al. 2002). Poly- a negligible fraction of the common African SNPs are morphic sites generated by this effort were then selected young, but an appreciable fraction of common Euro- for genotyping in ethnically well-defined samples. It has Demographic Inference From SNP Data 363

Figure 6.—The average ex- tent of linkage disequilibrium, as predicted by the best-fitting, three-epoch demographic models for the three popula- tion samples. Values of r 2 and the corresponding values of re- combination fraction are shown for each of the three populations. On the right- hand side, we have indicated the equivalent physical dis- tances assuming a genome av- erage per-nucleotide, per-gen- eration recombination rate, r ϭ 10Ϫ8 (methods).

been previously noted that collections of samples from netic hitchhiking can mimic the effects of population multiple ethnicities contain a surplus of rare SNPs when expansion in that it gives rise to an excess of low-fre- measured in the same mixed collection (Ptak and quency alleles (Kaplan et al. 1989; Braverman et al. Przeworski 2002). However, it is unclear what the allele 1995). Recent efforts have been aimed at detecting loci frequency of the same SNPs is when measured sepa- that exhibit signatures of positive selection (Cargill et rately, within subpopulations. If the ethnicity of the al. 1999; Sunyaev et al. 2000; Akey et al. 2002; Payseur discovery and the genotyping samples were known, one et al. 2002). However, the exact proportion of genes could estimate the effect of the ascertainment bias with that have been targets of strong positive selection within models of population subdivision using coalescent simu- our evolutionary past is unclear (Bamshad and Wood- lation (Pluzhnikov et al. 2002). The effect of ascertain- ing 2003). It is also unclear, in general, how far the ment bias between ethnically mismatched or undefined effects of hitchhiking extend beyond the locus under samples is the subject of future investigation. selection (Wiehe 1998). Given that only a few percent Additionally, internal population substructure can of the human genome represents coding DNA, and also distort the frequency spectrum (Przeworski 2002; that not all genes are expected to be targets of positive Ptak and Przeworski 2002). Unfortunately, the little selection, we speculate that the distortion due to selec- amount of information that was available concerning tive forces on the AFS in our data set of Ͼ20,000 ran- sample origin did not permit incorporation of this effect domly selected genomic loci is small when compared into our models in a meaningful fashion. Specifically, to the global effects of drift modulated by long-term we did not take into account in our models the effects demography. of recent admixture in the African-American samples. Conclusion: The allele frequency spectrum is an ex- Although the AFS in these samples are best modeled cellent data source for modeling demographic history by population growth, it carries a slight but noticeable because of its independence of the effects of recombina- dip at medium minor allele frequencies, a feature pres- tion and local, or sequence composition-specific varia- ent in a more pronounced form in both the European tions of mutation rates and because the experimental (Figure 4a) and the Asian (Figure 4b) spectra. This determination of the allele frequency spectrum requires potentially signifies the contribution of European ances- measurement of allelic states only at single-nucleotide tral lineages on the background of African lineages positions, instead of sequencing of long stretches of (Rybicki et al. 2002) in the AFS signal. contiguous DNA. The emergence of population-specific We must also acknowledge that the current shape of genotype sets on the genome scale provides sufficient human variation structure is the result of a combination data for the direct comparison of model-predicted and of neutral and nonneutral (selective) forces. The cur- observed spectra with great resolution. This permits us rent state of the art in recognizing the effects of selection to improve on previous conclusions drawn on the in variation data has been reviewed recently (Bamshad strength of summary statistics, on the basis of data from and Wooding 2003). Positive selection resulting in ge- a handful of loci. Recent advances in allele frequency 364 G. T. Marth et al. modeling should provide us with exciting, new tools et al., 1997 Archaic African and Asian lineages in the genetic ancestry of modern humans. Am. J. Hum. Genet. 60: 772–789. to explore our demographic past and explain human Harpending, H., and A. Rogers, 2000 Genetic perspectives on hu- haplotype structure. Accurate reconstruction of the his- man origins and differentiation. Annu. Rev. Genomics Hum. tory of world populations should also help us to detect Genet. 1: 361–385. Hey, J., 1997 Mitochondrial and nuclear genes present conflicting and interpret differences that must be taken into ac- portraits of human origins. Mol. Biol. Evol. 14: 166–172. count during the development of general resources for Hey, J., and E. Harris, 1999 Population bottlenecks and patterns medical use such as the recently initiated human Haplo- of human polymorphism. Mol. Biol. Evol. 16: 1423–1426. Hudson, R. R., 1991 Gene genealogies and the coalescent process, type Map Project (Cardon and Abecasis 2003; Clark pp. 1–44 in Oxford Surveys in Evolutionary Biology, edited by D. 2003; Wall and Pritchard 2003). Futuyama and J. Antonovics. Oxford University Press, Lon- don/New York/Oxford. The authors are indebted to Andrew Clark for useful comments Ingman, M., H. Kaessmann, S. Paabo and U. Gyllensten, 2000 on the manuscript. We also thank Ravi Sachidanandam for kindly Mitochondrial genome variation and the origin of modern hu- providing earlier versions of the allele frequency data set analyzed in mans. Nature 408: 708–713. this study. Jorde, L. B., W. S. Watkins and M. J. Bamshad, 2001 Population genomics: a bridge from evolutionary history to genetic medicine. Hum. Mol. Genet. 10: 2199–2207. Kaplan, N. L., R. R. Hudson and C. H. Langley, 1989 The “hitch- LITERATURE CITED hiking effect” revisited. Genetics 123: 887–899. Kimmel, M., R. Chakraborty, J. P. King, M. Bamshad, W. S. Watkins Akey, J. M., G. Zhang, K. Zhang, L. Jin and M. D. Shriver, 2002 et al., 1998 Signatures of population expansion in microsatellite Interrogating a high-density SNP map for signatures of natural repeat data. Genetics 148: 1921–1930. selection. Genome Res. 12: 1805–1814. Kondrashov, A. S., 2003 Direct estimates of human per nucleotide Altshuler, D., V. J. Pollara, C. R. Cowles, W. J. Van Etten, J. mutation rates at 20 loci causing Mendelian diseases. Hum. Mutat. Baldwin et al., 2000 An SNP map of the human genome gener- 21: 12–27. ated by reduced representation shotgun sequencing. Nature 407: Kruglyak, L., 1999 Prospects for whole-genome linkage disequilib- 513–516. rium mapping of common disease genes. Nat. Genet. 22: 139–144. Bamshad, M., and S. P. Wooding, 2003 Signatures of natural selec- Kuhner, M. K., J. Yamato and J. Felsenstein, 1995 Estimating tion in the human genome. Nat. Rev. Genet. 4: 99–111. effective population size and mutation rate from sequence data Braverman, J. M., R. R. Hudson, N. L. Kaplan, C. H. Langley and using Metropolis-Hastings sampling. Genetics 140: 1421–1430. W. Stephan, 1995 The hitchhiking effect on the site frequency Lander, E. S., L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody et spectrum of DNA polymorphisms. Genetics 140: 783–796. al., 2001 Initial sequencing and analysis of the human genome. Cardon, L. R., and G. R. Abecasis, 2003 Using haplotype blocks Nature 409: 860–921. to map human complex trait loci. Trends Genet. 19: 135–140. Li, W. H., 1977 Distribution of nucleotide differences between two Cargill, M., D. Altshuler, J. Ireland, P. Sklar, K. Ardlie et al., randomly chosen cistrons in a finite population. Genetics 85: 1999 Characterization of single-nucleotide polymorphisms in 331–337. coding regions of human genes. Nat. Genet. 22: 231–238. Marth, G., G. Schuler, R. Yeh, R. Davenport, R. Agarwala et al., Clark, A. G., 2003 Finding genes underlying risk of complex disease 2003 Sequence variations in the public human genome data by linkage disequilibrium mapping. Curr. Opin. Genet. Dev. 13: reflect a bottlenecked population history. Proc. Natl. Acad. Sci. 296–302. USA 100: 376–381. Clark, A. G., K. M. Weiss, D. A. Nickerson, S. L. Taylor, A. Mullikin, J. C., S. E. Hunt, C. G. Cole, B. J. Mortimore, C. M. Buchanan et al., 1998 Haplotype structure and population ge- Rice et al., 2000 An SNP map of human chromosome 22. Nature netic inferences from nucleotide-sequence variation in human 407: 516–520. lipoprotein lipase. Am. J. Hum. Genet. 63: 595–612. Nachman, M. W., and S. L. Crowell, 2000 Estimate of the mutation Crow, J. F., and M. Kimura, 1970 An Introduction to Population Genetic rate per nucleotide in humans. Genetics 156: 297–304. Theory. Harper & Row, New York. Ott, J., 1991 Analysis of Human Genetic Linkage. Johns Hopkins Uni- Di Rienzo, A., and A. C. Wilson, 1991 Branching pattern in the versity Press, Baltimore. evolutionary tree for human mitochondrial DNA. Proc. Natl. Payseur, B. A., A. D. Cutter and M. W. Nachman, 2002 Searching Acad. Sci. USA 88: 1597–1601. for evidence of positive selection in the human genome using Di Rienzo, A., P. Donnelly, C. Toomajian, B. Sisk, A. Hill et al., patterns of microsatellite variability. Mol. Biol. Evol. 19: 1143– 1998 Heterogeneity of microsatellite mutations within and be- 1153. tween loci, and implications for human demographic histories. Pluzhnikov, A., A. Di Rienzo and R. R. Hudson, 2002 Inferences Genetics 148: 1269–1284. about human demography based on multilocus analyses of non- Ewens, W. J., 1972 The sampling theory of selectively neutral alleles. coding sequences. Genetics 161: 1209–1218. Theor. Popul. Biol. 3: 87–112. Przeworski, M., 2002 The signature of positive selection at ran- Fay, J. C., and C.-IWu, 1999 A human population bottleneck can domly chosen loci. Genetics 160: 1179–1189. account for the discordance between patterns of mitochondrial Przeworski, M., R. R. Hudson and A. Di Rienzo, 2000 Adjusting versus nuclear DNA variation. Mol. Biol. Evol. 16: 1003–1005. the focus on human variation. Trends Genet. 16: 296–302. Fu, Y. X., 1995 Statistical properties of segregating sites. Theor. Ptak, S. E., and M. Przeworski, 2002 Evidence for population Popul. Biol. 48: 172–197. growth in humans is confounded by fine-scale population struc- Fu, Y. X., and W. H. Li, 1993 Statistical tests of neutrality of muta- ture. Trends Genet. 18: 559–563. tions. Genetics 133: 693–709. Reich, D. E., and D. B. Goldstein, 1998 Genetic evidence for a Gabriel, S. B., S. F. Schaffner, H. Nguyen, J. M. Moore, J. Roy et al., Paleolithic human population expansion in Africa. Proc. Natl. 2002 The structure of haplotype blocks in the human genome. Acad. Sci. USA 95: 8119–8123. Science 296: 2225–2229. Reich, D. E., M. Cargill, S. Bolk, J. Ireland, P. C. Sabeti et al., Gonser, R., P. Donnelly, G. Nicholson and A. Di Rienzo, 2000 2001 Linkage disequilibrium in the human genome. Nature Microsatellite mutations and inferences about human demogra- 411: 199–204. phy. Genetics 154: 1793–1807. Reich, D. E., S. F. Schaffner, M. J. Daly, G. McVean, J. C. Mullikin Griffiths, R. C., and S. Tavare, 1994a Simulating probability distri- et al., 2002 Human genome sequence variation and the influ- butions in the coalescent. Theor. Popul. Biol. 46: 131–159. ence of gene history, mutation and recombination. Nat. Genet. Griffiths, R. C., and S. Tavare, 1994b Sampling theory for neutral 32: 135–142. alleles in a varying environment. Philos. Trans. R. Soc. Lond. B Relethford, J. H., and L. B. Jorde, 1999 Genetic evidence for Biol. Sci. 344: 403–410. larger African population size during recent human evolution. Harding, R. M., S. M. Fullerton, R. C. Griffiths, J. Bond, M. J. Cox Am. J. Phys. Anthropol. 108: 251–260. Demographic Inference From SNP Data 365

Rogers, A. R., 2001 Order emerging from chaos in human evolu- Venter, J. C., M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural et tionary genetics. Proc. Natl. Acad. Sci. USA 98: 779–780. al., 2001 The sequence of the human genome. Science 291: Rogers, A. R., and H. Harpending, 1992 Population growth makes 1304–1351. waves in the distribution of pairwise genetic differences. Mol. Wall, J. D., and J. K. Pritchard, 2003 Haplotype blocks and linkage Biol. Evol. 9: 552–569. disequilibrium in the human genome. Nat. Rev. Genet. 4: 587– Rybicki, B. A., S. K. Iyengar, T. Harris, R. Liptak, R. C. Elston 597. et al., 2002 The distribution of long range admixture linkage Wall, J. D., and M. Przeworski, 2000 When did the human popula- disequilibrium in an African-American population. Hum. Hered. tion size start increasing? Genetics 155: 1865–1874. 53: 187–196. Weber, J. L., D. David, J. Heil, Y. Fan, C. Zhao et al., 2002 Human Sachidanandam, R., D. Weissman, S. C. Schmidt, J. M. Kakol, L. D. diallelic insertion/deletion polymorphisms. Am. J. Hum. Genet. Stein et al., 2001 A map of human genome sequence variation 71: 854–862. containing 1.42 million single nucleotide polymorphisms. Nature Wiehe, T., 1998 The effect of selective sweeps on the variance of 409: 928–933. the allele distribution of a linked multiallele locus: hitchhiking Sherry, S. T., A. R. Rogers, H. Harpending, H. Soodyall, T. Jen- kins et al., 1994 Mismatch distributions of mtDNA reveal recent of microsatellites. Theor. Popul. Biol. 53: 272–283. human population expansions. Hum. Biol. 66: 761–775. Wooding, S., and A. Rogers, 2002 The matrix coalescent and an Sherry, S. T., H. C. Harpending, M. A. Batzer and M. Stoneking, application to human single-nucleotide polymorphisms. Genetics 1997 Alu evolution in human populations: using the coalescent 161: 1641–1650. to estimate effective population size. Genetics 147: 1977–1982. Yu, N., Z. Zhao, Y. X. Fu, N. Sambuughin, M. Ramsay et al., 2001 Sunyaev, S. R., W. C. Lathe III, V. E. Ramensky and P. Bork, 2000 Global patterns of human DNA sequence variation in a 10-kb SNP frequencies in human genes an excess of rare alleles and region on chromosome 1. Mol. Biol. Evol. 18: 214–222. differing modes of selection. Trends Genet. 16: 335–337. Zhao, Z., L. Jin, Y. X. Fu, M. Ramsay, T. Jenkins et al., 2000 World- Tajima, F., 1989 Statistical method for testing the neutral mutation wide DNA sequence variation in a 10-kilobase noncoding region hypothesis by DNA polymorphism. Genetics 123: 585–595. on human chromosome 22. Proc. Natl. Acad. Sci. USA 97: 11354– Tavare, S., D. J. Balding, R. C. Griffiths and P. Donnelly, 1997 11358. Inferring coalescence times from DNA sequence data. Genetics Zhivotovsky, L. A., L. Bennett, A. M. Bowcock and M. W. Feldman, 145: 505–518. 2000 Human population expansion and microsatellite varia- Tishkoff, S. A., and S. M. Williams, 2002 Genetic analysis of African tion. Mol. Biol. Evol. 17: 757–767. populations: human evolution and complex disease. Nat. Rev. Genet. 3: 611–621. Communicating editor: L. Excoffier

APPENDIX: THE EXPECTED NUMBER OF SEGREGATING SITES IN A SAMPLE DRAWN FROM A POPULATION CHARACTERIZED BY A PIECEWISE CONSTANT, MULTI-EPOCH HISTORY OF EFFECTIVE SIZE Model: We consider a population of a given organism evolving under the Wright-Fisher model and under selective neutrality. Let us select a specific site in the genome of the organism. Furthermore, let us randomly draw n DNA samples from this population. Without regard to recombination, the samples possess a unique tree-shaped genealogy at the selected site (the site genealogy). Such a genealogy can be described within the framework of the coalescent: starting with n samples in the present and, through a series of coalescent events (pairs of samples finding their common ancestors), this number reduces to 1, the most recent common ancestor (MRCA), or the root of the genealogy at that site (site root). At a given time, the process is said to be in state j, if at that time the current number of samples is j. This process is Markovian, in that the length of time until the next coalescent event depends only on the current state and is independent of the previous states. Due to molecular mutation processes, the nucleotide observed at the site under consideration might be different in different individuals. Let us assume that, at any given site, only two possible nucleotides are observed (diallelic variations). Accordingly, an individual carries either the allele that was present in the site root (also known as the ancestral allele) or a mutant or derived allele. Let us further assume that the mutant allele is the result of a single mutation event (infinite-sites assumption) within an ancestral sample of the site genealogy. Under this assumption, the number of samples that carry the derived allele is identical to the number of descendants of that ancestor within the site genealogy. Conversely, the derived allele is found in exactly i samples if and only if the ancestor in which the mutation occurred gave rise to i descendants. Under the further assumption of a constant-rate mutation process (Hudson 1991), the likelihood that a given mutation is of size i is related to the number of ancestral nodes with i descendants within the site genealogy and to the “life span” of these ancestors. As Fu shows in a seminal work (Fu 1995), this likelihood can be expressed with the length of time the site genealogy spends in state k, i.e., while the number of ancestor samples within the genealogy is exactly k. Under the further assumption of constant effective population size N, Fu then derives an explicit formula for the expected length of time in state k, leading to a simple result for the expected number of mutations of a given size within n samples (Fu 1995). Our final goal is to extend this result from constant to merely piecewise constant population size. To this end, we use a standard continuous approximation according to which the probability density function of the length of time t spent in state k within the genealogy is exponential under a constant population size, and for a diploid population,

t k Ϫ͑͑k͒ ͒ ( ) 2 /2N 2 e . 2N 366 G. T. Marth et al.

Using this approximation, we derive the expectation for the length of time spent in state k, under piecewise constant population history of an arbitrary number of epochs. Under the assumption of a constant-rate mutation process, ⌿ this allows us to compute the expectation for the number of mutations of size i, denoted by i, observed at a single site, at sites having identical site genealogies (DNA without recombination), or at a collection of sites with completely independent site genealogies. Because the distributions are identical for every site, the result is also valid for a collection of sites. Conventions and useful identities: We use the convention that the value of an empty product is 1 and the value of an empty sum is 0. The probability density function of a random variable X is denoted by f X and its cumulative density function by F X. The variable X conditioned on the event Y is denoted by X|Y. Next, we briefly state three lemmas to aid further derivations. In the following we assume that the ai are different.

Lemma 1. For every value of x, for each 1 Յ l Յ n, n Ϫ ͚ ͟ am x ϭ Ϫ 1. (A1) iϭl m:m϶i am a i lՅmՅn

Proof. Let   n a Ϫ x f(x) :ϭϪ1 ϩ ͚  ͟ i  ; i:i϶j; Ϫ ϭ  a i a j j l lՅiՅn we need to show that f(x) ϵ 0. For r: l Յ r Յ n we have that Ϫ ϭϪ ϩ ͟ ai ar ϭ f(ar) 1 Ϫ 0. i:i϶r; a i ar lՅiՅn Since f(x) is of degree at most n Ϫ l and it has at least n Ϫ l ϩ 1 different zeros, necessarily f(x) ϵ 0. Q.E.D.

Lemma 2. For k, i:1Յ k Ͻ i Յ n we have     i a a a  ͚  i  ͟ l   ͟ m  ϭ Ϫ Ϫ 0. (A2) jϭk aj l:kϽlՅj al ak m:jՅm Ͻi am ai

Proof.     i a a a  ␤ :ϭ ͚  i  ͟ l   ͟ m  k,i Ϫ Ϫ . jϭk aj l:kϽlՅj al ak m:jՅm Ͻi am ai ␤ ϭ Ͼ ϩ k,kϩ1 0, and for i k 1       i a a a  a ␤ ϭ ͚  i  ͟ l   ͟ m  ϭ  ͟ l ␣ k,i Ϫ Ϫ Ϫ k,i , jϭk aj l:kϽlՅj al ak m:jՅm Ͻi am ai l:kϽl Յi al ak where

Ϫ     i 1 a a (a Ϫ a ) a Ϫ a a  ␣ ϭ ϩ ͚  i j i k  ͟ m k  ͟ m  k,i 1 · Ϫ · Ϫ jϭk aj (aj ai) ai m:jϽmϽi am  m:jϽm Ͻi am ai

Ϫ   i 1 a Ϫ a a Ϫ a  ϭ ϩ ͚  i k  ͟ m k 1 Ϫ Ϫ jϭk a j ai m:jϽm Ͻi am ai

iϪ2     a Ϫ Ϫ a a Ϫ a a Ϫ a ϭ i 1 k ϩ ͚ Ϫ ϩ j k  ͟ m k Ϫ 1 Ϫ Ϫ aiϪ1 ai jϭk  aj ai m:jϽm Ͻi am ai

 Ϫ   Ϫ    i 2  a Ϫ a  i 1 a Ϫ a a Ϫ a  ϭϪ͚  ͟ m k ϩ  ͚  j k  ͟ m k Ϫ Ϫ Ϫ jϭk m:jϽmϽi am ai jϭkϩ1 aj ai m:jϽmϽi am ai Demographic Inference From SNP Data 367

 Ϫ   Ϫ  i 2  a Ϫ a  i 1  a Ϫ a  ϭϪ͚  ͟ m k ϩ  ͚  ͟ m k ϭ Ϫ Ϫ 0. Q.E.D. jϭk m:jϽmϽi am ai jϭkϩ1 m:jՅmϽi am ai

Lemma 3. For s Ͻ k Ͻ i Յ n:

i      ai ͟ al ͟ am  ϭ ͚  ϶   ϶  0. (A3) ϭ l:l k; Ϫ m:m i; Ϫ j k aj sϩ1ՅlՅj al ak jՅmՅn am ai

Proof. From Lemma 2,       a a i a a a 0 ϭ  ͟ q   ͟ r  ͚ i ͟ l ͟ m  ϩ Յ Ͻ Ϫ Ͻ Յ Ϫ Ͻ Յ Ϫ Յ Ͻ Ϫ q:s 1 q k aq ak r:i r n ar ai jϭk aj l:k l j al ak m:j m i am ai

i      ϭ ai ͟ al ͟ am  ͚  ϶   ϶  . Q.E.D. ϭ l:l k; Ϫ m:m i; Ϫ j k aj sϩ1ՅlՅj al ak jՅmՅn am ai

Lemma 4.

n    n    1 ϭ 1  ͟ ai  Ϫ 1  ͟ ai  ͚ ϶  ͚  ϶ . ϭ l:i j; Ϫ ϭ ϩ i:i j; Ϫ as j s aj sՅiՅn ai aj j s 1 aj  sϩ1ՅiՅn ai aj

Proof. Using Lemma 1,

n     n   Ϫ     1 ϭ 1  ͟ ai  ϭ 1  ͟ ai  ϩ 1  Ϫ as aj  ͟ ai  ͚ ϶ ͚  1 ϶  ϭ i:i j; Ϫ i:sϩ1ՅiՅn Ϫ ϭ ϩ i :i j; Ϫ as as j s sՅiՅn ai aj as  ai aj j s 1 aj  as   sՅiՅn ai aj 

n    n    ϭ 1  ͟ ai  Ϫ 1  ͟ ai  ͚ ϶  ͚  ϶ . Q.E.D. ϭ i:i j; Ϫ ϭ ϩ i:i j; Ϫ j s aj sՅiՅn ai aj j s 1 aj  sϩ1ՅiՅn ai aj

Constant effective population size: First, we consider a demographic history characterized by a single, constant ϭ ͑j ͒ (1) ϭ population size N1. We introduce the notations aj 2 and a j a j/2N1. The length of time spent in state j (after Ϫ which the number of samples reduces from j to j 1) is denoted by Tj, jϪ1. The random variables Tj, jϪ1 Ϫ (1) ϶ ϭ (1) aj t and Ti,iϪ1 are independent for i j. The density function of Tj,jϪ1 is fTj, jϪ1(t) a j e , according to our model assumptions. The length of time from the present, when the number of samples is n, to the instant when the number {1} {1} ϭ ͚n of samples reduces to s, is denoted by Tn,s. Clearly Tn,s jϭsϩ1 Tj,jϪ1. The probability that, at time t, the genealogy {1} Յ Ͻ {1} {1} ϭ {1} ϩ Յ Ͻ is in state s is P(Tn,s t Tn,sϪ1). Since Tn,l Tn,lϩ1 Tlϩ1,l , for l :1 l n we can use the following convolution: {1} ϭ ͐t {1} Ϫ fTn,l(t) 0 fTn,lϩ1(t x)fTlϩ1,l(x)dx . Using these notations, the following are true:

Theorem 1. For s:1Յ s Ͻ n:

n    (1) (1) Ϫa t ͟ ai {1} ϭ ͚  j   fTn,s(t) a j e ϶ , (A4) ϭ ϩ i:i j; Ϫ j s 1  sϩ1ՅiՅn ai aj 

n    Ϫa (1)t ͟ ai {1} ϭ Ϫ ͚  j   FTn,s(t) 1 e ϶ , (A5) ϭ ϩ i:i j; Ϫ j s 1  sϩ1ՅiՅn ai aj 

n    n    {1} 1 ͟ ai 1 ͟ ai ͑Tn,s͒ ϭ ͚    ϭ ͚    E (1) ϶ 2N1 ϶ . (A6) ϭ ϩ i:i j; Ϫ ϭ ϩ i:i j; Ϫ j s 1 a j sϩ1ՅiՅn ai aj  j s 1a j sϩ1ՅiՅn ai aj  368 G. T. Marth et al.

For s:2Յ s Ͻ n:

n    {1} (1) f (t) {1} {1} a j Ϫa t ͟ ai Tn,sϪ1 ͑Tn,s͒ Յ Ͻ Tn,sϪ1͒ ϭ ͚ j   ϭ P t e ϶ (1) , (A7) ϭ i:i j; Ϫ j s as sՅiՅn ai aj  a s

1 ͑Ts,sϪ1͒ ϭ E (1) . (A8) a s For i:1Յ i Ͻ n: ␮ ⌿ ϭ 4N1 E( i) . (A9) i

Proof. First we show Equations A4 and A5 by downward induction on s. These equations are clearly valid for s ϭ n Ϫ 1. Assume they are valid for s : s Ͼ k. Then t {1} ϭ {1} Ϫ {1} f Tn,k (t) Ύ f Tn,kϩ1 (t x)f Tkϩ1,k(x)dx 0

n  t (1)Ϫ (1)  Ϫ (1) ai (a a ϩ )x ϭ  (1) (1) aj t ͟ j k 1  ͚ a kϩ1aj e ϶ Ύ e dx ϭ ϩ i:i j; Ϫ j k 2  kϩ2ՅiՅn (ai aj) 0 

n  (1)Ϫ (1)  Ϫ (1) ai (a a ϩ )t ϭ  (1) aj t ͟ Ϫ j k 1  ͚ aj e ϶ ΄1 e ΅ ϭ ϩ i:i j; Ϫ j k 2  kϩ1ՅiՅn (ai aj) 

 n   Ϫ (1) n    Ϫa (1) ai a ϩ t ai ϭ  (1) j t ͟  Ϫ k 1 (1) ͟  ͚ aj e ϶ e ͚ aj ϶ . ϭ ϩ i:i j; Ϫ ϭ ϩ i:i j; Ϫ j k 2  kϩ1ՅiՅn ai aj  j k 2  kϩ1ՅiՅn ai aj 

For Equation A4 we need to show that

n     Ϫ  ͟ ai  ϭ  ͟ ai  ͚ aj ϶ akϩ1 . ϭ ϩ i:i j; Ϫ kϩ2ՅiՅn Ϫ ϩ j k 2 kϩ1ՅiՅn ai aj  ai ak 1

This is equivalent to

n Ϫ ΂͚jϭkϩ2 aj͟i:i϶j; ai/(ai aj)΃ n   kϩ1ՅiՅn ͟ αι Ϫ ακϩ1 ϭϪ ϭ ͚  , 1 Ϫ ι:ι϶ϕ; ΂ α Ϫ α ΃ akϩ1͟kϩ2ՅiՅn ai/(ai akϩ1) jϭkϩ2 κϩ2ՅιՅν ι ϕ 

which follows from Lemma 1. Using Lemma 1 with l ϭ s ϩ 1 and x ϭ 0, we get

t n   t  (1) {1} ͟ ai (1) Ϫa x {1} ϭ T Յ ϭ {1} ϭ ͚   j  F Tn,s (t) P( n,s t) Ύ f Tn,s (x)dx ϶ Ύ aj e dx ϭ ϩ i:i j; Ϫ 0 j s 1 sϩ1ՅiՅn ai aj 0 

n    ai Ϫ (1) ϭ  ͟  Ϫ aj t  ͚ ϶ ΂1 e ΃ ϭ ϩ i:i j; Ϫ j s 1 sϩ1ՅiՅn ai aj  

 n   n  ͟ ai Ϫ (1) ͟ ai ϭ   Ϫ  aj t  ͚ ϶ ͚ e ϶ ϭ ϩ i:i j; Ϫ ϭ ϩ i:i j; Ϫ j s 1sϩ1ՅiՅn ai aj  j s 1 sϩ1ՅiՅn ai aj 

n Ϫa (1) ai ϭ Ϫ j t ͟ 1 ͚ e ϶ . ϭ ϩ i:i j; Ϫ j s 1 sϩ1ՅiՅn (ai aj) Demographic Inference From SNP Data 369

{1} {1} ͑T Ͼ ͒ ϭ Ϫ {1} ͑T Յ This completes the proof of Equations A4 and A5. For (A7), note that P n ,s t 1 F Tn,s(t) and P n,s Ͻ {1} ͒ ϭ ͑ {1} Ͼ ͒ Ϫ ͑ {1} Ͼ ͒ t Tn,sϪ1 P Tn,sϪ1 t P Tn,s t . Then

n    n    {1} {1} Ϫ (1) ai Ϫ (1) ai ͑ Յ Ͻ Ϫ ͒ ϭ aj t  ͟  Ϫ aj t  ͟  P Tn,s t Tn,s 1 ͚  e ϶  ͚  e ϶  ϭ i:i j; Ϫ ϭ ϩ i:i j; Ϫ j s  sՅiՅn ai aj  j s 1  sϩ1ՅiՅn ai aj 

 n  Ϫ   Ϫa (1)  ai as aj Ϫa (1) ͟ ai ϭ e s t  ͟  ϩ ͚  ΂1 Ϫ ΃e j t   i:i϶j;  ϩ Յ Յ Ϫ ϭ ϩ Ϫ s 1 i n ai aj  j s 1  as sՅiՅn ai aj 

Ϫ (1) Ϫ (1) as t   n  aj t   ϭ as e  ͟ ai  ϩ aj e  ͟ ai  ϶ ͚  ϶  i:i s; Ϫ Ϫ ϭ ϩ i:i j; Ϫ as sՅiՅn (ai as 1) j s 1  as sՅiՅn ai aj 

n    {1} a Ϫ a fT Ϫ (t) ϭ ͚  j ajt  ͟ i  ϭ n,s 1 e ϶ (1) . ϭ i:i j; Ϫ j s  as sՅiՅn (ai aj) a s

{1} Ն For (A6), since Tn,s 0,

∞ ∞ n    {1} {1} Ϫ (1) a ͑ ͒ ϭ Ն ͒ ϭ aj x  ͟ i  E Tn,s Ύ P(Tn,s x dx Ύ ͚  e ϶ dx ϭ ϩ i:i j; Ϫ 0 0 j s 1  sϩ1ՅiՅn ai aj 

n    ∞  n   1 a Ϫ (1) 1 a ϭ ͚   ͟ i  (1) aj x  ϭ ͚  ͟ i  (1) ϶ Ύ aj e dx (1) ϶ . ϭ ϩ i:i j; Ϫ ϭ ϩ i:i j; Ϫ j s 1  aj sϩ1ՅiՅn ai aj  0  j s 1  aj sϩ1ՅiՅn ai aj 

Equation A8 can be easily obtained from fs,sϪ1(t). Finally, Equation A9 follows from Equation A8, by the argument presented by Fu (1995) to derive Equation 22. Q.E.D.

Piecewise constant effective population size: Consider a demographic history of M distinct epochs indexed by 1, 2,

...,M, where the ancestral epoch is numbered M. For epoch i, the constant effective population size is Ni, and ϭ ∞ (i) ϭ ͑k͒ ␶ ϭ ͚i the duration of this epoch is Ti; in particular, TM . We define a k 2 /2Ni. We introduce i jϭ1Tj , the time ␶ ϭ ␶ ϭ ∞ from the present back until the end of the ith epoch (so 0 0 and M ). At a given time t, the index of the ϭ ␶ Ն ␶ ϭ ␶ Ͻ Յ␶ current epoch is denoted by m(t), in formula m(t) min {k : k t}. In particular, m( i) i, and m(t)Ϫ1 t m(t). We also introduce a “normalized” time t*:

m(t)Ϫ1 t Ϫ␶ Ϫ T t * ϭ m(t) 1 ϩ ͚ i . 2Nm(t) iϭ1 2Ni

The proof is based on induction on the number of epochs. To facilitate this, we consider two kinds of partial models with smaller numbers of epochs, as follows:

{i } 1. The first model has a single epoch, with effective population size Ni. The random variable T n, j denotes the time from the present (state n) to the beginning of state j, under the parameters of the first model. 2. The second model is a truncated version of the original M-epoch model: it consists of i epochs, with parameters ϭ ∞ that are identical to the parameters of the first i epochs of the original model, except Ti ; i.e., the ith of the [i ] original model becomes the ancestral epoch of the truncated model. The random variable T n, j denotes the time from the present (state n) to reach state j, under the parameters of the second model.

Note that the two types of models coincide when i ϭ 1. The following are true:

Theorem 2. For s :1Յ s Ͻ n:

[M] ϭ [m(t)] [M] ϭ [m(t)] f T n,s(t) f T n,s (t) and F Tn,s(t) F Tn,s (t), (A10)

n    1 Ϫa t * ͟ ai [M] ϭ ͚  j   f Tn,s(t) a j e ϶ , (A11) ϭ ϩ i:i j; Ϫ 2Nm(t) j s 1  sϩ1ՅiՅn ai aj  370 G. T. Marth et al.

n    Ϫa t * ͟ ai [M] ϭ Ϫ ͚  j   F Tn,s(t) 1 e ϶ , (A12) ϭ ϩ i:i j; Ϫ j s 1  sϩ1ՅiՅn ai aj 

n   MϪ1 n        m (t)  1 ͟ ai Ϫ͚ a j Tl ͟ ai 1 1 ͑T [M] ͒ ϭ ͚    ϩ ͚ ͚  lϭ1    Ϫ  E Tn,s (1) ϶ e ϶ (mϩ1) (m) ϭ ϩ i:i j; Ϫ ϭ ϭ ϩ i:i j; Ϫ j s 1 a j sϩ1ՅiՅn ai aj  m 1 j s 1 sϩ1ՅiՅn ai aj  aj aj 

n    ϭ 1  ͟ ai  2N1 ͚  ϶  ϭ ϩ i:i j; Ϫ j s 1  aj sϩ1ՅiՅn ai aj 

MϪ1  n    Ϫa ␶* 1 a ϩ  Ϫ j m  ͟ i   ͚ 2(Nmϩ1 Nm) ͚ e ϶  . (A13) ϭ ϭ ϩ i:i j; Ϫ m 1  j s 1  aj sϩ1ՅiՅn ai aj 

For s :2Յ s Ͻ n:

[M] [m(t)] [M] [M] fTn,sϪ1(t) fTn,sϪ1(t) ͑Tn,s Յ Ͻ Tn,sϪ1͒ ϭ ϭ P t (m(t)) (m(t)) , (A14) a s a s

MϪ1 n     Ϫ m (l) 1 ͚ ϭ a j Tl ͟ ai 1 1 ͑Ts,sϪ1͒ ϭ ϩ ͚ ͚  l 1   Ϫ  E (1) e ϶ (mϩ1) (m) ϭ ϭ i:i j; Ϫ a s m 1 j s  sՅiՅn ai aj  as as 

 MϪ1  n   2 Ϫa ␶* a ϭ ϩ  Ϫ  j m ͟ i  N1 ͚ (Nmϩ1 Nm) ͚ e ϶ . (A15) ϭ ϭ i:i j; Ϫ as  m 1  j s  sՅiՅn ai aj 

For i :1Յ i Ͻ n:   Ϫ M 1 Ϫ n  Ϫ n  (j(jϪ1)␶* ) 2 Ϫ  ⌿ ϭ ␮N1 ϩ Nmϩ1 Nm n k m / ͟ l(l 1)  E( i) 4 ͚  ͚΂ ΃͚ e  . (A16)  n Ϫ 1 Ϫ l:l϶j; Ϫ Ϫ Ϫ  i mϭ1 ͑ ͒ kϭ2  i 1 jϭk  Յ Յ l(l 1) j(j 1)   i i k l n 

Proof: (A12) and (A14) are consequences of (A11):

∞ n  Ϫ (M) Ϫ␶ Ϫ (l)   ∞  aj ( MϪ1) Ϫ͚M 1a j Tl Ϫ (M) lϭ1 ͟ ai (M) a t [M] ϭ Ϫ [M] ϭ Ϫ ͚    j  F Tn,s(t) 1 Ύ f Tn,s (t)dt 1 e ϶ Ύ aj e dt ϭ ϩ i:i j; Ϫ t j s 1  sϩ1ՅiՅn ai aj t 

n  Ϫ (M) Ϫ␶ MϪ1 (l)   ϭ Ϫ aj (t MϪ1) Ϫ ͚ a j Tl  ͟ ai  1 ͚ e lϭ1 ϶ . ϭ ϩ i:i j; Ϫ j s 1  sϩ1ՅiՅn ai aj 

 n     Ϫ (M) Ϫ␶ MϪ1 (l)  [M] [M] aj (t MϪ1) Ϫ͚ a j Tl lϭ1 ͟ ai ͑T Յ Ͻ T Ϫ ͒ ϭ [M] Ϫ [M] ϭ ͚    P n,s t n,s 1 FTn,s (t) FTn,sϪ1(t) e ϶ ϭ i:i j; Ϫ j s  sՅiՅn ai aj      n  (M) Ϫ  Ϫa (tϪ␶ Ϫ ) Ϫ M 1 a(l)T j M 1 ͚ ϭ j l ͟ ai Ϫ  l 1    ͚ e ϶  ϭ ϩ i:i j; Ϫ  j s 1  sϩ1ՅiՅn ai aj    (M) Ϫ Ϫa (tϪ␶ Ϫ ) Ϫ M 1 a(l)T s M 1 ͚ ϭ s l ͟ ai ϭ e l 1   i:i϶j; Ϫ sՅiՅn ai aj    n  (M) Ϫ  Ϫ Ϫa (tϪ␶ Ϫ ) Ϫ M 1 a(l)T as aj j M 1 ͚ ϭ j l ͟ ai ϩ Ϫ l 1   ͚ ΂1 ΃e ϶  ϭ ϩ i:i j; Ϫ j s 1  as sՅiՅn ai aj  Demographic Inference From SNP Data 371

  n  (M) Ϫ  [M] [m(t )] Ϫa (tϪ␶ Ϫ ) Ϫ M 1 a(l)T fT Ϫ (T) fT Ϫ (T) aj j M 1 ͚ ϭ j l ͟ ai n,s 1 n,s 1 ϭ ͚  l 1   ϭ ϭ e ϶ (M) (m(t )) . ϭ i:i j; Ϫ j s as sՅiՅn ai aj  a s a s

We prove (A10) and (A11) by induction on the number of epochs M. The statements are true for M ϭ 1by Theorem 1. For M Ͼ 1 assume that the statements are true if the number of epochs is less than M. Clearly,

 n  ͕ [M] ϭ ͖ ϭ ͕ [M] Յ␶ [M] ϭ ͖ ʜ ʜ ͕ [M] Յ␶ Ͻ [M] [M] ϭ ͖ Tn,j t Tn,j MϪ1 and Tn,j t  Tn,i MϪ1 Tn,iϪ1 and Tn,j t . iϭjϩ1  The right side is a union of disjoint events; therefore (using density functions of conditioned variables) we have

[M] [M] ϭ T Յ␶ Ϫ [M]| [M]Յ␶ f Tn,j (t) P( n,j M 1) f Tn,j Tn,j MϪ1(t) n [M] [M] ϩ ͚ T Յ␶ Ϫ Ͻ T Ϫ [M]| [M]Յ␶ Ͻ [M] P( n,i M 1 n,i 1 )f Tn,j Tn,i MϪ1 Tn,iϪ1(t). iϭjϩ1 Clearly

[M] f [MϪ1] (t)/P(T Յ␶ ), t Յ␶ Ϫ  Tn,j n,j MϪ1 M 1 f [M]| [M]Յ␶ (t) ϭ Tn,j Tn,j MϪ1 Ͼ␶ 0, t MϪ1 and for each i Ͼ j Յ␶ 0, t MϪ1 [M] [M] [M] ϭ f T | T Յ␶ Ϫ ϽT ϩ (t)  . n,j n,i M 1 n,i 1 {M} Ϫ␶ Ͼ␶ f i,j (t MϪ1), t MϪ1

Յ␶ [M] ϭ [MϪ1] [M] ϭ [MϪ1] Therefore for t MϪ1 we have f T (t) f T (t) and F T (t) F T (t), so using the induction hypothesis, for Յ␶ n,j n,j n,j n,j t MϪ1, Equations A10, and consequently A11, hold. In particular,

[MϪ1] ␶ n    f ( MϪ1) Ϫ MϪ1 a(l) T [M] [M] Tn,sϪ1 aj ͚ ϭ j l ͟ ai ͑Tn,s Յ␶ Ϫ Ͻ Tn,sϪ1͒ ϭ ϭ ͚  l 1   P M 1 (MϪ1) e ϶ . ϭ i:i j; Ϫ a s j s as sϩ1 ՅiՅn ai aj  Ͼ␶ ϭ If t MϪ1, i.e., m(t) M, then (A10) and (A11) follow from Lemma 3:

n  n   s    Ϫ (l) Ϫ (M) Ϫ␶ am Ϫ͚M 1 a m T ͟ ap (M) aj (t MϪ1) ͟ aq [M] ϭ ͚  ͚  lϭ1 l  ͚    fTn,i (t) e ϶ aj e ϶ ϭ ϩ ϭ p:p m; Ϫ ϭ ϩ q:q j; Ϫ s i 1 m s as sՅpՅn ap am j i 1  iϩ1ՅqՅs aq aj

n  n  m      Ϫa(M)(tϪ␶ ) (l) (M) j MϪ1 Ϫ͚MϪ1 a Tl am ͟ aq ͟ ap ϭ  lϭ1 m      ͚ aj e ͚ e ͚ ϶  ϶  ϭ ϩ ϭ ϭ q:q j; Ϫ p:p m; Ϫ j i 1  m j  s jas  iϩ1ՅqՅs aq aj   sՅpՅn ap am

n    Ϫ (M) Ϫ␶ (l) (M) aj (t MϪ1) Ϫ͚MϪ1 a Tl ͟ aq ϭ  lϭ1 j   ͚ aj e ϶ . ϭ ϩ q:q j; Ϫ j i 1   iϩ1ՅqՅn aq aj

We get Equation A13 in a way similar to the proof of Equation A8:

∞ ∞ n    Ϫ (m(t)) Ϫ␶ m(t)Ϫ1 (l) [M] aj ͑t m(t)Ϫ1͒Ϫ͚ ϭ a Tl ͟ ai ͑T ͒ ϭ ͑ Ϫ [M] ͒ ϭ ͚  l 1 j   E n,s Ύ 1 F Tn,s(t) dt Ύ e ϶ dt ϭ ϩ i:i j; Ϫ 0 0 j s 1  sϩ1ՅiՅn ai aj 

M T n    n    m Ϫ (m) mϪ1a(l)T Ϫ (l) aj tϪ͚ ϭ j l ͟ ai 1 Ϫ͚M 1a j Tl ͟ ai ϭ ͚ Ύ ͚ e l 1  dt ϭ ͚  e lϭ1   i:i϶j; (M) i:i϶j; ϭ ϭ ϩ Ϫ ϭ ϩ Ϫ m 1 0 j s 1  sϩ1ՅiՅn ai aj  j s 1 aj sϩ1ՅiՅn ai aj 

MϪ1 n    Ϫ (m) Ϫ (l) 1 aj Tm Ϫ͚m 1a j Tl ͟ ai ϩ ͚ ͚  ͑ Ϫ ͒ lϭ1   (m) 1 e e ϶ ϭ ϭ ϩ i:i j; Ϫ m 1 j s 1 aj sϩ1ՅiՅn ai aj 372 G. T. Marth et al.

M n    MϪ1 n    Ϫ (l) (l) 1 Ϫ͚m 1a j Tl ͟ ai 1 Ϫ͚m a j Tl ͟ ai ϭ ͚ ͚  e lϭ1   Ϫ ͚ ͚  e lϭ1   (m) i:i϶j; (m) i:i϶j; ϭ ϭ ϩ Ϫ ϭ ϭ ϩ Ϫ m 1j s 1 aj sϩ1ՅiՅn ai aj  m 1 j s 1aj sϩ1ՅiՅn ai aj 

n    MϪ1 n      (l) 1 ͟ ai Ϫ͚m a j Tl ͟ ai 1 1 ϭ ͚    ϩ ͚ ͚ e lϭ1    Ϫ  . (1) i:i϶j; i:i϶j; (mϩ1) (m) ϭ ϩ Ϫ ϭ ϭ ϩ Ϫ j s 1 aj sϩ1ՅiՅn ai aj  m 1 j s 1 sϩ1ՅiՅn ai aj  aj aj 

Using Lemma 4,

MϪ1 n      [M] [M] 1 m a(t)T ai 1 1 ͑ ͒ ϭ ͑ ͒ Ϫ ͑ ͒ ϭ ϩ Ϫ͚ ϭ j l ͟ Ϫ E Ts,sϪ1 E Tn,sϪ1 E Tn,s ͚ ͚e l 1     (1) i:i϶j; Ϫ (mϩ1) (m) as mϭ1 jϭs  sՅiՅn ai aj  aj aj 

MϪ1 n      (l) Ϫ͚m a j Tl ͟ ai 1 1 Ϫ ͚ ͚  lϭ1    Ϫ  e ϶ (mϩ1) (m) ϭ ϭ ϩ i:i j; Ϫ m 1j s 1  sϩ1ՅiՅn ai aj   ai aj 

MϪ1   n    1 m a(l)T ai 1 1 ϭ ϩ Ϫ͚ ϭ s l ͟ Ϫ ͚  e l 1    ϩ  (1) ϭ ϩ Ϫ (m 1) (m) as mϭ1  i s 1 ai as  aj as    MϪ1 n        (l) Ϫ  Ϫ͚m a j Tl ͟ ai 1 1 as aj ϩ ͚ ͚ e lϭ1    Ϫ  1 Ϫ   i:i϶j; Ϫ (mϩ1) (m)  ϭ ϭ ϩ  ai aj  aj aj   as  m 1 j s 1  sՅiՅn    MϪ1 n      (l) 1 Ϫ͚m a j Tl ͟ ai 1 1  ϭ ϩ ͚ ͚ e lϭ1    Ϫ  . (1)  i:i϶j; Ϫ (mϩ1) (m)  as ϭ jϭs  ai aj  aj aj  m 1  sՅiՅn  This gives Equation A15. Finally, using manipulations identical to those used by Fu (1995) we derive Equation A16:  n Ϫ k  ΂ ΃    n  MϪ1 n    E(⌿ ) i Ϫ 1   Ϫ ␶ a  i ϭ  ϩ Ϫ a j *m ͟ s  ͚ N1 ͚ (Nmϩ1 Nm) ͚ e  ␮ Ϫ   s:s϶j; Ϫ  4 kϭ2  n 1 mϭ1 jϭk  kՅsՅn as aj   i΂ ΃   i    Ϫ Ϫ n n  n Ϫ  M 1 Nmϩ1 Nm Ϫ    N1 n k  n k Ϫ ␶* ͟ as  ϭ ͚ ΂ ΃ ϩ ͚  ͚ ΂ ΃͚e aj m   Ϫ Ϫ n Ϫ 1 ϭ  ϭ s:s϶j; Ϫ  n 1  ϭ i 1  ϭ  k 2 i Ϫ 1 j k Յ Յ as aj   i΂ ΃ k 2 m 1 i΂ ΃  k s n  i  i

 Ϫ   MϪ1 Ϫ Ϫ 1 n Ϫ n   N1 Nmϩ1 Nm n 1  n k Ϫ ␶* ͟ as  ϭ ϩ ͚ ΂ ΃ ͚ ΂ ΃ ͚ e aj m  ,   s:s϶j; Ϫ  i ϭ i i ϭ i Ϫ 1 ϭ  Յ Յ as aj  m 1  k 2  j k k s n  ␶ ϭ ͚m where *m lϭ1(Tl/2Nl). This completes the proof. Q.E.D. Sequence variations in the public human genome data reflect a bottlenecked population history

Gabor Marth*†, Greg Schuler*, Raymond Yeh‡, Ruth Davenport§, Richa Agarwala*, Deanna Church*, Sarah Wheelan*¶, Jonathan Baker*, Ming Ward*, Michael Kholodov*, Lon Phan*, Eva Czabarka*, Janos Murvai*, David Cutlerʈ, Stephen Wooding**, Alan Rogers**, Aravinda Chakravartiʈ, Henry C. Harpending**, Pui-Yan Kwok†,††, and Stephen T. Sherry*†

*National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD 20894; ‡Department of Genetics, Washington University School of Medicine, St. Louis, MO 63130; §Division of Internal Medicine, Washington University School of Medicine, St. Louis, MO 63130; ¶Department of Molecular Biology and Genetics and ʈMcKusick–Nathans Institute of Genetic Medicine, The Johns Hopkins University School of Medicine, Baltimore, MD 21205; **Department of Anthropology, University of Utah, Salt Lake City, UT 84112; and ††Cardiovascular Research Institute and Department of Dermatology, University of California, San Francisco, CA 94143

Contributed by Henry C. Harpending, November 5, 2002 Single-nucleotide polymorphisms (SNPs) constitute the great ma- (density) distribution of genomic sequence variations. To this jority of variations in the human genome, and as heritable variable end, we built a set of reagents (pairwise sequence alignments landmarks they are useful markers for disease mapping and re- and their corresponding sets of variation) by analyzing the solving population structure. Redundant coverage in overlaps of overlapping regions of large-insert clones sequenced as part of large-insert genomic clones, sequenced as part of the Human the human genome project. These data provided marker Genome Project, comprises a quarter of the genome, and it is density observations grouped by overlap fragment length. representative in terms of base compositional and functional Extending previous methods (9, 10), we implemented simula- sequence features. We mined these regions to produce 500,000 tion and numerical techniques to estimate population genetic high-confidence SNP candidates as a uniform resource for describ- parameters that best describe these observed data. We report ing nucleotide diversity and its regional variation within the results indicating that both the effects of recombination and genome. Distributions of marker density observed at different substantial changes in effective population size are required to overlap length scales under a model of recombination and popu- fit models of neutral sequence evolution to observed marker lation size change show that the history of the population repre- densities. sented by the public genome sequence is one of collapse followed by a recent phase of mild size recovery. The inferred times of Methods collapse and recovery are Upper Paleolithic, in agreement with Overlap Detection, SNP Discovery, and Tabulation of Observed Marker archaeological evidence of the initial modern human colonization Density. The initial data consisted of genomic clones of either of Europe. finished or draft quality that were part of the September 5, 2000, genome data freeze. Regions of known human repeats and low nformation on the demographic history of a species is complexity sequence were masked with REPEATMASKER (Arian ͞͞ Iimprinted in the distribution of sequence variations in its Smit, http: repeatmasker.genome.washington.edu). Candidate genome. The completion of a draft sequence for the human sequence overlaps were determined by a fast initial similarity genome provides a useful substrate for both the detection of search with MEGABLAST (11), followed by pairwise alignment ࿝ sequence variants and a study of their distribution. To date, the with the dynamic programming algorithm CROSS MATCH (Phil number of publicly available single-nucleotide polymorphisms Green, www.phrap.org). Draft quality sequence is often (SNPs) well exceeds two million (dbSNP build 105). The main composed of unordered fragments; hence an overlap between data sources for computational SNP discovery have been two such clones is broken up into a set of partial overlap expressed sequence tags (ESTs) (1, 2), genomic restriction fragments. Overlaps were retained for further analysis if: (i) both fragments (3), sequences aligned to genome both from the clones resided on the same chromosome, as could be determined Ͼ ends of bacterial artificial chromosomes (BACs) and from by physical mapping; and (ii) total overlap length was 6 kb, random shotgun sequences of clone sequence, and overlapping counting only overlap fragments longer than 3 kb in the total. regions of genomic clone sequences themselves (4, 5). Gen- Overlap fragments were analyzed with the POLYBAYES SNP- erally, SNPs from these data were detected in surveys of a few discovery program (12). An observed mismatch was called a chromosomes, an ascertainment strategy that biases allele candidate SNP if the corresponding POLYBAYES probability frequency patterns toward common variations (6), and thus value was at least 0.80, and there were no discrepancies in the five these data are expected to fall into a range that is unlikely to base pairs immediately flanking either side. To avoid false contain the majority of clinically important mutations (7, 8). positive predictions caused by the erroneous alignment of di- Under the ‘‘common disease, common allele’’ hypothesis, vergent copies of segmental duplications (sequence paralogy) we Ͼ however, these common variants may be of special importance. have excluded overlap fragments with 1 SNP per 400 nucleo- In either case, to assess the potential utility of these data for tides. This censorship procedure, necessary to maintain a high inferences of gene function or population history, one must quality for the candidate set, also removes overlap fragments in first understand its overall structure and distribution in the which the inherent polymorphism rate was genuinely high. The genome. Statistical power in such analyses requires a large resulting bias was estimated in subsequent analysis. An addi- amount of data, ascertained under uniform, well-characterized conditions. Clone overlaps and their derived variations are Abbreviations: SNPs, single-nucleotide polymorphisms; BAC, bacterial artificial chromo- well suited for this task, as long (up to 100 kb) regions of some. redundant sequence coverage distributed in roughly even Data deposition: SNPs discovered in this study are available from the dbSNP web site intervals (5), covering nearly a quarter of the genome. The fact (http:͞͞ncbi.nlm.nih.gov͞SNP), under the ‘‘KWOK’’ submitter handle (accession nos. that regions in a wide range of overlap length are available ss1566252–ss2075206). makes this set especially suited for studying the effects of †To whom correspondence and requests for materials may be addressed. E-mail: recombination and demographic size fluctuation on the spatial [email protected], [email protected], or [email protected].

376–381 ͉ PNAS ͉ January 7, 2003 ͉ vol. 100 ͉ no. 1 www.pnas.org͞cgi͞doi͞10.1073͞pnas.222673099 tional bias was introduced when regions of low-quality sequences were analyzed. These regions cannot be effectively evaluated for SNPs, as sequence differences are more likely to represent sequencing errors than true polymorphisms. We rectified this bias by adjusting the overlap interval to include only the high- quality portions of the overlaps [i.e., where the base quality value (13) was Ͼ35 in both sequences]. This procedure discarded Ϸ5% of the total overlap length.

Integration with the Public Genome Assembly. To ensure an un- biased evaluation of the density distributions with respect to the reference genome sequence, we included only those portions of our overlaps that were also present in the genome assembly based on the September 5, 2000, data (14). We evaluated repeat content in the genome, as well as in the clone overlap regions by using REPEATMASKER. We used custom software to compute GϩC nucleotide and CpG dinucleotide content.

SNP Validation and Estimation of Allele Frequency. The experimen- tal methods and conditions used to assess the candidate SNPs were fully described previously (15).

Modeling Marker Density Distributions. Mismatch distributions de- scribe the likelihood of observing k (k ϭ 0, 1, 2, . . .) polymorphic sites (mismatches) when n sample sequences of a given length, L, are compared (n ϭ 2 in this study). Traditionally, the opposing effects of meiotic recombination and co-ancestry have been studied under two simple, yet extreme, scenarios (Fig. 1a). A simple (first-order) model that ignores any structure imposed by demographic history and assumes complete independence be- tween the genealogies of neighboring sites because of recombi- nation (infinite recombination model) predicts a Poisson mis- match distribution driven solely by the mutation rate (16). Conversely, a first-order model that accounts for genealogical structure only through static demographic history and ignores recombination (zero-recombination model) predicts a geometric distribution of mutational differences (17). A detailed demographic history described by the time evolu- tion of effective population size, Ne, profoundly affects the distribution of polymorphic sites shared between individuals. In particular, a large increase of the effective population size yields an overabundance of new lineages that increase the likelihood that random sequence pairs will harbor one or more mutational GENETICS differences (9) (Fig. 1b). Alternatively, a sharp decrease in effective population size raises the likelihood of relatedness between random pairwise DNA samples, resulting in the oppo- site effect: an overrepresentation of sequence identity (zero difference) as seen in Fig. 1c. Both models represent a second- order changing population size dynamics characterized by dif- ferent effective population sizes in each of two epochs: an ancestral size N2, followed by a size change to N1, happening T1 generations ago. It is possible to go to higher-order models by ANTHROPOLOGY increasing the number of epochs in a population history. An Fig. 1. Marker density distributions predicted under competing population- example third-order (three-epoch) model is the ‘‘bottleneck’’ genetic models (for 10-kb pairwise aligned length, censored at 25 SNPs per alignment). (a) First-order, stationary history. (b) Second-order, expansion dynamics (i.e., a collapse followed by a phase of recent popu- history. (c) Second-order, collapse history. (d) Third-order, ‘‘bottleneck’’- lation recovery) depicted in Fig. 1d. shaped history. r indicates the per nucleotide, per generation recombination While the density distribution can be computed explicitly for rate. zero-recombination models even with high-order population dynamics (9), no explicit formulas are available for models with realistic levels of recombination. In these cases, we are reliant on nucleotide per generation [obtained from recombination fre- numerical simulations that use the coalescent process with quencies measured across the genome (21)] when appropriate. recombination (18). Implementing this technique with custom We note that an alternative estimate of ␮ ϭ 1.0 ϫ 10Ϫ8, although software, we were able to study the counterparts of the previous less conventional, is perhaps more plausible, as it accounts for a (zero-recombination) models with realistic levels of recombina- larger ancestral anthropoid population size and older separation tion (Fig. 1). Models used a uniform genome-averaged mutation times (20, 22). The latter rate estimate will yield a human rate, ␮ ϭ 2.0 ϫ 10Ϫ8 per site per generation [obtained as a effective size estimate of 20,000 rather than 10,000 (below), and compromise between prominent estimates (19, 20)] and a uni- double the time estimates for demographic events. Initial sim- form genome-averaged recombination rate of r ϭ 1.0 ϫ 10Ϫ9 per ulations were run with 100,000 replicates per parameter set, and

Marth et al. PNAS ͉ January 7, 2003 ͉ vol. 100 ͉ no. 1 ͉ 377 Table 1. Performance of population genetic models of various complexities for fitting marker density data observed in interindividual BAC overlap fragments Improvement because of Improvement because of Recombination Best-fitting Model inclusion of recombination inclusion of extra epoch Model structure rate (r) model parameters log likelihood df ϭ 1 df ϭ 2

Free combination ϱ N ϭ 9,200 Ϫ13,576.89 — —

One-epoch 0 N1 ϭ 12,000 Ϫ626.11 — —

Ϫ8 Ϫ7 10 N1 ϭ 10,300 Ϫ566.25 2 ln ␭ ϭ 119.72 (P Ͻ 1 ϫ 10 )—

Two-epoch N2 ϭ 13,200 Ϫ7 0 T1 ϭ 200 Ϫ559.75 — 2 ln ␭ ϭ 132.72 (P Ͻ 1 ϫ 10 ) N1 ϭ 2,000

N2 ϭ 11,000 Ϫ8 Ϫ7 Ϫ7 10 T1 ϭ 700 Ϫ466.82 2 ln ␭ ϭ 185.86 (P Ͻ 1 ϫ 10 )2ln␭ ϭ 198.86 (P Ͻ 1 ϫ 10 ) N1 ϭ 4,000

Three-epoch N3 ϭ 9,000 T2 ϭ 7,000 Ϫ7 0 N2 ϭ 50,000 Ϫ469.64 — 2 ln ␭ ϭ 180.22 (P Ͻ 1 ϫ 10 ) T1 ϭ 7,000 N1 ϭ 9,000

N3 ϭ 11,000 T2 ϭ 400 Ϫ8 Ϫ4 10 N2 ϭ 5,000 Ϫ463.07 2 ln ␭ ϭ 13.14 (P Ͻ 2.89 ϫ 10 )2ln␭ ϭ 7.5 (P Ͻ 0.023) T1 ϭ 1,200 N1 ϭ 6,000

For each model, we report the population parameter values within a given model structure that produced the best fit to the observations. We also report the corresponding log likelihood, ln P(data͉model). The penultimate column reports the statistical significance of model improvement attributable to the introduction of a genome average recombination rate into our models (adding one extra model parameter). The final column reports the significance of the model improvement attributable to the introduction of an additional epoch (two extra model parameters). Significance of the improvement was evaluated with statistical hypothesis testing for nested model structures (see Methods). refinements for the best-performing parameter sets were reeval- number of differences permitted by the censorship procedure for uated with one million replicates. length L, and mL,k is the marker density probability predicted by the model, at length class L, for k differences. In evaluating an Model Parameterization. A given model is specified by a recom- alternative goodness of fit for a given model, we used the ␹2 bination rate (r ϭ 0or1.0ϫ 10Ϫ9) and a vector of population metric (see Discussion): sizes (Ni) and epoch durations (Ti) determined by the model’s C order (for examples of such parameter sets see Table 1). Values L ͑O Ϫ O m ͒2 ␹2 ϭ ͸ ͸ L,k L L,k N . of i (ranging from 1,000 to 100,000) were sampled in units of OLmL,k 100 for numerical calculations and for one-epoch model simu- L kϭ0 lations, and in units of 1,000 for higher-epoch model simulations. Using either of the above metrics requires the model-predicted Values of T (ranging from 100 to 10,000) were sampled in units i probability distributions to be calculated very accurately, espe- of 100 for all cases. Predicted marker density distributions were cially in mismatch categories with low predicted probabilities. In generated for each length scale analyzed (L ϭ 4, 6, 8, 10, 12, 14, those cases where the distribution can be calculated only with and 16 kb) for each parameter set considered from the multi- simulations, accuracy is constrained by the practical upper limit dimensional parameter space defined above. on the number of simulation replicates. To avoid numerical instability, we restricted fit testing to the first K categories Model Evaluation. Parameter sets within a fixed model structure L within each length class such that categories k ϭ 0,1,...,K were compared by computing, for each competing model, a L contain 95% of all fragments for that class. degree of fit between the observed (o) marker density distribu- tion and the probability distribution predicted by each of the Model Comparison. The performances of different model struc- models (m), using the log likelihood of the data given the model tures were compared based on the maximum likelihood param- in question. Because observations between different overlap eter estimates for each model structure. Standard tools of fragment length classes, as well as observations for each of the normal hypothesis testing could be used (23) when two nested number of differences, k, were independent, this likelihood is models were compared, by calculating the likelihood ratio, ␭, described by a multinomial distribution: between the less restricted and more restricted model. The 2 CL quantity 2 ln(␭) is expected to be asymptotically ␹ distributed OL ͑ ͉ ͒ ϭ ͹ͩ ͪ͹ OL,k with degrees of freedom equal to the difference in the number P o m mL,k , OL,0,...,OL,CL L kϭ0 of parameters. Increasing the number of epochs by one adds two parameters to the model (the effective population size, and the where OL is the number of overlap fragments in class L, OL,k is duration of the new epoch). Considering recombination adds the number of fragments with k differences, CL is the maximum one extra parameter to the model structure. The less restricted

378 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.222673099 Marth et al. model (the one with more free parameters) was accepted when the ␹2 value yielded P Յ 0.05 in a one-tailed test.

Monte Carlo Testing of Data Fit. To analyze the behavior of our models in the face of increasing amounts of observed data we generated random subsets of the observed data set of given fractions. At each fraction, we generated 1,000 subsets. For each subset, and for a given model, we determined whether the fit between the predicted marker density distribution and the observation subset could be rejected as statistically insignificant by the described ␹2 test. We calculated the proportion of subsets for which the model prediction could not be rejected, and tabulated these proportions for each of the data fractions analyzed. Results Data Collection and Assessment. We analyzed 25,901 genomic clones consisting of 7,122 finished and 18,779 draft sequences for which PHRAP (13) base-quality scores were available. We iden- tified 21,020 clone overlaps (Methods) comprising 124,356 over- lap fragments (see Methods). The total pairwise length of these overlaps was 1,105 megabases. Using the POLYBAYES SNP dis- covery tool (12), we detected and submitted to dbSNP (24) 507,152 candidate SNPs (Methods). When the data were restricted to overlaps also present in the genome assembly (Methods), the number of overlaps reduced to 18,074 overlaps (average length of 51.1 Ϯ 35.5 kb) containing 399,067 candidate SNPs. Measures of GϩC, CpG, and repeat content in the overlap set were generally equivalent to average genome values, indi- cating it to be representative of the complete genome assembly (see Supporting Text, which is published as supporting informa- tion on the PNAS web site, www.pnas.org). To evaluate the quality of candidate SNPs, we tested for segregation in human populations and evaluated the sequence data for intrinsic error (see supporting information on the PNAS web site for details). Verification experiments show that the computational SNP predictions from the BAC overlap sequences are high quality, and the majority of SNPs are informative in one or more world populations (15). Fig. 2. Comparison of model predictions to observed marker density data. (a) Marker density distributions observed in the interindividual overlap frag- Estimation of Genomewide Nucleotide Diversity. By comparing the ment data (ocher) and corresponding distributions predicted by our overall number of SNPs to the total length of pairwise overlaps ana- best-fitting, three-epoch bottleneck model (gray), at each analyzed length. (b) lyzed, we estimated the overall value of pairwise nucleotide Predictions under the best-performing parameter set for each model structure GENETICS diversity, ˆ␪, for our complete dataset as 5.047 ϫ 10Ϫ4 per studied, compared with observed (ocher) data (pairwise overlap length, 10 kb; nucleotide. This value, however, is biased by the inclusion of censorship at 25 SNPs per alignment). r indicates models with recombination. overlaps derived from the same source chromosome from a single individual. For the remainder of this study, we thus used only clone overlaps derived from interindividual libraries where recombination’’ model, demonstrating that the inheritance of both clone sequences were of draft quality. There were 3,174 markers in close proximity is strongly correlated, and is such overlaps (18,152 overlap fragments, total overlap length 144 consistent with extensive linkage disequilibrium observed in megabases). The nucleotide diversity value observed in this set

humans (26). This finding is an improvement over our previous ANTHROPOLOGY ˆ␪ ϭ ϫ Ϫ4 ( 7.571 10 ) is in excellent agreement with the value study which, on the basis of marker density distribution observed for shotgun reads aligned to the human genome (25). measured in short read fragments aligned to genome sequence, Our value for ˆ␪ indicates an expectation of one SNP in every ϭ did not carry sufficient power to distinguish between these 1,321 bp of paired sequence (n 2). This average value, competing models (25). however, must be treated with caution, as the actual number of SNPs in an overlap of a given length is highly variable at all length Examination of Second-Order Demographic Dynamics (Two-Epoch scales examined (Fig. 2a). Our value for ˆ␪ corresponds to an ϭ ϫ Model) Shows Population Collapse as the Dominant Effect. Second- effective size estimate of Ne 9,464, if the mutation rate is 2.0 Ϫ8 ϫ Ϫ8, order models provided a superior fit compared with all station- 10 . Ne should be doubled if the mutation rate is 1.0 10 putting it in line with the figure of 17,500 estimated from Alu ary histories tested. The best-fitting parameters describe a diversity in the human genome (6). severe, 2- to 7-fold, collapse of population size several hundred generations ago (Table 1). This result is consistent whether we Of Two Extreme Models, Zero Recombination vs. Full Recombination, used the zero-recombination models or a genome average Zero Recombination Provides the Closest Fit to Our Data. Based on recombination value. Additionally, within the class of all second- the multikilobase scale marker density distributions in the order models tested, models with a realistic recombination value overlap data, the ‘‘zero-recombination’’ model provides a fit significantly better than those that disregarded recombination clearly superior fit (Table 1) compared with the ‘‘infinite effects.

Marth et al. PNAS ͉ January 7, 2003 ͉ vol. 100 ͉ no. 1 ͉ 379 tocene (9). Mismatch distributions from the hypervariable re- gions of the human mitochondrion exhibit a wavelike shape that has been interpreted as the sign of this expansion. However, limitations on the number of loci available for population genetic analysis have restricted a more detailed demographic inference (9). Our third-order analysis indicates that the dominant effect in our data is a collapse ca. 40,000 years ago (1,600 generations), consistent with the timing of the initial appearance of anatom- ically modern humans in Europe. To which population do our results refer? The ethnic composition of the DNA donors of the public human genome is not described, but genotyping of diallelelic, insertion-deletion type polymorphisms mined from the same BAC overlaps (27) suggests that the majority of these sequences represent donors of European origin. Similar patterns resulting in reduction of diversity and extension of linkage disequilibrium in European samples (26, 28–31), and reports of Fig. 3. Observed and predicted pairwise nucleotide diversity values at each long invariable regions in the human genome (32) have been overlap fragment length. Predicted values were based on the best-fitting published. If our results indeed describe European chromo- bottleneck (three-epoch) model. Details of the censoring process (censored) somes, then our estimated time of collapse is in good agreement and correction for censorship (uncensored) are described in the text. with expansion time estimates from mitochondrial mismatch distributions (9). How do we reconcile the signature of a population collapse Third-Order Models Show a Bottleneck History. No third-order seen in our data with the obvious recent explosive increase in model that disregarded recombination could produce a fit population size? The recovery visible in our data is a very modest superior to that of the best-fitting second-order (collapse) increase of effective population size during the Upper Paleolithic model with recombination (Table 1). However, the third-order (Table 1). This finding suggests that recent population growth is models with recombination did produce an improved fit (see not yet deeply imprinted in the nuclear marker density distri- Fig. 2a) with the best-fitting parameters representing a ‘‘bot- bution, presumably because of the low average nuclear mutation tleneck’’-shaped population history. A visual representation of rate. the best-performing parameter sets within each model class is shown in Fig. 2b. These parameter values, together with the Although our best, three-epoch, model produces a visually convincing fit to the observed data (Fig. 2a), application of a quantitative description of the fit values are given in Table 1. ␹2 The third-order model structure with bottleneck parameters general test reveals that the fit can be rejected at the 5% (or (Table 1) is our best description of the population history even at the 1%) level, and the same is true for each of the other imprinted in the BAC overlap variation data. While all sets model structures (data not shown). Does this mean that we have were qualitatively similar, the best-fitting parameter combi- to discard these models as inadequate descriptions of the ob- nation was slightly different for each overlap fragment length served data? The models are cartoon-like, and the marker (data not shown). The overall optimum thus represents a density observed in the BAC overlap data was shaped by many compromise among the best-fitting parameter sets. We com- unconsidered effects. If our models are not perfect, it is natural pared the predicted censored nucleotide diversity values pre- dicted by the optimal model to the observed values at each length scale analyzed (Fig. 3). The fit is better at shorter sequence length than at longer lengths, as the majority of data available at shorter lengths were weighted heavier during the determination of a global optimum.

Unbiased Estimates of Genomewide Nucleotide Diversity. The direct measurement of pairwise nucleotide diversity is confounded with the effects of the censorship procedure (Methods). Using our best-fitting model, we projected the shape of the maker density curve beyond the censorship limit, and estimated the unbiased value of pairwise nucleotide diversity intrinsic to the overlap dataset as ˆ␪ ϭ 8.25 ϫ 10Ϫ4 per site per generation or one substitution-like polymorphism per 1,212 nucleotides. Assuming a genome average mutation rate of ␮ ϭ 2.0 ϫ 10Ϫ8 or 1.0 ϫ 10Ϫ8 per site per generation, the corresponding long-term ϭ effective population size is Ne 10,313 or 20,626, respectively. Discussion The dataset considered here, by virtue of its global nature, is expected to be robust against selection-induced distortions at individual loci and serves as a proper reagent to test theory describing the distribution of the number of mismatches in pairwise comparisons observed in a large number of different Fig. 4. Model assessment based on the amount of data required for rejection genomic regions. by the general ␹2 test. For each model, at each data fraction, we have plotted Evidence from both archeological and genetic sources sug- the percentage of successful trials (random data subsets for which the model gests that modern human populations are the product of an cannot be rejected by the ␹2 test at the 5% level). r indicates models with episode of explosive population growth beginning in the Pleis- recombination.

380 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.222673099 Marth et al. to ask how well they perform in an ‘‘absolute’’ sense, instead of desirable to refine our models by considering these effects. To relative terms, compared with each other. In all cases, model confirm the generality of our results it will be necessary to rejection is based on statistical significance, which in turn is evaluate similar data from non-European samples, analyze other always defined in the context of the test data at hand. Therefore, characteristic distributions of SNPs such the allele frequency it is possible that a model that could not be rejected on the basis spectrum, and contrast our observations to data collected in of a given dataset later proves inaccurate (rejected) when molecular systems with alternative mutational mechanisms such additional testing data become available. This consideration as diallelic insertions͞deletions, short tandem repeat polymor- provides an alternative way to evaluate model accuracy, by phisms (STRPs), and mitochondrial polymorphisms. examining how much data are necessary for the rejection of each The amount of heterogeneity observed in the BAC overlaps of the competing models. The better the model, presumably, the should be a warning that average genome measures of nucleotide larger the dataset that is required for its rejection as inaccurate. diversity should be used with caution. On the other hand, our Accordingly, we have performed a computational experiment to computational experiments demonstrate that even relatively ␹2 examine how a measure of data fit between our observations simple models of random drift are adequate to predict the range and best-fitting models decays as more and more of the original of variability in our data, suggesting that drift is an important (if data is considered. Results are shown in Fig. 4. Our best-fitting not the most important) component of the resultant of forces model (three-epoch history with uniform recombination) can be that shape the regional distribution of human variability. It is rejected only 50% of the time when subsets containing at least Ϸ striking, for example, that purely neutral forces can account for 15% ( 2,300 overlap fragments) of the original observed data the fact that Ϸ10% of our 16-kb overlaps did not contain a single are considered. We anticipate that evaluations of this sort will sequence variant (Fig. 2a). Observations such as this will require become increasingly useful in the analysis of genome-scale data us to rethink our expectations when evaluating variation struc- (over-powered experiments) where numerical models will fail ture and its possible significance in specific genomic loci. The traditional significance tests when genome-sized datasets are mature, fully annotated human reference sequence, together considered. with an increase in well-characterized SNP markers, should What can we do to improve our models? We know that afford us a high-resolution view to provide context for inter- mutation rates are not uniform in the nuclear genome (19). preting regional variation data, improving existing models of There is also evidence for recombination hotspots (33). The population history, and resolving the selective forces of genome existence of hotspots implies that, at least to some degree, evolution. recombination favors certain regions of the genome, a departure from the uniform distribution that we have assumed in our We thank James Weber for providing detailed recombination frequency models. We also know that population history is far more data, Matt Minton and Rachel Donaldson for technical assistance, complex than we can capture in our cartoon-like models invoking Stephen Altschul, Alexey Kondrashov, Raymond Miller, Ravi Sachidan- instantaneous stepwise changes of effective size. The same andan, and John Spouge for useful discussion, and Andrew Clark for history may or may not be true for all chromosomal regions many useful comments on the manuscript. The work was supported in within the genome. There is also a large corpus of literature part by National Human Genome Research Institute Grant HG01720 (to discussing nonneutral effects such as selective sweeps (34). It is P.-Y.K.).

1. Clifford, R., Edmonson, M., Hu, Y., Nguyen, C., Scherpbier, T. & Buetow, 18. Hudson, R. R. (1990) in Oxford Surveys in Evolutionary Biology, eds. Futuyama, K. H. (2000) Genome Res. 10, 1259–1265. D. J. & Antonovics, J. (Oxford Univ. Press, Oxford), Vol. 7, pp. 1–44. 2. Irizarry, K., Kustanovich, V., Li, C., Brown, N., Nelson, S., Wong, W. & Lee, 19. Kondrashov, A. S. (2003) Hum. Mutat., in press. C. J. (2000) Nat. Genet. 26, 233–236. 20. Nachman, M. W. & Crowell, S. L. (2000) Genetics 156, 297–304. 3. Altshuler, D., Pollara, V. J., Cowles, C. R., Van Etten, W. J., Baldwin, J., 21. Yu, A., Zhao, C., Fan, Y., Jang, W., Mungall, A. J., Deloukas, P., Olsen, A., Linton, L. & Lander, E. S. (2000) Nature 407, 513–516. Doggett, N. A., Ghebranious, N., Broman, K. W., et al. (2001) Nature 409,

4. Mullikin, J. C., Hunt, S. E., Cole, C. G., Mortimore, B. J., Rice, C. M., Burton, 951–953. GENETICS J., Matthews, L. H., Pavitt, R., Plumb, R. W., Sims, S. K., et al. (2000) Nature 22. Brunet, M., Guy, F., Pilbeam, D., Mackaye, H. T., Likius, A., Ahounta, D., 407, 516–520. Beauvilain, A., Blondel, C., Bocherens, H., Boisserie, J. R., et al. (2002) Nature 5. Taillon-Miller, P., Gu, Z., Li, Q., Hillier, L. & Kwok, P. Y. (1998) Genome Res. 418, 145–151. 8, 748–754. 23. Ott, J. (1991) Analysis of Human Genetic Linkage (John Hopkins Univ. Press, 6. Sherry, S. T., Harpending, H. C., Batzer, M. A. & Stoneking, M. (1997) Genetics Baltimore), 2nd Ed. 147, 1977–1982. 24. Sherry, S. T., Ward, M. & Sirotkin, K. (1999) Genome Res. 9, 677–679. 7. Cargill, M., Altshuler, D., Ireland, J., Sklar, P., Ardlie, K., Patil, N., Shaw, N., 25. Sachidanandam, R., Weissman, D., Schmidt, S. C., Kakol, J. M., Stein, L. D., Lane, C. R., Lim, E. P., Kalyanaraman, N., et al. (1999) Nat. Genet. 22, 231–238. Marth, G., Sherry, S., Mullikin, J. C., Mortimore, B. J., Willey, D. L., et al. 8. Sunyaev, S. R., Lathe, W. C., III, Ramensky, V. E. & Bork, P. (2000) Trends (2001) Nature 409, 928–933. Genet. 16, 335–337. 26. Reich, D. E., Cargill, M., Bolk, S., Ireland, J., Sabeti, P. C., Richter, D. J., 9. Harpending, H. & Rogers, A. (2000) Annu. Rev. Genomics Hum. Genet. 1, Lavery, T., Kouyoumjian, R., Farhadian, S. F., Ward, R., et al. (2001) Nature ANTHROPOLOGY 361–385. 411, 199–204. 10. Hudson, R. R. (2002) Bioinformatics 18, 337–338. 27. Weber, J. L., David, D., Heil, J., Fan, Y., Zhao, C. & Marth, G. T. (2002) Am. J. 11. Zhang, Z., Schwartz, S., Wagner, L. & Miller, W. (2000) J. Comput. Biol. 7, Hum. Genet. 71, 854–862. 203–214. 28. Kimmel, M., Chakraborty, R., King, J. P., Bamshad, M., Watkins, W. S. & 12. Marth, G. T., Korf, I., Yandell, M. D., Yeh, R. T., Gu, Z., Zakeri, H., Stitziel, Jorde, L. B. (1998) Genetics 148, 1921–1930. N. O., Hillier, L., Kwok, P. Y. & Gish, W. R. (1999) Nat. Genet. 23, 452–456. 29. Pereira, L., Dupanloup, I., Rosser, Z. H., Jobling, M. A. & Barbujani, G. (2001) 13. Gordon, D., Abajian, C. & Green, P. (1998) Genome Res. 8, 195–202. Mol. Biol. Evol. 18, 1259–1271. 14. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, 30. Goldstein, D. B. & Weale, M. E. (2001) Curr. Biol. 11, R576–R579. J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001) Nature 409, 31. Gabriel, S. B., Schaffner, S. F., Nguyen, H., Moore, J. M., Roy, J., Blumenstiel, 860–921. B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., et al. (2002) Science 296, 15. Marth, G., Yeh, R., Minton, M., Donaldson, R., Li, Q., Duan, S., Davenport, 2225–2229. R., Miller, R. D. & Kwok, P. Y. (2001) Nat. Genet. 27, 371–372. 32. Miller, R. D., Taillon-Miller, P. & Kwok, P. Y. (2001) Genomics 71, 78–88. 16. Kimura, M. (1968) Nature 217, 624–626. 33. Jeffreys, A. J., Kauppi, L. & Neumann, R. (2001) Nat. Genet. 29, 217–222. 17. Watterson, G. A. (1975) Theor. Popul. Biol. 7, 256–276. 34. Nachman, M. W. (2001) Trends Genet. 17, 481–485.

Marth et al. PNAS ͉ January 7, 2003 ͉ vol. 100 ͉ no. 1 ͉ 381 BMC Genomics BioMed Central

Research article Open Access STRP Screening Sets for the human genome at 5 cM density Nader Ghebranious4, David Vaske2, Adong Yu1, Chengfeng Zhao1, Gabor Marth3 and James L Weber*1

Address: 1Center for Medical Genetics, Marshfield Clinic Research Foundation, Marshfield, WI 54449, USA, 2Pioneer Hi-Bred International, Johnston, IA USA, 3National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD USA and 4Molecular Diagnostic Genotyping Laboratory, Marshfield Clinic Research Foundation, Marshfield, WI 54449, USA Email: Nader Ghebranious - [email protected]; David Vaske - [email protected]; Adong Yu - [email protected]; Chengfeng Zhao - [email protected]; Gabor Marth - [email protected]; James L Weber* - [email protected] * Corresponding author

Published: 24 February 2003 Received: 10 December 2002 Accepted: 24 February 2003 BMC Genomics 2003, 4:6 This article is available from: http://www.biomedcentral.com/1471-2164/4/6 © 2003 Ghebranious et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.

Abstract Background: Short tandem repeat polymorphisms (STRPs) are powerful tools for gene mapping and other applications. A STRP genome scan of 10 cM is usually adequate for mapping single gene disorders. However mapping studies involving genetically complex disorders and especially association (linkage disequilibrium) often require higher STRP density. Results: We report the development of two separate 10 cM human STRP Screening Sets (Sets 12 and 52) which span all chromosomes. When combined, the two Sets contain a total of 782 STRPs, with average STRP spacing of 4.8 cM, average heterozygosity of 0.72, and total sex-average coverage of 3535 cM. The current Sets are comprised almost entirely of STRPs based on tri- and tetranucleotide repeats. We also report correction of primer sequences for many STRPs used in previous Screening Sets. Detailed information for the new Screening Sets is available from our web site: http://research.marshfieldclinic.org/genetics. Conclusion: Our new human STRP Screening Sets will improve the quality and cost effectiveness of genotyping for gene mapping and other applications.

Background (multiplexed), sharp bands on gels, easy and accurate Since their discovery in 1988, multiallelic short tandem scoring of allele sizes, relatively low mutation rate, and repeat polymorphisms (STRPs) (also called microsatel- appropriate position along the genetic map. lites or simple sequence length polymorphisms (SSLPs)) have been the polymorphisms of choice for linkage map- We have performed human genome polymorphism scans ping and many other genetic studies. in our lab since 1989 [3]. Our first human Screening Set of STRPs developed in 1992 had an average STRP spacing Although there are hundreds of thousands of reasonably of ~20 cM, no sex chromosome STRPs, and consisted al- informative STRPs in the human genome [1,2], only a most entirely of dinucleotide repeat STRPs identified at small fraction are optimal for genotyping and genome Marshfield. Each subsequent Screening Set from our lab scans. Optimal properties of an STRP include: high heter- improved on the previous version by adding STRPs, by us- ozygosity, strong and specific PCR amplification, capabil- ing more accurate genetic maps to make STRP spacing ity to be amplified simultaneously with other STRPs more uniform and to eliminate large gaps, and especially

Page 1 of 10 (page number not for citation purposes) BMC Genomics 2003, 4 http://www.biomedcentral.com/1471-2164/4/6

by replacing relatively low quality STRPs with superior the offending PCR primer to a nearby position along the ones. Typing better STRPs leads to higher data quality chromosome. through fewer missing genotypes and fewer incorrect al- lele calls. Typing optimal STRPs also leads to lower geno- For most applications of genome polymorphism scans, typing costs by providing more information, by reducing higher STRP densities are preferable. This is particularly the need for duplicate genotyping, by permitting the use important for gene mapping by association. While ana- of shorter gels (with lower resolving power but shorter run lysts have predicted that very high polymorphism densi- times), and by increasing the efficiency of allele calling. ties will be required for association mapping in mixed or outbred populations [see for example reference [7]], We have replaced nearly all of the dinucleotide repeat promising results have been obtained using genome scans STRPs in our Screening Sets with tri- and tetranucleotide of 600–1200 STRPs in isolated populations where levels STRPs. Although dinucleotide STRPs are abundant and of linkage disequilibrium are particularly high [8–10]. In meet many of the criteria for optimal STRPs, they are also this manuscript we describe the development of two new in our hands more difficult to score accurately because of 10 cM human STRP Screening Sets (Sets 12 and 52) which substantial strand slippage during PCR [4]. We also find when combined provide average STRP spacing of 4.8 cM. that dinucleotide STRPs are more difficult to PCR multi- plex than tri- or tetranucleotide STRPs. Results Building new human Screening Sets Similarly, we have eliminated nearly all of the STRPs with Over about the last decade we have produced at Marsh- frequent (> 2%) "non-integer" alleles. Non-integer alleles field twelve separate, but related 10 cM Screening Sets of are defined as have length differences from the most fre- STRPs for the human genome (see Table 1 and http://re- quent alleles which are other than integer multiples of the search.marshfieldclinic.org/genetics). For each of these repeat length. For example, an allele of 221 bp (PCR prod- Sets, the lowest quality STRPs in the previous Set were re- uct length) would be a non-integer allele for a tetranucle- placed with superior ones. Of the most recent collections, otide STRP with frequent alleles of 230, 226, 222, and 214 Sets 6, 7, 10, and 11 were major overhauls, with 21–52% bp. Non-integer alleles are not typing artifacts as they have of the STRPs replaced (Table 1). Sets 5 and 8 were de- been observed in many labs and have been confirmed by scribed in the literature [11,12]. Beginning particularly sequencing of individual alleles [5]http:// with Set 6, many of the dinucleotide STRPs were replaced www.cstl.nist.gov/biotech/strbase. Non-integer alleles with tri- and tetranucleotide STRPs from the Cooperative probably exist somewhere in the human population for Human Linkage Center (CHLC) [13]. CHLC STRPs still all or nearly all STRPs, but a significant fraction of STRPs comprise 81% and 55% of our current Sets (Sets 12 and do not have frequent non-integer alleles. 52, respectively). Starting in about 2001, the availability of the human genomic draft sequence greatly expanded We have also excluded or repaired STRPs with weak or the number of STRPs from which to choose. Sets 12 and null alleles. In at least most cases, weak and null alleles ap- 52 contain 15% and 44%, respectively, newly derived pear to be due to substitution polymorphisms within the STRPs from the genomic sequence. primer annealing sites [6]. They can be repaired by sliding

Table 1: History of Marshfield 10 cM STRP Screening Sets

Set Year Number of STRPs Number of Dinucleotide Number of STRPs Shared STRPs (Fraction) with Previous Set (Fraction)

1 1992 231 211 (0.91) 2 1993 366 347 (0.95) 131 (0.36) 3 1993 319 226 (0.71) 176 (0.55) 4 1994 347 243 (0.70) 274 (0.79) 5 1994 363 191 (0.53) 265 (0.73) 6 1995 391 55 (0.14) 186 (0.48) 7 1995 390 47 (0.12) 297 (0.76) 8 1996 387 43 (0.11) 377 (0.97) 9 1997 387 44 (0.11) 378 (0.98) 10 1999 405 49 (0.12) 313 (0.77) 11 2001 410 2 (0.01) 324 (0.79) 12 2002 408 3 (0.01) 405 (0.99)

Page 2 of 10 (page number not for citation purposes) BMC Genomics 2003, 4 http://www.biomedcentral.com/1471-2164/4/6

We began the construction of Sets 12 and 52 with specific tained from the comprehensive list of indel polymor- goals. For Set 12 (and the very similar Set 11), we intend- phisms on the Marshfield web site. ed to replace all or nearly all dinucleotide repeat polymor- phisms and to eliminate other problematic STRPs such as We also improved the amplification efficiency of Screen- those with frequent non-integer alleles. For Set 52 (and ing Set STRPs. Most of the human STRPs developed in the the preceding Set 51), we aimed to identify one or two early and mid 90s were based on relatively crude, single- high quality tri- or tetranucleotide STRPs at approximately pass sequencing of genomic DNA subclones. Comparison the midpoint between each pair of STRPs within Set 12. of the PCR primer sequences for Set 10 STRPs with the Set 52 is therefore the second in a new series of 10 cM new public genomic sequences revealed that a surprising- Screening Sets. For both Sets, we needed to identify new, ly high 25% of the STRPs had mismatches in at least one high quality STRPs within specific, relatively small (~1 of the primers (an example is shown in Figure 1). Nearly mb) chromosomal segments. all of the mismatches were near the middle or 5' ends of the primers. New primers designed using the public ge- Altogether, we screened 2262 STRPs for possible inclusion nomic sequences were then tested side by side with old within the Screening Sets. Of these, 1103 were STRPs de- primers. At 55°C annealing temperature and no PCR mul- veloped within the CHLC or at Utah [14]. The remaining tiplexing, few differences were observed between the old 1159 were developed from human genomic sequences. and new primer pairs, but under more stringent condi- Most of these (961) were identified by searching for tri- or tions (60°C annealing temperatures), 79 STRPs were tetranucleotide STRs with ≥ 7 or 8 uninterrupted repeats found to amplify better with the new primer pairs (two ex- within the sequence assembly available from the Univer- amples are shown in Figure 2). sity of California – Santa Cruz web site, December 2000 version http://genome.cse.ucsc.edu. Others (198) were STRPs alleles are usually identified and labelled as the identified by examining a collection of ~1 gb of overlap- length of the PCR product as measured on denaturing ping BAC sequences [15] for the presence of variable tri- polyacrylamide gels. Only in a handful of cases have the and tetranucleotide STRs. We focused efforts on AAT and full spectrum of STRP alleles been sequenced. Therefore, AGAT repeats because these sequences are known to be STRP alleles are referenced to allele sizes for standard abundant and to yield useful polymorphisms [13,16]. DNA templates (we use the parents of CEPH family 1331 available from the NIGMS Human Genetic Cell Reposi- New PCR primers selected from the sequences flanking tory). Allele sizes will also, of course, often change if the the tandem repeats were tested by amplification with ten PCR primer sequences for a polymorphism are altered. To individual DNA samples and one DNA pool using incor- avoid null and weak alleles, to prevent the formation of poration of a nucleotide tagged with a fluorescent dye (see doublet bands during PCR [17,18], and to achieve opti- Methods). PCR primers labelled with a fluorescent dye at mal PCR product length, we have modified original prim- the 5' end were then synthesized for those STRPs which er sequences for a substantial fraction of our Screening Set displayed ≥ 4 alleles in the first screen. These were com- STRPs. We have used several different letters following the bined with existing CHLC and Utah STRPs, and were used STRP name to indicate changes in PCR primers (see to screen 12 individuals and one pool. All donors of DNA Marshfield web site). As two examples for STRPs on chro- samples used in these first two screens had Northern Eu- mosome 1 in Set 12: GATA26G09N indicates that one of ropean ancestry. STRPs which passed these first two hur- the original primers for GATA26G09 was changed to cor- dles were then used in genome scans within the rect a sequencing error without change in allele sizes, and Mammalian Genotyping Service (see Marshfield web site) GGAA3A07Z indicates that one of the primers for using hundreds of DNA samples from various geographi- GGAA3A07 was shifted along the chromosome resulting cal locations. in different allele sizes. Current PCR primer sequences for all Screening Set STRPs are listed on the Marshfield web Only 11% of the 2262 genomic STR sequences that were site along with allele sizes for individuals 133101 and screened were included within Sets 12 and 52. The great 133102. majority of excluded STRPs were rejected because of lim- ited numbers of alleles (low informativeness). About 9% Genetic Map Positions were rejected because of the presence of frequent non-in- Initially, new STRPs were selected and incorporated into teger alleles. We found that use of candidate genomic se- our Screening Sets based on physical distances obtained quences with larger numbers of uninterrupted tandem from the December 2000 UC-Santa Cruz draft sequence repeats and use of overlapping BAC sequences with alleles assembly. However, we soon found that the draft assem- which differed by two or more repeats led to higher rates of bly contained many errors [see for example reference [19]] STRP inclusion into the Screening Sets. Information on all and resulted therefore in many STRPs being in the wrong of the STRPs that we found to be polymorphic can be ob- map positions. To correct these mistakes, we utilized the

Page 3 of 10 (page number not for citation purposes) BMC Genomics 2003, 4 http://www.biomedcentral.com/1471-2164/4/6

Figure 1 Correction of PCR Primer Sequences using Genomic Sequence Assemblies. The original single pass sequence for GATA87E02 is aligned with the sequences from several BACs containing overlapping genomic DNA. The original reverse PCR primer mismatched the BAC sequences near its 3' end. Note that because the great majority of the public human genomic sequence was generated from BAC libraries prepared from just a few donors, it is possible that two or even all three of the BAC sequences shown in the figure came from the same chromosome.

Figure 2 Comparison of PCR Amplification using Original and Corrected PCR Primer Sequences. Shown are electro- phoretic separations of DNA fragments from unrelated individuals amplified at 60°C annealing temperature. Fragments obtained with the corrected PCR primer pairs are indicated by the N suffixes after the STRP names.

Page 4 of 10 (page number not for citation purposes) BMC Genomics 2003, 4 http://www.biomedcentral.com/1471-2164/4/6

most recent (June 2002) sequence assembly in addition to map, except for two close (~1 mb apart), adjacent STRPs linkage analysis using three large Sets of families. In all on chromosome 6p, ATA50C05 and ATC4D09, where the cases except one (4ptel04), the linkage results matched linkage results, the Iceland map and the June 2002 se- the June 2002 assembly in terms of STRP order (we as- quence assembly all disagreed with the Marshfield map. sumed the linkage results were correct for 4ptel04). Our confidence in STRP order is therefore high. Characterization of Sets 12 and 52 Numbers of STRPs, heterozygosity values, and sex-average With one exception on chromosome 6p (see below) ge- genetic map properties for Screening Sets 12, 52, and 12 netic map positions for the Screening Set STRPs were tak- plus 52 combined, broken down by chromosome, are dis- en from the most recent Marshfield map [20] or by played in Table 2. Of the 39 total X chromosome STRPs in interpolation using the Marshfield map and the genetic the combined Sets, 3 (GATA2A12, GGAT3F08, and and physical map positions described in the previous par- GATA42G01) are in the pter pseudoautosomal region, agraph. Although the new Iceland genetic map [19] is and 1 (SDF1) is in the qter pseudoautosomal region. The higher resolution than the Marshfield map, a large frac- 9 Y chromosome STRPs are all male-specific. Also, two tion (62%) of the Screening Set 12 and 52 STRPs were not small, tightly-spaced clusters of STRPs are included in Set typed in the Iceland families. We did, however, check 12 (six STRPs near the centromere of chromosome 11 and STRP order for all STRPs that were typed in the Iceland three STRPs on the short arm of chromosome 1) for the families and found no disagreements with the Marshfield purpose of gauging linkage disequilibrium.

Table 2: General Properties of Sets 12 and 52 by Chromosome.

Numbers of STRPs Average Heterozygosity Total Distance Covered Average Spacing (cM) (cM)

Chr. Set 12 Set 52 Both Set 12 Set 52 Both Set 12 Set 52 Both Set 12 Set 52 Both Sets Sets Sets Sets

1 31 36 67 0.76 0.68 0.72 274.6 261.1 282.9 9.2 7.5 4.3 2 28 28 56 0.78 0.66 0.72 266.2 215.7 266.2 9.9 8.0 4.8 3 25 24 49 0.75 0.73 0.74 222.5 207.1 222.5 9.3 9.0 4.6 4 23 16 39 0.76 0.69 0.73 208.0 198.6 208.0 9.5 13.2 5.5 5 20 25 45 0.78 0.66 0.71 196.6 190.3 197.5 10.3 7.9 4.5 6 23 18 41 0.75 0.69 0.72 192.4 188.0 192.4 8.7 11.1 4.8 7 21 20 41 0.76 0.68 0.72 178.6 157.3 178.6 8.9 8.3 4.5 8 19 12 31 0.74 0.67 0.72 159.7 141.5 159.7 8.9 12.9 5.3 9 18 17 35 0.74 0.67 0.71 158.2 151.8 158.2 9.3 9.5 4.7 10 20 18 38 0.75 0.66 0.71 163.1 152.3 163.1 8.6 9.0 4.4 11 20 12 32 0.78 0.65 0.73 145.5 127.7 145.5 7.7 11.6 4.7 12 17 19 36 0.79 0.72 0.75 161.5 151.4 165.8 10.1 8.4 4.7 13 12 12 24 0.75 0.69 0.72 98.9 94.3 102.0 9.0 8.6 4.4 14 14 14 28 0.75 0.69 0.72 122.5 109.4 122.5 9.4 8.4 4.5 15 13 9 22 0.76 0.71 0.74 114.8 81.5 114.8 9.6 10.2 5.5 16 15 13 28 0.75 0.67 0.71 127.7 114.5 127.7 9.1 9.5 4.7 17 13 13 26 0.76 0.65 0.71 118.5 103.6 118.5 9.9 8.6 4.7 18 13 14 27 0.78 0.69 0.74 113.2 121.0 121.0 9.4 9.3 4.7 19 108 180.770.650.7291.286.691.210.112.45.4 20 11 12 23 0.78 0.69 0.73 98.4 85.1 98.4 9.8 7.7 4.5 21 6 5 11 0.82 0.62 0.73 44.8 49.6 54.8 9.0 12.4 5.5 22 8 9 17 0.75 0.71 0.73 59.4 44.6 59.4 8.5 5.6 3.7 X 22 17 39 0.70 0.62 0.66 184.0 149.1 184.0 8.8 9.3 4.8 Y 6 3 9 0.66 0.41 0.58

Total STRPs Average Heterozygosity Total Coverage (cM) Average Spacing (cM)

408 374 782 0.76 0.67 0.72 3500 3182 3535 9.3 9.5 4.8

Genetic distances for the autosomes are sex-average, and for the X chromosome are female (except for pseudoautosomal regions).

Page 5 of 10 (page number not for citation purposes) BMC Genomics 2003, 4 http://www.biomedcentral.com/1471-2164/4/6

Table 3: Breakdown of Screening Set STRPs by Repeat Length

Total STRPs Dinucleotide Trinucleotide Tetranucleotide Pentanucleotide

Set 12 408 3 82 318 5 Set 52 374 0 98 267 9 Both Sets 782 3 180 585 14

Table 4: Breakdown of Screening Set STRPs by Repeat Sequence.

Total STRPs AAAT AAGG AAT AATG AGAT Other

Set 12 408 13 27 78 4 265 21 Set 52 374 22 11 85 6 218 32 Both Sets78235381631048353

Repeat sequences are listed in their alphabetically minimal forms.

Set 12 STRPs with overall average heterozygosity of 76% 10, 11.1% of GGAA and 10.2% of AGAT STRPs had fre- are more informative than Set 52 STRPs with overall aver- quent non-integer alleles, compared to only 1.8% of AAT age heterozygosity of 67%. At least part of this difference STRPs. Because of high rates of non-integer alleles, STRPs may simply be a reflection of the populations used to de- with purines on one strand and pyrimidines on the other duce these values (see Methods). As shown in Table 2, X (eg AAGG) were avoided even though they are reasonably and especially Y chromosome STRPs had lower average abundant and often especially informative [13]. informativeness than autosomal STRPs. Association of Screening Set STRPs with interspersed re- The average, sex-average STRP spacing of the combined peat elements (IREs) is shown in Table 5. STRPs were con- Sets was 4.8 cM. The maximum gaps are 18.4, 37.5, and sidered to be associated with IREs if the IRE fell in the 50 15.5 cM for Set 12, Set 52 and Sets 12 and 52 combined, bp flanking the STR on either side (total of 100 bp of respectively. There were 13 gaps ≥ 15 cM in Set 12, 51 such flanking sequence). Although total numbers for some of gaps in Set 52, and 29 gaps ≥ 10 cM in Sets 12 and 52 com- the STR types are relatively small, it appears that each type bined. Set 12 STRPs were generally closer to telomeres of STR has its own particular signature of IRE association. than Set 52 STRPs, resulting in greater total chromosomal For example, AAAT STRs are very often (86%) associated coverage. with Alu elements, consistent with the hypothesis that most of these repeats evolved from the polyA tail of Alus A summary of repeat length in the Screening Set STRPs is [22]. An unexpectedly large fraction of AGAT STRs (16%) presented in Table 3. Only 3 dinucleotide STRPs remain in were found to be associated with LTRs. The results in Ta- Set 12. Fourteen pentanucleotide STRPs were also includ- ble 5 may generally provide clues about the evolution of ed in the combined Sets. STRs.

Breakdown of the Screening Set STRPs by repeat type is Discussion shown in Table 4. STRPs with AGAT and AAT repeats to- Development of human STRP Screening Sets has paral- gether accounted for 83% of the STRPs in the combined leled advances in construction of genetic and physical Sets. Note that because of permutation and the comple- maps. Except in regions with long inversion polymor- mentary strand there are several names for each repeat phisms [23], it should soon be possible to specify STRP or- type. As just one example, AGAT repeats can also be pre- der within Screening Sets with near certainty. However, sented as GATA, ATAG, TAGA, ATCT, TATC, CTAT, and because of individual and even possibly population TCTA repeats. Following the suggestion of Jin et al. [21] differences in recombination rates [24–26], it may never we have chosen the alphabetically minimal name. be possible to specify genetic distances between STRPs with high precision. We found that AAT repeats in particular, have a relatively low level of non-integer alleles. For example, within Set

Page 6 of 10 (page number not for citation purposes) BMC Genomics 2003, 4 http://www.biomedcentral.com/1471-2164/4/6

Table 5: Association of Interspersed Repetitive Elements with Selected STR Types.

Number of STRPs Associated with Indicated Repeats (Fraction STRPs)

Repeat Total STRPs AAAT AAGG AAT AATG AGAT Other

ALU 189 (0.24) 31 (0.89) 9 (0.24) 64 (0.39) 2 (0.20) 63 (0.13) 20 (0.38) L1 96 (0.12) 2 (0.06) 5 (0.13) 42 (0.26) 0 (0.00) 40 (0.08) 7 (0.13) LTR 119 (0.15) 4 (0.11) 3 (0.08) 14 (0.09) 0 (0.00) 97 (0.20) 1 (0.02) L2 16 (0.02) 1 (0.03) 1 (0.03) 4 (0.02) 4 (0.40) 4 (0.01) 2 (0.04) MER 27 (0.03) 0 (0.00) 0 (0.00) 5 (0.03) 0 (0.00) 20 (0.04) 2 (0.04) MIR 22 (0.03) 1 (0.03) 0 (0.00) 8 (0.05) 0 (0.00) 8 (0.02) 5 (0.10) Other 6 (0.01) 0 (0.00) 0 (0.00) 2 (0.01) 0 (0.00) 3 (0.01) 1 (0.02) Not Associated 352 (0.45) 3 (0.09) 23 (0.61) 44 (0.27) 5 (0.50) 259 (0.54) 18 (0.35) TOTAL 78135381631048352 TESTED

STRPs were screened using Repeat Masker for IREs that are within 50 bp in either direction of the short tandem repeats (excluding the tandem repeats). Sums of the numbers in the columns do not match the totals because some sequences had two different interspersed repeats within the 100 bp.

Screening Set 10 STRPs have been typed in the ~1000 Although nearly all Screening Set STRPs are at least mod- members of the Human Diversity Panel [27]. We hope to estly polymorphic in all human populations examined to also type the new Set 12 and 52 STRPs through this Panel date, this does not guarantee that they will be free of fre- in the relatively near future. It is beneficial to have a global quent non-integer alleles or weak or null alleles in some perspective on informativeness and allele frequencies for populations. For example, we have observed apparent each Screening Set STRP. Although the Screening Set null alleles for some STRPs in Chinese that were not STRPs were initially screened using European DNA sam- present in Europeans (eg GATA29A01 on chromosome 6 ples, we have found that in almost all cases, they are high- and GGAA20G04 on chromosome 2). We have also ob- ly or modestly informative in other human populations. served non-integer alleles in Sub-Saharan Africans that Consistent with previous results [28–30], average hetero- have not been seen at appreciable frequency in other pop- zygosities for the Screening Set STRPs in Sub-Saharan Af- ulations (eg GATA104 on chromosome 7 and ricans are the highest [27]. Isolated European populations GATA11A06 on chromosome 18). such as Sardinian villagers, and Old Order Amish in the U.S. have only slightly diminished heterozygosities. Al- Despite having much higher mutation rates than diallelic though some Screening Set STRPs have reduced informa- polymorphisms, there is abundant evidence that highly tiveness in East Asian populations such as Han Chinese, informative STRPs of the type found within Screening Sets average heterozygosities in East Asians are also only slight- are generally powerful markers for detection of linkage ly diminished compared to Europeans. So far, only Native disequilibrium [eg [31,32]]. It is unclear, however, wheth- American and some Oceanic populations have average er dinucleotide or tetranucleotide STRPs are superior in heterozygosities for Screening Set STRPs that are substan- this regard. Experimental evidence seems to favour higher tially reduced. With enough effort, it probably would be average mutations rates for tetranucleotide STRPs [33], possible to develop Screening Set STRPs with higher aver- while theoretical results favour higher average rates for di- age heterozygosities for populations such as Native nucleotide STRPs [34]. Analysis of STRPs typed in CEPH Americans, but the practicality of such an undertaking is reference families for construction of human genetic maps uncertain. revealed that the fraction of dinucleotide/dinucleotide STRP pairs < 200 kb apart with linkage disequilibrium at Similarly, it would also be helpful to carry out extensive p < 0.01 was 18.7%, whereas the fraction for dinucle- sequencing of at least the frequent alleles for each Screen- otide/tetranucleotide pairs was 7.9% and for dinucle- ing Set STRP. This would eliminate the need to approxi- otide/trinucleotide pairs was 22.1% (Broman K, Weber J mate allele sizes. However, this would also be a large and unpublished results). expensive project, and may have to wait until sequencing costs drop so that many human genomes from around the Many of the Set 12 and 52 STRPs are superior to the thir- world can be sequenced. teen STRPs used routinely in forensic DNA testing in the U.S. http://www.cstl.nist.gov/biotech/strbase/fbi- core.htm. Several of the thirteen forensic STRPs have fre-

Page 7 of 10 (page number not for citation purposes) BMC Genomics 2003, 4 http://www.biomedcentral.com/1471-2164/4/6

quent non-integer alleles. Several are not especially ue to improve our human STRP Screening Sets until we informative. Five of the thirteen forensic STRPs are cur- have exhausted all available STRPs at specific chromo- rently included within Set 12. This occurred by chance somal sites. rather than design. If genome polymorphism scans for ei- ther research or clinical purposes become widespread, Materials and Methods then overlap between our Screening Sets and forensic Sets Identification of candidate polymorphisms will have to be carefully considered. Two different approaches were used to search for new pol- ymorphisms. One approach was to use overlapping BAC Although our newest Screening Sets are substantial im- genomic sequences to select polymorphisms that varied provements over previous versions, they are still not per- by more than two repeats [15]. The other approach was to fect. Some STRPs have lower informativeness than browse the genome for STRs using the December 12, 2000 desired, and some large gaps in coverage remain. The Set version of the genomic sequence at University of Califor- 12 STRPs are generally superior to Set 52 polymorphisms nia – Santa Cruz http://genome.ucsc.edu/[36]. because Set 52 is new. There has not yet been a chance to make many replacements. Once a sequence containing the desired polymorphism was selected (usually 400–700 bp in length), it was run We will continue to make improvements in our human through the Repeat Masker program http://ftp.ge- STRP Screening Sets and to post upgrades on the Marsh- nome.washington.edu/cgi-bin/RepeatMasker in order to field web site. But are there limits to the quality of STRP avoid selecting PCR primers within Alu, L1, or other re- Sets? The answer is undoubtedly yes. There are only ap- peats. The Primer 3 program http://www-ge- proximately 65,000 modestly to highly informative tri- nome.wi.mit.edu/cgi-bin/primer/primer3_www.cgi was and tetranucleotide STRPs in the human gene pool [1,2]. used to select PCR primers. Candidate sequences which Within some ~1 mb regions of the genome, we have al- did not permit the placement of at least one PCR primer ready exhausted all likely tri- and tetranucleotide STRP within unique sequence (ie outside of a repeat identified candidates. Only a small fraction (11%) of the new STRPs by Repeat Masker) were not tested further. In cases where we screened from the genomic sequence were selected for one PCR primer was located within a repeat, the primer the new Sets. It is quite conceivable, that over the next dec- from within the unique sequence was tagged with a fluo- ade or two we will characterize all human STRPs that have rescent dye. reasonable informativeness. Resequencing different hu- man genomes will undoubtedly contribute much to this Sequence Alignments effort. All of the 406 single read STRP sequences from Set 10 were Blasted against genomic sequences from the public Quite a few investigators have speculated that diallelic labs. For nearly all STRPs, we identified 1 to 3 BACs that polymorphisms such as SNPs or diallelic indels will sup- showed high homology (Blast criteria were score (bits) > plant STRPs in human Screening Sets. Our position con- 200, expect (E) value < e-50, and ratio of matched bases tinues to be that this question will likely be ultimately to STRP sequence length >85%). Two different multiple determined by typing costs [4]. STRPs provide much more alignment programs were then used to align the single information than diallelic polymorphisms, so diallelic read and the genomic sequences: "multalin" http://pro- typing costs would need to drop well below those for tein.toulouse.inra.fr/multalin/multalin.html and "clus- STRPs. This might happen, but it hasn't yet, and it's not talw" http://searchlauncher.bcm.tmc.edu/multi-align/ clear that it ever will. There may also be advantages to in- multi-align.html. cluding both high and low mutation rate polymorphisms within Screening Sets (ie STRPs and diallelics) [35]. In any Screening of candidate polymorphisms case, we believe that our STRP Screening Sets will continue For initial screening of the PCR primers, we incorporated to be highly valuable and widely used for many years. a dye-labelled nucleotide with a two-step PCR protocol. Briefly, the first step contained 10 mM Tris-HCl (pH 8.3), Conclusions 50 mM KCl, 1.5 mM MgCl2, 0.001% gelatin, 250 µM each The development of Screening Sets 12 and 52 will im- dNTP, 4.7 µM of the forward and reverse primers, 0.15 prove gene mapping in general, and specifically genome units of Taq polymerase (Roche) in a total 5 µl reaction scans where a relatively high STRP density is required. volume. The second reaction had the same components Complete information on all of our Screening Sets is free- and volume as in the first step, except that the forward ly available from the Marshfield web site http://re- primer was present at 6.2 µM and R6G dUTP (Applied Bi- search.marshfieldclinic.org/genetics along with lists of osystems) at 0.5 µM with no reverse primer. About 0.5 µl over 200,000 candidate and confirmed human indel pol- of step 1 PCR product was used as a DNA template for step ymorphisms, both multi- and diallelic. We plan to contin- 2 PCR. Each PCR step initiated with a 95°C soak for 4

Page 8 of 10 (page number not for citation purposes) BMC Genomics 2003, 4 http://www.biomedcentral.com/1471-2164/4/6

min, followed by 30 and 25 cycles for steps 1 and 2, ping BAC sequences. JLW conceived the study and coordi- respectively, consisting of 95°C for 40 sec, 55°C for 75 nated all efforts. All authors read and approved the final sec, 72°C for 40 sec, and a final extension of 7 min at manuscript. 72°C. An equal volume of loading solution composed of EDTA (10 mM) and Orange G dye (13.6 mM) (Sigma) Acknowledgements dissolved in formamide was added to the reaction follow- We thank Jan Wood, Heather Pagenkopf, Jocelyn Schroeder, Thao Le, ing PCR, and 0.6 µl of the product was fractionated on de- Robert Kuntz, Vani Natarajan, Jessica Kayhart, Jennifer Kislow, Matt Wil- naturing acrylamide gels (6.0% acrylamide, 7.7 M urea, liamson and Kate Buehler for expert laboratory assistance. This work was 89 mM Tris, 89 mM borate, 2.5 mM EDTA, pH 8.3). supported through NHLBI Contract HV48141 for the Mammalian Geno- typing Service.

For use of fluorescent-labelled primers, 45 ng of template References DNA is dried in the wells of 96 well polypropylene plates. 1. Zhao C, Heil J and Weber JL A genome-wide portrait of short PCR amplifications were carried out in a 4 µl volume con- tandem repeats. Am J Hum Genet 1999, 65(supplement):A102 taining 10 mM Tris-HCl (pH 8.3), 50 mM KCl, 1.5 mM 2. Tóth G, Gáspári Z and Jurka J Microsatellites in different eukary- otic genomes: survey and analysis. Genome Res 2000, 10:967-981 MgCl2, 0.001% gelatin, 100 µM each dNTP, 0.075 µM of 3. Wijmenga C, Frants RR, Brouwer OF, Moerer P, Weber JL and Pad- fluorescent-labelled forward and unlabeled reverse prim- berg GW Location of the fascioscapulohumeral muscular dys- trophy gene on chromosome 4. Lancet 1990, 336:651-653 er, and a 0.12 units of Taq polymerase. PCR amplification 4. Weber JL and Broman KW Genotyping for human whole-ge- was carried out for 27 cycles with the same times and tem- nome scans: past, present, and future. Adv Genet 2001, 42:77-96 peratures as listed above. 5. Brinkmann B, Klintschar M, Neuhuber F, Hühne J and Rolf B Muta- tion in human microsatellites: Influence of the structure and length of the tandem repeat. Am J Hum Genet 1998, 62:1408- Genetic map positions for new STRPs 1415 6. Callen DF, Thompson AD, Shen Y, Phillips HA, Richards RI, Mulley JC Genetic distances for the new STRPs were obtained by typ- and Sutherland GR Incidence and origin of "null" alleles in the ing the STRPs in several projects with large numbers of Eu- (AC)n microsatellite markers. Am J Hum Genet 1993, 52:922-927 ropean families. The CRIMAP program was used to order 7. Kruglyak L Prospects for whole-genome linkage disequilibri- um mapping of common disease genes. Nat Genet 1999, the STRPs and to deduce genetic distances. In order to fit 22:139-144 new STRPs into the Marshfield map [20], approximate ge- 8. Simonic I, Gericke GS, Ott J and Weber JL Identification of genetic netic values were obtained by extrapolations using the STRPs associated with Gilles de la Tourette Syndrome in an Afrikaner population. Am J Hum Genet 1998, 63:839-846 new sex-average genetic distances and the Marshfield map 9. Ober C, Abney M and McPeek MS The genetic dissection of com- genetic distances for two flanking, older STRPs. In rare in- plex traits in a founder population. Am J Hum Genet 2001, 69:1068-1079 stances, when no neighbouring STRPs with known Marsh- 10. Ophoff RA, Escamilla MA, Service SK, Spesny M, Meshi DB, Poon W, field map distances were available, the genetic distances Molina J, Fournier E, Gallegos A and Mathews C Genomewide link- were extrapolated from physical distances from the UC- age disequilibrium mapping of severe bipolar disorder in a population isolate. Am J Hum Genet 2002, 71:565-574 Santa Cruz sequence assembly, June 2002 version. For the 11. Dubovsky J., Sheffield VC, Duyk GM and Weber JL Sets of short X-chromosome analysis, female genetic distances were tandem repeat polymorphisms for efficient linkage Screen- ing of the human genome. Hum Mol Genet 1995, 4:449-452 used in place of sex-average genetic distances. 12. Yuan B, Vaske D, Weber JL, Beck J and Sheffield VC Improved Set of short tandem repeat polymorphisms for screening the hu- Heterozygosity values were determined by typing STRPs man genome. Am J Hum Genet 1997, 60:459-460 13. Sheffield VC, Weber JL, Buetow KH, Murray JC, Even DA, Wiles K, in two different population groups. For Cooperative Hu- Gastier JM, Pulido JC, Yandava C and Sunden SL A collection of tri- man Linkage Center (CHLC) and Utah STRPs in Set 12, and tetranucleotide repeat STRPs used to generate high heterozygosity values were deduced by typing the STRPs quality, high resolution human genome-wide linkage maps. Hum Mol Genet 1995, 4:1837-1844 through several populations of different ethnic groups 14. Utah Marker Development Group A collection of ordered tetra- (African, Asian and European), whereas for newly devel- nucleotide repeat markers from the human genome. Am J Hum Genet 1995, 57:619-628 oped STRPs in Set 12 and all the STRPs within Set 52 15. Weber JL, David D, Heil J, Fan Y, Zhao C and Marth G Human Am J (newly developed, CHLC, and Utah STRPs), a European Hum Genet 2002, 71:854-862 population was used. Heterozygosity estimates of the Set 16. Gastier JM, Pulido JC, Brody T, Sheffield VC, Weber JL, Buetow KH, Murray JC, Hudson TJ and Duyk GM Survey of trinucleotide re- 10 (and many Set 12) STRPs are also available from gen- peats in the human genome: assessment of their utility as ge- otyping of the Human Diversity Panel [see Marshfield netic markers. Hum Mol Genet 1995, 4:1829-1836 web site and reference [27]]. 17. Brownstein MJ, Carpten JD and Smith JR Modulation of non-tem- plated nucleotide addition by Taq DNA polymerase: primer modifications that facilitate genotyping. Biotechniques 1996, Authors' contributions 20:1004-1010 18. Magnuson VL, Ally DS, Nyland SJ, Karanjawala ZE, Rayman JB, Knapp NG led the building of Screening Sets 11, 12, 51 and 52 JI, Lowe AL, Ghosh S and Collins FS Substrate nucleotide-deter- and drafted the manuscript. DV led the building of Set 10. mined non-templated addition of adenine by Taq polymer- AY worked out conditions for initial screening of new ase: implications for PCR-based genotyping and cloning. Biotechniques 1996, 21:700-709 STRPs. CZ carried out ePCR and STR computer searches. 19. Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Ri- GM identified candidate polymorphisms from overlap- chardsson B, Sigurdardottir S, Barnard J, Hallbeck B and Masson G A

Page 9 of 10 (page number not for citation purposes) BMC Genomics 2003, 4 http://www.biomedcentral.com/1471-2164/4/6

high-resolution recombination map of the human genome. Nat Genet 2002, 31:241-247 20. Broman KW, Murray JC, Sheffield VC, White RL and Weber JL Com- prehensive human genetic maps: individual and sex-specific variation in recombination. Am J Hum Genet 1998, 63:861-869 21. Jin L, Zhong Y and Chakraborty R The exact numbers of possible microsatellite motifs. Am J Hum Genet 1994, 55:582-583 22. Beckmann JS and Weber JL Survey of human and rat microsatellites. Genomics 1992, 12:627-631 23. Giglio S, Calvari V, Gregato G, Gimelli G, Camanini S, Giorda R, Ra- gusa A, Guerneri S, Selicorni A and Stumm M Heterozygous sub- microscopic inversions involving olfactory receptor-gene clusters mediate the recurrent t(4;8)(p16;p23) translocation. Am J Hum Genet 2002, 71:276-285 24. Weber JL The Iceland Map. Nature Genet 2002, 31:225-226 25. Cullen M, Perfetto SP, Klitz W, Nelson G and Carrington M High- resolution patterns of meiotic recombination across the hu- man major histocompatibility complex. Am J Hum Genet 2002, 71:759-776 26. Lynn A, Koehler KE, Judis L, Chan ER, Cherry JP, Schwartz S, Seftel A, Hunt PA and Hassold TJ Covariation of synaptonemal com- plex length and mammalian meiotic exchange rates. Science 2002, 296:2222-2225 27. Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivot- ovsky LA and Feldman MW Genetic structure of human populations. Science 2002, 298:2381-2385 28. Bowcock AM, Ruiz-Linares A, Tomfohrde J, Minch E, Kidd JR and Cav- alli-Sforza LL High resolution of human evolutionary trees with polymorphic microsatellites. Nature 1994, 368:455-457 29. Deka R, Jin L, Shriver MD, Yu LM, DeCroo S, Hundrieser J, Bunker CH, Ferrell RE and Chakraborty R Population genetics of dinu- cleotide (dC-dA)n (dG-dT)n polymorphisms in world populations. Am J Hum Genet 1995, 56:461-474 30. Calafell F, Shuster A, Speed WC, Kidd JR and Kidd KK Short tan- dem repeat evolution in humans. Eur J Hum Genet 1998, 6:38-49 31. Huttley GA, Smith MW, Carrington M and O'Brien SJ A scan for linkage disequilibrium across the human genome. Genetics 1999, 152:1711-1722 32. Varilo T, Paunio T, Parker A, Perola M, Meyer J, Terwilliger JD and Peltonen L The interval of linkage disequilibrium detected with microsatellite and SNP markers in chromosomes of Finnish populations with different histories. Hum Mol Genet 2003, 12:51-59 33. Weber JL and Wong C Mutation in short tandem repeat polymorphisms. Hum Mol Genet 1993, 2:1123-1128 34. Chakraborty R, Kimmel M, Stivers DN, Davison LJ and Deka R Rel- ative mutation rates at di-, tri-, and tetranucleotide micros- atellite loci. Proc Natl Acad Sci USA 1997, 94:1041-1046 35. de Kniff P Messages through bottlenecks: on the combined use of slow and fast evolving polymorphic markers on the hu- man Y chromosome. Am J Hum Genet 2000, 67:1055-1061 36. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM and Haussler D The human genome browser at UCSC. Genome Res 2002, 12:996-1006

Publish with BioMed Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright

Submit your manuscript here: BioMedcentral http://www.biomedcentral.com/info/publishing_adv.asp

Page 10 of 10 (page number not for citation purposes) Computational SNP Discovery 85

7

Computational SNP Discovery in DNA Sequence Data

Gabor T. Marth

1. Introduction Both the quantity and the distribution of variations in DNA sequence are the product of fundamental biological forces: random genetic drift, demography, population history, recombination, spa- tial heterogeneity of mutation rates, and various forms of selection. In humans, single base-pair substitution-type sequence variations occur with a frequency of approx 1 in 1.3 kb when two arbitrary sequences are compared (1). This frequency increases with higher sample size (2), i.e., we expect to see, on average, more single nucle- otide polymorphisms (SNPs) when a higher number of individual chromosomes are examined (3,4). SNPs currently in the public repository (5) were discovered in DNA sequence data of diverse sources, some already present in sequence databases, but the majority of the data generated specifi- cally for the purpose of SNP discovery. Nearly 100,000 SNPs in tran- scribed regions were found by analyzing clusters of expressed sequence tags (ESTs) (6Ð8), or by aligning ESTs to the human refer- ence sequence (9). The three major sources of genomic SNPs were sequences from restricted genome representation libraries (10), ran-

From: Methods in Molecular Biology, vol. 212: Single Nucleotide Polymorphisms: Methods and Protocols Edited by: P-Y. Kwok © Humana Press Inc., Totowa, NJ 85 86 Marth

dom shotgun reads aligned to genome sequence (1), and the overlap- ping sections of the large-insert (mainly bacterial artificial chromo- some, or BAC) clones sequenced for the construction of the human reference genome (11Ð13). Most of these SNPs were detected in pair- wise comparisons where one of the two samples was a genomic clone sequence. Theory predicts (14), and experiments confirm, that shal- low sampling results in an overrepresentation of common variations: these common SNPs tend to be ancient variations, often present in all or most human populations (15) and expected to be valuable for detecting statistical association (16). For the same reason, many rare polymorphisms with rare phenotypic effects are likely to be absent from this set. The current collection of SNPs forms a dense, genome- wide polymorphism map (1) intended as a starting point for regional variation studies. An exhaustive survey of polymorphisms in a given region of interest is likely to require significantly higher sample sizes. Even so, the isolation of rare phenotypic mutations may only be pos- sible by the crosscomparison between large samples of affected pa- tients and those of controls. Computational SNP discovery, in a general sense, refers to the process of compiling and organizing DNA sequences that represent orthologous regions in samples of multiple individuals, followed by the identification of polymorphic sequence locations. The first step typically involves a similarity search with the Basic Local Align- ment Search Tool (BLAST) (17) to compile groups of sequences that originate from the region under examination. This is followed by the construction of a base-wise multiple alignment to determine the precise, base-to-base correspondence of residues present in each of the samples in a group. Finally, each position of the multiple alignment is scanned for nucleotide mismatches. Some of the most serious difficulties of sequence organization stems from the repetitive nature of the DNA observed in many organisms. It is well known that nearly half of the human genome is made up of high copy-number repetitive elements (18,19). In addi- tion, many intra- and interchromosomal duplication exist, a large number of them yet uncharacterized. Similar to members of multi- gene families, these duplicated (paralogous) genomic regions may Computational SNP Discovery 87 exhibit extremely high levels of sequence similarity (18), sometimes over 99.5%, and can extend over hundreds of kilobases. Failure to distinguish between sequences from different copies of duplicated regions results in false SNP predictions that represent paralogous sequence differences rather than true polymorphisms. The construction of correct base-wise multiple alignments is a difficult problem because of its computational complexity. Sequences under consideration are generally of different length ren- dering global sequence alignment algorithms such as CLUSTALW (20) rarely applicable. Expressed sequences (ESTs or more or less complete gene sequences) require local alignment techniques that are unperturbed by exon-intron punctuation and alternatively spliced sequence variants. Once a multiple alignment is constructed, nucleotide differences among individual sequences can be analyzed. Owing to the pres- ence of sequencing errors, not every nucleotide position with mismatches automatically implies a polymorphic site. Although it is impossible to decide which is the case with certainty, the success of SNP detection ultimately depends on how well one is able to discriminate true polymorphisms from likely sequencing errors. This is usually accomplished by statistical considerations that take advantage of measures of sequence accuracy (21,22) accompany- ing the analyzed sequences. The result, ideally, is a set of candidate SNPs, each with an associated SNP score that indicates the confi- dence of the prediction. Accurate confidence values can be extremely useful for the experimentalist in selecting which SNPs to use in a study or for further characterization, and enables one to use the highest number of candidates within the bounds of an accept- able false positive rate.

2. Materials Sequences used in SNP analysis come from diverse sources. From the viewpoint of sequence accuracy, they can be categorized as either single-pass sequence reads or consensus sequences that result from multipass, redundant sequencing of the same underlying DNA. 88 Marth

The overall sequencing error rate of single-pass sequences is in the 1%-range (21Ð23), an order of magnitude higher than the average polymorphism rate (roughly 0.1%). The error rate is typically much higher at the beginning and the end of a read (21,22). Clusters of sequencing errors are also common; the location of these is highly dependent on specific base combinations, as well as the sequencing chemistry used. For detecting sequence variations, even marginally accurate data can be useful as long as regions of low accuracy nucle- otides can be avoided. The most widely used base-calling program, PHRED (21,22) associates a base quality value to each called nucle- otide. This base quality value, Q, is related to the likelihood that the nucleotide in question was determined erroneously: Q = Ð10 log10(Perror). Although different sequencing chemistries pose dif- ferent challenges to base calling, tests involving large data sets have demonstrated that the quality value produced by PHRED is a very good approximation of actual base-calling error rates (21,22). Using base quality values, mismatches between low-quality nucleotides can be discarded as likely sequencing errors. Because consensus sequences are the product of multiple sequence reads, they are gen- erally of higher accuracy. Exceptions to this rule are regions where the underlying read coverage is low, and/or regions where all underlying reads are of very low quality. Recognizing this problem, sequence assemblers (computer programs that create consensus sequences) also provide base quality values for the consensus sequence by combining quality scores of the underlying reads (24,25). The following subsections describe the most commonly used sequence sources used in SNP discovery.

2.1. STS Sequences Sequence-tagged site (STS) sequences, amplified and sequenced in multiple individuals, were used in the first large-scale efforts to catalog variations at the genome scale (26). One of the main advan- tages of this strategy was that PCR primers, optimized during STS development, were readily available for use. If starting material for Computational SNP Discovery 89 the amplification is genomic DNA, these sequences represent the superposition of both copies of a chromosome within an individual. As a result, the sequence may contain nucleotide ambiguities that correspond to heterozygous positions in the individual. Base-calling algorithms trained for homozygous reads will assign a low base quality value to whichever nucleotide is called, rendering base quality value-based SNP detection algorithms ineffective for these reads. Specialized algorithms (31) have been designed to deal with heterozygote detection, as discussed next.

2.2. EST Sequences Expressed Sequence Tag (EST) Reads represent the richest source of SNPs in transcribed regions (6Ð8,27,28) to date. The majority of ESTs are single-pass reads, often from tissue-specific cDNA librar- ies (29,30). Because a single EST read may contain several exons, special care must be taken when these reads are aligned to genomic sequences. An additional difficulty is the alignment of ESTs repre- senting alternative splice-variants of a single gene.

2.3. Small Insert Clone Sequences 2.3.1. Sequences from Reduced Representation Libraries Size-Selected Restriction Fragments recognized by specific restriction enzymes are quasirandomly distributed in genomic DNA. The average distance between neighboring restriction sites (restric- tion fragment length) is a function of the length of the recognition sequence. A reduced, quasirandom representation of the genome can be achieved by first constructing a library of cloned restriction fragments, followed by size-selection to exclude fragments outside a desired length range. The number of different fragments (com- plexity) present in the library can be precalculated for any given length range. Inversely, library complexity can be controlled by appropriate selection of the upper and lower size limits (10). 90 Marth

2.3.2. Sequences from Random Genomic Shotgun Libraries Random Genomic Subclone Reads are sequenced from DNA libraries with a quasirandom, short-insert subclone representation of the entire genome (whole-genome shotgun libraries). Because these reads deliver a random sampling of the whole genome, they are well-suited for genome-wide SNP discovery (1,12).

2.4. Large-Insert Genomic Clone Consensus Sequences Recent large-scale, genome-wide SNP discovery projects (1,11Ð 13,32) take advantage of the public human reference sequence built as a tiling path through partially overlapping, large-insert genomic clones (18,23). The sequence of these clones was determined with a local shotgun strategy. By cloning random fragments into a suitable sequencing vector, a subclone library is created for each clone. This library is then extensively sequenced until reaching a desired, three- to tenfold, quasirandom read coverage. The DNA sequence of the large-insert clone is reconstructed by assembling the shotgun reads with computer programs (24). At this stage, there are still several gaps in the sequence, although overall accuracy is high (approx 99.9%). Gap closure and clean up of regions of low-quality sequence requires considerable manual effort (23) known as “fin- ishing.” Finished or “base-perfect” sequence is assumed at least 99.99% accurate (18).

2.5. Assembled Whole-Genome Shotgun Read Consensus Sequences Similar in nature to genomic clone sequences, these consensus sequences are the result of assembling a large number of genome- wide shotgun reads, possibly from libraries representing multiple individuals. Over two million human SNP candidates were discov- ered in the private sector by the analysis of multi-individual reads that provided the raw material for the construction of a human genome reference sequence produced by the whole-genome sequence assem- bly method (19). Computational SNP Discovery 91

3. Methods 3.1. Published Methods of SNP Discovery

Methods of SNP mining have gone through a rapid evolution dur- ing the past few years. The first approaches relied on visual com- parison of sequence traces from multiple individuals (33). Although manual comparison of a small number of sequence traces is fea- sible, standard accuracy criteria are hard to establish, and this method does not scale well for multiple sequence traces and many polymorphic locations. The efficiency of visual inspection is increased when it is performed in the context of a multiple sequence alignment (27,34,35), aided by computer programs that are capable of displaying the alignments and provide tools for simultaneous viewing of sequence traces at a given locus of the multiple align- ment (36). Computer-aided prefiltering followed by manual exami- nation of sequence traces (11,32) was used in the analysis of overlapping regions of genomic clone sequences to detect candi- date SNPs as sequence differences between reads representing the two overlapping clones. These early methods were instrumental in demonstrating the value of extant sequences, sequenced as part of the Human Genome Project, for the discovery of DNA sequence variations. Although visual inspection remains an integral part of software testing and tuning, demands for fast and reliable SNP detection in large data sets have necessitated the development of automated, computational methods of SNP discovery. The first generation of these methods was designed to enable min- ing the public EST database (37), and relied, in part, on tools previ- ously developed to aid the automation of DNA sequencing (23). SNP detection was performed by software implementing heuristic considerations. Picoult-Newberg et al. (27) used the genome frag- ment assembler PHRAP to cluster and multiply align ESTs from 19 cDNA libraries. The use of the genome assembler implied that alternatively spliced ESTs were not necessarily included in a single cluster. There was no attempt to distinguish between closely related members of gene families (paralogs). SNP detection was carried 92 Marth

out through the successive application of several filters to discard SNP candidates in low-quality regions, followed by manual review. Mainly as the result of conservative heuristics, this method only found a small fraction, 850 SNP candidates in several hundreds of thousands of sequences analyzed. Buetow et al. (6) used UNIGENE (38), a collection of precomputed EST clusters as a starting point. ESTs within each cluster were multiply aligned with PHRAP (24). Identification of paralogous subgroups within clusters was done by constructing phylogenetic trees of all cluster members and analyz- ing the resulting tree topology. Again, SNP candidates were identi- fied by heuristic methods to distinguish between true sequence differences and sequencing errors. This method yielded over 3,000 high-confidence candidates in 8,000 UNIGENE clusters that con- tained at least 10 sequence members. Unfortunately, the great majority of clusters contained significantly fewer sequences that could not be effectively analyzed with these methods. The development of a second generation of tools was prompted by the needs of genome-scale projects of SNP discovery. The large amount of data generated by The SNP Consortium (TSC) (1) has spurred the development of several SNP discovery tools. In the ini- tial phase, the TSC employed a molecular strategy called restricted genome representation (RRS), which involves the sequencing of size-selected restriction fragment libraries from multiple individu- als (10). For example, the full digestion by a given restriction enzyme may produce 20,000 genomic fragments in the 450Ð550-bp length range. After digestion of the genomic DNA of each of the 24 individuals, followed by size-selection, the restriction fragment libraries are pooled. When a collection of such random fragments is sequenced to appreciable redundancy (say, 60,000Ð80,000 reads), the sequence of many of the fragments will be available from more than one individual. These redundant sequences are a suitable sub- strate for SNP analysis. The analysis of data of this type is similar to that of EST sequences. First, one must cluster the sequence reads to delineate groups of identical fragments. To avoid grouping sequences based on similarity between known human repeats they Computational SNP Discovery 93

contain, the reads are screened and repetitive sequences are masked (39). Pairs of similar sequences are determined by a full pair-wise similarity search between all reads from a given library. Pairs are merged into groups (cliques) by single-linkage, transitive cluster- ing. Some groups may still be composed of sequences that represent low-copy repeats (paralogous regions) not present in the REPEATMASKER repeat-sequence library. One of the strategies to identify these potential paralogs is to compare cluster depth (the number of sequences in the group) to expectations obtained from Poisson sampling with the given redundancy (10). Groups that sur- vive these filtering steps are analyzed for SNPs. One of the methods used is based on establishing a quality standard for each of the aligned nucleotides within each sequence, taking into account the base quality value of the nucleotide in question as well as the qual- ity of the neighboring nucleotides (10; Neighborhood Quality Stan- dard, or NQS). Instead of the full multiple alignment, the detection of SNPs was based on the analysis of all possible read pairs within a given group: mismatches between pairs of aligned nucleotides meet- ing the NQS were extracted as SNP candidates. As the initial, draft sequencing of the human genome neared completion, it was possible to switch towards a more accurate, more efficient strategy. As the majority of the genome was available as genome reference sequence (18), sequencing of whole-genome, ran- dom, subclone libraries would provide sequence coverage that could be compared to the reference sequence. This reduced the time and cost associated with the creation of restricted representation subclone libraries (10,18). The informatics problems associated with this strategy were also reduced in complexity. It was now pos- sible to use a single similarity search to place the fragments on the genome reference. By the same procedure, it was also possible to ascertain alternative (paralogous) locations. This is the strategy employed by the algorithm SSAHASNP (40), which combines a fast search algorithm of short-sequence fragments against the genome with a SNP detection algorithm that uses the NQS (10) to find SNP candidates in pair-wise comparisons of sequence frag- 94 Marth ments against the genome. As a fast tool capable of efficient pro- cessing of large data sets, SSAHASNP was used in the discovery of a large fraction of SNPs in the TSC data (1). As we can see from the previous discussion, the molecular sub- strates involved in different projects of sequence-based SNP dis- covery represent data of varied types and sequence sources. The result is a multitude of different scenarios in terms of alignment depth, what the individual sequences represent, overall sequence accuracy, and so on. The methods of SNP discovery we have dis- cussed so far are generally quite successful in operating within the specific sequence context for which they were developed. There was, however, a growing need for general tools of SNP discovery (41) that are able to analyze sequences both in shallow or in deep coverage, sequences of different sources simultaneously, without human review, and assign a realistic measure of confidence in the SNP candidates, without regard to the source and overall accuracy of these sequences. To achieve the flexibility this required, it was necessary to develop mathematically rigorous, statistical methods of SNP detection. Here we will describe POLYBAYES (9), one of the first general-purpose SNP analysis tools available for use today. POLYBAYES is composed of three parts, each independent of the others: an anchored multiple alignment algorithm, a paralog dis- crimination algorithm, and the SNP detection algorithm. The anchored alignment algorithm assumes the availability of a genomic reference sequence (such as the Genome Assembly [18] for the Human Genome). Short-sequence fragments are organized by align- ing them to the reference sequence. This algorithm works well in the case of cDNA (EST) sequences even in the presence of alternative splicing, as individual exons are aligned while leaving gaps for the introns or spliced-out exons (see Fig. 1). The paralog discrimi- nation algorithm examines the alignment of the fragment to the genomic reference, and decides, on the basis of the sequence qual- ity information, whether the number of discrepancies observed in the alignment is statistically consistent with the number expected from polymorphisms plus sequencing errors. If the number of observed discrepancies greatly exceeds the number expected, the Computational SNP Discovery 95

Fig. 1. Alignment of EST reads to genomic anchor sequence (viewed in the CONSED sequence viewer-editor program). ESTs in this align- ment represent two alternative splice variants, both correctly aligned to the genome sequence.

Fig. 2. Example of a paralogous EST sequence (marked with blue bar) in alignment with sequences likely to originate from the given genomic locus. The paralog is detected and tagged automatically by the software. sequence fragment is flagged as a likely paralog, and is discarded from further analysis (see Fig. 2). The SNP detection algorithm employed by POLYBAYES calcu- lates the probability that discrepancies at the analyzed location rep- resent true sequence variation as opposed to sequencing error. As a 96 Marth

Bayesian algorithm, it combines a priori (prior) knowledge about the sequence context with the specific, observed data represented by the sequences under examination. Typically, such prior knowl- edge includes an approximate average polymorphism rate in the region, and the expected ratio between transitions and transversions. Additional information may include the knowledge of the number of different individuals represented by the sequences within the alignment, or the degree of their relatedness. Often, multiple sequence reads (e.g., forward-reverse read pairs) may originate from a single DNA clone template; in such cases, any mismatch between these reads is a priori identified as a sequencing error. The role of sequence accuracy, as expressed by the base quality values in the individual sequences, is quite intuitive: a mismatch between nucle- otides of low accuracy is more likely the result of sequencing error than that of true variation. On the other hand, if a mismatch occurs between nucleotides with high base quality values, the likelihood of a true polymorphism is higher. Alignment depth (the number of sequences contributing to the site under examination) is similarly important: a candidate A/G polymorphism between only two sequences may be less convincing than in a situation where, say 30 sequences contribute an A and another 30 sequences contribute a G residue to the alignment slice. Finally, the effect of base composi- tional biases may be significant in extremely A/T or G/C rich organisms, and is taken into account in the computations. The algo- rithm can be summarized as follows: At a given slice of N aligned nucleotide sequences, each sequence can represent one of the four DNA nucleotides, giving rise to a total of 4N possible permutations within the slice. The POLYBAYES algorithm calculates the Baye- sian posterior probability for all 4N possible permutations taking into account the prior expectations, the base quality values, local base composition, and the alignment depth. The sum of the prob- abilities for all polymorphic permutations (i.e., permutations whereby not all N sequences are in agreement) is the likelihood that the sequences at the given location harbor a SNP. Because the algo- rithm does not depend on the source of the quality values (whether generated by a base caller such as PHRED, or by a fragment assem- Computational SNP Discovery 97

bly program such as PHRAP) it is possible to objectively and simul- taneously evaluate all available data present in the alignment, with- out regard to sequence source or restrictions on data quality. For each site of the alignment, the algorithm outputs the probability that the site is polymorphic. These probability values were shown to accurately estimate the validation rate of candidate SNPs in various mining applications (1,9,15). This is desirable because realistic estimates for the true positive rate allow one to use the highest num- ber of SNP candidates within an acceptable false positive rate. The POLYBAYES software is compatible with the PHRED/PHRAP/ CONSED file structure, is capable of analyzing multiple alignments created with PHRAP, and the output, including markup information such as paralog tags and candidate SNP sites, is directly viewable within CONSED (Figs. 2 and 3). An alternative statistical formula- tion (8) developed to analyze EST clusters produces a log-odds (LOD) score to rank SNP candidates based on sequence accuracy, the quality of the alignment, prior polymorphism rate, and by evalu- ating adherence to the rules of Mendelian segregation of alleles within individual cDNA libraries. There are two additional cases of practical importance that the algorithms described earlier were not designed to work with directly. In many situations, the DNA template that is available for analysis is double stranded, genomic DNA of an individual, or sometimes a pool of multiple individuals. The first is the case when a known region is assayed from the genomic DNA of multiple indi- viduals (34,35), giving rise to sequence traces that contain heterozy- gous nucleotides. An example of a multi-individual DNA pool is one constructed to obtain population-specific estimates of allele fre- quency of known polymorphisms (42). PCR products obtained from such starting material represent more than a single, unique strand of DNA. When these products are sequenced, polymorphic locations between different strands of DNA appear as base ambiguities in the sequence trace (Fig. 4). The automation of heterozygote detection motivated the development of POLYPHRED (31), a computer pro- gram (43) that examines numerical characteristics of sequence traces such as drop in peak-height, ratio of a second peak under the 98 Marth

Fig. 3. Candidate SNP site. The SNP (alleles A/G) is evident within members of one of the two alternatively spliced forms of ESTs aligned to the genomic anchor sequence at this location. The tag above, generated automatically by the detection software POLYBAYES, shows the most likely allele combination at the site, together with the probability of that variation. primary peak, and overall sequence quality in the neighborhood of the analyzed nucleotide position. POLYHRED integrates seam- lessly with the University of Washington PHRED/PHRAP/ CONSED genome analysis software package. Although both POLYPHRED, and other specialized, heuristic approaches has been tested for allele frequency estimation in pooled sequencing, reliable computer algorithms of frequency estimation are not yet available. Another topic of practical importance is the detection of short insertions and deletions (INDELs). Polymorphisms of this type are also commonly referred to as DIPs (deletion-insertion polymorph- isms). The main difficulty of detecting DIPs is the fact that current, base-wise measures of sequence accuracy provide no direct Computational SNP Discovery 99

Fig. 4. Heterozygote detection with the POLYPHRED program. Mul- tiple alignment with the site of an SNP marked up with POLYPHRED (left). Sequence traces of a homozygous A/A, a heterozygous A/G, and a homozygous G/G individual (right). estimates of insertion or deletion type sequencing errors. The base quality value, accompanying a given nucleotide, expresses the like- lihood that the nucleotide was called in error, but it is not possible to separate the likelihood of substitution-type sequencing error from the likelihood that a nonexistent nucleotide was artifactually inserted by the base caller. Similarly, there is no direct measure of the likelihood that between two called, neighboring nucleotides there are additional bases in the sequencing template that were erro- neously omitted and therefore represent deletion-type errors. In the absence of sequencing error estimates, it is difficult to formulate rigorous models of insertion-deletion type polymorphisms. A heu- ristic approach employed by POLYBAYES for DIP detection is based on the assumptions that a higher base quality value corre- sponds to a decreased chance that the called nucleotide is, in fact, an artifactual insertion, and that the likelihood of deleted nucleotides 100 Marth between two high-quality called bases is low. Taking into account the base quality value of the nucleotides neighboring a candidate deletion, as well as the base quality values of the corresponding candidate insertion in another aligned sequence, a heuristic DIP like- lihood is calculated. This likelihood was used to detect DIPs in over- lapping regions of large-insert clones of the Human Genome Assembly. Validation rate for DIPs that were at least two base pairs long was about 70%; the validation rate for single base-pair inser- tions-deletions was significantly lower, especially for base-number differences in mono-nucleotide runs.

3.2. Computational Aspects of SNP Discovery The majority of software packages for automated SNP discovery were developed to run under the UNIX operating system. Part of the reason for this is the availability of powerful and flexible pro- gramming tools that UNIX provides for the software developer. In addition, many of the SNP discovery tools available today were written in a way that enables their integration into existing genome analysis packages such as the PHRED/PHRAP/CONSED system, developed at the University of Washington under UNIX. Hardware requirements for SNP mining depend greatly on the scope of the task tackled. Searching for SNPs in specific, short (up to 100Ð150 kb) regions of the genome, in up to a few hundred sequences, is well within the capabilities of a conventional UNIX workstation (or a computer running the user-friendly LINUX operating system that can be installed on a personal computer with relative ease). Genome- wide SNP mining projects typically require server-class machines, and access to several hundred gigabytes of data storage, especially if intermediate steps of the mining procedure are tracked and results are recorded in a database. Unfortunately, there is no official standard data exchange format for sequence multiple alignments, or SNP markup information. Many of the SNP discovery tools currently in use expect input and produce output in file formats specific to the program. In these cases, Computational SNP Discovery 101

data translation between different tools is achieved via custom scripts. The closest to a de facto standard is the PHRED/PHRAP/ CONSED (24) file structure and software architecture developed at the University of Washington that is widely used in sequencing laboratories worldwide. Given that several of the main SNP analy- sis tools, including POLYPHRED and POLYBAYES, were built to integrate within this structure, it is worthwhile to briefly summarize the University of Washington package standards for representing SNP information. The main directory of the file architecture contains four subdirectories in which all relevant data is organized. Sequence traces reside in the subdirectory chromat_dir. When the base- calling algorithm PHRED interprets a trace, it creates a sequence analysis file in the PHD format, and writes it into the subdirectory phd_dir. In addition to header information such as sequence name, read chemistry, and template identifier, the PHD format file con- tains three important pieces of information for each called base: the called DNA residue, the corresponding base quality value describ- ing the accuracy of the call, and the position of the called nucleotide relative to the sequence trace. The PHD file may also contain per- manent additional sequence information or tags attached to sections of the read (such as the region of an annotated repeat, or cloning vector sequence). The pre-requisite of using POLYPHRED is the presence of an additional trace analysis file that contains detailed information about the trace, at the location of the called nucleotide. This file is the POLY format trace analysis file, located in the subdirectory poly_dir. Finally, all downstream analysis files are kept in the fourth subdirectory edit_dir. Perhaps the most com- monly used file in this directory is the ACE format sequence assem- bly, or multiple alignment file. This file format was designed as an interchange format between the PHRAP sequence assembly pro- gram and the CONSED sequence editor. ACE files are versioned and sequence edits performed within CONSED are saved as con- secutive versions. The SNP detection program POLYPHRED takes an ace format multiple alignment file, and adds markup information 102 Marth

regarding the location of heterozygous trace positions. These tags are visible when the alignment is viewed with CONSED, enabling rapid manual review. POLYBAYES operates in one of two modes. The first mode is the analysis of a pre-existing multiple alignment, supplied in the ACE format. In this case, the anchored multiple alignment step is bypassed, and an ACE format output file is cre- ated that contains the results of paralog identification and SNP detection, again, as tags viewable from within CONSED. In the second mode of operation one utilizes the anchored alignment capa- bility of POLYBAYES. In this case, one starts out with FASTA format files representing the DNA sequence and the accompanying base quality values for the genomic anchor sequence, as well as the cluster member sequences (for a description of the FASTA format see URL: http://www.ncbi.nlm.nih.gov/BLAST/fasta.html). CROSS_MATCH (24), a pair-wise, dynamic programming align- ment algorithm is run between each member sequence and the anchor. The sequences, together with the pair-wise alignmentsare supplied to POLYBAYES. The program multiply aligns the mem- ber sequences, performs the paralog filtering and the SNPdetection step, and produces a new ACE format output file for the viewing of the anchored multiple alignment and SNP analysis results.

3.3. SNP Discovery Protocol Given the diversity of sequence data that can be used to detect polymorphic sites within an organism, it is impossible to prescribe a single protocol that works in every situation. In general, the mining procedure will contain the following steps: data organization, the cre- ation of a base-wise multiple alignment, filtering of paralogous sequences (or cluster refinement), followed by the detection of SNPs in slices of the multiple alignment. In this final section of this chap- ter, we will give two different examples that typify the usual steps of SNP mining. The majority of mining applications can be success- fully completed by customizing and combining these steps. Computational SNP Discovery 103

3.3.1. SNP Discovery in EST Sequences In the first scenario, in a screen against a cDNA library one pulls out a clone sequence that contains a gene of interest. The cDNA is an already sequenced clone, the corresponding EST is in the public database, dbEST (37) (URL: http://www.ncbi.nlm.nih.gov/dbEST). The goal is to explore single base-pair variations within the gene. The first step towards this goal is to find all SNPs in those tran- scribed sequences of the gene that are available in public sequence databases. One proceeds as follows: 1. Find the location of the gene in the human genome from which the EST was expressed. Go to the NCBI (National Center for Biotech- nology Information) web site (URL: http://www.ncbi.nlm.nih.gov) and follow the Map Viewer link. Use the search facility on this page to find the genomic location of the EST, pre-computed by the NCBI. Perform the search using the accession number of the EST. Make sure that you set the “Display Settings” to include the “GenBank” view. Click on the genome clone accession that overlaps the EST, and download the sequence in FASTA format. This sequence will act as the genomic anchor sequence for the ESTs to be analyzed. 2. Find all other ESTs in dbEST with significant sequence similarity to the original EST sequence. Perform the similarity search from the NCBI (National Center for Biotechnology Information) website (URL: http://www.ncbi.nlm.nih.gov/BLAST). Choose the “Standard nucleotide-nucleotide BLAST” option. Type the accession number of the EST in the “Search” field. Choose “est_human” as the data- base to search against. Once the search is done, format the output as “Simple text,” and parse out the accession list of ESTs from the list of hitting sequences (see Note 1). 3. Retrieve EST sequence traces. In the near future, EST trace retrieval will be possible from the trace repository (URL: http:// www.ncbi.nlm.nih.gov/Traces) that is under construction at the NCBI. Currently, EST sequence traces can be downloaded from the Washington University ftp site: (URL: ftp://genome.wustl.edu/pub/ gsc1/est) for ESTs produced there. Searching is done via the local EST names. Download all ESTs for which traces can be found at this site (see Note 2). 104 Marth

4. Process the sequence traces with the PHRED base-calling program. Invoke PHRED with the command line parameters that produce files necessary for downstream processing in the University of Washington PHRED/PHRAP/CONSED architecture (URL: http://www.phrap.org). Make sure that PHD format sequence files are created in the “phd_dir” subdirectory, by specifying the location of this directory with the “-cd” option. Use the utility program PHD2FASTA (pro- vided with CONSED) to produce a FASTA format file of the DNA sequences (“-os” option) of the ESTs file. Also, produce a FASTA format file for the accompanying base quality values (“-oq” option), and one for the list of base positions that specify the location of each called nucleotide relative to the sequence trace (“-ob” option). The DNA sequence of the ESTs will be used in the next step, as the mem- bers of the cluster (group) of expressed sequences to analyze for polymorphic sites. 5. Create a multiple alignment of the EST sequences with the anchored alignment algorithm implemented within POLYBAYES (instruc- tions at the POLYBAYES web site, URL: http://genome.wustl.edu/ gsc/polybayes). As the anchor sequence, use the genomic clone sequence from step 1. Use the CROSS_MATCH dynamic alignment program to compute the initial pair-wise alignments between each of the ESTs and the genomic anchor sequence (CROSS_MATCH is distributed as part of the PHRAP software package [24]). As cluster member sequences, use the ESTs obtained in steps 2Ð4. Figure 1 shows a section of a sample multiple alignment, viewed with the CONSED (36) sequence viewer-editor program. Observe that, in this case, the ESTs are divided into two groups of alternative splice forms. 6. Likely paralogous sequences are identified with the in-built paralog- filtering feature of POLYBAYES. This feature is invoked by the “-filterParalogs” command line option (additional relevant argu- ments explained in the online documentation available at the POLYBAYES web site). Figure 2 shows a different section of the multiple alignment produced in the previous step. Observe that there are several high-quality mismatches between the genomic anchor sequence and EST marked with the blue tag. This sequence is considered a sequence paralog, and is automatically tagged by the filtering algorithm. The paralogous sequence is removed from con- sideration in any further analysis. 7. The multiple alignment is scanned for polymorphic sites. At each site, the slice of the alignment composed of nucleotides contributed Computational SNP Discovery 105

by every sequence that was locally aligned, is examined for mis- matches. The Bayesian SNP detection algorithm calculates the prob- ability that such mismatches are the result of true polymorphism as opposed to sequencing error. Likely polymorphic sites are recorded as SNP candidates. The SNP detection feature is enabled with the “-screenSnps” option (additional parameters such as setting prior polymorphism rates or the SNP probability threshold, and enabling pre-screening steps, are explained in online the documentation). Fig- ure 3 shows the site of a SNP candidate in the multiple alignment in the previous example. This SNP is found within members of one alternatively spliced group of EST sequences, and is automatically tagged by the SNP detection algorithm implemented within POLYBAYES (see Note 3). A similar procedure is applicable for a wide range of scenarios where sequence fragments (e.g., ESTs, random genomic shotgun reads, BAC-end reads, sequenced restriction fragments, etc.) are organized with the help of genome reference sequence, and com- pared both against each other, and/or to the reference sequence in search of polymorphic sites.

3.3.2. SNP Discovery in PCR Product Sequences The second scenario is a genotyping application. The goal is to assay a set of individuals for the presence of polymorphic sites in a small region of interest (such as an exon of a gene). A primer pair is available to amplify the region from genomic DNA. The region is amplified from each individual, and the amplicon sequenced. When- ever an individual is heterozygous for a given allele, the sequence shows an ambiguous (heterozygous) peak. Use POLYPHRED, a software package specifically developed for heterozygote detection, to identify heterozygous positions within sequence traces. The pro- cedure is as follows: 1. Process the sequence traces, each representing the double-stranded, genomic DNA of a single individual, with the PHRED base-calling program. This time, in addition to the trace files and the PHD format sequence files central to the CONSED file structure, also create POLY format trace analysis files. This is done by invoking PHRED 106 Marth

with the “-dd” command line option to specify the “poly_dir” subdirectory, within the CONSED structure) where these files are to be written. At the end of this step, a POLY file is present for each of the sequence traces, containing detailed numeric information about the trace characteristics at the position of each called nucleotide. 2. Create a multiple alignment of the sequences representing each of the genotyped individuals. Use the PHRAP fragment assembly pro- gram (24) for this purpose. To enable further analysis of the multiple alignment, invoke PHRAP with the “-new_ace” command line option. This will cause the program to produce an ACE format out- put file that is suitable for direct analysis by the POLYPHRED pro- gram. The ACE format output file can also be directly loaded into the viewer-editor program CONSED for visual review of the mul- tiple alignment. 3. Run POLYPHRED on the multiple alignment to detect polymorphic sites. Using the “-ace” option, specify the “ACE” format PHRAP output file created in the previous step when invoking POLYPHRED. The program analyzes the multiple alignment and tags the sites of candidate SNPs, as identified by likely heterozygous peaks within sequence traces. Figure 4 shows a section of a multiple alignment containing the site of a SNP, together with examples of sequence traces representing individuals homozygous for each of the two alleles, and a heterozygote.

4. Notes

1. To facilitate the retrieval of the corresponding sequence traces, make a list of local EST read names available in the header information for each EST. 2. The following URL: http://genome.wustl.edu/est/est_search/ ftp_guide.html contains detailed instructions. 3. Additional information is provided in the output files produced by the program (for more detail, see the online documentation).

References

1. Sachidanandam, R., Weissman, D., Schmidt, S. C., Kakol, J. M., Stein, L. D., Marth, G., et al. (2001) A map of human genome Computational SNP Discovery 107

sequence variation containing 1.42 million single nucleotide poly- morphisms. Nature 409, 928Ð933. 2. Watterson, G. A. (1975) On the number of segregating sites in geneti- cal models without recombination. Theor. Popul. Biol. 7, 256Ð276. 3. Halushka, M. K., Fan, J. B., Bentley, K., Hsie, L., Shen, N., Weder, A., et al. (1999) Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat. Genet. 22, 239Ð247. 4. Cargill, M., Altshuler, D., Ireland, J., Sklar, P., Ardlie, K., Patil, N., et al. (1999) Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22, 231Ð238. 5. Sherry, S. T., Ward, M. H., Kholodov, M., Baker, J., Phan, L., Smigielski, E. M. and Sirotkin, K. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308Ð311. 6. Buetow, K. H., Edmonson, M. N., and Cassidy, A. B. (1999) Reliable identification of large numbers of candidate SNPs from public EST data. Nat. Genet. 21, 323Ð325. 7. Buetow, K. H., Edmonson, M., MacDonald, R., Clifford, R., Yip, P., Kelley, J., et al. (2001) High-throughput development and character- ization of a genomewide collection of gene-based single nucleotide polymorphism markers by chip-based matrix-assisted laser desorp- tion/ionization time-of-flight mass spectrometry. Proc. Natl. Acad. Sci. USA 98, 581Ð584. 8. Irizarry, K., Kustanovich, V., Li, C., Brown, N., Nelson, S., Wong, W., and Lee, C. J. (2000) Genome-wide analysis of single-nucleotide poly- morphisms in human expressed sequences. Nat. Genet. 26, 233Ð236. 9. Marth, G. T., Korf, I., Yandell, M. D., Yeh, R. T., Gu, Z., Zakeri, H., et al. (1999) A general approach to single-nucleotide polymorphism discovery. Nat. Genet. 23, 452Ð456. 10. Altshuler, D., Pollara, V. J., Cowles, C. R., Van Etten, W. J., Baldwin, J., Linton, L. and Lander, E. S. (2000) An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407, 513-6. 11. Taillon-Miller, P., Gu, Z., Li, Q., Hillier, L., and Kwok, P. Y. (1998) Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res. 8, 748Ð754. 12. Mullikin, J. C., Hunt, S. E., Cole, C. G., Mortimore, B. J., Rice, C. M., Burton, J., et al. (2000) An SNP map of human chromosome 22. Nature 407, 516Ð520. 108 Marth

13. Marth, G. T. S., G., Yeh, R., Davenport, R., Agarwala, R., Church, D., Wheelan, S., et al. The structure of single-nucleotide variation in over- lapping regions of human genome sequence. In preparation. 14. Fu, Y. X. (1995) Statistical properties of segregating sites. Theor. Popul. Biol. 48, 172Ð197. 15. Marth, G., Yeh, R., Minton, M., Donaldson, R., Li, Q., Duan, S., et al. (2001) Single-nucleotide polymorphisms in the public domain: how useful are they? Nat. Genet. 27, 371Ð372. 16. Reich, D. E., Cargill, M., Bolk, S., Ireland, J., Sabeti, P. C., Richter, D. J., et al. (2001) Linkage disequilibrium in the human genome. Nature 411, 199Ð204. 17. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic local alignment search tool. J. Mol. Biol. 215, 403Ð410. 18. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860Ð921. 19. Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., et al. (2001) The sequence of the human genome. Science 291, 1304Ð1351. 20. Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence align- ment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673Ð4680. 21. Ewing, B. and Green, P. (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186Ð194. 22. Ewing, B., Hillier, L., Wendl, M. C., and Green, P. (1998) Base-call- ing of automated sequencer traces using phred. I. Accuracy assess- ment. Genome Res. 8, 175Ð185. 23. The Sanger Centre and the Washington University Genome Sequenc- ing Center. T. S. C. a. t. W. U. G. S. (1998) Toward a complete human genome sequence. Genome Res. 8, 1097Ð1108. 24. Green, P. http://www.phrap.org 25. Myers, E. W., Sutton, G. G., Delcher, A. L., Dew, I. M., Fasulo, D. P., Flanigan, M. J., et al. (2000) A whole-genome assembly of Droso- phila. Science 287, 2196Ð2204. 26. Wang, D. G., Fan, J. B., Siao, C. J., Berno, A., Young, P., Sapolsky, R., et al. (1998) Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science 280, 1077Ð1082. Computational SNP Discovery 109

27. Picoult-Newberg, L., Ideker, T. E., Pohl, M. G., Taylor, S. L., Donaldson, M. A., Nickerson, D. A., and Boyce-Jacino, M. (1999) Mining SNPs from EST databases. Genome Res. 9, 167Ð174. 28. Garg, K., Green, P., and Nickerson, D. A. (1999) Identification of candidate coding region single nucleotide polymorphisms in 165 human genes using assembled expressed sequence tags. Genome Res. 9, 1087Ð1092. 29. Hillier, L. D., Lennon, G., Becker, M., Bonaldo, M. F., Chiapelli, B., Chissoe, S., et al. (1996) Generation and analysis of 280,000 human expressed sequence tags. Genome Res. 6, 807Ð828. 30. Adams, M. D., Soares, M. B., Kerlavage, A. R., Fields, C., and Ven- ter, J. C. (1993) Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nat. Genet. 4, 373Ð380. 31. Nickerson, D. A., Tobe, V. O., and Taylor, S. L. (1997) PolyPhred: automating the detection and genotyping of single nucleotide substi- tutions using fluorescence-based resequencing. Nucleic Acids Res. 25, 2745Ð2751. 32. Dawson, E., Chen, Y., Hunt, S., Smink, L. J., Hunt, A., Rice, K., et al. (2001) A SNP resource for human chromosome 22: extracting dense clusters of SNPs from the genomic sequence. Genome Res. 11, 170Ð178. 33. Kwok, P.-Y., Carlson, C., Yager, T. D., Ankener, W., and Nickerson, D. A. (1994) Comparative analysis of human DNA variations by fluo- rescence-based sequencing of PCR products. Genomics 23, 138Ð144. 34. Nickerson, D. A., Taylor, S. L., Weiss, K. M., Clark, A. G., Hutchinson, R. G., Stengard, J., et al. (1998) DNA sequence diversity in a 9.7-kb region of the human lipoprotein lipase gene. Nat. Genet. 19, 233Ð240. 35. Nickerson, D. A., Taylor, S. L., Fullerton, S. M., Weiss, K. M., Clark, A. G., Stengard, J. H., et al. (2000) Sequence diversity and large- scale typing of SNPs in the human apolipoprotein E gene. Genome Res. 10, 1532Ð1545. 36. Gordon, D., Abajian, C., and Green, P. (1998) Consed: a graphical tool for sequence finishing. Genome Res. 8, 195Ð202. 37. Boguski, M. S., Lowe, T. M., and Tolstoshev, C. M. (1993) dbEST: database for “expressed sequence tags”. Nat. Genet. 4, 332Ð333. 38. Wheeler, D. L., Church, D. M., Lash, A. E., Leipe, D. D., Madden, T. L., Pontius, J. U., et al. (2001) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 29, 11Ð16. 110 Marth

39. Smit, A. F. A. G., P., http://ftp.genome.washington.edu/RM/ RepeatMasker.html 40. Ning, Z., Cox, A. J., and Mullikin, J. C. (2001) SSAHA: A fast search method for large DNA databases. Genome Res. 11, 1725Ð1729. 41. Collins, F. S., Patrinos, A., Jordan, E., Chakravarti, A., Gesteland, R., and Walters, L. (1998) New goals for the U.S. Human Genome Project: 1998Ð2003. Science 282, 682Ð689. 42. Kwok, P.-Y. (2000) Approaches to allele frequency determination. Pharmacogenomics 1, 231Ð235. 43. Nickerson, D. A., http://droog.mbt.washington.edu/PolyPhred.html Am. J. Hum. Genet. 71:854–862, 2002

Human Diallelic Insertion/Deletion Polymorphisms James L. Weber,1 Donna David,1 Jeremy Heil,1,* Ying Fan,1 Chengfeng Zhao,1 and Gabor Marth2 1Center for Medical Genetics, Marshfield Medical Research Foundation, Marshfield, WI; and 2National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD

We report the identification and characterization of 2,000 human diallelic insertion/deletion polymorphisms (indels) distributed throughout the human genome. Candidate indels were identified by comparison of overlapping genomic or cDNA sequences. Average confirmation rate for indels with a у2-nt allele-length difference was 58%, but the confirmation rate for indels with a 1-nt length difference was only 14%. The vast majority of the human diallelic indels were monomorphic in chimpanzees and gorillas. The ratio of deletion:insertion mutations was 4.1. Allele frequencies for the indels were measured in Europeans, Africans, Japanese, and Native Americans. New alleles were generally lower in frequency than old alleles. This tendency was most pronounced for the Africans, who are likely to be closest among the four groups to the original modern human population. Diallelic indels comprise ∼8% of all human polymorphisms. Their abundance and ease of analysis make them useful for many applications.

Introduction between alleles can be tens or even hundreds of kilobase pairs (Lupski et al. 1996). Some diallelic indels differ by Nearly all genetics research makes use of DNA sequence the insertion of a retroposon, such as an Alu or L1 ele- variants. Despite this, we know surprisingly little about ment (Watkins et al. 2001). However, by far the largest the numbers and types of variants within human popu- group of diallelic indels are those with allele-length dif- lations. DNA polymorphisms are usually defined as nat- ferences of relatively few nucleotides. The most recently urally occurring variants for which the most common published broad surveys of short indels (covering 80 allele has a frequency of no more than 99% (Gelehrter polymorphisms) were by Krawczak and Cooper (Cooper and Collins 1990). and Krawczak 1991; Krawczak and Cooper 1991). In The vast majority of human DNA polymorphisms can the present article, we report basic properties of human be split into two groups: those based on nucleotide sub- diallelic indels determined by the analysis of 2,000 poly- stitutions (commonly called “SNPs”) and those based on morphisms. insertion or deletion of one or more nucleotides (indels). Indels can in turn be divided into those with multiple Material and Methods alleles (multiallelic) and those with only two alleles (di- allelic). Nearly all of the multiallelic indels are based For the Unigene clusters and the SNP Consortium cliques on tandem repeats, mostly STRs. STRPs (also called (see the “Results” section), sequences were aligned us- “microsatellites”) have been the predominant type of ing the Fragment Assembly System (Genetics Com- polymorphism used in human genetic studies since puter Group). Because the single-pass Unigene cDNA se- about 1990. More recently, millions of candidate SNPs quences were of relatively low quality, candidate indels у have been identified and are beginning to be applied were considered only if the cluster contained 4 reads (International SNP Map Working Group 2001). and if the minor allele appeared in at least two reads. All In contrast, diallelic indels have received very little BAC-end/BAC overlaps and 14% of the BAC/BAC over- laps were aligned using a customized version of BLAST attention. Diallelic indels vary greatly in length differ- that distributed jobs nightly to idle laboratory computers. ence between alleles. In rare cases, the length difference Candidates were considered from alignments of у90% overall identity. Short BAC-end sequences (Zhao et al. Received May 13, 2002; accepted for publication July 9, 2002; electronically published September 4, 2002. 2000) were paired with full BAC sequences from the Address for correspondence and reprints: Dr. James Weber, Center Sanger, MIT, and Baylor sequencing centers. Sanger In- for Medical Genetics, Marshfield Medical Research Foundation, 1000 stitute public Acedb files were parsed to identify pairs North Oak Avenue, Marshfield, WI 54449. E-mail: [email protected]fldclin of overlapping BACs (see the Sanger Institute Web site). .edu The remaining 86% of BAC/BAC overlap candidates * Present affiliation: Celera, Rockville, MD. ᭧ 2002 by The American Society of Human Genetics. All rights reserved. were obtained by collection of large insert clone (pre- 0002-9297/2002/7104-0014$15.00 dominantly BAC) sequences plus associated PHRAP nu-

854 Weber et al.: Diallelic Indels 855 cleotide-quality values from the public genome sequenc- 10 mCi/ml; Amersham); 0.6 pmol each primer; and 32 ng ing centers (International Human Genome Sequencing of genomic DNA template. Samples were cycled 27 times Consortium 2001). Overlaps between pairs of clones through steps of 30 s at 94ЊC, 75 s at 55ЊC, and 30 s at (∼1.1 Gb of sequence; G. Marth, G. Schuler, R. Yeh, 72ЊC, followed by a final 6 min at 72ЊC. An equal volume R. Davenport, R. Agarwala, D. Church, S. Wheelan, J. of loading buffer that contained 0.3% xylene cyanol, Baker, M. Ward, M. Kholodov, L. Phan, H. Harpending, 0.3% bromophenol blue, 10 mM EDTA (pH 8.0), and A. Chakravarti, P.-Y. Kwok, and S. Sherry, unpublished 90% (v/v) formamide was added to each amplified prod- data) were detected primarily with a BLAST similarity uct. Samples were denatured at 95ЊC for 10 min and search. Putative overlaps were filtered using stringent were resolved on 6.5% polyacrylamide gels that con- criteria, to avoid overlaps that represent duplicated seg- tained 7.7 M urea. After electrophoresis, gels were trans- ments of the genome. For each overlapping clone pair, a ferred to Whatman 3 MM chromatography paper and precise, nucleotide-wise alignment was produced using were dried. Amplified products were visualized on au- the CROSS_MATCH banded Smith-Waterman dynam- toradiographs after exposure for 6 h–30 d. ic-programming alignment algorithm. These alignments DNA templates for testing of candidate indels in- were analyzed with a modified version of POLYBAYES cluded several individual human DNA samples, pools SNP-discovery software (Marth et al. 1999). For substi- of human DNA, and chimpanzee and/or gorilla DNA. tutions, POLYBAYES computes an SNP confidence value Individual DNA samples included CEPH family DNA by using the PHRED or PHRAP nucleotide-quality values and Polymorphism Discovery Resource (PDR) samples of the sequences aligned at the candidate polymorphic 1–8. PDR samples are from a mix of American donors site. Because nucleotide-quality values do not directly of European (42%), African (24%), Asian (24%), and provide information on the likelihood of deleted nucle- Native American (10%) ancestries (see the Coriell Cell otides, a similar confidence value cannot be computed Repositories DNA Polymorphism Discovery Resource for candidate indels. Instead, an experimental, heuristic Web site) (Collins et al. 1998). Five DNA pools were algorithm was used on the basis of the logic that high- prepared using equal amounts of DNA from 21 Africans quality, well-resolved nucleotides are unlikely to repre- (12 Mbuti and 9 Biaka Pygmies), 25 Japanese, 25 Native sent artifactual insertions or deletions due to sequencing Americans (14 Karitiana and 11 Rondonian Surui, both error. Accordingly, a high confidence value was assigned from the Amazon), 100 Europeans, and 44 PDR sam- to a candidate if (1) insertion nucleotides were of high ples (1–44). The African, Japanese, and Native Ameri- sequence quality and (2) nucleotides flanking the site of can samples were kindly provided by Ken Kidd (see polymorphism in both the long and short alleles were of the ALFRED Web site). The European samples were high sequence quality. obtained from unidentified blood samples from 100 Candidate indels were further screened by manual consecutive Marshfield Clinic patients. On the basis of inspection. To avoid multiallelic polymorphisms, we ex- a recent Marshfield-area population genetics survey, cluded sequences if, at the site of polymorphism, the ∼99% of Marshfield Clinic patients are of European long allele contained more than five uninterrupted, tan- ancestry, and nearly all of these are of northern or dem mononucleotide repeats (e.g., (A)6) or more than three uninterrupted, tandem repeats with 2–6-nt repeat central European ancestry. The PDR pool was not used for the first 333 indels. lengths (e.g., (AC)4 or (AAAG)5). Candidates were also rejected if they contained 110 unknown nucleotides Allele frequencies were estimated from the DNA within the PCR product, if the PCR product fell entirely pools by scanning of exposed phosphorscreens with a within an interspersed repetitive element, or if the PCR Storm 860 Imaging System (Molecular Dynamics). Fre- product contained 110 uninterrupted STRs outside the quencies were averaged from two independent PCR am- site of polymorphism (this last criterion was instituted plifications of each pool. Frequencies were not obtained ∼ after the first 751 indels). Some putative indels from the for 10% of the indels because of weak bands or in- Unigene clusters were also manually rejected because of terfering nonspecific PCR products. To gauge the ac- a relatively high level of mismatch among the aligned curacy of our method for measurement of allele fre- sequences, indicating low sequence quality. PCR prim- quencies, we amplified three of the indels by using DNA ers were selected using Primer3 software. All PCR prim- from 25 individuals separately and also using a pool ers were outside the putative polymorphic regions. that contained equal amounts of DNA from those 25 PCR amplifications were performed in 96-well micro- individuals. The three indels had frequencies for the titer plates in 4-ml volumes with the following final con- most common allele (in these 25 individuals) of 0.50, centrations: 10 mM Tris (pH 8.3); 50 mM KCl; 1.5 mM 0.74, and 0.98. Measured differences in allele frequen-

MgCl2; 0.001% gelatin; 0.12 U Taq DNA polymerase cies between the individual genotypes and the pool (Sigma D1806); 100 mM each dCTP, dGTP, and dTTP; ranged from 0.004 to 0.037 and averaged 0.018. From 1.25 mM dATP; 0.28 mCi of a33P-dATP (12,500 Ci/mmol, this test, we conclude that the great majority of our 856 Am. J. Hum. Genet. 71:854–862, 2002

Table 1 matically as the length difference between alleles increased Sequence Sources and Confirmation Rates from 1 to 4 nt. Above 4 nt, confirmation rates slowly drifted downward. All of the 2,000 indels described in a Confirmation Rate the present article have у2 nt between alleles. The con- Source No. (%) (%) firmation rate for 1-nt length differences was so low that Unigene clusters 176 (8.8) 40.1 we abandoned efforts on this group early in the project. BAC end/BAC overlaps 254 (12.7) 40.5 BAC/BAC overlaps 1,477 (73.8) 65.6 Most (85%) of the 1-nt-length-difference candidates had SNP Consortium cliques 93 (4.7) 69.4 mononucleotide tandem repeats of у2 nt—for example, (A) —in the long allele. These candidates had a confir- a Of the PCR primer pairs that supported successful amplification, 3 the fraction that led to confirmed polymorphisms. Data in this table mation rate of only 11%. For the remaining 15% of can- cover only the 2,000 indels with a у2-nt length difference between didates without mononucleotide runs, the confirmation alleles. rate was 31%. allele-frequency estimates made using the DNA pools Evolution are within 0.05 of the true allele frequencies. To study evolution of the indels and to determine an- cestral state, we amplified DNA from chimpanzees and Results gorillas. We attempted to amplify the first 100 indels in a set of six individual gorilla samples and/or in pools of Identification and Confirmation two to five chimpanzee DNAs. Of the 87 indels that We identified candidate diallelic indels by comparing were amplified successfully, only three showed any ev- overlapping human genomic or cDNA sequences. The idence of length polymorphism in the ape DNA, and in majority of the 2,000 confirmed indels were derived from none of these cases did both ape alleles match both hu- BAC/BAC overlaps, although substantial numbers also man alleles in length. For a second group of 100 indels, came from Unigene cDNA sequence assemblies, BAC-end/ we amplified DNA from four unrelated chimpanzees. BAC overlaps, and SNP Consortium sequence cliques (ta- Only one of these indels displayed length variation in ble 1). We confirmed candidates by PCR amplification of the chimpanzees, and the alleles in that case were also short (70–220 bp) DNA fragments that encompassed the different than they were in humans. For the remaining putative polymorphism, followed by electrophoresis on 1,800 indels, the single chimpanzee DNA sample tested denaturing polyacrylamide gels. PCR templates for each appeared to carry both human alleles in only two cases. indel included at least nine individual DNA samples, pools Therefore, only very rarely are the human length vari- of DNA from different human populations, and at least ations shared with chimpanzees or gorillas. Our data one great ape sample (for most indels, a single chimpanzee indicate that nearly all of the 2,000 indels arose since sample). Altogether, we screened a minimum of 360 hu- the divergence of the human/chimpanzee/gorilla com- man chromosomes for each candidate. Criteria for con- mon ancestors. The monomorphic alleles in chimpanzees firmation of the polymorphisms were the presence of no and gorillas very likely represent the ancestral states of more than two alleles of the expected PCR-product length these sequences. and the presence of at least one homozygote among the Typing of the ape DNA therefore allowed us to split individual DNA samples and/or substantial variation in the 2,000 indels into four mutation groups (table 3). allele frequency among the different population pools. Deletion or insertion mutations were indicated when the Of the total 3,721 primer pairs tested, 92.7% supported chimpanzee allele matched exactly in length, respec- amplification of DNA of the expected length. Of the tively, the long or short human allele. We also observed primer pairs with successful PCR, 58.0% led to confirmed polymorphisms. The Unigene and BAC-end/BAC overlap Table 2 sequence sources had the lowest confirmation rates; the BAC/BAC and SNP Consortium sources had the highest Confirmation Rates by Allele-Length Difference rates (table 1). The great majority of unconfirmed can- Allele-Length Primer Pairs Confirmation Rate a didates gave a PCR product of only a single length in all Difference (in nt) Tested (%) individuals and pools. Approximately 3% of the primer 1 343 14.3 pairs that supported amplification yielded both long and 2 1,037 46.9 short alleles in approximately equal amounts in all in- 3 719 60.8 4 710 69.6 dividuals and pools. We believe that most of the aligned 5 273 66.7 sequences in these cases are paralogs—nearly identical 6 154 61.7 sequences from two or more distinct genomic locations у7 558 54.8 with long alleles at one locus and short alleles at a second. a Numbers include only those primer pairs that supported successful As shown in table 2, confirmation rates increased dra- PCR amplification. Weber et al.: Diallelic Indels 857

Table 3 There were approximately equal numbers of indels with Mutation Events Leading to Diallelic Indels 2-, 3-, or 4-nt length differences, and these three groups comprised 71% of the total. Beyond 4-nt length differ- Event No. (%) ences, the numbers of indels dropped off with increasing Deletion 1,348 (67.4) length difference. There were 10 indels with a у30-nt Insertion 331 (16.6) length difference; the greatest length difference was 55 Othera 161 (8.0) No amplificationb 160 (8.0) nt. Indels with greater length differences are increasingly more difficult to detect, because the length difference a The amplified chimpanzee/gorilla “allele” had a adversely affects sequence-alignment algorithms. Inser- different length than either human allele. b When chimpanzee or gorilla DNA was used as tions had a modest dearth of 2-nt length differences and template, the human PCR primers did not amplify an excess of 4-nt length differences compared to dele- any DNA fragments close in length to the human tions, but, overall, the two groups did not differ greatly. PCR products. The 2,000 diallelic indels mapped to all 24 chromo- somes. The distribution of indels among the chromo- a significant number of cases (∼8%) in which the chim- somes, however, was biased. For example, 45% of the panzee allele was different in length from either human indels mapped to chromosomes 5, 6, 7, and 22, whereas allele (see “Other” in table 3). The ratio of deletions: only 2.8% of the indels mapped to chromosomes 4 and insertions was 4.1. Classification of the polymorphisms 8. We believe that this bias can be largely or entirely as either deletions or insertions offered a convenient way explained by differential availability among the chro- to compare and contrast these two types of mutations. mosomes of overlapping sequences at the time of indel We also compared allele lengths for indels that were development. amplified using both gorilla and chimpanzee DNA. Of the 65 indels for which we had data from both apes and Allele Frequencies for which the gorilla allele matched either the long or short human allele, chimpanzee alleles were the same Using DNA pools from different human populations, ∼ length as the gorilla allele in 61 cases. This provides fur- we measured allele frequencies for 90% of the diallelic ther support that the ape DNA represents the ancestral indels. Distributions of long-allele frequencies in five state. We also examined a group of 70 indels for which populations plus an average of the populations are dis- either the chimpanzee (most cases) or gorilla DNA was played in figure 1. As expected from comparison (in most different in length from either human allele (taken from cases) of only two overlapping sequences, the indels gen- the “Other” category in table 3). In 30 cases (43%), the erally had high informativeness. Fifty-one percent of the allele from the second ape species matched one of the indels had population-average frequencies of the minor human alleles in length. These cases can easily be ex- allele that were in the range of 30%–50%, and only 8% plained by an independent deletion or insertion event had population-average minor-allele frequencies of 0– within the PCR product in the chimpanzee or gorilla an- 10%. Note that, for indels that arose by DNA insertion, cestral line after divergence from the common human/ long-allele frequencies were shifted toward low values. chimpanzee/gorilla ancestor. In 32 cases (46%), both For indels that arose by DNA deletion, the shift was chimpanzee and gorilla alleles were the same length but differed in length from either human allele. These cases Table 4 can most easily be explained by two separate indel mu- tations in the human ancestral line after divergence from Allele-Length–Difference Distributions the common ancestor. The first mutation became fixed in Allele-Length No. (%) No. (%) No. (%) the human line, and the second led to the current, ob- Difference (in nt) All Indels Deletions Insertions served polymorphism. This interpretation is consistent 2 486 (24.3) 350 (26.0) 61 (18.4) with the lower nucleotide diversity observed in humans 3 437 (21.8) 301 (22.3) 73 (22.1) compared to chimpanzees or gorillas (Kaessmann et al. 4 494 (24.7) 310 (23.0) 87 (26.3) 5 182 (9.1) 123 (9.1) 29 (8.8) 2001). In eight cases (11%), the chimpanzee and gorilla 6 95 (4.8) 66 (4.9) 19 (5.7) alleles differed in length from each other, as well as from 7 51 (2.6) 36 (2.7) 10 (3.0) both human alleles. These cases can be explained by the 8 48 (2.4) 26 (1.9) 16 (4.8) occurrence of independent indel mutations in both chim- 9 24 (1.2) 14 (1.0) 6 (1.8) panzee and gorilla lines after divergence. In six of these 10 30 (1.5) 21 (1.6) 4 (1.2) 11 27 (1.4) 17 (1.3) 5 (1.5) last eight cases, a relatively highly mutable mononucle- 12 15 (.8) 10 (.7) 2 (.6) otide run of 6–15 nt was present within the PCR product 13 15 (.8) 14 (1.0) 0 (.0) (but not at the site of the human polymorphism). 14 10 (.5) 6 (.4) 2 (.6) Distribution of length differences between long and 15 6 (.3) 6 (.4) 0 (.0) у short alleles for the 2,000 indels is shown in table 4. 16 80 (4.0) 48 (3.6) 17 (5.1) 858 Am. J. Hum. Genet. 71:854–862, 2002 toward high values (see also table 5). This trend was As a final comparison of the populations, we plotted most pronounced for the Africans and was least pro- long-allele frequencies from one population against the nounced for the Native Americans. The European and others in pairs. Linear correlation coefficients for these “mixed” population (PDR and population average) dis- plots are shown in table 7. Not unexpectedly, Africans tributions were hump-shaped, with relatively few indels were the most divergent population. Among the unmixed at extreme frequencies, whereas the Native American populations, Europeans/Japanese and Japanese/Native distributions were bowl- or U-shaped, with the greatest Americans had the highest correlations. Correlations number of indels at frequency extremes. Among the “un- with the PDR pool were generally high, with the Eu- mixed” populations (Africans, Europeans, Japanese, and ropean/PDR value being highest of all. Native Americans) for both deletions and insertions, Af- Complete information for all 2,000 indels is avail- ricans had the greatest mean long-allele–frequency de- able at the dbSNP (Sherry et al. 2000) and Marshfield viations from 0.50, and Native Americans had the high- Web sites. Tables with indel data, including popula- est SDs (table 5). The average long-allele frequencies of tion allele frequencies, can be downloaded from the all indels combined are 10.50 because of the predomi- Marshfield Web site. nance of deletions. We next considered indels that were informative or Discussion uninformative in only one or two populations (table 6). Europeans, Africans, and Europeans/Africans combined The overall confirmation rate for the 2,000 diallelic indels had by far the largest number of indels informative in was 58%. When the highest-quality candidate polymor- only those populations. Africans, Native Americans, and phisms were utilized (accounting for PHRED or PHRAP Japanese/Native Americans combined had the largest nucleotide-quality values), this rate climbed to ∼70% numbers of uninformative indels. (table 1). Even so, the rate for the indels was lower than

Figure 1 Long-allele–frequency distributions. Distributions are shown for the five indicated populations plus a population average. Gray bars indicate the deletions, and black bars indicate the insertions. Weber et al.: Diallelic Indels 859

Table 5 Long-Allele Frequencies

SD FREQUENCY (NO. OF INDELS) IN ע MEAN INDEL SET Africans Europeans Japanese Native Americans PDR Population Average (1,806) 24. ע 55. (1,519) 24. ע 54. (1,804) 32. ע 53. (1,795) 29. ע 53. (1,805) 25. ע 53. (1,802) 27. ע All indels .60 (1,214) 22. ע 59. (1,023) 22. ע 58. (1,213) 31. ע 56. (1,207) 28. ע 57. (1,213) 24. ע 56. (1,212) 23. ע Deletions .67 (311) 23. ע 39. (254) 23. ע 38. (310) 31. ע 40. (309) 28. ע 40. (311) 25. ע 40. (310) 25. ע Insertions .34 published confirmation rates for SNPs of ∼83% (Inter- the DNA donors for the bulk of the public human ge- national SNP Map Working Group 2001; Marth et al. nome sequencing were not reported (International Hu- 2001). Possible reasons for this difference include the man Genome Sequencing Consortium 2001). Our data lack of sequence-quality values for missing nucleotides indicate, however, that, of the four populations that we (see the “Material and Methods” section), increased studied, Europeans are closest to the major DNA donors rates of indel-sequencing errors compared to substi- for sequencing. Support for this conclusion comes from tutions, and increased artifacts that occurred during the relatively large numbers of indels informative in only Escherichia coli subcloning. the Europeans (or Europeans/Africans) and from the rel- Confirmation rates for candidate indels with 1- and, atively small number uninformative in only the Euro- to a lesser degree, 2-nt allele-length differences were peans (table 6). Europeans also had the fewest number especially low (table 2). This is important because in- of indels with minor-allele frequency !10% or !2%, the dels with 1-nt allele-length differences are most abun- highest average heterozygosity at 37% (Japanese and Af- dant of all (Antonarakis et al. 2000; Berger et al. 2001; ricans each had 33%, and Native Americans had 30%), Halangoda et al. 2001; Wicks et al. 2001; also see and hump-shaped distributions for long-allele frequen- below). The confirmation rate improved somewhat (to cies (fig. 1). 31%) for indels with 1-nt allele-length differences that Even with the limitations described above, some pop- did not contain runs of mononucleotides at the site of ulation genetics and evolutionary conclusions can be polymorphism. drawn from our data. Of the four populations studied, We observed a ratio of deletion:insertion mutation African Pygmy new-allele–frequency distributions are events leading to the indels of 4.1 (table 3). This value clearly closest to the shape expected for neutral alleles in agrees reasonably well with the ratio of 2.7 taken from a population of constant size (Fu 1995; Subrahmanyan the Human Gene Mutation Database and with somatic et al. 2001). The Africans have the greatest bias toward mutation studies of the lacI (ratio 3.7) and p53 (ratio low frequencies for the new alleles (fig. 1 and table 5). 3.4) genes (Halangoda et al. 2001). Support for our Watkins et al. (2001) obtained very similar results for categorization of the indels as either insertions or de- polymorphisms based on insertion of Alu elements and letions is justified by large differences in allele-frequency for a relatively small group of SNPs. The Africans appear distributions between the two groups (fig. 1) and by large differences between the two groups in mechanism Table 6 of mutation (J. L. Weber, R. Boudreau, and D. David, unpublished data). Numbers of Indels Informative or Uninformative in Single Populations or Pairs of Populations The finding that nearly all human diallelic indels are apparently monomorphic in chimpanzees and gorillas Native Alonea Africans Europeans Japanese Americans closely matches results reported previously for SNPs Aloneb 24 7 1 0 (Hacia et al. 1999). The average lifetimes for both types Africans 36 19 2 1 of polymorphisms appear to be significantly shorter Europeans 21 51 than the ∼6 million years since the common human/ Japanese 31 1 1 chimpanzee ancestor (Clark 1997; Miller et al. 2001). Native Americans 38 4 2 20 When considering allele frequencies for the indels, it NOTE.—Numbers of indels that are informative only in each population or each pair of populations are listed in the upper right half of the array (italic); is important to recognize the ascertainment bias in in- numbers of indels that are uninformative only in each population or each pair formativeness due to the nature of the overlapping se- of populations are listed in the lower left half of the array (boldface). In this quences used to identify the indels. The great majority table, informative indels are defined as having allele (long or short allele) fre- quencies у20.0% in the population or pair of populations under consideration of the sequence overlaps that we used contained only two and frequencies of the same allele р5.0% in the other two or three populations. sequences. With only two sequences, the probability of Uninformative indels are defined as having allele frequencies р5.0% in the pop- detecting the polymorphism is equal to the heterozygos- ulation or pair of populations under consideration and frequencies у20.0% in the other populations. ity. In addition, there is a likely population bias in the a Numbers of uninformative indels in the individual populations. identification of the polymorphisms. The ancestries of b Numbers of informative indels in the individual populations. 860 Am. J. Hum. Genet. 71:854–862, 2002

Table 7 four populations that we studied. Care will have to be Correlation Coefficients among Populations Studied taken to ensure that diallelic polymorphisms chosen for generic screening sets have reasonable informativeness Europeans Japanese Native Americans PDR in all major world populations. Africans .32 .30 .22 .48 It is important to keep in mind that, although sub- Europeans .58 .49 .85 stitutions (i.e., SNPs) are the most abundant class of Japanese .58 .75 Native Americans .64 human polymorphisms, indels are also quite common. (Although the term “SNP” has occasionally been used NOTE.—Linear correlation coefficients are given for plots of long- allele frequencies between the population pairs. to cover indels with a 1-nt allele-length difference, we recommend that this term be restricted to substitutions.) As shown in table 8, indels comprise ∼20% of all human to be most similar to the original modern human pop- DNA polymorphisms. The numbers in table 8 that are ulation and to have undergone no severe population bot- from the Human Gene Mutation Database may be bi- tlenecks (Tishkoff et al. 2000; Jorde et al. 2001). ased toward indels because this catalog contains mostly In contrast, Europeans, Japanese, and Native Amer- mutations that severely disrupt gene function. The es- icans all appear to have undergone at least one relatively timates from the overlapping BACs for the whole ge- severe population bottleneck. These populations have nome and specifically for chromosome 22 are less biased less bias toward low frequencies for new alleles and have and probably are a more accurate reflection of the true higher SDs for average allele frequencies than the Af- ricans. The Native Americans probably passed through situation. The fraction of polymorphisms that are indels more than one severe bottleneck, since they show bowl- in humans is consistent with numbers from three model shaped long-allele–frequency distributions (fig. 1). As a organisms (table 8). In the many species with more ge- population passes through a bottleneck, allele frequen- netic diversity than humans, average spacing between cies tend to change rapidly; rare neutral alleles usually indels will be impressively low. ∼ drop in frequency, but some increase dramatically. The Human indel candidates from the 1.1 Gb of over- 36 indels uninformative in only the Africans (table 6) lapping BAC sequences (see the “Material and Methods” ∼ may be examples of the latter case. section) can be further divided into an 60:40 ratio of Our results are also consistent with many evolution- multiallelic STRPs and diallelic indels, respectively. Di- ary trees that have been drawn for modern human pop- vision of the indels into these two groups is based on the ulations. Africans are clearly the most distant group rules listed in the “Material and Methods” section and compared to the other three (table 7). Of the three “out is therefore somewhat arbitrary. Many sequences cat- of Africa” populations, the Japanese and Native Amer- egorized as multiallelic will likely turn out to be diallelic, icans are clearly the most closely related pair, as dem- and at least a few of the sequences categorized as dialle- onstrated by their relatively high correlation coefficient lic will likely have more than two alleles. Repeat lengths (table 7) and also by the relatively large numbers of for STRP candidates followed expected patterns (To´th indels that are uninformative in only the Native Amer- et al. 2000). Mononucleotide repeats were most abun- icans/Japanese (table 6). dant (73% of total), followed by dinucleotide repeats Nearly all diallelic polymorphisms, even those with (18%), tetranucleotide repeats (6%) and trinucleotide high average informativeness, will have low informa- repeats (2%). For diallelic candidates, most had 1-nt tiveness in some human populations. As an example, of length differences between alleles (76%). The distri- the 909 indels with population-average long-allele fre- bution of candidates with у2-nt allele-length differ- quencies between 30% and 70%, 176 (19%) had a ences was very close to table 4. minor-allele frequency of р10% in at least one of the Indels can be easily genotyped using just PCR and gel

Table 8 Breakdown of DNA Polymorphisms by Type Species Indels Substitutions Reference Arabidopsis thaliana 37% 63% Arabidopsis Genome Initiative 2000 Caenorhabditis elegans 25% 75% Wicks et al. 2001 Drosophila melanogaster 16% 84% Berger et al. 2001 Homo sapiens: Human Gene Mutation Database 30% 70% Antonarakis et al. 2000 Overlapping BACs 21% 79% G. Marth, G. Schuler, R. Yeh, R. Davenport, R. Agarwala, D. Church, S. Wheelan, J. Baker, M. Ward, M. Kholodov, L. Phan, H. Harpending, A. Chakravarti, P.-Y. Kwok, and S. Sherry, unpublished data Chromosome 22 18% 82% Dawson et al. 2000 Weber et al.: Diallelic Indels 861 electrophoresis. Diallelic indels can also be genotyped mutagenesis in human genes causing genetic disease. Hum using the various methods developed for SNPs. The sig- Genet 87:409–415 nificant difference in sequence between many of the dial- Dawson E, Chen Y, Hunt S, Smink LJ, Hunt A, Rice K, Liv- lelic indels allows these polymorphisms to be efficiently ingston S, Bumpstead S, Bruskiewich R, Sham P, Ganske R, analyzed in a highly automated fashion by allele-specific Adams M, Kawasaki K, Shimizu N, Minoshima S, Roe B, PCR (J. L. Weber, J. Che, A. Yu, N. Ghebranious, and Bentley D, Dunham I (2001) A SNP resource for human M. Doktycz, unpublished data). We recommend indels chromosome 22: extracting dense clusters of SNPs from the for most genetic studies. genomic sequence. Genome Res 11:170–178 Fu Y-X (1995) Statistical properties of segregating sites. Theor Popul Biol 48:172–197 Gelehrter TD, Collins FS (1990) Principles of medical genetics. Acknowledgments Williams & Wilkins, Baltimore, p 55 Hacia JG, Fan JB, Ryder O, Jin L, Edgemon K, Ghandour G, This work was supported by grant HL62681 and contract Mayer RA, Sun B, Hsie L, Robbins CM, Brody LC, Wang HV48141 from the National Heart, Lung, and Blood Institute. D, Lander ES, Lipshutz R, Fodor SP, Collins FS (1999) De- We thank Drs. Ken Kidd (Yale), Gay Reinartz (Milwaukee Zoo), termination of ancestral alleles for human single-nucleotide and Oliver Ryder (San Diego Zoo) for providing human, bon- polymorphisms using high-density oligonucleotide arrays. obo, and gorilla DNA samples, respectively. Jayme Opolka, Nat Genet 22:164–167 Ryan Boudreau, Jianhong Che, Patti Franckowiak, Kelly Gebert, Halangoda A, Still JG, Hill KA, Sommer SS (2001) Spontaneous Jennifer Imm, Fay Jahr, Jessica Kayhart, Obrad Kokanovic, Mel- microdeletions and microinsertions in a transgenic mouse mu- issa Krall, Sarah Merz, Keith Pulvermacher, Bryndon Schank, tation detection system: analysis of age, tissue, and sequence Ann Solatycki, Dan Tomaszewski, and Maggie Yin provided specificity. Environ Mol Mutagen 37:311–313 excellent laboratory technical assistance. We also thank Andrew Clark for helpful comments. International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Na- ture 409:860–921 International SNP Map Working Group (2001) A map of hu- Electronic-Database Information man genome sequence variation containing 1.42 million sin- gle nucleotide polymorphisms. Nature 409:928–933 URLs for data presented herein are as follows: Jorde LB, Watkins WS, Bamshad MJ (2001) Population geno- mics: a bridge from evolutionary history to genetic medicine. ALFRED, http://alfred.med.yale.edu/alfred/ Hum Mol Genet 10:2199–2207 Center for Medical Genetics, Marshfield Medical Research Kaessmann H, Wiebe V, Weiss G, Pa¨a¨bo S (2001) Great ape Foundation, http://research.marshfieldclinic.org/genetics/ DNA sequences reveal a reduced diversity and an expansion Coriell Cell Repositories DNA Polymorphism Discovery Re- in humans. Nat Genet 27:155–156 source, http://locus.umdnj.edu/nigms/pdr.html Krawczak M, Cooper DN (1991) Gene deletions causing hu- dbSNP Home Page, http://www.ncbi.nlm.nih.gov/SNP/ man genetic disease: mechanisms of mutagenesis and the role Human Gene Mutation Database, http://www.hgmd.org/ of the local DNA sequence environment. Hum Genet 86: Primer3 Software Distribution, http://www-genome.wi.mit 425–441 .edu/genome_software/other/primer3.html Lupski JR, Roth JR, Weinstock GM (1996) Chromosomal du- Sanger Institute, The, http://www.sanger.ac.uk/HGP/ plications in bacteria, fruit flies, and humans. Am J Hum Genet 58:21–27 Marth GT, Korf I, Yandell MD, Yeh RT, Gu Z, Zakeri H, References Stitziel NO, Hillier L, Kwok P-Y, Gish WR (1999) A general approach to single-nucleotide polymorphism discovery. Nat Antonarakis SE, Krawczak M, Copper DN (2000) Disease- Genet 23:452–456 causing mutations in the human genome. Eur J Pediatr 159 Marth G, Yeh R, Minton M, Donaldson R, Li Q, Duan S, Suppl 3:S173–S178 Davenport R, Miller RD, Kwok P-Y (2001) Single-nucleo- Arabidopsis Genome Initiative (2000) Analysis of the genome tide polymorphisms in the public domain: how useful are sequence of the flowering plant Arabidopsis thaliana. Nature they? Nat Genet 27:371–372 408:796–815 Miller RD, Taillon-Miller P, Kwok P-Y (2001) Regions of Berger J, Suzuki T, Senti K-A, Stubbs J, Schaffner G, Dickson BJ (2001) Genetic mapping with SNP markers in Drosoph- low single-nucleotide polymorphism incidence in human ila. Nat Genet 29:475–481 and orangutan Xq: deserts and recent coalescences. Ge- Clark AG (1997) Neutral behavior of shared polymorphism. nomics 71:78–88 Proc Natl Acad Sci USA 94:7730–7734 Sherry ST, Ward M, Sirotkin K (2000) Use of molecular var- Collins FS, Brooks LD, Chakravarti A (1998) A DNA Poly- iation in the NCBI dbSNP database. Hum Mutat 15:68–75 morphism Discovery Resource for research on human ge- Subrahmanyan L, Eberle MA, Clark AG, Kruglyak L, Nick- netic variation. Genome Res 8:1229–1231 erson DA (2001) Sequence variation and linkage disequilib- Cooper DN, Krawczak M (1991) Mechanisms of insertional rium in the human T-cell receptor b (TCRB) locus. Am J 862 Am. J. Hum. Genet. 71:854–862, 2002

Hum Genet 69:381–395 SV, Batzer MA, Harpending HC, Rogers AR, Jorde LB Tishkoff SA, Pakstis AJ, Stoneking M, Kidd JR, Destro-Bisol G, (2001) Patterns of ancestral human diversity: an analysis Sanjantila A, Lu RB, Deinard AS, Sirugo G, Jenkins T, Kidd of Alu-insertion and restriction-site polymorphisms. Am J KK, Clark AG (2000) Short tandem-repeat polymorphism/ Hum Genet 68:738–752 Alu haplotype variation at the PLAT locus: implications for Wicks SR, Yeh RT, Gish WR, Waterston RH, Plasterk RHA modern human origins. Am J Hum Genet 67:901–925 (2001) Rapid gene mapping in Caenorhabditis elegans using To´thG,Ga´spa´ri Z, Jurka J (2000) Microsatellites in different a high density polymorphism map. Nat Genet 28:160–164 eukaryotic genomes: survey and analysis. Genome Res 10: Zhao S, Malek J, Mahairas G, Fu L, Nierman W, Venter JC, 967–981 Adams MD (2000) Human BAC ends quality assessment and Watkins WS, Ricker CE, Bamshad MJ, Carroll ML, Nguyen sequence analyses. Genomics 63:321–332 © 2001 Nature Publishing Group http://genetics.nature.com brief communications

samples; 420 (54.3%) to be common Single-nucleotide polymorphisms SNPs in 2 or more populations; and 589 (76.1%) to be common SNPs in at least 1 in the public domain: how useful population. In both studies, between 52% and 54% are they? of the characterized SNPs turn out to be common SNPs for each population pool. In other words, about half of the candi- There is a concerted effort by a number of public and private groups to identify a large dates are common SNPs in the Cau- set of human single-nucleotide polymorphisms1,2 (SNPs). As of March 2001, 2.84 mil- casians, and so forth. Moreover, between lion SNPs have been deposited in the public database, dbSNP, at the National Center 30% and 34% of the characterized SNPs for Biotechnology Information (http://www.ncbi.nlm.nih.gov/SNP/). The 2.84 million are not detected in each population pool. SNPs can be grouped into 1.65 million non-redundant SNPs. As part of the Interna- Our results show that if a researcher uses tional SNP Map Working Group, we recently published a high-density SNP map of the the publicly available candidate SNPs for human genome consisting of 1.42 million SNPs (ref. 3). In addition, numerous SNPs are a study in a population, there is only a maintained in proprietary databases. Our survey of more than 1,200 SNPs indicates 66–70% chance that the SNPs have that more than 80% of TSC and Washington University candidate SNPs are polymor- appreciable minor allele frequency and a phic and that approximately 50% of the candidate SNPs from these two sources are 50-50 chance that the SNPs are common common SNPs (with minor allele frequency of ≥20%) in any given population. in that population. Although pooled sequencing for allele frequency estimation is a validated Most of the SNPs in the public domain STSs that amplified well, we found 539 method9, and our recent study showed came from three groups: the SNP Con- candidate SNPs. Complete sequencing that the individual genotype data (over sortium4 (TSC), the Sanger Centre in the data were obtained for 502 candidate 300 individuals typed for each marker) United Kingdom, and Washington Uni- SNPs (93%). The remaining 7% of candi- corresponded well with the pooled versity5. The SNPs found in dbSNP are date SNPs only had partial data (that is, sequencing data10, this approach yields mostly ‘candidate’ SNPs found by com- one or more of the pool sequences were only a rough estimate of the allele fre- puter data-mining procedures and have missing). Of the characterized SNPs, 87 quencies. Because of uncertainties in the not been characterized. In other words, (17%) were monomorphic (that is, only accuracies of DNA pooling and sequenc- the SNPs in dbSNP are mostly variants 1 of the 2 predicted alleles was found in ing data quality issues, one cannot detect found when DNA sequences from a all 3 population samples), and 30 SNPs rare alleles (<5% in the pooled sample) handful of clones were compared by a (6%) had minor allele frequencies below and the allele frequency estimates can be computer algorithm6. They are basically 20% in all 3 populations. In contrast, 135 off by about 5% (ref. 9). Here we are only annotations of the human genome SNPs (27%) were common SNPs, with trying to determine if a candidate SNP

© http://genetics.nature.com Group 2001 Nature Publishing sequence. By our estimate, less than 15% minor allele frequencies greater than or has appreciable minor allele frequency of the SNPs in the database have been equal to 20% in all 3 population samples; and if it is a common SNP in a popula- proven to be polymorphic in any popula- 263 SNPs (52%) were common in 2 or tion. Both questions can be answered tion. Even fewer have genotyping assays more populations; and 385 (77%) were with confidence in this rough estimation developed for them. We carried out two common SNPs in at least 1 population. approach. pilot studies to determine how well the In a second study, STSs were developed There is also a real concern that the candidate SNPs in dbSNP would fare if for 897 candidate SNPs generated by candidate SNPs are not real polymor- they were to be developed into genetic comparing the consensus genomic phisms but duplicated regions of the markers. sequences of two overlapping BAC genome with near-identical sequences. In the first study, 528 radiation hybrid clones. The STSs were developed using This is a legitimate concern, except that mapped sequence-tagged sites (STSs) the Primer3 program8. In all, 133 STSs the candidate SNPs from the TSC we containing candidate SNPs from the TSC failed PCR and sequencing (14.8%), leav- used were uniquely mapped by radiation set were tested by a pooled DNA sequenc- ing 774 candidate SNPs found in 764 hybrid mapping and the overlap SNPs ing approach to determine the allele fre- STSs. Similar to the results obtained in were from clone sequences with extensive quencies of the SNPs in 3 ethnic groups7 the TSC pilot project, 130 candidate alignment in the vicinity. With the (Caucasians, Chinese and Africans). Each SNPs (16.8%) were monomorphic. We increasing complete human genome DNA pool contains equal amounts of found 55 SNPs (7.1%) to have a minor sequence as reference, most of the false- DNA from 30 individuals. Preparative allele frequency of less than 20% in all 3 positive SNPs due to paralogous PCR (30 µl reactions) were carried out in population samples; 208 (26.9%) to be sequences have already been screened out 96-well microtiter plates and the excess common SNPs in all 3 population as the SNPs are mapped. Moreover, the PCR primers and deoxynucleotides were removed by passing the crude PCR prod- Table 1 • Allele frequencies of SNPs found in dbSNP ucts through a size-exclusion resin in 96- TSC SNPs Overlap SNPs well format (Edge Biosystems). An aliquot of the PCR product was used in Total characterized 502 774 SNPs not detecteda 87 (17.3%) 130 (16.8%) the sequencing reaction (also done in 96- Uncommon SNPsb 30 (6.0%) 55 (7.1%) well format) and the sequencing reaction Common SNPs in ≥1 populationc 385 (76.7%) 589 (76.1%) products were purified by a size-exclu- Common SNPs in ≥2 populationsc 263 (52.4%) 420 (54.3%) sion resin (Princeton Separations). We Common SNPs in all 3 populationsc 135 (27.0%) 208 (26.9%) found that 28 STSs failed PCR and aOnly one of the two predicted alleles found in all three populations. bMinor allele frequency appreciable c ≥ sequencing (5.3%; Table 1). In the 500 but <20% in all 3 populations. A SNP is considered ‘common’ when the minor allele frequency is 20%.

nature genetics • volume 27 • april 2001 371 © 2001 Nature Publishing Group http://genetics.nature.com brief communications

false-positive SNPs in duplicated regions researcher looking for SNPs in the public Received 7 November 2000; accepted 1 March 2001. show the tell-tale sign of having 50% domain if they are selected judiciously. Gabor Marth1, Raymond Yeh3, Matthew allele frequencies for both alleles in all To make the marker set even more useful Minton2, Rachel Donaldson2, Qun Li2, populations. The only way to test for to the genome research community, our Shenghui Duan2, Ruth Davenport2, false-positive SNPs due to duplications is group at Washington University and sev- Raymond D. Miller2 & Pui-Yan Kwok2,3 to check for mendelian inheritance of the eral other groups will characterize more 1National Center for Biotechnology Information, alleles or assay the candidate SNP against than 100,000 candidate SNPs by the end Bethesda, Maryland, USA. 2Division of a duplicated haploid genome such as the of 2001. With PCR assays designed for Dermatology and 3Department of Genetics, complete hydatidiform mole5. Based on the SNPs and the allele frequencies of Washington University, St. Louis, Missouri, USA. the general experience that only approxi- these SNPs determined, the average Correspondence should be addressed to P.-Y.K. mately 5% of candidate SNPs that passed researcher can use these SNPs with a high (e-mail: [email protected]).

the computer filters for repetitive ele- degree of confidence that they are useful 1. Collins, F.S., Guyer, M.S. & Chakravarti, A. Science ments are due to low-copy duplications, in their own populations. 278, 1580–1581 (1997). 2. Marshall, E. Science 284, 406–407 (1999). global testing of candidate SNPs for 3. The International SNP Map Working Group Nature duplications is not warranted. 409, 928–933 (2001). Because a significant fraction of the Acknowledgments 4. Altshuler, D. et al. Nature 407, 513–516 (2000). We thank E.P.H. Yap for the Asian samples; M. 5. Taillon-Miller, P., Gu, Z., Li, Q., Hillier, L. & Kwok, P.- SNPs in the public domain are found in Y. Genome Res. 8, 748–754 (1998). repetitive regions, there is no guarantee Boyce-Jacino for the Caucasian and African 6. Marth, G.T. et al. Nature Genet. 23, 452–456 (1999). American samples; R. Sachidanandam and L. 7. Taillon-Miller, P. & Kwok, P.-Y. Genome Res. 9, that all SNPs can be amplified uniquely 499–505 (1999). Stein for TSC sequence information; and S. Sherry 8. Rozen, S. & Skaletsky, H. Methods Mol. Biol 132, from the genome. Despite these limita- and E.H. Lai for discussion. This work is funded in 365–386 (2000). tions, the publicly available candidate part by grants from the National Human Genome 9. Kwok, P.-Y., Carlson, C., Yager, T., Ankener, W. & Nickerson, D.A. Genomics 23, 138–144 (1994). SNPs from TSC and Washington Univer- Research Institute (HG01720) and the SNP 10. Taillon-Miller, P. et al. Nature Genet. 25, 324–328 sity are likely to be useful to any Consortium. (2000).

Genetic linkage of childhood atopic models for linkage by non-parametric sib- pair methods. These were ADao (affected subjects only), ADau (affected and unaf- dermatitis to psoriasis susceptibility fected subjects given equal weighting), asthmaau (affected and unaffected subjects loci given equal weighting) and the total serum IgE analysed as a quantitative trait. We had insufficient subjects with asthma

© http://genetics.nature.com Group 2001 Nature Publishing We have carried out a genome screen for atopic dermatitis (AD) and have identified to analyse only affected sibpairs. linkage to AD on chromosomes 1q21, 17q25 and 20p. These regions correspond At the P<0.001 level, we identified link- closely with known psoriasis loci, as does a previously identified AD locus on chromo- age to AD on chromosomes 1q21 and some 3q21. The results indicate that AD is influenced by genes with general effects on 17q25, and linkage to asthma on 20p dermal inflammation and immunity. (Table 1). Linkage of chromosome 20p to children with both AD and asthma (χ2=10.9, P=0.0005) was not greatly dif- AD (also known as eczema) commonly together (geometric mean 880 IU/l; 95% ferent than that to children with asthma begins in infancy and early childhood, CI 637–1,230 IU/l) than in children with alone, indicating that the combination of and is typified by itchy, inflamed skin. It asthma alone (mean 91; 95% CI 23–361 AD and asthma may correspond to a affects 10–20% of children in Western IU/l) or with AD alone (mean 171; 95% genetic subtype of disease. The total societies and shows a strong familial CI 106–277 IU/l). serum IgE concentration was linked to aggregation1,2. Eighty percent of cases of We typed 385 microsatellite markers chromosome 16q–tel. Weaker evidence AD have elevations of the total serum IgE with an average marker spacing of 8.9 cM for linkage was seen between the total concentration3, and atopic mechanisms and an average information content greater serum IgE and D5S2115 (P=0.004) within dominate current understanding of the than 65%. We tested four phenotypic the chromosome 5 cytokine cluster, pathogenesis of the disease4. We examined 148 nuclear families Table 1 • Results of linkage analysis from genome screen recruited through children with active AD ADao ADau Asthmaau IgE (see Web Methods). The families con- Marker Locationa χ2 (LR)b Pc χ2 (LR) P χ2 (LR) P χ2 (LR) P tained 383 children and 213 sibling pairs; D1S252 155.1 4.74 0.015 7.54 0.003 – – 3.45 0.03 254 children had physician-diagnosed D1S498 160.7 4.00 0.02 10.95 0.0005 – – 3.04 0.04 AD, 153 had asthma and 139 had both. D1S484 173.9 – – 5.34 0.01 – – – – Children with AD were aged 6.9±4.4 years D16S520 123.3 – – – – – – 10.25 0.0007 and 124 were male. The age of onset of D17S784 117.7 11.04 0.0004 5.38 0.01 – – – – D17S928 128.7 8.23 0.002 4.78 0.015 – – – – disease was less than 2 years in 90% of D20S889 11.0 – – – – 3.86 0.02 – – children (geometric mean 1.5 y). We D20S115 20.9 – – – – 10.63 0.0005 – – found that 51.5% of children had moder- D20S186 33.2 – – – – 6.67 0.01 – – ate disease and 28.6% had severe disease. Linkages with P<0.001 are shown, together with flanking markers with P<0.05. aPosition in cM from top The serum IgE concentration was much of chromosome linkage group. bLikelihood ratio χ2. cSingle marker significance, unadjusted for genome- higher in children with AD and asthma wide scan.

372 nature genetics • volume 27 • april 2001 articles A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms

The International SNP Map Working Group*

* A full list of authors appears at the end of this paper......

We describe a map of 1.42 million single nucleotide polymorphisms (SNPs) distributed throughout the human genome, providing an average density on available sequence of one SNP every 1.9 kilobases. These SNPs were primarily discovered by two projects: The SNP Consortium and the analysis of clone overlaps by the International Human Genome Sequencing Consortium. The map integrates all publicly available SNPs with described genes and other genomic features. We estimate that 60,000 SNPs fall within exon (coding and untranslated regions), and 85% of exons are within 5 kb of the nearest SNP. Nucleotide diversity varies greatly across the genome, in a manner broadly consistent with a standard population genetic model of human history. This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.

Inherited differences in DNA sequence contribute to phenotypic high-throughput genotyping. Finally, in contrast to more mutable variation, influencing an individual’s anthropometric characteris- markers, such as microsatellites21, SNPs have a low rate of recurrent tics, risk of disease and response to the environment. A central goal mutation, making them stable indicators of human history. We have of genetics is to pinpoint the DNA variants that contribute most constructed a SNP map of the human genome with sufficient significantly to population variation in each trait. Genome-wide density to study human haplotype structure, enabling future linkage analysis and positional cloning have identified hundreds of study of human medical and population genetics. genes for human diseases1 (http://ncbi.nlm. nih.gov/OMIM), but nearly all are rare conditions in which mutation of a single gene is Identification and characteristics of SNPs necessary and sufficient to cause disease. For common diseases, The map contains all SNPs that were publicly available in November genome-wide linkage studies have had limited success, consistent 2000. Over 95% were discovered by The SNP Consortium (TSC) with a more complex genetic architecture. If each locus contributes and the public Human Genome Project (HGP). TSC contributed modestly to disease aetiology, more powerful methods will be 1,023,950 candidate SNPs (http:// snp.cshl.org) identified by shot- required. gun sequencing of genomic fragments drawn from a complete (45% One promising approach is systematically to explore the limited of data) or reduced (55% of data) representation of the human set of common gene variants for association with disease2–4.Inthe genome18,22. Individual contributions were: Whitehead Institute, human population most variant sites are rare, but the small number 589,209 SNPs from 2.57 million (M) passing reads; Sanger Centre, of common polymorphisms explain the bulk of heterozygosity3 (see 262,279 SNPs from 1.16M passing reads; Washington University, also refs 5–11). Moreover, human genetic diversity appears to be 172,462 SNPs from 1.69M passing reads. TSC SNPs were discovered limited not only at the level of individual polymorphisms, but also using a publicly available panel of 24 ethnically diverse individuals23. in the specific combinations of alleles (haplotypes) observed at Reads were aligned to one another and to the available genome closely linked sites8,11–14. As these common variants are responsible sequence, followed by detection of single base differences using one for most heterozygosity in the population, it will be important to of two validated algorithms: Polybayes24 and the neighbourhood assess their potential impact on phenotypic trait variation. quality standard (NQS18,22). If limited haplotype diversity is general, it should be practical to An additional 971,077 candidate SNPs were identified as define common haplotypes using a dense set of polymorphic sequence differences in regions of overlap between large-insert markers, and to evaluate each haplotype for association with clones (bacterial artificial chromosomes (BACs) or P1-derived disease. Such haplotype-based association studies offer a significant artificial chromosomes (PACs)) sequenced by the HGP. Two advantage: genomic regions can be tested for association without groups (NCBI/Washington University (556,694 SNPs): G.B., requiring the discovery of the functional variants. The required P.Y.K. and S.S.; and The Sanger Centre (630,147SNPs): J.C.M. and density of markers will depend on the complexity of the local D.R.B.) independently analysed these overlaps using the two detec- haplotype structure, and the distance over which these haplotypes tion algorithms. This approach contributes dense clusters of SNPs extend, neither of which is yet well defined. throughout the genome. The remaining 5% of SNPs were discov- Current estimates (refs 13–17) indicate that a very dense marker ered in gene-based studies, either by automated detection of single map (30,000–1,000,000 variants) would be required to perform base differences in clusters of overlapping expressed sequence haplotype-based association studies. Most human sequence varia- tags24–28 or by targeted resequencing efforts (see ftp://ncbi.nlm.nih. tion is attributable to SNPs, with the rest attributable to insertions gov/snp/human/submit_format/*/*publicat.rep. gz). or deletions of one or more bases, repeat length polymorphisms and It is critical that candidate SNPs have a high likelihood of rearrangements. SNPs occur (on average) every 1,000–2,000 bases representing true polymorphisms when examined in population when two human chromosomes are compared5,6,9,18–20, and are studies. Although many methods and contributors are represented thus present at sufficient density for comprehensive haplotype on the map (see above), most SNPs (Ͼ 95%) were contributed by analysis. SNPs are binary, and thus well suited to automated, two large-scale efforts that uniformly applied automated methods.

928 © 2001 Macmillan Magazines Ltd NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com articles

Random samples of these SNPs have been evaluated by confirma- the publicly available genome assemblies of July and September tion in the original DNA samples (where possible) to rule out false 2000 (http://genome.ucsc.edu). Candidate SNPs were included in the positives, and in independent population samples to determine final map only if they mapped to a single location in the genome allele frequency. The TSC centres and two outside laboratories assembly. Integrated displays of SNPs, genes and other features are (Orchid and Cold Spring Harbor Laboratory) successfully geno- availableattheENSEMBL(http://www.ensembl.org),NCBI(National typed 1,585 TSC SNPs in the 24 DNA samples used for discovery Center for Biotechnology Information; http://www.ncbi.nlm.nih. (http://snp.cshl.org); having surveyed all chromosomes in which gov), UCSC (University of California at Santa Cruz; http://genome. each SNP could have been identified, any non-polymorphic candi- ucsc.edu) and TSC (http://snp.cshl.org) websites. dates must represent false positives. In these tests, 1,500 SNPs (95%) The nonredundant SNP total of 1,433,393 is fewer than the sum were polymorphic, 67 (4%) non-polymorphic (false positives) and of individual submissions (2,067,476) because some SNPs (mainly 18 (1%) uniformly heterozygous (previously unrecognized repeats). in regions of BAC overlap) were discovered by more than one effort. These high validation rates were observed separately for subsets of Of these, 1,419,190 mapped to unique locations in the 2.7 gigabases SNPs discovered by reduced representation shotgun and genomic (Gb) of assembled human genome sequence, providing an average alignment, and for subsets identified with Polybayes and the NQS. density of one SNP every 1.91 kb. TSC SNPs, which are more evenly Thus, these algorithms appear to generate few false positive SNPs. distributed than those from clone overlaps, were found on average The small number (1%) of uniformly ‘heterozygous’ candidate every 3.05 kb. SNP density (Table 1) is relatively constant across the SNPs show that the methods also exclude nearly all low-copy autosomes. To characterize the distribution of SNPs, we examined repeats. 366,192 SNPs that fell within finished sequence. Most of the genome The allele frequencies of a set of SNPs have been evaluated29 in contains SNPs at high density (Fig. 1): 90% of contiguous 20-kb independent populations using pooled resequencing. Samples of windows contain one or more SNPs, as do 63% of 5-kb windows TSC (n ¼ 502) and overlap SNPs (n ¼ 774) were studied in and 28% of 1-kb windows. Only 4% of genome sequence falls in population samples of European, African American and Chinese gaps between SNPs of Ͼ 80 kb, and some of these gaps are covered descent, revealing 82% to be polymorphic in at least one ethnic by SNPs that are discovered but not yet mapped owing to gaps in the group at frequencies above the detection threshold of pooled genome assembly. resequencing (ϳ10%). The remaining 18% presumably represent To evaluate the density of SNPs in regions within and surround- SNPs with a frequency less than 10% in the populations surveyed ing genes, we used the September 2000 release of RefSeq31. In total, and false positives. Furthermore, 77% of SNPs had a minor allele 14,534 SNPs map to within these 7,000 carefully annotated, non- frequency of more than 20% in at least one population, and 27% redundant messenger RNAs, equivalent to about two exonic SNPs had an allele frequency higher than 20% in all three ethnic groups. per gene (coding and untranslated regions). Extrapolating two TSC and overlap SNPs had similar distributions across the popula- exonic SNPs per gene to the approximately 30,000 human genes32, tions, showing that they are comparable in quality and frequency. we estimate there to be 60,000 exonic SNPs in this collection. The The high proportion of SNPs with significant population frequency density of SNPs in exons (one SNP per 1.08 kb; Table 1) is higher is expected after SNP discovery in two or a few chromosomes, given than in the genome as a whole, owing to the contribution of efforts standard assumptions about human population history18,29,30. targeted to exonic regions. We also assessed the distribution of SNPs in the genomic locus Description of the SNP map surrounding each of the RefSeq mRNAs. We assigned the RefSeq We mapped the sequence flanking each SNP by alignment to the exons to their genomic locations, restricting analysis to the 2,960 genomic sequence of large-insert clones in Genbank. These align- RefSeq mRNAs mapping onto finished sequence. As we cannot ments were converted into chromosomal coordinates according to define the extent of the noncoding (regulatory) regions of each gene, we arbitrarily defined each ‘gene locus’ as extending from Table 1 SNP distribution by chromosome 10 kb upstream of the start of the first exon to the end of the last exon. By this definition, 93% of gene loci contain at least one SNP, Chromosome Length (bp) All SNPs TSC SNPs and 98% are within 5 kb of the nearest SNP; also, 59% of gene loci SNPs kb per SNP SNPs kb per SNP contained five or more SNPs, and 39% ten or more. Of 24,953 1 214,066,000 129,931 1.65 75,166 2.85 2 222,889,000 103,664 2.15 76,985 2.90 3 186,938,000 93,140 2.01 63,669 2.94 4 169,035,000 84,426 2.00 65,719 2.57 100 5 170,954,000 117,882 1.45 63,545 2.69 6 165,022,000 96,317 1.71 53,797 3.07 90 7 149,414,000 71,752 2.08 42,327 3.53 8 125,148,000 57,834 2.16 42,653 2.93 80 9 107,440,000 62,013 1.73 43,020 2.50 10 127,894,000 61,298 2.09 42,466 3.01 70 11 129,193,000 84,663 1.53 47,621 2.71 12 125,198,000 59,245 2.11 38,136 3.28 60 13 93,711,000 53,093 1.77 35,745 2.62 14 89,344,000 44,112 2.03 29,746 3.00 50 15 73,467,000 37,814 1.94 26,524 2.77 16 74,037,000 38,735 1.91 23,328 3.17 40 17 73,367,000 34,621 2.12 19,396 3.78 18 73,078,000 45,135 1.62 27,028 2.70 30 19 56,044,000 25,676 2.18 11,185 5.01 20 63,317,000 29,478 2.15 17,051 3.71 20 21 33,824,000 20,916 1.62 9,103 3.72 22 33,786,000 28,410 1.19 11,056 3.06 10 X 131,245,000 34,842 3.77 20,400 6.43 SNPs Per cent of windows with one or more Y 21,753,000 4,193 5.19 1,784 12.19 0 RefSeq 15,696,674 14,534 1.08 1 2 51015204080 Totals 2,710,164,000 1,419,190 1.91 887,450 3.05 ...... Size of windows of genome sequence (kb) Length (bp) is from the public Genome Assembly of 5 September 2000. Density of SNPs on each chromosome is influenced by the amount of available genome sequence included in the Genome Figure 1 Distribution of SNP coverage across intervals of finished sequence. Windows of Assembly, depth of overlap coverage from TSC reads and clone overlaps, and the underlying defined size (in chromosome coordinates) were examined for whether they contained one heterozygosity (Table 2). Data are presented for the entire dataset (All SNPs) and for those from the SNP consortium (TSC SNPs), as the latter are more evenly spaced than those from clone overlaps. or more SNPs. Analysis was restricted to the 900 Mb of available finished sequence.

NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com © 2001 Macmillan Magazines Ltd 929 articles

exons, 85% were within 5 kb of the nearest SNP. Thus, most exons size (Ne) and lower mutation rate (m)in⌰ =4Nem. Because the X should be close enough to at least one SNP for haplotype-based chromosome is hemizygous in males, the effective population size is association studies, where the functional variant may be some three-quarters of that of the autosomes. In addition, m is higher in distance from the SNPs used in the study. male than in female meiosis, with mmale/mfemale Ϸ 1.7/1.0 (ref. 33). As The density of SNPs obtained at any given location depends upon the X chromosome undergoes male meiosis only 1/3 of the time, the the methods of SNP discovery contributing at each position (TSC, overall rate of mutation in the X chromosome is expected to be 91% BAC overlap or targeted), the availability of genome sequence for that of the autosomes (mX = 1.23/1.35 = 0.91). Thus, the diversity of SNP discovery and mapping, and the rate of nucleotide diversity. Of the X chromosome is predicted to be 69% that of the autosomes. − these, only nucleotide diversity is a fundamental characteristic of The observed heterozygosity of the X chromosome was 4.69 × 10 4, the region and population studied. To chart the landscape of human or 61% of the average value of the autosomes. Thus, the population genome sequence polymorphism, we performed a genome-wide genetic considerations described above could largely explain the analysis of nucleotide diversity. lower heterozygosity on the X chromosome. It is possible that strong selection on the X chromosome (owing to hemizygosity in Analysis of nucleotide diversity males) or other factors might partially explain this observation. Describing the underlying pattern of nucleotide diversity required a The Y chromosome has the lowest observed heterozygosity of any polymorphism survey performed at high density, in a single, chromosome. It is divided into two regions: a pseudoautosomal defined population sample, and analysed with a uniform set of region at either telomeric end that recombines with the X chromo- tools. We reanalysed 4.5M passing sequence reads generated by TSC some and is highly heterozygous34, and the non-recombining Y using genomic alignment using the NQS (see Methods). This set (NRY). The genome assembly used for this analysis contains only contained 1.2 billion aligned bases and 920,752 heterozygous the NRY, which shows very little diversity: 348 SNPs in 2,304,916 − positions. We measured nucleotide sequence variation using the bases (␲ = 1.51 × 10 4). These values agree reasonably with previous normalized measure of heterozygosity (␲), representing the like- estimates for NRY35,36. The lower diversity of NRY is influenced by a lihood that a nucleotide position will be heterozygous when smaller effective population size (20% that of the autosomes), compared across two chromosomes selected randomly from a counterbalanced by the higher mutation rate of male meiosis population. ␲ also estimates the population genetic parameter ⌰ (mY = 1.7/1.35 = 1.26 × that of the autosomes). These factors predict =4Nem in a model in which sites evolve neutrally, with mutation rate that the Y chromosome would have a diversity 31% that of the m, in a constant-sized population of effective size Ne. For the human autosomes, as compared to the observed 20%. Other influences − genome, ␲ was 7.51 × 10 4, or one SNP for every 1,331 bp surveyed might include selection against deleterious alleles, patterns of male in two chromosomes drawn from the NIH diversity panel. This dispersal35 and a correlation of diversity with recombination rate19. value agrees with smaller surveys of human genome variation18–20. To look at diversity on a finer scale, we divided each chromosome We next examined the heterozygosity of individual chromosomes into contiguous 200,000-bp bins according to the public Genome (Table 2). The autosomes were quite similar to one another, with Assembly of 5 September 2000. The distribution of heterozygosity 20 of 22 within 10% of the genome-wide average for autosomes among these bins ranges from zero (12 bins, each with zero SNPs − − (7.65 × 10 4). Two had more extreme values: chromosome 21 (␲ = over an average of 24,720 bp examined) to 60 × 10 4 (357 SNPs in a − − 5.19 × 10 4) and chromosome 15 (␲ = 8.79 × 10 4). Whether these bin surveying 58,755 bp). Although 95% of bins display nucleotide − − observations are due to statistical fluctuations or methodological diversity values between 2.0 × 10 4 and 15.8 × 10 4, the pattern is issues, or are biologically meaningful, will require investigation. The variable (Fig. 2a, b; see also Supplementary Information). One most striking difference in heterozygosity is the lower diversity of measure of the spread in the data is the coefficient of variation (CV), the sex chromosomes. The lower rate of polymorphism on the X the ratio of the standard deviation (j) to the mean (m) of the chromosome may be explained by both a lower effective population heterozygosity ␲ of each individual read. For the observed data, the CV (jobserved/mobserved) was 1.93, considerably larger than would be expected if every base had uniform diversity, corresponding to a Table 2 Nucleotide diversity by chromosome Poisson sampling process (jPoisson/mPoisson = 1.73). It was expected Chromosome Heterozygous High-quality bp ␲ (× 10−4) that the observed distribution would be much more variable than a positions examined Poisson process, because both biochemical and evolutionary forces 1 71,483 92,639,616 7.72 cause diversity to be nonuniform across the genome. Biological 2 81,860 111,060,861 7.37 3 61,190 81,359,748 7.52 4 59,922 74,162,156 8.08 Table 3 Coefficients of variation for the observed data and the Poisson and 5 56,344 77,924,663 7.23 coalescent models 6 53,864 72,380,717 7.44 7 52,010 68,527,550 7.59 SNPs per read Observed Poisson Coalescent 8 44,477 57,476,056 7.74 0 8,796 Ϯ 43 8,256 Ϯ 52 8,767 Ϯ 50 9 41,329 50,834,047 8.13 1 2,247 Ϯ 44 3,040 Ϯ 49 2,332 Ϯ 46 10 43,040 52,184,561 8.25 2 668 Ϯ 24 617 Ϯ 24 663 Ϯ 26 11 47,477 56,680,783 8.38 3 214 Ϯ 14 99 Ϯ 9 200 Ϯ 15 12 38,607 51,160,578 7.55 4 102 Ϯ 10 16 Ϯ 466Ϯ 9 13 35,250 43,915,606 8.03 j/m 1.94 Ϯ 0.02 1.72 Ϯ 0.02 1.96 Ϯ 0.03 14 35,083 47,425,180 7.40 ...... 15 27,847 31,682,199 8.79 Observed distribution of heterozygosity and comparison to expectation under Poisson and 16 22,994 27,736,356 8.29 coalescent population genetic models. The autosomes were divided into 200,000-bp bins 17 21,247 27,124,496 7.83 according to chromosome coordinates and one read randomly selected from each bin. This procedure was chosen to minimize the correlation in gene history of nearby regions, under the 18 24,711 30,357,102 8.14 simplifying assumption that reads 200,000 bp apart and selected from unrelated individuals will 19 11,499 15,060,544 7.64 have uncorrelated genealogies. Correlation of gene history does not influence the expected mean 20 22,726 31,795,754 7.15 value of the CV, but does effect its variance. The random selection of reads and generation of 21 26,160 50,367,158 5.19 expected distributions were repeated 100 times: presented are the mean and standard deviation of 22 17,469 20,478,378 8.53 the number of reads in which 0,1,2,3, or 4 SNPs were observed or predicted under each scenario. X 23,818 50,809,568 4.69 The Poisson model reports the number of such reads expected to display 0–4 SNPs under Poisson Y 348 2,304,916 1.51 sampling of each read with a heterozygosity adjusted for length and GC content (Fig. 2c). Even in −99 38 Total 920,752 1,225,448,590 7.51 this reduced data set, the Poisson model can be rejected at P Ͻ 10 . The coalescent simulation ...... assumed a constant-sized population of effective size 10,000 and free recombination among reads. Heterozygosity (␲) of each chromosome. The data were filtered to remove repetitive sequences and For each read, m was scaled according to its length and GC content (Fig. 2c). Each sampled read heterozygosity calculated as described in the methods. Heterozygous positions and high-quality was assigned a coalescent history from a simulated distribution and the number of SNPs predicted. bases examined were counted separately for each pairwise comparison of read to genome, and The coefficient of variation of the estimate of heterozygosity is presented, with the mean and then summed over each chromosome. standard deviation of the 100 sampling runs shown.

930 © 2001 Macmillan Magazines Ltd NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com articles factors may include rates of mutation and recombination at each each read, we adjusted m on the basis of its per cent GC and length, locus. For example, heterozygosity is correlated with the GC content and simulated genealogical histories under the assumption of a for each read (Fig. 2c), reflecting, at least in part, the high frequency constant-sized population with Ne = 10,000. The CV determined of CpG to TpG mutations arising from deamination of methylated under this model (jconstant-size/mconstant-size = 1.96) is a close match to 5-methylcytosine. Population genetic forces are likely to be even the observed data. To estimate standard deviations around these more important: each locus has its own history, with samples at estimates of the CV, it was necessary to consider that tightly linked some loci tracing back to a recent common ancestor, and other loci regions may display correlated histories, and thus are nonindepen- describing more ancient genealogies. The time to the most recent dent. We sampled subsets of the data chosen to minimize correla- common ancestor at a particular stretch of DNA is variable, and tion among reads (see Methods), providing estimates of the mean represents the opportunity for sequence divergence; thus, the and standard deviation of CV for the observed and simulated data expected pattern of heterozygosity is more heterogeneous than if (Table 3). These results indicate that the observed pattern of every locus shared the same history37,38. genome-wide heterozygosity is broadly consistent with predictions To assess whether gene history would account for the observed of this standard population genetic model (for comparison, see an variation in heterozygosity, we compared the observed CV to that analysis of variation in heterozygosity in the mouse genome)39. expected under a standard coalescent population genetic model. For However, much work will be required to assess additional factors

a 14 c 8.4

12 8.2 ) –4 10 8.0

8 7.8

6 7.6

4 7.4 Per cent of all bins 2 (x 10 Heterozygosity 7.2

0 7.0 0 2 4 6 8 10 12 14 16 18 20 25 30 35 40 45 50 55 60 Heterozygosity (x10–4) Per cent GC content of reads (deciles)

b 35 Chromosome 6

30

25

20

15 SNPs per 10,000 bases

10

5

0 0 20 40 60 80 100 120 140 160 180 200 Megabases

Figure 2 Distribution of heterozygosity. a, The genome was divided into contiguous bins this range. The extended region of unusually high heterozygosity centred at 34 Mb of 200,000 bp based on chromosome coordinates, and the number of high-quality bases corresponds to the HLA. c, Correlation of nucleotide diversity with GC content of each read examined and heterozygosity calculated for each. A histogram was generated of the (autosomes only). The GC content and heterozygosity of reads from the heterozygosity distribution of heterozygosity values across all such bins. b, Heterozygosity was calculated analysis was calculated after sorting of reads by GC content and separation into 10 bins of across contiguous 200,000-bp bins on Chromosome 6. The blue lines represent the equal size. Each bin contains ϳ150 Mb of aligned, high-quality sequence. values within which 95% of regions fall: 2.0 × 10−4−15.8 × 10−4. Red, bins falling outside

NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com © 2001 Macmillan Magazines Ltd 931 articles that could influence this distribution: biological factors such as large-insert clones (finished and draft with available PHRAP quality scores) in Genbank. variation in mutation and recombination rates, historical forces For the analysis of clone overlaps, all available finished and unfinished genomic sequence 40,41 accessions were aligned. Two methods were used to detect SNPs. The NQS relies upon the such as bottlenecks , expansions or admixture of differentiated sequence trace quality surrounding the SNP base to increase base-calling confidence18,22; populations, evolutionary selection, and methodological artefacts. most data discovered using the NQS was processed using SsahaSNP, an ultrafast, hash- Regions of low diversity were more prevalent on the sex chromo- based implementation of the algorithm (Z.N., A. Cox and J.C.M, manuscript in preparation). The second method calculates confidence scores on the basis of a Bayesian somes. Whereas only 2.5% of 200,000-bp bins across the genome 24 ␲Ͻ × −4 42 analysis of confidence scores . A variety of methods were used to find SNPs in expressed had 2.0 10 , 15% of bins on the X chromosome and 89% on sequence tag (EST) overlaps24,25,27 and for targeted resequencing; details of the remaining the Y chromosome (NRY) had these levels of diversity. Regions of SNPs can be found in the individual dbSNP entries (www.ncbi.nlm.nih.gov/SNP/). low diversity may be explained by the smaller effective population size of the sex chromosomes and the variable underlying distribu- Mapping of SNPs and features tion of heterozygosity. Strong selection acting on the sex chromo- MEGABLAST44 was used to align TSC SNP flanking sequences to the genomic sequence somes in males might also have a role, but this hypothesis requires accessions. A SNP was considered mapped if a high-quality match (99% identity or greater) was found across the available flanking sequence of no less than 270 bp. SNPs that further testing. Regions of high heterozygosity were also observed. matched more than three accessions with identity Ͼ 98% were judged to be possible One was found on chromosome 6 (Fig. 2b, centred on 34 Mb), and repetitive regions and set aside. SNP coordinates were generated relative to the OO18 build was confirmed to represent the HLA locus, which has high nucleo- of the genome assembly (5 September 2000) and the OO15 build (15 July 2000), using the tide diversity owing to balancing selection43. Other regions of AGP format files provided by D. Haussler (http://genome.ucsc.edu). The NCBI RefSeq mRNA transcripts31 were aligned to the Genome Assembly using the varying size were observed on this and other chromosomes (Fig. NCBI SPIDEY alignment tool. Alignment required Ͼ97% sequence similarity between 2c and Supplementary Information). Some of these highly diverse mRNA and genome sequence; alignments were refined by taking into account the donor/ regions might have also experienced balancing selection, but there acceptor sites. In cases where CDS annotations were available in the GenBank record, are other possible explanations: for example, sampling fluctuations exons of the CDS were aligned within the confines of the mRNA alignment. Regions of known human repeats were annotated directly using RepeatMasker (A. Smit, of the coalescent distribution, regions with high rates of mutation unpublished). and/or recombination, unrecognized duplications in the human genome and sequencing of a rare haplotype by the HGP (to which Nucleotide diversity analysis the TSC reads were compared). To characterize nucleotide diversity, we required a data set in which all data could be Given the unfinished state of publicly available sequence data and analysed both for the number of high-quality bases meeting quality standards for SNP genome assembly, it will be important to reevaluate these estimates detection, and for the number of SNPs. Toensure homogeneity of analysis, we performed a single analysis of 4.5 million high-quality TSC reads from the Sanger Centre, Washington as more complete genome sequence becomes available. University in St. Louis and the Whitehead Center for Genome Research. The GC content of these reads was 41%, the same as the genome as a whole32, and the distribution of read Implications for medical and population genetics GC content across deciles of the genome (sorted by GC content) was within 10% of the We describe a map of publicly available SNPs (as of November expected value for all bins. The read coverage was well distributed: 88% of contiguous 200,000-bp windows contained over 10,000 aligned bases (5%) surveyed for SNPs (see 2000), fully integrated with the sequence, physical and genetic maps below). Using a single analytic tool (SsahaSNP, an implementation of the NQS; Z.N., A. of the human genome. We anticipate immediate application to Cox and J.C.M, in preparation), these reads were aligned to the available genome sequence studies of human population genetics, candidate-gene studies for (finished and draft with quality scores) and the number of high-quality bases (meeting disease association, and eventually unbiased, genome-wide associa- NQS) and SNPs counted. We limited the analysis to SNPs found by genomic alignment so tion scans. First, the map provides an unprecedented tool for that the cluster depth of each comparison would be exactly two chromosomes. We precisely measured the target size for SNP discovery by counting the number of positions studying the character of human sequence variation. We use these meeting the NQS. This is desirable because alignments contain positions of both high and data to describe the first genome-wide view of how human DNA low quality, but only those meeting the NQS are candidates for SNP discovery. Where a sequence varies in the population, and the public availability of single TSC read aligned to multiple (overlapping) BACs from the HGP, we averaged the these data should fuel future research into biological and popula- number of SNPs and aligned bp for all pairwise alignments of that read; this weighted evenly those reads mapping to a single BAC and those aligning to a region of overlap. tion genetic influences on human genetic diversity. Reads representing repeat loci were excluded using validated criteria18,22: alignments of Second, insights into human evolutionary history will be reads to genome were excluded if they were less than 99% identical. The genome was then obtained by using SNPs from the map to characterize haplotype divided into contiguous bins of 200,000 bp (based on chromosome-relative coordinates). diversity throughout the genome. Human haplotype structure Individual reads were filtered for repeats: any that aligned to more than one bin in the genome assembly were rejected. Finally, heterozygous positions and bases meeting the remains largely unexplored, and this map makes it possible to NQS were counted. As a final filter for regions containing a high proportion of repeats, we define the extent and variation of haplotype identity, the number reject any bin for which more than 10% of the reads mapping to that bin also mapped to and frequencies of common haplotypes, and their distribution another chromosome. Finally, to avoid statistical fluctuation due to inadequate sampling, among and within existing ethnic groups. we examined only the 88% of bins in which at least 10,000 aligned bases met the NQS and thus could be examined for SNPs. Most practically, where a gene has been implicated in causing Coalescent modelling was performed by simulation38, and assumed a constant-sized disease (by chromosomal position relative to linkage peaks, known population of 10,000 individuals and a mutation rate adjusted for each read on the basis of biological function or expression pattern), it is desirable exhaus- its GC content (Fig. 2c) and length. To assess the standard deviation around this estimate, tively to survey allelic variation for any association to disease. Using the simulation was repeated 100 times. For the observed data, calculating a standard the SNP map, it should be possible to evaluate the extent to which deviation around the CV is difficult owing to the correlation of gene history for closely linked sites. In expectation, this correlation should not alter the mean of the observed common haplotypes contribute to disease risk. As the speed and coefficient of variation, but does influence its variance. To estimate the variance around efficiency of SNP genotyping increases, such studies will fuel increas- the CV for the observed data, we selected 100 reduced data sets, each containing one ingly comprehensive tests of the hypothesis that common variants randomly chosen read from each 200,000-bp bin along the autosomes. In using this contribute significantly to the risk of common diseases. To the extent approach, we assume that these reads, 200,000 bp apart and sampled from unrelated individuals, have independent genealogies. This random sampling procedure was repeated that such studies are successful, they should profoundly affect our 100 times to estimate the mean and variance of the observed CV. understanding of disease, methods of diagnosis, and ultimately the The data for the heterozygosity analysis, including the coordinates of each bin, the development of new and more effective therapies. Ⅺ number of bases examined and number of SNPs identified, is available as Supplementary Information. Methods Received 28 November; accepted 27 December 2000. SNP identification 1. Collins, F. S. Of needles and haystacks: finding human disease genes by positional cloning. Clin. Res. 39, 615–623 (1991). Candidate SNPs were identified by detection of high-confidence base differences in 2. Collins, F. S., Guyer, M. S. & Charkravarti, A. Variations on a theme: cataloging human DNA sequence aligned sequences. For TSC, sequence reads were filtered to exclude low quality reads and variation. Science 278, 1580–1581 (1997). those containing predominantly known repetitive sequence. Sequences were aligned to 3. Lander, E. S. The new genomics: global views of biology. Science 274, 536–539 (1996). each other using the reduced representation shotgun (RRS) method, and by genomic 4. Risch, N. & Merikangas, K. The future of genetic studies of complex human diseases. Science 273, alignment (GA) as described18,22. For GA of TSC data, reads were compared to available 1516–1517 (1996).

932 © 2001 Macmillan Magazines Ltd NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com articles

5. Li, W. H. & Sadler, L. A. Low nucleotide diversity in man. Genetics 129, 513–523 (1991). 30. Yang, Z. et al. Sampling SNPs. Nature Genet. 26, 13–14 (2000). 6. Cargill, M. et al. Characterization of single-nucleotide polymorphisms in coding regions of human 31. Pruitt, K. D., Katz, K. S., Sicotte, H. & Maglott, D. R. Introducing RefSeq and LocusLink: curated genes [published erratum appears in Nature Genet. 23, 373 (1999)]. Nature Genet. 22, 231–238 human genome resources at the NCBI. Trends Genet. 16, 44–47 (2000). (1999). 32. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human 7. Cambien, F. et al. Sequence diversity in 36 candidate genes for cardiovascular disorders. Am. J. Hum. genome. Nature 409, 860–921 (2001). Genet. 65, 183–191 (1999). 33. Bohossian, H. B., Skaletsky, H. & Page, D. C. Unexpectedly similar rates of nucleotide substitution 8. Fullerton, S. M. et al. Apolipoprotein E variation at the sequence haplotype level: implications for the found in male and female hominids. Nature 406, 622–625 (2000). origin and maintenance of a major human polymorphism. Am. J. Hum. Genet. 67, 881–900 (2000). 34. Cooke, H. J., Brown, W. R. & Rappold, G. A. Hypervariable telomeric sequences from the human sex 9. Halushka, M. K. et al. Patterns of single-nucleotide polymorphisms in candidate genes for blood- chromosomes are pseudoautosomal. Nature 317, 687–692 (1985). pressure homeostasis. Nature Genet. 22, 239–247 (1999). 35. Shen, P. et al. Population genetic implications from sequence variation in four Y chromosome genes. 10. Nickerson, D. A. et al. DNA sequence diversity in a 9.7-kb region of the human lipoprotein lipase gene. Proc. Natl Acad. Sci. USA 97, 7354–7359 (2000). Nature Genet. 19, 233–240 (1998). 36. Underhill, P. A. et al. Detection of numerous Y chromosome biallelic polymorphisms by denaturing 11. Rieder, M. J., Taylor, S. L., Clark, A. G. & Nickerson, D. A. Sequence variation in the human high-performance liquid chromatography. Genome Res. 7, 996–1005 (1997). angiotensin converting enzyme. Nature Genet. 22, 59–62 (1999). 37. Tajima, F. Evolutionary relationship of DNA sequences in finite populations. Genetics 105, 437–460 12. Templeton, A. R., Weiss, K. M., Nickerson, D. A., Boerwinkle, E. & Sing, C. F. Cladistic structure (1983). within the human lipoprotein lipase gene and its implications for phenotypic association studies. 38. Hudson, R. R. in Oxford Surveys in Evolutionary Biology (eds Futuyma, D. & Antonovics, J.) 1–44 Genetics 156, 1259–1275 (2000). (Oxford Univ. Press, Oxford, 1991). 13. Eaves, I. A. et al. The genetically isolated populations of Finland and sardinia may not be a panacea for 39. Lindblad-Toh, K. et al. Large-scale discovery and genotyping of single-nucleotide polymorphisms in linkage disequilibrium mapping of common disease genes. Nature Genet. 25, 320–323 (2000). the mouse. Nature Genet. 24, 381–386 (2000). 14. Taillon-Miller, P. et al. Juxtaposed regions of extensive and minimal linkage disequilibrium in human 40. Kimmel, M. et al. Signatures of population expansion in microsatellite repeat data. Genetics 148, Xq25 and Xq28. Nature Genet. 25, 324–328 (2000). 1921–1930 (1998). 15. Kruglyak, L. Prospects for whole-genome linkage disequilibrium mapping of common disease genes. 41. Reich, D. E. & Goldstein, D. B. Genetic evidence for a Paleolithic human population expansion in Nature Genet. 22, 139–144 (1999). Africa [published erratum appears in Proc. Natl Acad. Sci. USA 95, 11026 (1998)]. Proc. Natl Acad. Sci. 16. Collins, A., Lonjou, C. & Morton, N. E. Genetic epidemiology of single-nucleotide polymorphisms. USA 95, 8119–8123 (1998). Proc. Natl Acad. Sci. USA 96, 15173–15177 (1999). 42. Miller, R. D., Taillon-Miller, P. & Kwok, P. Y. Regions of low single-nucleotide polymorphism (SNP) 17. Reich, D. E. et al. Linkage disequilibrium in the human genome. Nature (submitted). incidence in human and orangutan Xq: deserts and recent coalescences. Genomics (in the press). 18. Altshuler, D. et al. An SNP map of the human genome generated by reduced representation shotgun 43. Horton, R. et al. Large-scale sequence comparisons reveal unusually high levels of variation in the sequencing. Nature 407, 513–516 (2000). HLA-DQB1 locus in the class II region of the human MHC. J. Mol. Biol. 282, 71–97 (1998). 19. Nachman, M. W., Bauer, V.L., Crowell, S. L. & Aquadro, C. F. DNAvariability and recombination rates 44. Zhang, Z., Schwartz, S., Wagner, L. & Miller, W. A greedy algorithm for aligning DNA sequences. at X-linked loci in humans. Genetics 150, 1133–1141 (1998). J. Comput. Biol. 7, 203–214 (2000). 20. Wang, D. G. et al. Large-scale identification, mapping, and genotyping of single- nucleotide polymorphisms in the human genome. Science 280, 1077–1082 (1998). Supplementary Information is available on Nature’s World-Wide Web site 21. Jorde, L. B. Linkage disequilibrium and the search for complex disease genes. Genome Res. 10, 1435– (http://www.nature.com) or as paper copy from the London editorial office of Nature. 1444 (2000). 22. Mullikin, J. C. et al. An SNP map of human chromosome 22. Nature 407, 516–520 (2000). 23. Collins, F. S., Brooks, L. D. & Chakravarti, A. A DNA polymorphism discovery resource for research Acknowledgements on human genetic variation [published erratum appears in Genome Res. 9, 210 (1999)]. Genome Res. The SNP Consortium, the Wellcome Trust and the National Human Genome Research 8, 1229–1231 (1998). Institute funded SNP discovery and data management at Cold Spring Harbor 24. Marth, G. T. et al. A general approach to single-nucleotide polymorphism discovery. Nature Genet. 23, Laboratories, The Sanger Centre, Washington University in St. Louis, and the Whitehead/ 452–456 (1999). MIT Center for Genome Research. Work in P.Y.K.’s laboratory is supported in part by 25. Buetow, K. H., Edmonson, M. N. & Cassidy, A. B. Reliable identification of large numbers of candidate SNPs from public EST data. Nature Genet. 21, 323–325 (1999). grants from the SNP Consortium and the National Human Genome Research Institute. 26. Gu, Z., Hillier, L. & Kwok, P. Y. Single nucleotide polymorphism hunting in cyberspace. Hum. Mutat. P.Y.K. thanks Q. Li, M. Minton, R. Donaldson and S. Duan for technical assistance. D.M.A. 12, 221–225 (1998). was supported during a phase of this work under a Postdoctoral Fellowship for Physicians 27. Irizarry, K. et al. Genome-wide analysis of single-nucleotide polymorphisms in human expressed from the Howard Hughes Medical Institute. For full list of contributors to TSC sequences. Nature Genet. 26, 233–236 (2000). programme, see www.snp.cshl.org. 28. Picoult-Newberg, L. et al. Mining SNPs from EST databases. Genome Res. 9, 167–174 (1999). 29. Marth, G. T. et al. Single nucleotide polymorphisms in the public database: how useful are they? Correspondence and requests for materials should be addressed to D.A. Nature Genet. (submitted). (e-mail: [email protected]) or D.B. (e-mail: [email protected]).

* The International SNP Map Working Group (contributing Robert H. Waterston4 & John D. McPherson4 institutions are listed alphabetically). Whitehead/MIT Center for Genome Research: Brian Gilman5, Cold Spring Harbor Laboratories: Ravi Sachidanandam1, Stephen Schaffner5, William J. Van Etten5,6, David Reich5, David Weissman1, Steven C. Schmidt1, Jerzy M. Kakol1 & John Higgins5, Mark J. Daly5, Brendan Blumenstiel5, Lincoln D. Stein1 Jennifer Baldwin5, Nicole Stange-Thomann5, Michael C. Zody5, Lauren Linton5, Eric S. Lander5,7 & David Altshuler5,8 National Center for Biotechnology Information: Gabor Marth2 & 1, Cold Spring Harbor, New York 11724, USA; 2, Building 38A, 8600 Rockville Steve Sherry2 Pike, Bethesda, Maryland 20894, USA; 3, Wellcome Trust Genome Campus, 3 3 Hinxton, Cambridge, CB10 1SA, UK; 4, 660 S. Euclid Ave, St. Louis, Missouri The Sanger Centre: James C. Mullikin , Beverley J. Mortimore , 63110, USA; 5, 9 Cambridge Center, Cambridge, Massachusetts 02139, USA; David L. Willey3, Sarah E. Hunt3, Charlotte G. Cole3, Penny C. Coggill3, 3 3 3 3 6, Present address: Blackstone Technology Group, Boston, Massachusetts 02110, Catherine M. Rice , Zemin Ning , Jane Rogers , David R. Bentley USA; 7, Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02142, USA; 8, Departments of Genetics and 4 Washington University in St. Louis: Pui-Yan Kwok , Medicine, Harvard Medical School; Department of Molecular Biology and 4 4 4 4 Elaine R. Mardis , Raymond T. Yeh , Brian Schultz , Lisa Cook , Diabetes Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, Ruth Davenport4, Michael Dante4, Lucinda Fulton4, LaDeana Hillier4, USA.

NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com © 2001 Macmillan Magazines Ltd 933 © 1999 Nature America Inc. ¥ http://genetics.nature.com letter

A general approach to single-nucleotide polymorphism discovery

Gabor T. Marth1, Ian Korf1, Mark D. Yandell1, Raymond T. Yeh1, Zhijie Gu2, Hamideh Zakeri2, Nathan O. Stitziel1, LaDeana Hillier1, Pui-Yan Kwok2 & Warren R. Gish1

Single-nucleotide polymorphisms (SNPs) are the most abun- We started with 1,268,211 bp finished (less than 1 error per dant form of human genetic variation and a resource for map- 10,000 bp) human reference sequence of 10 genomic clones, with ping complex genetic traits1. The large volume of data EST content typical of gene-bearing clones. To initiate the analy- produced by high-throughput sequencing projects is a rich and sis procedure (Fig. 1) to identify human ESTs that originated largely untapped source of SNPs (refs 2–5). We present here a from these clones, we performed a database search against the unified approach to the discovery of variations in genetic public EST set (dbEST) and recovered 1,954 hits (representing sequence data of arbitrary DNA sources. We propose to use the potentially multiple exons of 1,365 unique ESTs) for which chro- rapidly emerging genomic sequence6,7 as a template on which matograms were available. Sequence clusters were constructed as to layer often unmapped, fragmentary sequence data8–11 and groups of overlapping alignments (147 clusters). Sequence traces to use base quality values12 to discern true allelic variations were re-processed with the PHRED base-calling program13,14 to from sequencing errors. By taking advantage of the genomic obtain base quality values. Subsequent analyses used the full sequence we are able to use simpler yet more accurate methods length of the ESTs, including low-quality portions. Cluster mem- for sequence organization: fragment clustering, paralogue bers were multiply aligned with an anchored alignment tech- identification and multiple alignment. We analyse these nique. Unlike traditional algorithms, this method rapidly sequences with a novel, Bayesian inference engine, POLY- produces correct multiple alignments even in the presence of BAYES, to calculate the probability that a given site is polymor- abundantly expressed or alternatively spliced transcripts. In total, phic. Rigorous treatment of base quality permits completely EST clusters represented 80,469 bp of expressed genomic automated evaluation of the full length of all sequences, with- sequence, 38% of this in regions of single EST coverage and 81% out limitations on alignment depth. We demonstrate this in regions covered by 8 or fewer ESTs (Table 1).

http://genetics.nature.com ¥ approach by accurate SNP predictions in human ESTs aligned to Inclusion of sequences representing highly similar regions finished and working-draft quality genomic sequences, a data duplicated elsewhere in the genome may give rise to false SNP pre- set representative of the typical challenges of sequence-based dictions, and the presence of such sequence paralogues points to SNP discovery. difficulties during marker development. We devised a Bayesian15

(a)a genomic anchor 1999 Nature America Inc. (b)b ESTs

© ESTs

(c)c (d)d candidate SNP anchor anchor

paralogues native EST s

(e)e

STS

Fig. 1 Application of the POLYBAYES procedure to EST data. a, Regions (f)f (g)g confirmed SNP of known human repeats in a genomic sequence are masked. b, Match- ing human ESTs are retrieved from dbEST and traces are re-called. c, Par- alogous ESTs are identified and discarded. d, Alignments of native EST reads are screened for candidate variable sites. e, An STS is designed for the verification of a candidate SNP. f, The uniqueness of the genomic location is determined by sequencing the STS in CHM1 (homozygous DNA). g, The presence of a SNP is analysed by sequencing the STS from pooled DNA samples. trace from CHM1 DNA trace from DNA pool

Washington University 1Department of Genetics and Genome Sequencing Center and 2Division of Dermatology, St. Louis, Missouri, USA. Correspondence should be addressed to G.T.M. (e-mail: [email protected]) or P.-Y.K. (e-mail: [email protected]).

452 nature genetics • volume 23 • december 1999 © 1999 Nature America Inc. ¥ http://genetics.nature.com letter

Table 1 • SNP discovery in EST alignments of varying coverage

No. of clusters No. of aligned sites Distribution of SNPs Deptha before paralogue after paralogue before paralogue after paralogue Candidatef analysedg confirmedh Confirmation filteringb filteringc filteringd filteringe ratei 1 47 (32.0%) 40 (32.0%) 30,828 (38.3%) 26,275 (37.7%) 12 (22.2%) 6 (16.7%) 5 (25.0%) 83% 2 25 (17.0%) 24 (19.2%) 15,771 (19.6%) 15,072 (21.6%) 8 (14.8%) 7 (19.4%) 2 (10.0%) 29% 3,4 23 (15.6%) 21 (16.8%) 12,478 (15.5%) 9,937 (14.2%) 17 (31.5%) 8 (22.2%) 5 (25.0%) 63% 5–8 17 (11.6%) 14 (11.2%) 6,627 (8.2%) 5,467 (7.8%) 7 (13.0%) 7 (19.4%) 1 (5.0%) 14% 9–16 14 (9.5%) 8 (6.4%) 7,704 (9.6%) 6,383 (9.2%) 3 (5.5%) 3 (8.4%) 3 (15.0%) 100% 17 or more 21 (14.3%) 18 (14.4%) 7,061 (8.8%) 6,662 (9.5%) 7 (13%) 5 (13.9%) 4 (20.0%) 80% Total 147 (100%) 125 (100%) 80,469 (100%) 69,756 (100%) 54 (100%) 36 (100%) 20 (100%) Overall 56%

aDepth of coverage (or cluster size), not including the genomic reference sequence. bNumber of clusters of given cluster size before removal of paralogous ESTs. cNumber of clusters of given cluster size after removal of paralogous ESTs. dNumber of sites of given alignment depth in multiple alignments before removal of paralogous ESTs. eNumber of sites of given alignment depth in multiple alignments after removal of paralogous ESTs. fNumber of candidate SNPs found at sites of given alignment depth. gNumber of unambiguously analysed candidate SNPs. hNumber of SNPs confirmed in at least one of four population pools. iSNP confirmation rate. b–iNumbers in parentheses indicate percentages of relevant total.

discrimination algorithm (Fig. 2a) that takes into account base probability exceeded a threshold value, PSNP,MIN, of 0.40, we quality values to calculate the probability, PNAT, that a cluster extracted 97 candidates. Of these, 38 were located in adenine-rich member is native to (derived from) the given genomic region. The regions of the genomic clones matching the 3´ ends of ESTs. Sub- bimodal distribution of these probability values (Fig. 2b) indicates sequent negative verification results are consistent with the that we can distinguish between less accurate sequences that never- hypothesis16 that these sites result from internal priming events theless originate from the same underlying genomic location, and during cDNA library construction and that the adenine allele is more accurate sequences with high-quality discrepancies that are contributed by the reverse transcription primer rather than the likely to be paralogous. Using a conservative threshold value, RNA template. PNAT,MIN, of 0.75, 23% of cluster members were declared paralo- We validated candidate sites with a pooled sequencing gous and removed from further consideration, leaving 69,756 sites approach17 that allowed us to confirm true positives, provided of native EST coverage. the minor allele frequency was above 10%. We eliminated five Once a proper data set is organized, the key to reliable detec- candidates that did not fulfil this requirement. An additional 18 tion of SNPs is the ability to discern true allelic variation from sites could not be analysed for lack of unique amplification (9 sequencing error. To this end, we have developed a Bayesian-sta- candidates in regions of low complexity or repetitive sequence, 4

http://genetics.nature.com ¥ tistical model for the mathematically rigorous treatment of candidates for unknown reasons and in 5 cases, the homozygous sequence differences within a multiple alignment that takes into control genome18 indicated the presence of paralogues absent in account the depth of coverage, the base quality values of the the EST set). Of the remaining 36 sites, 20 were confirmed in at sequences and the a priori expected rate of polymorphic sites in least 1 of 4 populations screened (13 transitions, 7 transversions), the region. For each site within a multiple alignment of native yielding a 56% overall confirmation rate. sequences, the POLYBAYES algorithm calculates the probability, The confirmation rate is somewhat lower than the average SNP PSNP, that the site is polymorphic, as opposed to monomorphic. score of 0.78. Some of this difference may be due to systematic The distribution of probability scores (Fig. 3a) exhibits a high base-calling errors (compressions) and reverse transcriptase level of specificity: most sites (99.83%) produce scores below 0.1. errors introduced during cDNA library construction. Several of They represent sites either with no disagreements between the candidate sites may be true polymorphisms specific to the 1999 Nature America Inc.

© aligned sequences or with low-quality discrepancies that are donors of the cDNA samples but absent in the population pools likely the result of sequencing errors or possibly very rare SNPs. used in verification. Although precise calibration of the SNP By marking a site as a candidate SNP if the corresponding SNP probability values would require analysing the genomic source of

1 a P(Data|ModelN)P(d|ModelNAT) b 0.9 P(Data|Model_P)P(d|ModelPAR) P(Model_N|Data)PNAT 0.8 PNAT, MIN 0.7 0.6 1100 1000 0.5 900 800 0.4 DNAT 700 probability 0.3 600 DPAR 500 PNAT, MIN 0.2 400 number of EST s 300 0.1 200 0 100 0 0123456789101112131415 0.25 0.50 0.75 1.0 discrepancies (d ) native paralogous PNAT

Fig. 2 Paralogue discrimination. a, Example probability distributions for a matching sequence with (hypothetical) uniform base quality values of 20, in pair-wise alignment with base perfect genomic anchor sequence (quality values 40), over a length of 250 bp. PPOLY,2 = 0.001, PPAR = 0.02, E=2.525, DNAT = 2.775 and DPAR = 7.525. If the posterior probability, PNAT, is higher than PNAT,MIN, the EST is considered native; otherwise, it is considered paralogous. b, Distribution of the posterior probability values, PNAT, calculated for 1,954 cluster members anchored to ten genomic clone sequences.

nature genetics • volume 23 • december 1999 453 © 1999 Nature America Inc. ¥ http://genetics.nature.com letter

a 69639 b

b 100

100 80 10 4 80 60 7 6 60 3 number of sites 40 6 PSNP, MIN 40 per cent of SNPs 20 20

0 0 123456789100.1 0.2 0.3 0.4 0.50.6 0.7 0.8 0.9 1.0 0.40Ð0.59 0.60Ð0.79 0.80Ð1.00 PSNP PSNP

Fig. 3 SNP probability scores. a, Distribution of the posterior probability value that a site is polymorphic, PSNP, for 69,756 sites in multiple alignments of native ESTs. b, Correlation between PSNP score and confirmation rate. The fraction of confirmed candidate SNPs (striped bars) and the fraction of candidate SNPs that were not detected in population-specific DNA pools (shaded bars) are shown. The absolute number of SNPs is shown above each bar.

each EST, an undertaking beyond the scope of this study, higher in clusters containing a single EST aligned to the reference SNP probability scores correspond to higher confirmation rates sequence (five confirmed sites), indicating that POLYBAYES is (Fig. 3b). This is the true significance of the SNP score: it enables effective even in very shallow alignments (Table 1). For the same one to strike a balance between true positive rates and the recov- reasons, our mining efficiency (1 candidate per 25 ESTs and 1 ery of low-frequency alleles. Using a higher detection threshold confirmed SNP per 68 ESTs analysed) compares favourably with reduces the number of false positives, but also discards more true recently published results4,5. polymorphic sites. Conversely, the recovery of rare SNPs requires During verification of candidates, we found only two novel a lower threshold, which in turn increases the false-positive rate, SNPs in 11,455 bp of STS sequence. One SNP was outside an EST

http://genetics.nature.com ¥ reflective of the fact that rare alleles or alleles in low-quality cluster and could not have been found in the data set. The other sequence are indistinguishable from sequencing error. The sensi- one was a rare variation present in one of four sampled popula- tivity of the algorithm as a function of allele frequency, sequence tions, but not within the EST cluster members. The dearth of quality, alignment depth and SNP probability threshold are novel SNPs unique to the population pools suggests that the ESTs reported (Fig. 4). The algorithm successfully detected variations contained most common variations in the analysed regions, and that POLYBAYES successfully detected them. a We evaluated the performance of POLYBAYES 50 with assembled shotgun, ‘working-draft’ quality 45 depth = 20 genomic reference sequence. To this end, we sim- depth = 40 ulated clone sequences of 2–6-fold shotgun cov- 1999 Nature America Inc. 40

© 35 depth = 60 erage by reassembling random subsets of the 30 original shotgun reads for 5 of 10 clones with the 25 PHRAP (P. Green, unpublished data) fragment 20 assembler. Using the resulting contig sequences as quality value 15 a reference, we repeated the subsequent SNP 10 analysis with unchanged parameters (Fig. 5). 5 Even at threefold shotgun coverage, an average 0 94% of ESTs were identified and 81% of con- 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% firmed SNPs detected (respectively, 98% and 94% allele frequency at fivefold coverage), indicating that POLYBAYES does not require base-perfect reference sequence to be effective and will work well with draft-qual- b ity sequences that have begun to dominate b 19 50 sequence production . 45 Because expressed regions comprise but a small 40 fraction of the genome, polymorphic sites recov- 35 ered from ESTs alone, however valuable, are 30 25 P = 0.2 20 SNP Fig. 4 Sensitivity of the SNP detection algorithm. a, Mini-

quality value P = 0.4 15 SNP mum base quality requirement for the detection of minor P = 0.6 10 SNP alleles of a given frequency, in alignments of depth N=20, 40, P = 0.8 5 SNP 60, at a detection threshold value PSNP,MIN = 0.40. b, Base 0 quality requirement for the detection of a single minor allele in alignments of depth N = 2,…,10, and SNP probability 2345678910threshold values PSNP,MIN = 0.20, 0.40, 0.60 and 0.80. In (a,b), alignment depth the quality value for each base was assumed to be uniform.

454 nature genetics • volume 23 • december 1999 © 1999 Nature America Inc. ¥ http://genetics.nature.com letter

error probabilities of both sequences (calculated from the base 100 quality values) along the pair-wise alignment. We considered two models: an EST is either native (ModelNAT) and we expect DNAT = 90 × L PPOLY,2+E discrepancies, or it is paralogous (ModelPAR) and we × 80 expect DPAR = L PPAR+E mismatches. The probability of observ- ing d discrepancies in the pair-wise alignment is approximated by 70 λ a Poisson distribution, with parameter =DNAT for ModelNAT and λ 60 =DPAR for ModelPAR. In absence of reliable a priori knowledge of the expected proportions of native versus paralogous ESTs, we 50 used uninformed (flat) priors. The posterior probability, PNAT = 40 P(ModelNAT|d), that the EST represents native sequence was deter- mined as: per cent recovery 30 1 20 P(ModelNAT | d) = 1+e(DNAT-DPAR). DPAR 10 ( DPAR) 0 ESTs that scored above a cutoff value, P , were consid- × × × × × NAT,MIN 2x2 3x3 4x4 5x5 6x6 ered native; sequences scoring below the threshold were shotgun depth declared paralogous.

Fig. 5 SNP detection with assembled shotgun genomic reference sequence. Fractions of SNP detection in multiple alignments. The algorithm identifies ESTs recovered (white bars) and SNPs recovered (grey bars) are shown. Percentages were polymorphic locations by evaluating the likelihood of nucleotide based on the 733 ESTs anchored by 5 of 10 genomic clones in the primary experiment, heterogeneity within cross-sections of a multiple alignment. Each and the 14 confirmed SNPs detected among these sequences. Error bars indicate stan- of the nucleotides, S1,…,SN, in such a cross-section of N dard deviation among 20 consecutive experiments. sequences, R1,…,RN, can be any one of the four DNA bases, for a N total of 4 nucleotide permutations. The likelihood, P(Si|Ri), that a nucleotide, Si, is A, C, G or T is estimated from the error proba- bility, PError,i, obtained from the base quality value. We assign unlikely to produce a SNP marker map of uniformly high density. (1–PError,i) to the called base and (PError,i/3) to each of the three uncalled Through the coordinated efforts of large-scale sequencing efforts bases. In the absence of likelihood estimates, insertions and deletions are worldwide, the nearly complete sequence of the human genome not considered. Each heterogeneous (polymorphic) permutation is classi- will soon be available, augmented by the generation of a stagger- fied according to its nucleotide multiplicity, the specific variation and the

distribution of alleles. We used the value P =0.003 (1 polymorphic site

http://genetics.nature.com POLY ¥ ing amount of fragmentary sequences. Our study demonstrates in 333 bp) as the total a priori probability that a site is polymorphic21,22 that through precise treatment of the data, combined with objec- (1/1,000 polymorphism rate between any pair of sequences). This value tive evaluation of data quality, it is possible to discover variations was distributed to assign a prior probability, PPrior(S1,…,SN), to each per- in these sequences with great efficiency, contributing to the cre- mutation. Permutations of higher nucleotide multiplicities received expo- ation of valuable resources20–22 with which to analyse complex nentially lower shares, in accordance with a random allele generation genetic traits and further our understanding of human origins. model. In this study, we assigned equal shares to different variation types (although unequal shares can be specified in the software to account for a higher rate of transitions compared to transversions). A prior value of Methods (1–PPOLY)/4 was assigned to each of the four non-polymorphic permuta- Data organization. Known human repeats in the genomic sequences were tions, corresponding to a uniform base composition, PPrior(Si). The

1999 Nature America Inc. masked with RepeatMasker (A.F.A. Smit and P. Green, unpublished data)

Bayesian posterior probability of a particular nucleotide permutation was © and searched against dbEST with WU-BLAST (W.R.G., http://blast.wustl. calculated through another application of Bayesian inference, considering edu) with parameters: M=5, N=–11, Q=11, R=11, S=170, gapS2=150, fil- the 4N different permutations as the set of conflicting models: ter=seg (P-value cutoff 10–50). Sequence traces that were available at the Washington University ftp site (ftp://genome.wustl.edu/pub/gsc1/est) were P(S1 | R1) . . P(SN | RN). ... PPrior (S1 ,..., SN) processed with the PHRED base-calling program; the full length of each PPrior(S1) PPrior(SN) P(S1 ,..., SN | R1 ,..., RN) = sequence, together with base quality values (expressing the likelihood that P(S | R ) P(Si | R ) Σ i1 1 . ... . N 1 . P (S ,..., S ) the called nucleotide is incorrect), was used in the subsequent analysis. Dis- Prior i1 iN every (Si ,..., SiN) PPrior(Si1) PPrior(SiN) tinct groups of matching ESTs were registered as clusters. Each cluster 1 member was first pair-wise aligned to the genomic anchor sequence with The Bayesian posterior probability of a SNP, PSNP, is the sum of poste- CROSS_MATCH (P. Green, unpublished data). We then produced a multi- rior probabilities of all heterogeneous permutations. The computation is ple alignment by propagating gaps and insertions in the pair-wise align- performed with an efficient, recursive algorithm. A site within a multiple ments into all remaining sequences, a procedure known as ‘sequence alignment is reported as a candidate SNP if the corresponding posterior padding’. The computational complexity of the algorithm grows linearly probability exceeds a set threshold value, PSNP,MIN. We examined the sensi- with the length and number of sequences. tivity of the detection algorithm under the simplifying assumption of uni- form base quality. We determined the relationship between observed Paralogue identification. We identified paralogous sequences by determin- minor allele frequency and base quality to produce a SNP probability score ing if the number of mismatches observed between the genomic reference PSNP = 0.4 (the threshold value used in this study), in alignments of various sequence and a matching EST was consistent with polymorphic variation as depths of coverage. We also determined the minimum base quality value opposed to sequence difference between duplicated chromosomal loca- required for detecting a single minor allele in alignments of ten or fewer tions, taking into account sequence quality. On the basis of our annotation sequences at various threshold values. experience of over 40 Mb of genomic sequence, we stipulated that most ‘paralogous’ sequences exhibit a pair-wise dissimilarity rate higher than Software. POLYBAYES was developed in a UNIX environment and runs PPAR = 0.02 (2%) compared with the average pair-wise polymorphism rate, efficiently on a conventional workstation. Sequence clustering is per- × PPOLY,2 = 0.001 (0.1%). In a pair-wise match of length L, we expect L PPOLY,2 formed with custom scripts. The anchored alignment, paralogue filtering × mismatches due to polymorphism, versus L PPAR mismatches due to paral- and SNP detection are accessed through a single program. SNP locations ogous difference. In both cases, an additional number, E, of mismatches are and probabilities are reported in text files or as a database compatible with expected to arise from sequencing errors, approximated as the sum of base the CONSED sequence editor23 to enable viewing the multiple alignments,

nature genetics • volume 23 • december 1999 455 © 1999 Nature America Inc. ¥ http://genetics.nature.com letter

quality values, sequence traces and annotated SNPs in toto. Instructions for Acknowledgements obtaining POLYBAYES are available (at no cost for non-profit use, see We thank T. Blackwell and S. Eddy for informative discussions during the http://genome.wustl.edu/gsc/polybayes). development of the mathematical framework of the technique. This work was supported by NIH grants P50HG01458 (L.H. and W.R.G.), R01HG1720 (P.- Accession numbers. 127H14, NID:g2439515; DJ0604G05, NID:g3006227; Y.K.) and T32AR07284 (Z.G.), and an equipment loan from Compaq DJ0777O23, NID:g3242763; DJ327A19, NID:g2341021; GS345D13, Computer Corporation. NID:g2078461; GS541B18, NID:g2781380; GS542D18, NID:g2388554; RG085C05, NID:g1669367; RG104I04, NID:g1809226; RG119C02, NID:g3004572. Confirmed SNPs were submitted to dbSNP, NCBI assay ID 4277–4281, 4618–4628, 4643–4648. Received 17 August; accepted 18 October 1999.

1. Collins, F.S., Guyer, M.S. & Chakravarti, A. Variations on a theme: cataloging 12. Durbin, R. & Dear, S. Base qualities help sequencing software. Genome Res. 8, human DNA sequence variation. Science 278, 1580–1581 (1997). 161–162 (1998). 2. Wang, D.G. et al. Large-scale identification, mapping, and genotyping of single 13. Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated traces nucleotide polymorphisms in the human genome. Science 280,1077–1082 (1998). using Phred. I. Accuracy assessment. Genome Res. 8, 175–185 (1998). 3. Taillon-Miller, P., Gu, Z., Hillier, L. & Kwok, P.-Y. Overlapping genomic sequences: a 14. Ewing, B. & Green, P. Base-calling of automated traces using Phred. II. Error treasure trove of single-nucleotide polymorphisms. Genome Res. 8, 748–754 probabilities. Genome Res. 8, 186–194 (1998). (1998). 15. Bayes, T. An essay towards solving a problem in the doctrine of chances. Philos. 4. Picoult-Newberg, L. et al. Mining SNPs from EST databases. Genome Res. 9, Trans. R. Soc. 53, 370–418 (1763). Reprinted in Biometrika 45, 293–315 (1958). 167–174 (1999). 16. Aaronson, J. et al. Toward the development of a gene index to the human 5. Buetow, K.H., Edmondson, M.N. & Cassidy, A.B. Reliable identification of large genome: an assessment of the nature of high-throughput EST sequence data. numbers of candidate SNPs from public EST data. Nature Genet. 21, 323–325 Genome Res. 6, 829–845 (1996). (1999). 17. Kwok, P.-Y., Carlson, C., Yager, T., Ankener, W. & Nickerson, D.A. Comparative 6. The Sanger Centre & The Washington University Genome Sequencing Center. analysis of human DNA variations by fluorescence-based sequencing of PCR Toward a complete human genome sequence. Genome Res. 8, 1097–1108 (1998). products. Genomics 23, 138–144 (1994). 7. Venter, J.C. et al. Shotgun sequencing of the human genome. Science 280, 18. Taillon-Miller, P. et al. The homozygous complete hydatidiform mole: a unique 1540–1542 (1998). resource for genome studies. Genomics 46, 307–310 (1997). 8. Hillier, L. et al. Generation and analysis of 280,000 human expressed sequence 19. Collins, F.S. et al. New goals for the U.S. Human Genome Project: 1998–2003. tags. Genome Res. 6, 807–828 (1996). Science 282, 682–689 (1998). 9. Adams, M.D., Soares, M.B., Kerlavage, A.R., Fields, C. & Venter, J.C. Rapid cDNA 20. Nickerson, D.A. et al. DNA sequence diversity in a 9.7-kb region of the human sequencing (expressed sequence tags) from a directionally cloned human infant lipoprotein lipase gene. Nature Genet. 19, 233–240 (1998). brain cDNA library. Nature Genet. 4, 373–380 (1993). 21. Cargill, M. et al. Characterization of single-nucleotide polymorphisms in coding 10. Hudson, T.J. et al. An STS-based map of the human genome. Science 270, regions of human genes. Nature Genet. 22, 231–238 (1999). 1945–1954 (1995). 22. Halushka, M.K. et al. Patterns of single-nucleotide polymorphisms in candidate 11. Marra, M., Weinstock, L.A. & Mardis, E.R. End sequence determination from large genes regulating blood-pressure homeostasis. Nature Genet. 22, 239–247 (1999). insert clones using energy transfer fluorescent primers. Genome Res. 6, 23. Gordon, D., Abaijan, C. & Green, P. Consed: a graphical tool for sequence 1118–1122 (1996). finishing. Genome Res. 8, 195–202 (1998).

http://genetics.nature.com ¥

1999 Nature America Inc.

©

456 nature genetics • volume 23 • december 1999 LETTER Sequence Assembly with CAFTOOLS Simon Dear,1 Richard Durbin,1 LaDeana Hillier,2 Gabor Marth,2 Jean Thierry-Mieg,3 and Richard Mott1,4,5

1Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK; 2Genome Sequencing Center, Washington University, St. Louis, Missouri 63108 USA; 3CRBM du Centre National de la Recherche Scientifique (CNRS), Route de Mende, Montpellier, France; 4SmithKline Beecham Pharmaceuticals, New Frontiers Science Park (North), Harlow, Essex, CM19 5AW, UK

Large-scale genomic sequencing requires a software infrastructure to support and integrate applications that are not directly compatible. We describe a suite of software tools built around the Common Assembly Format (CAF), a comprehensive representation of a sequence assembly as a text file. These tools form the backbone of sequencing informatics at the Sanger Centre and the Genome Sequencing Center. The CAF format is intentionally flexible, and our Perl and C libraries, which parse and manipulate it, provide powerful tools for creating new applications as well as wrappers to incorporate other software. The tools are available free by anonymous FTP from ftp://ftp.sanger.ac.uk/pub/badger/.

Genomic sequencing is now a semi-industrial pro- The purpose of this paper is threefold: (1) to cess that is being increasingly automated. The describe CAF and its associated software package amount of finished sequence produced in large cen- CAFTOOLS; (2) to illustrate their use for genomic ters worldwide more than doubles each year. This sequencing at the Sanger Centre; and (3) to propose effort has required a huge investment in bioinfor- CAF as a standard format for developers and se- matics, and new software is under continual devel- quencing centers. opment both within these centers and in the wider academic community. High-throughput sequence The CAF assembly is a complicated multistep pipeline, using Overview many pieces of software, and we as users want to be in a position to use the best set of software tools, CAF is a restriction of the data file format (.ace file even if this causes problems reconciling the various format) conforming to a specific acedb (http:// data formats they use. In addition, because more www.sanger.ac.uk/Software/Acedb/) schema for than one tool may be suitable for the same task (e.g., describing sequence assemblies. It is acedb- for manually editing sequence assemblies) we also compliant, although using CAF does not require the want to offer alternatives within the same frame- use of acedb. A full acedb schema for CAF can be work. found at the official CAF web site, http:// Consequently we require a system that is flex- www.sanger.ac.uk/Software/CAF/. CAF is designed ible enough for us to evaluate and incorporate new to be sufficiently comprehensive that any assembly software as it emerges and yet is easy to maintain engine/editor such as phrap (P. Green, pers. and use. We have not found any existing product comm.), consed (Gordon et al. 1998), gap4 (Bon- that meets these requirements completely, so our field et al. 1995), acembly (J. Thierry-Mieg, un- solution to this problem was to create the Common publ.), FAKII (Larson et al. 1996; Myers 1996), and Assembly Format (CAF), a complete textual descrip- so forth, can derive all of the information it needs tion of a sequence assembly, together with Perl and from the CAF file without reading any other data, C libraries for parsing and manipulating CAF files, except for trace information that is held in standard and applications written with these libraries to per- chromatogram format (SCF) files (Dear and Staden form tasks for which no third-party software exists. 1992). We have written tools to convert to and from each of these systems. Note that because CAF de- scribes a superset of sequence attributes, passing an

5Corresponding author. assembly through any of these editors may result in E-MAIL richard [email protected]; FAX 44 1279 622200. loss of information.

260 GENOME RESEARCH 8:260–267 ©1998 by Cold Spring Harbor Laboratory Press ISSN 1054-9803/98 $5.00; www.genome.org SEQUENCE ASSEMBLY WITH CAFTOOLS

A sequence assembly is essentially a set of con- The format of the data following a Sequence : tigs, each contig being a multiple alignment of Name is variable. The minimum requirement is to reads. In outline, the information we may need to specify the type (Is Read or Is Contig) and state store about each sequence is (Padded or Unpadded, see the examples below). For clarity, we divide the Sequence data into simple 1. The DNA sequence. and coordinate-sensitive attributes. 2. The base quality (a list of numbers indicating the confidence that the corresponding base in the Simple Sequence Attributes sequence is correct). 3. The base positions (for reads only: a list of num- Sequence attributes that are not linked to coordi- bers indicating the location of the corresponding nates have the general format base within the SCF trace). Attribute type Value(s) 4. General properties (for reads only) such as the sequence template, name of corresponding trace For example, the CAF fragment in Figure 1 describes file, etc. the sequence attributes of the read hh26e2.s1. 5. Tags (regions of the sequence with some prop- Comments start with // and continue to the end of erty, e.g., matching vector, repeat sequence, etc.). the line. New simple attributes can be created with- 6. Alignment of DNA onto the constituent reads, in out constraint because CAFTOOLS will treat unrec- the case of contigs, or of DNA onto original base ognized attributes as text to be carried along un- calls in the case of reads. changed with the associated Sequence object. The order of the attributes is arbitrary. Figure 1 details CAF supports all of these features. In CAF the infor- the most important and commonly used attributes, mation associated with the sequence Name is di- and a full list may be found at our web site. vided into a maximum of four data types, which are represented as separate paragraphs of text, separated Alignments and Coordinate-Sensitive Data by blank lines, with header lines. Those data that are coordinate dependent, for ex- DNA : Name 1. the DNA sequence ample, alignments and tags, are more structured be- cause they must be parsed so that they can mirror BaseQuality : Name 2. the base quality changes to their associated DNAs. For example, ed- information for the iting, padding, or depadding a sequence alters the DNA corresponding tag coordinates. The CAFTOOLS BasePosition : Name 3. the base position handle these changes transparently. information for the CAF stores two levels of alignment—that of the DNa, ie the trace contig DNA to the read DNA, and that of the read coordinates DNA to the original base calls in the SCF trace. The alignment of a read onto a contig is represented by Sequence : Name 4–6. all other a series of statements of the form information about the Assembled from Read c1 c2 r1 r2 sequence which means coordinates c1 to c2 in the contig Throughout this paper, we will use Courier align with r1 to r2 in the read. Coordinates start at font for CAF reserved words and Italic Courier 1. If c1 > c2 then the statement means align the for CAF variables. The order of paragraphs within a reverse complement of r1 to r2 in the read. The CAF file is arbitrary. The BaseQuality and Base- lengths |c1–c2| and |r1–r2| must be the same. Position types are optional but the DNA and Se- The alignment of a read to its original base calls in quence types are mandatory. Contigs and reads the SCF trace is similar: have the same DNA and BaseQuality types but spe- Align to SCF r1 r2 t1 t2 cialized Sequence subtypes. DNA sequence is rep- resented as consecutive lines of text following that is, coordinates r1 to r2 in the read correspond DNA : Name. Base quality is represented as lines of with t1 to t2 in the trace. Align to SCF is only space-separated integers following BaseQuality : applicable when the base positions of the DNA can Name. The number of quality values must equal the be derived from the base position information held length of the DNA. Base positions are represented in the corresponding SCF trace files (recall that an similarly following BasePosition : Name. SCF trace stores the trace data, the original base

GENOME RESEARCH 261 DEAR ET AL.

In padded coordinate space, positions [1,19] in Read X align with [3,21] in Contig Y. However, the presence of pads in Read X makes its alignment to its trace SCF X complicated. Figure 2A shows a fragment of a CAF file describing the pad- ded alignment. Note that the alignment of the read to the contig is attached to the con- hh26e2.s1 Figure 1 An example showing how the attributes of the read are tig’s sequence data and not the described as a CAF Sequence object. The text following the // on each line read’s. The equivalent unpad- are comments. This example illustrates most of the commonly used CAF read attributes. ded description is shown in Figure 2B, where the coordi- nates now refer to the unpad- calls, and the mapping of each called base onto the ded sequences. Note also that the alignment of the trace). For certain purposes, for example, consed, read DNA sequence onto the underlying base calls we can override the SCF base calls and their posi- in the SCF file will change if insertions or deletions tions by storing the coordinates explicitly in a have been made to the read. BasePosition paragraph, in which case there is no The other major type of coordinate-sensitive need to use Align to SCF. data to consider is the tag. A tag is a region of a CAF supports padded and unpadded align- sequence with some property. The format must be ments. Padded means that gaps (–) have been in- one of serted where required in the contigs and their aligned reads so that there is a one-to-one corre- spondence between the aligned sequences. In a pad- Tag Type x1 x2 Comment ded assembly, there is exactly one Assembled from line for each aligned read in a contig, and the Seq vec Method x1 x2 Comment DNA objects contain – padding characters. In an unpadded alignment, all of the pads are Clone vec Method x1 x2 Comment removed from the DNA objects and there are multiple Clipping Method x1 x2 Comment Assembled from lines for each read in a contig. Within each Assembled from line there is still a one- The Tag attribute means that positions x1 to x2 of to-one correspondence between the read and contig. the sequence have property Type, optionally with Some applications, for example, auto-editor the free text Comment. This is used extensively to (described in Table 1B) and gap4, require padded mark regions matching repeat sequences, or those alignments. Others (e.g., applications that screen that have been automatically edited with auto- the DNA against known sequences like repeats) editor. The special tags Seq vec, Clone vec, require unpadded. The programs caf pad and and Clipping are reserved for regions that match caf depad allow one to move transparently be- sequencing vector, cloning vector, or are high qual- tween padded and unpadded states, without losing ity. These are held inside CAFTOOLS as separate information. In a padded alignment with Base- data structures. Their Method attribute is used to Quality information it is necessary to attach a indicate which algorithm was used to generate the quality value to each pad to keep the lengths equal. relevant coordinates (e.g., so that different quality By convention this is interpolated from neighbor- clip points can be represented). ing BaseQuality values. BasePosition data are Finally, CAF supports the phrap concept of the treated similarly. ‘‘golden path’’ of a contig. This is a sequence of We illustrate these ideas with a hypothetical abutting intervals covering the contig’s DNA, such example. Suppose the alignment of Read Xto that each interval of the contig is associated with Contig Yis the read with locally the highest base quality. The format is a series of lines Read X GCTGCCTTCGC–TTAAAA Contig Y CAGCTGC-TTAGCGCTTAAAA Golden path Read x1 x2

262 GENOME RESEARCH SEQUENCE ASSEMBLY WITH CAFTOOLS

Table 1. Summary of the CAFTOOLS

A. General CAF utilities, including tools for communication with other software general CAF utilities caf pad Converts an unpadded assembly to a padded one. All coordinate-dependent data are updated (written in C). caf depad Inverse of caf pad (C). cafcat Concatenates and consolidates multiple CAF files into a single file. Also reports semantic errors (C). cafmerge Merges two CAF files, replacing duplicated objects rather than concatenating them (cf. cafcat) (C). caf2phrap Extracts a subset of sequences from a CAF file into three files; (1) FASTA DNA; (2) base quality; (3) CAF stub of remaining attributes. This is used to prepare data for phrap but also provides a general way to extract FASTA sequence data from CAF (C). assembler support caf phrap Takes a CAF file and an optional list of reads, assembles them with phrap and creates a new CAF file describing the assembly. No other postprocessing is done. caf fak Similar Perl wrapper for FAKII. acembly support caf2bly Converts a CAF file into an acembly database. bly2caf Exports a CAF file from an acembly database. caf bly Takes a CAF file and a script command file, reads a CAF file into an acembly database, performs the script, re-exports a CAF file on standard output and cleans up. gap support caf2gap Converts a CAF file into a gap4 database. All CAF tags are converted to their gap4 equivalents (C). gap2caf Creates a CAF representation of data in a gap4 database (C). exp2caf Converts Staden Experiment files into CAF. update caf consed support consed2caf Converts a phrap assembly or a consed database into CAF. consed2gap Converts a phrap assembly or consed database into a gap4 database. caf2phd Converts CAF reads to phred PHD files required for consed. phd2caf Converts phred PHD files to CAf.

B. Specialized processing tools. Programs are written in Perl unless indicated otherwise. tag generators cafvector These wrappers extract contig sequence from CAF, screen it using blast or cross match against a caftagfeature library of sequences, and create tags for any matches found. cafalu cafcgi auto-editor np edit Proposes edits for an assembly by examining the SCF traces in the context of the alignment. A new CAF file is generated listing the suggested edits as special edit tags that are parsed and acted on by nd edit (C). nd edit Makes the edits proposed by np edit. Editing will change the DNA sequences of the reads. nd edit modifies the coordinates of all tags and base qualities appropriately (C). clipping nd clip Clips back all assembled reads according to the Clipping tags. We use this to postprocess phrap assemblies to restrict aligned reads to their higher-quality regions (C). ne clip Used after nd clip to extend back clipped reads where necessary to cover holes created (C). cafsplit Alternative to ne clip. Splits contigs at holes. finishing finish Analyzes the assembly to choose directed reads for the purpose of finishing. cafcop Checks assembly for finishing errors and regions of insufficient sequence coverage. clone overlap data management Readraid Incorporates SCF traces and sequence of reads from overlapping regions of neighboring clones in the physical map. Conraid Incorporates consensus sequence from overlapping regions of neighboring clones.

GENOME RESEARCH 263 DEAR ET AL.

CAFTOOLS have been tested ex- tensively under Digital Alpha OSF and Sun Solaris UNIX but should work on most UNIX platforms. They do not provide graphics. Most of the applications that we have written with the libraries act as UNIX-style filters, reading a CAF file on standard input, modi- fying it, and writing a new CAF on standard output. Command-line switches can modify the function of many applications. The only exceptions are those that split the Figure 2 An example alignment in CAF, when the sequences are padded CAF into multiple files (e.g., to (A) and when they are unpadded (B). prepare data for processing by phrap) or that merge multiple CAF files into one. This means it is meaning Read provides the golden path for the in- possible to pipe together many processing modules terval [x1,x2]. Complete examples of CAF files can in one command. For example, to auto-edit a gap4 be found on our CAF web site. database GAPDB.0 one could type

The CAFTOOLS gap2caf -project GAPDB -version 0 -preserve | np edit -scf | nd edit We have written two libraries for reading, manipu- | caf2gap -project GAPDB -version 1 lating, and writing CAF files: 1. Perl-5 libraries, which are easy to use and are con- This will create a new edited database GAPDB.1. venient for creating wrappers for software writ- (The -preserve switch ensures that the internal ten by third parties. The general procedure is to gap4 numbering of the sequences is retained. We extract the relevant data from the input CAF file, use a special attribute, Staden id, for this pur- convert it into the format required by the pro- pose.) In practice, we chop up the assembly pipeline gram, run the application, parse the output back at significant breakpoints and write intermediate into CAF, and write an updated CAF file. For ex- temporary CAF files to disk so that we can switch ample, the CAF tagging applications are based on modules more easily and perform a postmortem if this model. an error occurs. Full details of how to run each ap- 2. ANSI-C libraries, which are less flexible but are plication, including all command-line options, may much faster and can handle very large data sets be found on our web site. Table 1 summarizes the (up to 50,000 reads) without using too much more important utilities and applications written memory. They also perform much more strin- using CAFTOOLS. gent data checking and understand which infor- Most of the utilities run in under 10 elapsed sec mation is position-sensitive, so that coordinates on a cosmid-sized CAF file. The applications are are automatically changed in concert with DNA slower, for example, the auto-editor takes 2–3 min modifications. They also provide error checking to edit a cosmid (40 kb, ∼1000 reads). We can pro- that reports (a) references to nonexistent se- cess a cosmid through our complete assembly pipe- quences, (b) sequence coordinates out of range line in ∼15 min on a 433 Mhz DEC Alpha processor. (compared with length of DNA), (c) inconsistent alignments, and (d) mixed or unspecified pad The Sequence Assembly Pipeline states. They are suited for writing computation- ally intensive applications, such as the auto- CAFTOOLS are best illustrated by their use at the editor, which requires access to trace data. Ap- Sanger Centre and the Genome Sequencing Center. plications written with the C libraries can pad or We use the following pipeline at the Sanger Centre depad the input data as required by a single func- for assembling reads from a bacterial clone (e.g., a tion call. They can also handle multiple CAF da- cosmid or BAC) into contigs. Most genome sequenc- tabases simultaneously in different name spaces. ing centers that use shotgun sequencing follow a

264 GENOME RESEARCH SEQUENCE ASSEMBLY WITH CAFTOOLS broadly similar work flow. Figure 3 summarizes the 4. Identify sequencing vector clip points by align- pipeline and shows how CAF is integrated into it. ment of the original SCF trace to the expected vector sequence (svec clip). 5. Screen the sequence against Escherichia coli and other possible contaminants. Preprocessing As each sample comes off a sequencing machine, we preprocess it using asp (Automated Sequence Pre- Each sample processed by asp generates an Ex- processor) (M. Wendl, S. Dear, D. Hodgson, and L. periment file (Bonfield and Staden 1996) containing Hillier, in prep.). asp is a chain of modules written all the information generated about the sample in in Perl, some of which call C programs, for example, the run. A sample can be ‘‘failed’’ by asp if it has no phred (Ewing and Green 1998) and svec clip high-quality region, is completely sequencing vec- (Mott 1998). At the Sanger Centre, asp performs the tor, or matches a contaminant. Samples that are following operations: passed by asp are moved, together with their asso- ciated trace files, into the relevant project directory 1. Query a central database to determine the world- for assembly. asp reports the fate (Pass or Failure, unique name of the sample, the name of the par- and the reason for failure) of each sample to a cen- ent clone (the ‘‘project’’), the sequencing chem- tral database. The sequence assembly process will istry, the expected insert size, the sequencing only start once the number of samples exceeds a vector, and the priming and cloning sites. threshold, typically ∼600 reads for a cosmid and 2. Base-call the sample using phred creating an ∼2000 for a PAC. SCF trace file and a file of phred base quality indices. 3. Determine quality clip points (i.e., the good- Assembly quality part of the read) from the phred base qualities. To do this we subtract 15 from each The automated assembly process is controlled by a base quality index and then find left L and right Perl script, phrap2gap. Each stage in the assembly R clip points such that the sum of adjusted qual- pipeline is a module that accepts a CAF description ity values from L to R is a maximum. of the current assembly, acts on it in some way, and writes out a new CAF file. Each CAF is a complete and consistent description of the current state of the assembly. Therefore, it is rela- tively easy to add or replace mod- ules provided they conform to this pattern. For example, we have added support for the editor consed, and for the sequence as- sembly engines FAKII and acem- bly. Figure 3 summarizes the pipe- line. In greater detail, the steps are 1. Create a CAF file containing all of the raw data from individual Staden Experiment files (up- date caf, exp2caf). 2. Extract from the raw CAF input files required by the assembly engine phrap (essentially a file of sequences and a file of base qualities). Assemble into con- Figure 3 How CAF and CAFTOOLS are used in the sequence assembly tigs with phrap and merge pipeline. Broken arrows show the order in which data are processed; the back with the other informa- solid arrows indicate actual data flow and how CAF is used as an interchange tion held in the raw CAF. The mechanism. result is a new CAF file com-

GENOME RESEARCH 265 DEAR ET AL.

pletely describing the reads and how they are as- on padded alignments. The early versions of CAF- sembled (caf phrap, caf2phrap, caf- TOOLS were entirely Perl-based. We developed the merge). C libraries initially to support auto-editing but soon 3. Clip back the assembled reads to their high- recognized that the efficiency gains made develop- quality regions determined previously by asp ing a full C library worthwhile, particularly as we (nd clip). This process can occasionally create were starting to shotgun sequence 120-kb PAC and holes in contigs where no high-quality sequence BAC clones. occurs. phrap2gap offers the choice of splitting CAF is not the only textual representation of contigs (cafsplit) or re-extending reads back sequence assemblies to be proposed. Staden Experi- into their low quality parts to close the gaps ment file format has been suggested previously as a (ne clip). basis on which to form a standard interchange for- 4. Auto-edit the assembled reads, referring back to mat (Bonfield and Staden 1996). The Boulder for- the original trace data (np edit, nd edit). mat (Stein et al. 1994) is used at the Whitehead Depending on the depth of coverage, up to 90% Institute for purposes similar to CAF. of the edits can be made at an error rate of less CAF may be viewed as a synthesis of many of than one mistake per 50 kb. the best features of other systems. Many of the CAF 5. Screen (using cafvector, cafalu, cafcgi, and sequence attributes have exact correlates in Staden caftagfeature) the assembled contigs against Experiment files, although the format is different various sequence data sets (e.g., cloning vector, and the description of multiple alignments by As- known repeats, transposons) and tag any match- sembled from lines is similar to that used by ing regions. acembly. We believe that CAF has the following 6. Analyze the assembly to choose directed reads for advantages: (1) The plain text acedb file structure gap closure and to resolve ambiguities with fin- used by CAF means that files are easy to read and ish (G. Marth and S. Dear, in prep.). debug. (2) Holding all of the information about an 7. Convert the edited, tagged assembly into a gap4 assembly in a single file, rather than divided into database for manual finishing (caf2gap). individual sequence files makes it simpler to main- tain consistent assemblies and to manipulate assem- Further finishing reads are incorporated to close blies as single entities. (3) CAF is very flexible. We gaps in the contigs and strengthen the consensus in have been able to incorporate new software into our poor quality regions. To do this the user has the assembly pipeline relatively easily. (4) CAFTOOLS choice of adding the new sequences individually have been tested thoroughly and are in production into the gap4 database and auto-editing again, or use. Most of the total (to October 1997) 125 Mb of reassembling all of the data from scratch. finished genomic DNA produced at our two se- This pipeline uses the third-party applications quencing centers was processed by some version of phrap, cross match (P. Green, pers. comm.), CAFTOOLS, and >50 Mbp was processed by the cur- consed, blast (Altshul et al. 1990), and gap4 and rent version. two CAF applications auto-editor and finish, The Sanger Centre recently completed the 4.4- plus repeated use of several CAF utilities. Mbp sequence of Mycobacterium tuberculosis (http:// The Genome Sequencing Center uses a similar www.sanger.ac.uk/Projects/M tuberculosis/) using pipeline, except the auto-editing step is omitted and CAFTOOLS applied to sequence data from a mixture consed is used in place of gap4. The Center also of cosmids and a whole-genome library. The final performs an internal quality-checking step on the stages of this project would have been much more final assembly (cafcop, pcop) before the sequence difficult to complete without the infrastructure that is submitted for analysis. CAF provided. This illustrates the power and impor- tance of CAF. DISCUSSION The Sanger Centre is currently sequencing other pathogens, using whole-genome or chromosome li- Historically, CAF dates from ∼1993 to 1994, growing braries. We have found that, in general, assemblies out of our need to combine the assembly editor covering more than 1 Mbp of genomic sequence are gap4 with the assembly engine phrap and other rather large to hold as single CAF files, although applications. At that time no existing interchange feasible on machines with large memories. More im- format was wholly appropriate for the task, for ex- portantly, the final assembly must be split into ample, the Staden Experiment file format then did pieces small enough to be completed by an indi- not represent contigs as distinct entities and relied vidual while maintaining the integrity of the entire

266 GENOME RESEARCH SEQUENCE ASSEMBLY WITH CAFTOOLS assembly. The natural solution is to store the CAF their application during large-scale sequencing projects. information in a database. It is possible to use DNA Sequence 6: 109–117. acedb for this purpose. Alternatively, one can use a Dear, S. and R. Staden. 1992. A standard file format for data relational database for CAF, and at the Sanger Cen- from DNA sequencing instruments. DNA Sequence tre we are currently implementing CAF in Oracle. 3: 107–110. The database communicates with other software by reading and writing CAF files, so that we can still use Ewing, B. and P. Green. 1998. Base-calling of automated our existing CAFTOOLS while we benefit from be- sequences traces using PHRED. II. Error probabilities. Genome Res. (this issue). ing able to manipulate subsets of the data in a safe and efficient way. These developments have also led Gordon, D., C. Abajian, and P. Green. 1998. consed: A us to explore the extension of the CAF model to graphical tool for sequence finishing. Genome Res. (this support partially assembled groups of contigs, to ex- issue). ploit relative contig order information from for- Larson, S., M. Jain, E. Anson, and E.W. Myers. 1996. An ward/reverse read pairs and restriction-digest infor- interface for a fragment assembly kernel. Tech. Rep. TR96-04. mation, although this is not yet part of the public Department of Computer Science, University of Arizona, distribution. Tucson, AZ. We anticipate that CAF will undergo further modification but that its basic principles will re- Mott, R. 1998. Trace alignment and some of its applications. Bioinformatics (in press). main intact. We welcome suggestions for changes and additions to CAF. Our hope is that CAF will Myers, E. 1996. A suite of UNIX filters for fragment assembly. become more widely used for describing sequence Tech. Rep. TR96-07. Department of Computer Science, assemblies and that developers will make use of it to University of Arizona, Tucson, AZ. the benefit of the wider sequencing community. As Stein, L., A. Marquis, E. Dredge, M.P. Reeve, M. Daly, S. all workers in the field of bioinformatics recognize, Rozen, and N. Goodman. 1994. Splicing UNIX into a an inordinate amount of time is spent interconvert- genome mapping laboratory. In USENIX Summer Technical ing file formats, to the detriment of actual software Conference, pp. 221–229. USENIX Association, Berkeley, CA. development. After >2 years of intensive develop- ment, testing, and production use, we believe CAF Received December 1, 1997; accepted in revised form January and CAFTOOLS are well-positioned to offer an ef- 29, 1998. fective solution to these problems in the domain of sequence assembly.

ACKNOWLEDGMENTS

This work was supported by grants from the Wellcome Trust, CNRS, and the National Human Genome Research division of the National Institutes of Health. Also, we acknowledge the contributions of Rob Davies at the Sanger Centre, and Dave Hodgson, formerly at the Sanger Centre, to this work. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked ‘‘advertisement’’ in accordance with 18 USC section 1734 solely to indicate this fact.

REFERENCES

Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403–410.

Bonfield, J.K., K.F. Smith, and R. Staden. 1995. A new DNA sequence assembly program. Nucleic Acids Res. 23: 4992–4999.

Bonfield, J.K. and R. Staden. 1996. Experiment files and

GENOME RESEARCH 267 Copies of abstracts

Copy of cover image, September 2008 Issue of journal Genome Research Genome Research -- About the Cover (September 2008, 18, (9)) http://genome.cshlp.org/content/vol18/issue9/cover.shtml

QUICK SEARCH: Author: Go Year: Vol About the Cover

Cover EagleView visualization of next-generation genome assembly. In many applications of next-generation sequencing technologies, data visualization is essential for (1) identification of different types of errors from sequencing, read mapping, and assembly; (2) validation of candidate polyphormisms; (3) software development and testing; and (4) data interpretation and hypothesis generation. The illustration shows a genome assembly of 33-bp Illumina reads. Below the genome coordinates is the genome feature annotation bar with different colors representing promoter regions (pink), exons (cyan), and introns (yellow). The small pink rectangle above the features is the navigation cursor, and the small red rectangle below indicates a single nucleotide polyphormism site. (Cover illustration by Weichun Huang. [For details, see Huang and Marth, pp. 1538-1543.])

[Table of Contents]

Copyright © 2008 by Cold Spring Harbor Laboratory Press.

1 of 1 10/5/2008 2:59 PM Invited / keynote talk posters Register by August 15 and Save up to $200! Final Agenda

Cambridge Healthtech Institute’s Inaugural NNext-Generationext-Generation SSequencingequencing DDataata AAnalysisnalysis Deciphering the Sequencing Data Deluge September 21-23, 2008

and

Cambridge Healthtech Institute’s Second Annual EExploringxploring Next-Generation SSequencingequencing September 23-25, 2008

Rhode Island Convention Center • Providence, RI

KEYNOTE PRESENTATIONS

Next-Generation Sequencing: The Informatics Angle

Gabor Marth, D.Sc., Assistant Professor, Department of Biology, Boston College

Massively Parallel High Throughput DNA Sequencing: Automation for Microbial Community, Gene Expression and de Novo Deciphering of New Genomes Bruce A. Roe, Ph.D., George Lynn Cross Research Professor of Chemistry and Biochemistry, Advanced Center for Genome Technology, Stephenson Research and Technology Center, University of Oklahoma

Part of: Corporate Sponsors:

Cambridge Healthtech Institute, 250 First Avenue, Suite 300, Needham, Massachusetts 02494 Telephone: 781-972-5400 or toll-free in the U.S. 888-999-6288 • Fax: 781-972-5425 • www.healthtech.com healthtech.com 10:50 NNorthwesternorthwestern UUniversityniversity Scalable Bioinformatics for Next-Generation Sequencing NNext-Generationext-Generation Jared Flatow, B.S.E.E., Analyst/Programmer, Bioinformatics Core In order to keep up with the massive amounts of data produced by next- SSequencingequencing DDataata generation sequencing technologies, bioinformaticists need a way of scaling existing tools without completely redesigning them. An understanding of the industrial strength map-reduce paradigm will be invaluable to those looking AAnalysisnalysis to cope with the next-generation datasets. Combined with the power of elastic computing clouds, many of the potential barriers to dealing with Deciphering the Sequencing Data Deluge such large-scale data can be completely eliminated. This talk will explain what map-reduce is, demonstrate how it can be used to formulate some September 21-23, 2008 classic bioinformatics problems and discuss how it compares to other ways Rhode Island Convention Center • Providence, RI of parallelizing computations, such as Message Passing Interface (MPI). 11:20 WWashingtonashington UUniversityniversity Marwan Alsarraj, Product Manager, Gene Expression Division, Genomic Analysis Accomplishments and Hurdles in Bio Rad Laboratories the Context of Next-Gen at the Genome Center Jarret Glasscock, Ph.D., Research Instructor, Department of Genetics, As new sequencing platforms are capable of generating gigabytes Technology Development, Genome Sequencing Center of data in a single sequence run - leading to terabytes of data in an Software and analysis approaches are under continuous development to experiment; data storage, transfer, and analysis will unquestionably be keep pace with the rapid evolution of Next-Gen sequencing technologies. the rate limiting steps in turning the new sequence data into knowledge. These new tools are developed to both characterize and apply data CHI’s Inaugural Next-Generation Sequencing Data Analysis convenes obtained from these new sequencing platforms. Each respective Next-Gen engineers who are developing the sequencing platforms, biological platform has its own unique characteristics, which once understood, can be researchers who are designing and running the experiments, biostaticians leveraged to address current biological questions. who are analyzing and interpreting the data, and software developers who are managing and storing the data. Each specialty provides 11:50 Close of Session perspectives and must be integrated into a cohesive, comprehensive 12:00pm Luncheon Technology Workshop (Sponsorship Available) team to decipher the sequencing data deluge. Data Analysis and Interpretation: Software SSUNDAY,UNDAY, SSEPTEMBEREPTEMBER 2211 Solutions “Take a Test Drive” 1:30 pm Short Course Registration 2:00 Chairperson’s Remarks 2:00-5:00 Recommended to Short Course II * 2:05 Creating Bioinformatics Processes sponsored by 4:00 – 5:00 Early Conference Registration and an IT Infrastructure That Transforms Next-Generation Sequencing Data into MMONDAY,ONDAY, SEPTEMBERSEPTEMBER 2222 Functional Knowledge Ron Ranauro, President & Chief Executive Offi cer, GenomeQuest, Inc. 7:30 am Registration and Morning Coffee Low-cost, high-throughput DNA sequencing promises to transform scientifi c research and usher in the era of personalized medicine. However, the Data Generation: current output of raw sequencing data is enormous, and requires large-scale Sequencing Centers bioinformatics processing before research can proceed. The infrastructure “Navigating the Expressway” for this is beyond the reach of all but the most advanced staffs and largest bio-IT budgets. A Web-enabled sequencing informatics service that combines workfl ow consultation with large-scale data and computational 8:30 Chairperson’s Opening Remarks resources delivers actionable scientifi c information quickly and at a fraction 8:45 Keynote Presentation of the cost. We will discuss: the challenges to sequence informatics posed Next-Generation Sequencing: The Informatics Angle by Next-Gen data volumes; consider a sampling of build-or-buy options Gabor Marth, D.Sc., Assistant Professor, Department of Biology, for your Next-Gen bioinformatics infrastructure and walk through a case Boston College example workfl ow customization and information delivery methods. 2:35 Tools to Manage the Next-Generation Data Flood W. Richard McCombie, Ph.D., Professor, Cold Spring Harbor Laboratory 9:30 JJ.. CraigCraig VenterVenter IInstitutenstitute 3:20 Refreshment Break Meeting the Data Management and Analysis Challenges of Next-Gen Sequencing Saul A. Kravitz, Ph.D., Director of Bioinformatics Software Next-Gen sequencing platforms like 454 and AB SOLiD present new challenges for the effi cient operation of a sequencing center and the cost-effective management of the large datasets produced. This talk will describe the integration of next-generation sequencers within our sequencing center, integration with our LIMS components, quality control, cost-effective management of datasets, and grid-based analysis of next- generation datasets. 10:00 Morning Coffee 10:30 Broad Institute (invited)

2 www.healthtech.com 2008 CSHL Meeting on Personal Genomes http://meetings.cshl.edu/meetings/person08.shtml

Register PERSONAL GENOMES: Abstract TECHNOLOGY, INTERPRETATION & CHALLENGES Instructions October 9 - 12, 2008 Abstract Deadline: July 30, 2008 Submit Abstract Organizers: Abstract Status Richard Gibbs, Baylor College of Medicine, Houston Mary-Claire King, University of Washington, Seattle General Maynard Olson, University of Washington, Seattle Information Lincoln Stein, Cold Spring Harbor Laboratory Jan Witkowski, Banbury Center, Cold Spring Harbor Travel Laboratory Information We are pleased to host the a special meeting on Personal Media Genomes, which will begin with 7.30pm on Thursday, Registration October 9, 2008 and run through lunch on Sunday, October 12.

The meeting is being held both to celebrate and to critically examine a significant milestone in human genetics-the first "personal genomes." These ultra high throughput sequencing strategies are used in a very limited number of laboratories and few scientists, and even fewer clinical geneticists, are familiar with the implications of the "$1000" genome. We believe that a meeting which reviews these topics will be very attractive to a range of scientists including biologists, geneticists, and biomedical researchers.

Tentative Topics: Opening session: ‘Setting the Tone’

Session I: Technical Status of Sequencing Whole Genomes Successes; problems; new developments; new things on the way

Session II: Making Sense of the Content of Whole Genomes Analysis of variation; large scale structure of the genome; human genome evolution

Evening: Panel Discussion - Ethics

Session III Whole Genome Genetics What are we learning about human genetics from genome scale studies?

Session IV: Applications of Whole Genome Studies How are genome scale studies providing new insights on clinical genetics?

1 of 3 10/5/2008 4:02 PM 2008 CSHL Meeting on Personal Genomes http://meetings.cshl.edu/meetings/person08.shtml

Session V: Preparing for the ‘Whole Genome’ World What happens when everyone has their genome sequenced?

Speakers: Marc Feldman, Stanford University Paul Flicek, EMBL-European Bioinformatics Institute, UK Yang Huanming, Beijing Genomics Institute, China Jim Lupski, Baylor College of Medicine Elaine Mardis, Washington University School of Medicine Gabor Marth, Boston College Len Pennacchio, Lawrence Berkeley National Laboratory J. Craig Venter, Center for the Advancement of Genomics Stephen Warren, Emory University School of Medicine James D. Watson, Cold Spring Harbor Laboratory

Abstracts should contain only new and unpublished material and are welcome for consideration as poster presentations (a small number may be selected as talks) and must be submitted electronically by the abstract deadline. Selection of material for poster and oral presentation will be made by the organizers. Status (talk/poster) of abstracts will be posted on our web site as soon as decisions have been made by the organizers.

We are eager to have as many young people as possible attend since they are likely to benefit most from this meeting. We have applied for funds from government and industry to partially support graduate students and postdocs. Apply in writing to [email protected] stating need for financial support - preference is given to those submitting abstracts.

We look forward to seeing you at Cold Spring Harbor in October.

This conference is supported in part by funds provided by: XXX

Pricing Academic Package $925 Graduate/PhD Student Package $770 Corporate Package $1175 Academic/Student No-Housing Package $630 Corporate No-Housing Package $790 Currency converter

Regular packages are all inclusive and cover registration, food, housing, parking, wine-and-cheese party, lobster banquet, etc. No Housing packages include all costs except housing. Full payment is due 4 weeks prior to the meeting.

Cold Spring Harbor Laboratory Meetings & Courses Program PO Box 100, 1 Bungtown Road Cold Spring Harbor, NY 11724-2213 Phone (516) 367-8346 Fax: (516) 367-8845

[email protected]

2 of 3 10/5/2008 4:02 PM Book reference to “PolyBayes” approach developed by Prof. Marth

Article reference to “PolyBayes” approach developed by Prof. Marth REVIEWS

THE BAYESIAN REVOLUTION IN GENETICS

Mark A. Beaumont* and Bruce Rannala‡ Bayesian statistics allow scientists to easily incorporate prior knowledge into their data analysis. Nonetheless, the sheer amount of computational power that is required for Bayesian statistical analyses has previously limited their use in genetics. These computational constraints have now largely been overcome and the underlying advantages of Bayesian approaches are putting them at the forefront of genetic data analysis in an increasing number of areas.

STATISTICAL INFERENCE In many branches of genetics, as in other areas of biol- particular genotype was born in a population other The process whereby data are ogy, various complex processes influence the data. than the one from which it is sampled (that is, is an observed and then statements Genetics has evolved rich mathematical theories to deal immigrant) depends, among other things, on the gene are made about unknown with this complexity. Using these theoretical tools, it is frequencies in that population. Inferences about the features of the system that gave rise to the data. often possible to construct realistic models that explain population gene frequencies depend, in turn, on infer- the data in terms of the processes. Formulating such a ences about the populations of origin for all other sam- PROBABILISTIC MODEL model is often the first step towards studying the under- pled individuals (given their genotypes), which depend, A model in which the data are lying processes and provides the basis for STATISTICAL in turn, on the inferred gene frequencies for all other modelled as random variables, the probability distribution of INFERENCE.Most genetic properties of individuals, popula- populations, and so on. Bayesian inference is a conve- which depends on parameter tions or species (such as individual genotypes, population nient way to deal with these sorts of problems (that is, values. Bayesian models are gene frequencies and DNA sequence polymorphisms) are models with many interdependent parameters). sometimes called fully a product of forces that are inherently stochastic and In this review, we compare the Bayesian approach to probabilistic because the therefore cannot be studied without the use of PROBABILISTIC genetic analysis with approaches that use other statisti- parameter values are also treated as random variables. MODELS.Ofcourse, not every aspect of molecular biology cal frameworks. We endeavour to explain why the use of must be studied using probabilistic models. At the bio- Bayesian methods has increased in many branches of LIKELIHOOD chemical level, for example, particular pathways of gene science during the past decade and highlight the aspects The probability of the data for expression can be studied under more or less controlled of many genetic problems that make Bayesian reasoning a particular set of parameter 1 values. conditions that seem (at least to many practitioners) to particularly attractive .A potentially attractive feature of obviate the need for any statistical analysis. However, even Bayesian analysis is the ability to incorporate back- such experimental studies are being increasingly supple- ground information into the specification of the model. *School of Animal and mented by the rapidly burgeoning field of functional However, we argue that the recent popularity of Bayesian Microbial Sciences, genomics, a field that has many of the same properties methods is largely pragmatic, and can be explained by University of Reading, (and problems) as other observational sciences and that the relative ease with which complex LIKELIHOOD prob- Whiteknights, P.O. Box 228, requires similar probabilistic analysis. lems can be tackled by the use of computationally Reading RG6 6AJ, UK. Genetic data are often the result of a complex process intensive MARKOV CHAIN Monte Carlo (MCMC) tech- ‡Department of Medical Genetics, 839 Medical with many mechanisms that can produce the observed niques. To illustrate this, we describe recent applica- Sciences Building, University data, so what is the best way to to choose among the tions of Bayesian inference to three areas of modern of Alberta, Edmonton, possible causes? As an example, consider the use of genetic analysis: population genetics, genomics and Alberta T6G2H7, Canada. genetic data to identify cryptic population structure human genetics (primarily gene mapping). Finally, Correspondence to M.A.B. e-mail: m.a.beaumont@ (that is, individuals with different population ancestries we highlight some of the current problems and limita- reading.ac.uk arising from, for example, geographic separation). The tions of Bayesian inference in genetics and outline doi:10.1038/nrg1318 calculation of the chance that an individual carrying a potential future applications.

NATURE REVIEWS | GENETICS VOLUME 5 | APRIL 2004 | 251 REVIEWS

Principles of Bayesian inference The essence of the Bayesian viewpoint is that there is no logical distinction between model parameters and data. Both are RANDOM VARIABLES with a JOINT PROBABILITY DISTRIBUTION that is specified by a probabilistic model. Φ ) Φ

( From this viewpoint,‘data’ are observed variables and P ‘parameters’ are unobserved variables. The joint distrib- ution is a product of the likelihood and the PRIOR.The Parameter, Parameter, Prior – prior encapsulates information about the values of

MARKOV CHAIN a parameter before examining the data in the form of a A model that is suitable for probability distribution. The likelihood is a CONDITIONAL modelling a sequence of random DISTRIBUTION that specifies the probability of the variables, such as nucleotide base observed data given any particular values for the para- pairs in DNA, in which the meters and is based on a model of the underlying probability that a variable Data, D assumes any specific value process. Together, these two functions combine all avail- depends only on the value of a able information about the parameters. Bayesian statis- specified number of most recent tics simply involves manipulating this joint distribution variables that precede it. In an in various ways to make inferences about the parameters, nth-order Markov chain, the or the probability model, given the data (FIG. 1).The main probability distribution of a Marginal likelihood – P(D) variable depends on the n aim of Bayesian inference is to calculate the POSTERIOR preceding observations. DISTRIBUTION of the parameters, which is the conditional distribution of parameters given the data. MARGINAL LIKELIHOOD A POINT ESTIMATE of a parameter is obtained by consid- Also known as the ‘prior predictive distribution’.The ering some property of the posterior distribution (usu- probability distribution of ally the mode or the mean). An INTERVAL ESTIMATE of a the data irrespective of the parameter can be obtained by considering a ‘credible set’ Posterior distribution – P(Φ|D) parameter values. of values (a set or interval that contains the true parame- α, α RANDOM VARIABLE ter with probability 1– for which is a pre-specified A quantity that might take any significance level such as 0.05). An example that uses of a range of values (discrete or Bayesian inference to ‘assign’ an individual from an continuous) that cannot be unknown source population to its population of birth predicted with certainty but on the basis of its genotype is presented in BOX 1. only described probabilistically. Φ Likelihood – P(D| ) Other well-known non-Bayesian approaches to sta- JOINT PROBABILITY Figure 1 | The basic features that underlie Bayesian tistical inference include the method of maximum likeli- DISTRIBUTION inference. We imagine that the data D can take any value hood and the METHOD OF MOMENTS,which form the basis of The probability distribution of that is measured along the x-axis of the figure. Similarly, the classical or FREQUENTIST INFERENCE2.Maximum likelihood all combinations of two or more parameter value Φ can take any value that is measured along random variables. bases inferences entirely on the likelihood function, the y-axis. Bayesian inference involves creating the joint incorporating no prior information and choosing point distribution of parameters and data, P(D,Φ), illustrated by the PRIOR [DISTRIBUTION] contour intervals in the figure. This distribution can be estimates of parameters that maximize the probability of The probability distribution Φ the data given the parameter (that is, maximizing the of parameter values before obtained simply as the product of the prior P( ) and the Φ likelihood as a function of the parameter for a fixed set of observing the data. likelihood P(D| ). Typically, the likelihood will arise from a statistical model in which it is necessary to consider how the data). Historically, there have been many arguments CONDITIONAL DISTRIBUTION data can be ‘explained’ by the parameter(s). The prior is an both for and against the use of various inference frame- The distribution of one or more assumed distribution of the parameter that is obtained from works. An old criticism of the Bayesian approach is that random variables when other background knowledge. The arrows in the figure show that there is something unsatisfactorily subjective in choosing random variables of a joint marginal distributions are obtained by summing (integrating) probability distribution are fixed the joint distribution either over the data, recovering the prior a prior. However, this is no different in principle from the at particular values. (the distribution on the right of the joint distribution), or over choice of likelihood function in the maximum-likelihood 1 the values of the parameter, giving the MARGINAL LIKELIHOOD method .In fact, as is demonstrated below, modern POSTERIOR DISTRIBUTION (the first distribution directly below the joint distribution). Bayesian methods often place explicit prior probabilities The conditional distribution Conditional distributions (represented by the ‘|’ in notation) are on alternative likelihood functions to calculate their of the parameter given the indicated by the dotted lines in the figure, and represent taking posterior probability given the data. observed data. a ‘slice’ through the joint distribution and then rescaling the distribution so that the sum (integral) of possible values is There are many practical reasons to use Bayesian POINT ESTIMATE equal to one. The scaling factor that is needed is given by inference: if a probability model includes many interde- A summary of the location of a the marginal distribution. Any conditional distribution is simply pendent variables that are constrained to a particular range parameter value. In a Bayesian the joint distribution divided by a marginal distribution. For setting, this is generally the of values (as is often the case in genetics), maximum- mean, mode or median of the example, the likelihood can be recovered by dividing the joint likelihood inference requires that a constrained multi- Φ posterior distribution. distribution by the prior. The posterior distribution, P( |D) — dimensional maximization be carried out to find the the key quantity that we want in Bayesian inference — is the joint distribution divided by the marginal likelihood. It is the combined set of parameter values that maximize the like- INTERVAL ESTIMATE lihood function. This is often a difficult numerical analy- An estimate of the region in computation of the marginal likelihood (that is, the integrations which the true parameter value denoted by the arrows that point down from the joint sis problem and might require enormous computational is believed to be located. distribution) that is typically problematic. effort. In addition, under the maximum-likelihood

252 | APRIL 2004 | VOLUME 5 www.nature.com/reviews/genetics REVIEWS

Box 1 | An example of Bayesian inference: assigning individuals to populations

Data (observed variables)

Genotype A Genotype B

Joint Posterior Joint Posterior Prior Likelihood distribution probability Likelihood distribution probability probability

0.01 0.0012 0.99 0.69 Immigrant Parameters 0.001 0.099 0.1 (unobserved 0.31 METHOD OF MOMENTS variables) 0.95 0.9988 0.05 Resident A method for estimating 0.855 0.045 0.9 parameters by using theory to obtain a formula for the Probability of data 0.856 0.144 1 expected value of statistics measured from the data as a function of the parameter values This example should be interpreted with reference to FIG. 1.We imagine a situation in which there are haploid individuals to be estimated. The observed in a population into which immigrants arrive at a low rate. From background information, such as ringing data in birds, values of these statistics are then we think that the probability that any randomly chosen individual is resident is 0.9 and the probability that it is an equated to the expected values. The formula is inverted to immigrant is 0.1: this is our prior (last column on the right). In this population, there are two genotypes at a locus (A and obtain an estimate of the B). Again from background information, we think that the likelihood of genotype A is 0.01 in the immigrant pool and parameter. 0.95 in the resident pool (far left column under genotype A). The joint distribution is the product of the prior and the likelihood (middle columns under each genotype): this represents the probability of a particular observation. For FREQUENTIST INFERENCE example, the joint distribution of an immigrant with genotype A is 0.001. The probability that an observation will be of a Statistical inference in which particular genotype, irrespective of whether it is resident or immigrant, is given by the lower margin of the table, which is probability is interpreted as the relative frequency of occurrences obtained by summing the joint distribution across parameter values. Given that we observe a particular genotype, the in an infinite sequence of trials. posterior probability that it is either immigrant or resident (right-hand columns under each genotype) is given by the joint distribution scaled so that the sum of possibilities is one, obtained by dividing the joint distribution by the COALESCENT THEORY probability of the data. So, if we observe genotype B,the posterior probability that it is an immigrant is 0.69 (whereas it A theory that describes the was 0.1 before this observation). genealogy of chromosomes or genes. Under many life-history schemes (discrete generations, method, calculation of confidence intervals and statisti- parameters in a genetic model and applying the overlapping generations, cal tests generally involve approximations that are most method of moments. Likelihood approaches were not non-random mating, and so 3,4 on), taking certain limits, the accurate for large sample sizes — for example, that the applied to population-genetic problems until later . statistical distribution of branch probability distribution of the maximum-likelihood The development of COALESCENT THEORY5,6 has strongly lengths in genealogies follows a estimate follows a normal distribution. On the other influenced many areas of population genetics. Similar simple form. Coalescent theory hand, in Bayesian inference — in which the prior auto- to earlier approaches, the theory allows the expected describes this distribution. matically imposes the parameter constraints — infer- values of statistics to be calculated, but also enables PARAMETRIC BOOTSTRAPPING ences about parameter values on the basis of the posterior sample data sets to be simulated rapidly for PARAMETRIC The process of repeatedly distribution usually require integration (for example, BOOTSTRAPPING,which in turn allows for more sophisti- simulating new data sets with calculating means) rather than maximization, and no cated calculation of confidence intervals and hypothe- parameters that are inferred further approximation is involved. Moreover, numerical sis testing in the frequentist tradition. Although not from the observed data, and then re-estimating the methods that were developed in the 1950s using applicable in all areas of population-genetic analysis, parameters from these simulated MCMC methods (BOX 2) and implemented on powerful the coalescent theory forms the basis for likelihood data sets. This process is used to new computers have greatly facilitated the evaluation of calculations in genealogical models7 and has allowed obtain confidence intervals. Bayesian posterior probabilities, making the calculations the use of Bayesian approaches to infer demographic

EFFECTIVE POPULATION SIZE tractable for complicated genetic models that have history from genetic data (BOX 3).In addition, Bayesian resisted analysis using maximum likelihood or other methods have been used to assign individuals to their (Ne). The size of a random mating population under a classical methods. This is arguably the most important population of origin and to detect selection acting on simple Fisher–Wright model factor that drives the recent surge of popularity of genes. that has an equivalent rate of Bayesian inference in most branches of science. Here, inbreeding to that of the observed population, which we present a range of examples in which Bayesian infer- Estimating parameters in demographic models. A fea- might have additional ence has allowed complicated models to be studied and ture of population-genetic inference is that parameters complexities such as variable biologically relevant parameters to be estimated, as in the likelihood function, such as mutation rate (µ) and population size or biased sex well as allowing prior information to be efficiently EFFECTIVE POPULATION SIZE (N ), occur only as their product ratio. e µ NON-IDENTIFIABLE incorporated. ( Ne) — that is, they are .With non- NON-IDENTIFIABLE Bayesian inference, if one parameter is of interest, a [PARAMETERS] Population genetics ‘best-guess’ point estimate is typically used for another8, One or more model parameters Population genetics has a rich theoretical heritage and there is no rigorous way to incorporate uncertainty. are non-identifiable if different that stems from the work of Fisher, Haldane and An arguable9 strength of the Bayesian approach is that combinations of the parameters generate the same likelihood of Wright.Initial statistical methods involved calculating prior information can be used to make inferences about the data. expected values of various estimators as functions of non-identifiable parameters10,11.

NATURE REVIEWS | GENETICS VOLUME 5 | APRIL 2004 | 253 REVIEWS

Box 2 | Markov chain Monte Carlo methods Markov chain Monte Carlo (MCMC) describes a class of method that relies on simulating a special type of stochastic process, known as a Markov chain, to study properties of a complicated probability distribution that cannot be easily studied using analytical methods (reviewed in REF.95).A Markov chain generates a series of random variables such that the probability distribution of future states is completely determined by the current state at any point in the chain. Under certain conditions, a Markov chain will have a ‘stationary distribution’,meaning that if the chain is iterated for a sufficient period, the HIERARCHICAL BAYESIAN MODEL states it visits will tend to a specific probability distribution that no longer depends on the iteration number or the initial In a standard Bayesian model, state of the variable. The basic idea that underlies all MCMC methods is to construct a Markov chain with a stationary the parameters are drawn from distribution that is the probability distribution of interest, and then to sample from this distribution to make inferences. In prior distributions, the Bayesian analysis, this distribution is usually the joint posterior distribution of one or more parameters. MCMC has also parameters of which are fixed by been used for estimating likelihoods and other purposes in maximum-likelihood inference. Monte Carlo refers to the the modeller. In a hierarchical quarter in the principality of Monaco that is famous for its gambling casinos and alludes to the fact that random numbers model, these parameters, usually are generated to simulate the Markov chain: this method has much in common with generating random events (such as referred to as ‘hyperparameters’, are also free to vary and are rolling a dice) as is done in games of chance. The simplest form of MCMC is Monte Carlo integration. themselves drawn from priors, Monte Carlo integration often referred to as ‘hyperpriors’. The basic idea that underlies Monte Carlo (MC) integration is that properties of random variables (such as the mean) can This form of modelling is most be studied by simulating many instances of a variable and analysing the results (reviewed in REF.96). Each replicate of the useful for data that is composed of exchangeable groups, such as MC simulations is independent and the procedure is therefore equivalent to taking repeated samples from a Markov genes, for which the possibility is chain that is ‘stationary’ at points that are sufficiently separated so that they are not correlated. MC integration has been required that the parameters widely applied in statistical genetics (see, for example, REF.97). The MC simulation method has the advantage that the that describe each group might estimates obtained are unbiased and the standard error of the estimates can be accurately estimated because the or might not be the same. simulated random variables are independent and identically distributed. A disadvantage is that with complex multidimensional variables that have a large state space (for example, a range of possible values), enormous numbers of APPROXIMATE BAYESIAN replicate simulations are needed to obtain accurate parameter estimates. COMPUTATION The data are simplified by Metropolis–Hastings algorithm representation as a set of The Metropolis–Hastings (MH) algorithm98,99 is similar to the MC simulation procedure in that it aims to sample from a summary statistics and stationary Markov chain to simulate observations from a probability distribution. However, in this case, rather than simulations used to draw simulating independent observations from the stationary distribution, it simulates sequential values from the chain until samples from the joint it converges and then samples simulated values at intervals from the chain to mimic independent samples from the distribution of parameters and summary statistics (that is, the stationary distribution. The MH algorithm has the advantage that it can improve the efficiency of simulations when the distribution shown in figure 1). state space is large because it focuses the simulated variables on values with high probability in the stationary chain. The posterior distribution is Disadvantages include the fact that in most practical applications, there are no rigorous methods available to determine approximated by estimating the when the chain has converged or what the optimal intervals between samples are to extract the most information while conditional distribution of preserving independence between observations. parameters in the vicinity of the summary statistics that are measured from the data (the Demographic models often have many parameters Bayesian assignment methods. The study of population vertical dotted line in figure 1) avoiding the need to calculate a and it is conceptually easier to make inferences about differences using genetic markers has a long history 20 likelihood function. them individually, or at most, jointly as pairs. Through (reviewed in Cavalli-Sforza et al. ). However, it is only the use of marginal posterior distributions, Bayesian relatively recent that methods to assign individuals to MULTILOCUS GENOTYPES analysis deals with this problem simply and flexibly. The populations on the basis of MULTILOCUS GENOTYPES (assign- The combinations of alleles that classical alternatives are to use point estimates for other ment methods) have been developed. The fundamental are observed when individuals are simultaneously genotyped at parameters or to construct confidence intervals on the equation used in assignment methods calculates the two or more genetic marker loci. basis of profile likelihood12.However,in demographic probability of an individual’s multilocus genotype given inference, likelihood functions can be complicated and the allele frequencies at different loci in different popu- ASSOCIATION STUDY the approximations behind the construction of frequen- lations (see BOX 1). The range of practical applications of If two or more variables have joint outcomes that are more tist confidence intervals are probably not accurate and such assignment tests has proven to be broad. These frequent than would be expected are technically difficult to apply with a large number of applications include everything from detecting cryptic by chance (if the two variables parameters13,14.Variability among loci in parameters population admixture in ASSOCIATION STUDIES21–24 to were independent), they are such as mutation rates can be addressed through the use detecting population sources of sporadic outbreaks or associated. An association study of HIERARCHICAL BAYESIAN MODELS15,16 (BOX 4) — for which no emerging epidemics25,26. statistically examines patterns of co-occurrence of variables, such classical counterpart is readily available. Recently, individual assignment methods have as genetic variants and disease As a result of these strengths, Bayesian analysis has in been extended in several new directions. Many of phenotypes, to identify factors recent years become more prevalent in demographic these new applications rely heavily on Bayesian (genes) that might contribute to inference (BOX 5).Computational difficulties can be methodologies and MCMC techniques. In particular, disease risk. addressed by improving the efficiency of MCMC meth- several new Bayesian methods have been proposed to 16 INBREEDING COEFFICIENT ods , and also through the use of alternatives to allow the combined inference of both the partitioning The probability of homozygosity MCMC. An example of the latter is what has come to be of individuals into subpopulations and the assignment by descent — that is, the 17 27,28 known as ‘APPROXIMATE BAYESIAN COMPUTATION’(ABC) , of individual migrant ancestries .Another recently probability that a zygote obtains which in comparisons18 with the evaluation of the same proposed method aims to enable the joint inference of copies of the same ancestral gene 19 from both its parents because problem through MCMC can be up to 1,000 times the presence of subpopulations within a larger popula- they are related. faster, and only slightly less accurate. tion and the estimation of traditional fixation indices

254 | APRIL 2004 | VOLUME 5 www.nature.com/reviews/genetics REVIEWS

Box 3 | Use of MCMC to infer parameters in genealogical models is probable that some of these loci have been subject to selection. A similar approach has been used to identify Markov chain Monte Carlo candidates for adaptive selection in subdivided popula- (MCMC) methods can tions34.A method for finding the distribution of selective be used to obtain posterior effects among loci has also been described35. distributions for Population-genetic methods for detecting selection demographic parameters, might be sensitive to the model that is fitted because even though it is only demographic events, such as bottlenecks, might mimic possible to calculate or mask the effects of selection36.More robust inference likelihoods for individual is possible using sequence data from different species, in genealogies. It is assumed which demographic effects are irrelevant because the that the parameter of interest is twice the product segregating variants within a population are not being 36 of the effective population considered .Analyses at this level focus on the ratio w size (N ) and mutation rate. of nucleotide substitutions that leave the amino acid e Posterior density For simplicity, the prior for unchanged in the protein to substitutions that result in a any parameter value is a change. If all amino-acid replacing substitutions are constant, and, therefore, the neutral, this ratio should be equal to one. If they are posterior density for a deleterious, this ratio should be less than one, and if parameter is proportional favoured (positive selection), it should be more than µ to the likelihood. From 2Ne one. Based on these principles, a Bayesian approach has coalescent theory, we can been used to identify which codons are under positive calculate the probability of the data for a specific parameter value and specific selection in a gene37.In this approach (an EMPIRICAL BAYES genealogy. The MCMC is assumed to have two types of move: changing the parameter PROCEDURE), maximum likelihood-generated point esti- value, keeping to the same genealogy and changing the genealogy, keeping the same mates of phylogenetic parameters are used to calculate parameter value. The moves are reversible but those towards higher likelihoods are the posterior probability that a codon belongs to one of favoured (represented by the larger arrow heads in the figure). Relative likelihood is three categories (w = 0.1, or >1). Bayesian phylogenetic indicated by the area of each individual rectangle. The same genealogy is represented by methods (see REF.38) might allow more fully Bayesian the same colour. The relative likelihood for particular parameter values is the sum of the estimates of these probabilities. relative likelihoods of the genealogies, and provided that a representative sample of genealogies is explored, the MCMC will visit parameter values in proportion to their Genomics relative likelihood. Sequence Analysis. The non-phylogenetic aspects of sequence analysis have a rich and diverse history of model-based methods39, and include an early applica- (F statistics29) among and within the identified subpop- tion of MCMC to a biological problem40. COMPARATIVE METHODS ulations30. Finally, a Bayesian MCMC method has been Markov chains or HIDDEN MARKOV MODELS (HMMs) Methods for comparing traits proposed for inferring short-term migration rates (over are at the heart of most maximum-likelihood meth- across species to identify trends in 41 character evolution that indicate the past few generations) using individual multilocus ods of sequence analysis .These methods use DYNAMIC 31 the effects of natural selection. genotypes .This method also allows for deviations PROGRAMMING to find high-dimensional maximum- from the Hardy–Weinberg equilibrium (that is, the likelihood solutions. Some likelihood-based analyses EMPIRICAL BAYES PROCEDURE genotype proportions expected under random mating) produce scoring functions that involve a Bayesian cal- A hierarchical model in which the within populations by including a separate INBREEDING culation. For example, the GeneMark software42,which hyperparameter is not a random variable but is estimated by some COEFFICIENT for each population (the value of the inbreed- is used to annotate prokaryote genomes, calculates the other (often classical) means. ing coefficient is estimated as part of the MCMC infer- likelihood under several different situations (the proba- ence procedure). The multidimensional complexity of bility of the data given that it is coding, non-coding, and HIDDEN MARKOV MODEL these models makes maximum-likelihood inference dif- so on) and then makes an empirical Bayes calculation to This is an enhancement of a Markov chain model, in which ficult and no comparable maximum-likelihood meth- pick between them — similar to that described above the state of each observation is ods have been developed. Multilocus assignment tests for detecting nucleotides under selection. drawn randomly from a are currently in their infancy, but we expect that within A rich strand of Bayesian analysis has stemmed from distribution, the parameters of a few years they will become a routinely used tool of models that assume that the bases at nucleotide posi- which follow a Markov chain. For biologists in fields as disparate as epidemiology, human tions, or amino-acid residues, are drawn at random example, the parameter might be an indicator for whether a DNA gene mapping and behavioural ecology. from frequency distributions that vary among regions. region is coding or non-coding, The inference problem is then to locate the regions, mar- and the observation is the base at Detecting selection. Both COMPARATIVE METHODS and pop- ginal to other parameters such as base composition each nucleotide. ulation-genetic methods can be used to identify candi- within and outside regions. In this context, Bayesian 32 DYNAMIC PROGRAMMING date loci that might have been affected by selection .In methods initially were used to model protein align- 40–43 A large class of programmimg the case of population-genetic analysis, one idea is to ment , an approach that has been extended to local algorithms that are based on use hierarchical Bayesian demographic models (BOX 4) alignment44, and have also been used to identify tran- breaking a large problem down in which the demographic parameters are allowed to scription-factor binding sites45.Bayesian modelling (if possible) into incremental vary among loci to mimic the effects of selection33,15.If based on this approach has been used to obtain the steps so that, at any given stage, optimal solutions are known the posterior probability of zero variance in demo- marginal distribution of change points (boundaries of sub-problems. graphic parameters among loci is itself close to zero, it regions) and base compositions along a sequence46 (see

NATURE REVIEWS | GENETICS VOLUME 5 | APRIL 2004 | 255 REVIEWS

Box 4 | Hierarchical Bayesian models

In a standard Bayesian calculation, as in FIG. 1,the posterior distribution, P(Φ|D), is proportional to P(D|Φ)P(Φ). For example, Φ might be a mutation rate and P(Φ) might be a prior for the mutation rate. Later, however, it might become apparent that the mutation rate varies among loci, and that there are two causes of uncertainty: uncertainty in the ‘type’ of locus and uncertainty in the mutation rate given that type. Therefore, rather than combine these two sources of uncertainty into P(Φ), it is possible to split it into two parts so that σ is a parameter that reflects the type of locus and P(Φ|σ) is the uncertainty in mutation rate given that it is σ.Analagously, Φ might be variance among replicates in expression levels in a microarray experiment. Again, the variance might itself vary among genes, specified by σ.In these cases, Bayesian calculation could be written as P(D|Φ)P(Φ|σ)P(σ). The parameter σ is then often referred to as a ‘hyperparameter’ and P(σ) as a ‘hyperprior’. For data from a single unit, such as a locus, this might not make much difference in the model, depending on how the priors and hyperpriors are specified. However, if the data consist of several different loci, the types of which can be regarded as a random sample from the distribution that is specified by σ,we can then make inferences about σ,as indicated in the figure. The figure shows the posterior distribution of the parameter Φ inferred for three different units (loci/genes), conditional on three different values of the hyperparameter σ that controls variability in Φ among units. As σ becomes smaller (tends to zero; top panel), the posterior distributions of Φ for each unit become more similar, resulting in more similar means (shrinkage; compare the range of means indicated with a black horizontal line in the three panels) and a reduction in variance occurs (BORROWING STRENGTH;compare the variances of the middle distribution indicated with a pink horizontal line in the three panels). Borrowing strength refers to the σ = 0.05 fact that as the priors for Φ become more similar, information is used across units. The inset shows the posterior distribution of σ. The figure implies that the posterior distribution of Φ for any P(σ|D) locus, marginal to σ,will be intermediate between Shrinkage the case σ = 0.05 and σ = 0.5. An empirical σ = 0.5 Bayes procedure would

use a point estimate for 0.05 0.5 1 σ,rather than make σ inferences about Φ, marginal to σ.

σ = 1 Borrowing strentgh

Shrinkage ) Φ | D P( Borrowing strentgh

BORROW STRENGTH This is the tendency in a Φ hierarchical Bayesian model for the posterior distributions of parameters among exchangeable also REF. 47). Maximum-likelihood approaches to a analysis), it enables full inference on each parameter units (for example, genes) to become narrower as a result of problem such as this are generally restricted in the num- and allows more rigorous significance testing through pooling information across ber of parameters considered, and significance testing is MODEL SELECTION.It is often straightforward to incor- units. often limited because of the high-dimensional opti- porate an HMM model into a MCMC framework48 mizations required46.By contrast, the Bayesian approach (see also REF. 47), and so it is likely that Bayesian MODEL SELECTION allows more parameters to be considered (essentially analyses for sequence data will become more wide- The process of choosing among different models given their allowing parameters that are assumed to be fixed in spread in future, built on the maximum-likelihood posterior probability. maximum-likelihood approaches to vary in the Bayesian framework.

256 | APRIL 2004 | VOLUME 5 www.nature.com/reviews/genetics REVIEWS

Box 5 | Examples of Bayesian analysis in demographic inference Inferring changes in population size The first fully Bayesian genealogical analysis was applied to Y-linked microsatellite (YLM) data11.Subsequently, there has been interest in inferring population growth. Both approximate Bayesian computation100 and Markov chain Monte Carlo19 approaches have been used for YLM data (these approaches yield similar results18). Methods for unlinked microsatellite markers have also been developed33,101. Analysis of population structure Models of populations that diverge and evolve independently without gene flow have been considered both for DNA sequence data16 and also for YLM data19 — the latter allowing complex bifurcating histories to be considered.A method that enables both migration and population splitting for DNA sequence data has also been developed13.Equilibrium models with a constant level of migration between populations seem not to have been directly addressed (but an option for Bayesian analysis is now available in the distributed package for the maximum-likelihood estimation method in REF.12). Use of temporal samples Bayesian methods have been developed to deal with genetic data that are taken at different times, allowing for population growth102. This additional temporal information can remove the problem of non-identifiability of parameters. It is then possible to include ancient DNA data to make more accurate inferences about population demography. The method also has applications in viral epidemiology103.Furthermore, simpler models can be used to estimate effective population size in the short-term monitoring of populations104.

Identification of SNPs. The Human Genome Project49,50 approach: in a large data set of ESTs, this method has generated an interest in the identification of nucleo- discarded around 99.9% of cases as false positives (that tide sites that are polymorphic among individuals — is, those in which the variation is inferred to be the result that is single nucleotide polymorphisms (SNPs). There of sequencing error) and 60% of the remaining SNPs is a large number of SNPs that potentially could be used were confirmed in a subsequent analysis53. as markers that are efficient and inexpensive to geno- type. The advantages of SNPs for modelling demo- Bayesian haplotype inference through population sam- graphic history are offset by the problems of modelling ples. The inference of haplotypes (that is, determining their ascertainment14,51.Typically, SNPs are identified by the phase of non-allelic polymorphisms) is an impor- intensively sequencing a small sample of individuals. tant goal for many reasons (see REFS 55–65). Haplotype However, several factors, such as genotyping errors, can phase can be determined in several ways, including link- lead to a large number of false positives. This presents an age analysis55 and direct molecular techniques, but most ideal problem for Bayesian modelling in which there are are too unreliable, too expensive or too time-consuming data that can be explained by competing hypotheses, to be routinely used. Recently, population-genetic tech- but in which we have prior information with which to niques have been proposed for inferring haplotype phase make judgements among them. using population samples of genotypes56–59 based on The details of how the Bayesian approach can be the principle that the distribution of (observed) multi- applied will obviously depend on the technical details of locus genotypes in a random sample of individuals car- how the SNPs are identified. A software package that is ries information about the underlying distribution of widely used in non-human52 as well as human genotyp- (unobserved) haplotypes. ing is PolyBayes53 (see REF.54 for a related approach). Bayesian methods58,59 have been proposed as an alter- Two important problems in the identification of SNPs native to the Expectation-Maximization (EM) algo- are the presence of PARALOGOUS sequences and sequenc- rithm60 (a maximum-likelihood approach) for inferring ing errors. Bayesian calculations can deal with both haplotypes from population-genetic data because they do these issues sequentially53.In the first case, the num- not require all the haplotype frequencies to be retained in ber of mismatches of a sample sequence from a refer- computer memory and eliminate the computationally ence sequence is measured. Using prior information on expensive maximization step of the EM algorithm. The the average pairwise differences between paralogous Bayesian approach seeks to estimate the posterior proba- sequences versus homologous sequences, the probabil- bility distributions of the population haplotype fre- ity of obtaining any given number of mismatches under quencies, F, and/or the individual diplotypes (pairs of either hypothesis is calculated to obtain the posterior haplotypes), H,given the sampled genotypes, G.This probability that a sequence is not paralogous to the ref- requires that an explicit prior probability distribution for erence sequence. Sequences in which this posterior the population haplotype frequencies, Pr(F), be specified. probability is higher than some critical value are then Niu et al.58 use an arbitrary distribution for F,whereas selected out. The second stage involves performing Stephens et al.59 use a distribution that is loosely based on another Bayesian calculation using aligned sequences, a population-genetic (coalescent) model. Although the this time with two competing models: first, that the methods of Stephens et al. and Niu et al. differ in many of observed variants are the result of sequencing error, the details, the basic approach is similar. PARALOGOUS and second, that the observed variants are true poly- A shortcoming of current applications of haplotype- This refers to sequences that have arisen by duplications morphisms. In this case, insertions and deletions are inference algorithms is that the resulting haplotypes are within a single genome. ignored. Initial indications are that this is an efficient often used directly in subsequent studies (for example,

NATURE REVIEWS | GENETICS VOLUME 5 | APRIL 2004 | 257 REVIEWS

ELSTON–STEWART ALGORITHM case–control tests for disease–haplotype associations) among replicate experiments using a particular gene, An iterative algorithm for without accounting for the uncertainty of the individ- and minimizing the false-positive and false-negative linkage mapping. The algorithm ual’s inferred haplotypes. In other words, a point rates. In the first case, the idea is that with limited repli- calculates the likelihood of estimate of the individual haplotype is treated as an cation, it is difficult to be sure whether an observed dif- marker genotypes on a pedigree. Calculations on the basis of the observation in carrying out such tests and this can make ference is significant or not; therefore, we need to use algorithm are efficient for the test outcome unreliable if the posterior distribution of the information from other genes. This can be achieved relatively large families, but its haplotypes is not highly concentrated. New methods are using a hierarchical Bayesian model, in which it is possi- application is typically limited to needed for carrying out tests of association, and so on, ble to borrow strength from different genes (BOX 4):a a small number of markers. that integrate over the posterior probability distribution partially Bayesian treatment along these lines has already of haplotypes and thereby explicitly take account of been proposed64.These and similar methods would LANDER–GREEN–KRUGYLAK uncertain phase in carrying out the test.A likelihood ratio then use a sequential p-value method to minimize the ALGORITHM test for differences in haplotype frequencies between cases number of false positives (for example, see REF. 65). An iterative algorithm that is 61 used for linkage mapping. It and controls has been proposed by Slatkin and Excoffier , Alternatively, a more fully Bayesian method is possi- 66,67 iteratively calculates the but equivalent Bayesian methods have yet to be developed. ble , in which the affected genes are picked out likelihood across markers on a through model selection. The advantage of this approach chromosome, rather than across Inferring levels of gene expression and regulation. The is that great flexibility can be introduced into deciding the families, as in the Elston–Stewart 68 algorithm. This allows efficient introduction of methods for measuring levels of gene level of stringency of discrimination . calculation of pedigree expression on the basis of DNA/RNA hybridization has Microarray studies are often used to group genes likelihoods for small families provoked substantial interest in the statistical problems that show similar patterns of expression with different with many linked markers. that arise62.Bayesian statisticians have taken on the chal- treatments. Traditionally, non-parametric ordination or lenge of this showcase area in droves, although many of clustering techniques have been used69.The advantage these studies remain in the statistical journals. Although of applying Bayesian modelling instead is that it is then interesting statistical problems are raised in the actual possible to carry out statistical tests and obtain confi- processing of signals from hybridization data63, the ques- dence bounds on particular groupings, which are not tions that have attracted most attention are: which genes easily obtained using the classical approaches. One are affected by treatments (for example, tissues and approach, which models time-series gene-expression times after treatment, and so on), and what is the model data using regression in a Bayesian framework, defines structure that best characterizes expression patterns? partitions in which genes have the same regression Two issues are important when evaluating the effect parameters, and then hierarchically clusters expression of treatment on expression level: making maximum use patterns on the basis of the posterior probability of par- of the information among genes to model variability titions, starting with an initial state in which each gene belongs to its own partition70.

Box 6 | Analysis of complex traits and quantitative trait locus mapping Human genetics The rapid expansion of human genetic data during the Complex genetic traits, such as body weight or height and many human diseases (for past few decades is unprecedented. The Human Genome example, type II diabetes and schizophrenia), are determined by the combined influences of multiple genes and the environment. Such polygenic traits are often Project produced a genetic blueprint of our chromo- 49,50 referred to as ‘quantitative’ because they are most often measured traits that have a more somes and documented similarities and differences or less continuous distribution in the population. Genes that have a major effect on a between individuals; the current haplotype map project quantitative trait are known as quantitative trait loci (QTLs). A common goal of much (HapMap;see online links box) seeks to further charac- research in animal and plant genetics, as well as in human-disease genetics, is to map terize the distribution of nucleotide polymorphisms QTLs to regions of chromosomes in the hope that the causal loci might ultimately be across chromosomes in human populations71.These identified by positional cloning. In animal populations, QTL mapping has been carried data present new opportunities to identify genes that are out for many years using controlled crosses. In humans, controlled crosses are not involved in human diseases, for both simple single-gene possible (for obvious reasons) and existing pedigrees must instead be used to map the disorders, such as cystic fibrosis, and complex disorders loci through linkage analysis. Mapping through pedigrees has recently become popular that are caused by multiple genes and the environment, in agricultural and livestock genetics as well. such as schizophrenia (reviewed in REF. 72;see BOX 6). One serious problem that is encountered when attempting to map QTLs through Genetic marker polymorphisms in human populations pedigree analysis is that the QTLs that influence human diseases, or other traits, often can be used to identify genes or genomic regions that have low penetrance (penetrance refers to the probability that an individual who carries are associated with diseases and to aid in the positional one or more copies of the gene has the disease/trait). Low penetrance greatly reduces the cloning of a disease mutation. These objectives require power of linkage analysis55. The size of the pedigrees can be increased to compensate for this reduction in power. However, maximum-likelihood methods for multipoint linkage complex statistical modelling, and Bayesian inference analysis that use the ELSTON–STEWART ALGORITHM105 or the LANDER–GREEN–KRUGYLAK has made more rigorous statistical methods feasible in ALGORITHM106,107 are limited to either a small number of linked loci or fewer than both areas. approximately a dozen individuals per pedigree, respectively. Recently, Markov chain Monte Carlo methods for carrying out linkage analysis under complex models of Association mapping. Association-mapping methods inheritance have been developed108,109. The methods seem promising in that they allow attempt to locate disease mutations by detecting associa- much larger pedigrees to be analysed for many linked loci. Several of the most recently tions between the incidence of a genetic polymorphism developed methods are Bayesian (reviewed by REF.110) owing to the fact that the and that of a disease (reviewed in REF.73). Often referred complex multidimensional space of the pedigree analysis problem with complex traits to as ‘case–control studies’, such methods have seen has limited progress for maximum-likelihood methods. widespread application to disease studies using genetic

258 | APRIL 2004 | VOLUME 5 www.nature.com/reviews/genetics REVIEWS

markers in recent years. Association studies that rely Bayesian approach for association-based quantitative on linkage disequilibrium might provide a new tool trait locus mapping using unlinked neutral markers as for mapping genes that influence complex diseases genomic controls. More recently, Hoggart et al.81 pro- (reviewed in REF. 74). posed a hybrid Bayesian–classical method that uses Although association methods have been shown to MCMC to integrate over uncertain admixture propor- be potentially more powerful than linkage analysis for tions and uncertain numbers of founding populations detecting genes that influence complex disease in some that are involved in an admixture, with a classical gener- circumstances, they are plagued by false-positive results alized linear model approach used to specify trait values. for various reasons73. One source of false-positive asso- ciations is population stratification. If a disease muta- Fine-mapping of disease-susceptibility genes. In the tion and a particular marker allele both happen to have 1980s, the first genome-wide genetic markers were devel- an increased, or decreased, frequency in some particular oped using restriction fragment length polymorphisms population (for example, owing to random effects such (RFLPs). This allowed disease genes to be assigned to as joint genetic drift to a higher, or lower, frequency of specific chromosomal intervals using pedigree-based susceptibility alleles and other non-causal alleles, or as a linkage analysis and raised the possibility of positionally result of confounding variables such as environmental cloning a disease gene. The size of a candidate interval effects), the allele and the disease might seem to be asso- defined by linkage analysis (determined by the number ciated; however, the allele is really a marker of popula- of informative meioses) is typically 1 Mb or more, how- tion affiliation rather than being linked to a disease ever, which is much larger than could be sequenced locus and is therefore a false association. using 1980s technologies. One solution is to genotype In the early 1990s, FAMILY-BASED ASSOCIATION TESTS (FBATs), polymorphic markers that span the candidate region such as the transmission disequilibrium test75,were pro- among unrelated individuals. In this way,‘ancestral’ hap- posed to allow association studies to be carried out in lotypes that are shared between disease chromosomes the presence of population stratification. The basic idea can be detected and used to further narrow the candidate was to examine trios of parents and an affected off- region82,83.The basic idea is that disease mutations arise spring and to use the non-transmitted alleles from par- on particular chromosomes that carry specific haplo- ents as controls and the transmitted alleles as cases. This types, and ancestral recombination increasingly disrupts procedure insures that the proper control allele is used haplotype sharing in regions that are further from the in each comparison even in cases in which the parental disease-mutation location84.Because alleles at markers FAMILY-BASED ASSOCIATION mating represents admixture between populations. The near a disease mutation are in greater linkage disequilib- TESTS currently available FBATs have several shortcomings. rium (LD) than those further away, this technique has A general class of genetic First, they test the composite null hypothesis of either come to be known as LD MAPPING. association tests that uses families with one or more affected no linkage or no association. In many cases, either link- Early methods for LD mapping could only be used for children as the observations age or association might be of specific interest. Second, pairwise analyses using single-linked genetic markers — rather than unrelated cases and the methods do not readily allow information from the basic approach was to solve for the expected fraction controls. The analysis treats the other prior linkage or association studies to be incorpo- of non-recombinant haplotypes under a simple demo- allele that is transmitted to (one or more) affected children from rated into the test. Recently, a Bayesian FBAT has been graphic model and then to use this result to derive an esti- 76 each parent as the ‘case’ and the proposed as a potential solution .The new method mate of the disease location assuming a Poisson recombi- untransmitted allele is treated as combines the likelihood function for FBATs developed nation process on the candidate interval85.Subsequent the ‘control’ to avoid the influence by Sham and Curtis77 with flexible prior probability methods used parametric models based on coalescent of population subdivision. densities for model parameters such as the recombina- theory that were more realistic for human populations

BAYES FACTOR tion fraction between the disease and marker loci that and solved for the maximum-likelihood estimate of the The ratio of the prior allow either uninformative (uniform) or informative disease-mutation position (reviewed in REF.86). As the probabilities of the null versus priors to be used depending on the available informa- models were made more realistic, and attempts were the alternative hypotheses over tion. Standard techniques for model testing, based on made to include factors such as multiple linked markers the ratio of the posterior probabilities. This can be the BAYES FACTOR,are then used to directly test specific and genetic heterogeneity (for example, multiple disease interpreted as the relative odds hypotheses about linkage, and so on. alleles), it became increasingly difficult to derive tractable that the hypothesis is true before An alternative way to correct for the effects of popu- maximum-likelihood estimates. Bayesian methods that and after examining the data. If lation stratification in association analyses is to examine use MCMC offer a potentially powerful alternative for the prior odds are equal, this unlinked genetic markers (so-called ‘genomic controls’) such analyses. These methods allow integratation (aver- simplifies to become the likelihood ratio. to correct for population subdivision in association age) over nuisance parameters such as the unknown studies21.Multilocus assignment tests developed in genealogy (coalescent tree) and ancestral haplotypes that LD MAPPING recent years78,79 have been applied to the problem of underlie a sample of disease (and control) chromo- A procedure for fine-scale association mapping in admixed populations21,22.These somes87,88, and over the unknown ages of disease localization to a region of a 89 chromosome of a mutation that methods have at least two limitations: they were not mutations .These new methods also allow the direct 90,91 causes a detectable phenotype specifically developed for mapping susceptibility alle- use of multilocus haplotypes or genotypes and (often a disease) by use of les that influence complex traits, and they do not ade- have been extended to allow the incorporation of linkage disequilibrium between quately account for the statistical uncertainty of genomic additional genomic information into LD mapping the phenotype that is induced by ancestries and admixture proportions. Several Bayesian through the prior for the disease location. Rannala and the mutation and markers that 87 are located near the mutation on approaches have been proposed that attempt to correct Reeve used information from an annotated human the chromosome. for these deficiencies. Sillanpaa et al.80 proposed a fully genome sequence (National Center for Biotechnology

NATURE REVIEWS | GENETICS VOLUME 5 | APRIL 2004 | 259 REVIEWS

CONVERGENCE Information (NCBI); see online links box) and the the sensitivity of the model to the priors, in complicated The inexorable tendency for a Human Gene Mutation Database (HGMD; see online hierarchical models it is generally unfeasible to systemati- mathematical function to links box) to modify prior probabilities for the location of cally examine the effect of different priors on the many approach some particular value a novel disease mutation taking account of the likelihood parameters in the model. Another issue for studies based (or set of values) with increasing n.In the case of Markov chain that disease mutations reside in introns, exons or non- on MCMC is the problem of assessing CONVERGENCE, Monte Carlo, n is the number of coding DNA. Other innovations made possible by the which can be particularly acute for models with a variable simulation replicates and the Bayesian approach include the direct use of genotype number of dimensions. Generally, most Bayesian meth- values that the chain approaches data, rather than haplotypes90,91,by integrating over possi- ods are slow, which provides a strong disincentive for any- are the posterior probabilities. ble haplotypes in the MCMC algorithm. Allelic hetero- thing more than rudimentary model-checking. geneity can also be modelled using so-called ‘shattered Current trends indicate that modifications to stan- coalescent’ methods that model independent disease dard MCMC methods will be increasingly explored92. mutations as having separate underlying genealogies88. For cases in which there are a large number of parame- ters that are not of interest (such as genealogical history Prospects and caveats in population-genetic models) and only a few that are of The enormous flexibility of the Bayesian approach, illus- interest, the ABC18,17 approach seems particularly trated by the examples given in this article, also points to promising. It is also a ‘democratizing’ method in that it the need for rigorous model testing. In frequentist infer- will attract, for example, biologists, who enjoy computer ence, a common practice has been to simulate large simulation but have little background in probability, into numbers (thousands) of test data sets in which the true converting their favourite simulation into a tool for parameter values are known, and then measure the bias, inference. Another burgeoning area, not covered in this mean squared error and coverage of the estimates. Such a review, is the use of Bayesian networks for combining the method sits uneasily within the Bayesian model, but is results from different analyses on the same data sets93,94. often the simplest way to compare with frequentist It could, however, be argued that such approaches, approaches18.For model-checking in Bayesian inference, although useful and commercially advantageous, are it has been suggested that parameters should be drawn technical fixes that do not easily lend themselves to sci- from the posterior distribution and then used to simulate entific enquiry. By contrast, the methods described here other data sets2.This is the posterior predictive distribu- are based on probabilistic models of the processes that tion — the distribution of other data sets given the give rise to a pattern. They have parameters that bear observed data set. Summary statistics measured in some relation to quantities that could in principle be the real data can then be compared with those in the measured and tested. At the moment, the Bayesian revo- simulated data to see whether the model is reason- lution is in its earliest phase, and it will be some time yet able. However, in practice this approach has seldom before the dust has settled and we can judge which are been taken. Similarly, although it is important to check the most promising avenues for exploration.

1. Shoemaker, J. S., Painter, I. S. & Weir, B. S. Bayesian 13. Nielsen, R. & Wakeley, J. Distinguishing migration from 25. Davies, N., Villablanca, F. X. & Roderick, G. K. Bioinvasions of statistics in genetics: a guide for the uninitiated. Trends isolation: a Markov chain Monte Carlo approach. Genetics the medfly Ceratitis capitata: source estimation using DNA Genet. 15, 354–358 (1999). 158, 885–896 (2001). sequences at multiple intron loci. Genetics 153, 351–360 2. Gelman, A., Carlin, J. B., Stern, H. S. & Rubin, D. B. 14. Wakeley, J., Nielsen, R., Liu-Cordero, S. N. & Ardlie, K. The (1999). Bayesian Data Analysis (Chapman and Hall, London, 1995). discovery of single-nucleotide polymorphisms and 26. Bonizzoni, M. et al. Microsatellite analysis of medfly 3. Cavalli-Sforza, L. L. & Edwards, A. W. F. Phylogenetic inferences about human demographic history. Am. J. Hum. bioinfestations in California. Mol. Ecol. 10, 2515–2524 analysis: models and estimation procedures. Evolution 32, Genet. 69, 1332–1347 (2001). (2001). 550–570 (1967). 15. Storz, J. F., Beaumont, M. A. & Alberts, S. C. Genetic 27. Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of 4. Ewens, W. J. The sampling theory of selectively neutral evidence for long-term population decline in a savannah- population structure using multilocus genotype data. Genetics 155, 945–959 (2000). alleles. Theor. Popul. Biol. 3, 87–112 (1972). dwelling primate: inferences from a hierarchical Bayesian An influential paper in the development of Bayesian The first use of a sampling distribution in population model. Mol. Biol. Evol. 19, 1981–1990 (2002). methods to study cryptic population structure. The genetics. This paper anticipates modern approaches, 16. Rannala, B. & Yang, Z. Bayes estimation of species divergence program described in it, Structure, has been widely such as the coalescent theory, that model the times and ancestral population sizes using DNA sequences used in molecular ecology. sampling distribution of chromosomes. from multiple loci. Genetics 164, 1645–1656 (2003). 28. Dawson, K. J. & Belkhir, K. A Bayesian approach to the 5. Kingman, J. F. C. The coalescent. Stochastic Process. Appl. 17. Marjoram, P., Molitor, J., Plagnol, V. & Tavaré, S. Markov identification of panmictic populations and the assignment of 13, 235–248 (1982). chain Monte Carlo without likelihoods. Proc. Natl Acad. Sci. individuals. Genet. Res. 78, 59–77 (2001). 6. Hudson, R. R. Properties of a neutral allele model with USA 100, 15324–15328 (2003). 29. Wright, S. Evolution and the Genetics of Populations: The intragenic recombination. Theor. Popul. Biol. 23, 183–201 18. Beaumont, M. A., Zhang, W., & Balding, D. J. Approximate Theory of Gene Frequencies (Chicago Univ. Press, Chicago, (1983). Bayesian computation in population genetics. Genetics 162, 1969). 7. Felsenstein, J. Estimating effective population size from 2025–2035 (2002). 30. Corander, J., Waldmann, P. & Sillanpaa, M. J. Bayesian samples of sequences: inefficiency of pairwise and 19. Wilson, I. J., Weale, M. E. & Balding, D. J. Inferences from analysis of genetic differentiation between populations. segregating sites as compared to phylogenetic estimates. DNA data: population histories, evolutionary processes and Genetics 163, 367–374 (2003). Genet. Res. 59, 139–147 (1992). forensic match probabilities. J. Roy. Stat. Soc. A Sta. 166, 31. Wilson, G. A. & Rannala, B. Bayesian inference of recent 8. Griffiths, R. C. & Tavaré, S. Ancestral inference in population 155–188 (2003). migration rates using multilocus genotypes. Genetics 163, genetics. Statistical Sci. 9, 307–319 (1994). 20. Cavalli-Sforza, L. L., Menozzi, P., & Piazza, A. The History 1177–1191 (2003). 9. Markovtsova, L., Marjoram, P. & Tavaré, S. The effect of rate and Geography of Human Genes (Princeton Univ. Press, 32. Bamshad, M. & Wooding, S. P. Signatures of natural variation on ancestral inference in the coalescent. Genetics Princeton, 1994). selection in the human genome. Nature Rev. Genet. 4, 99–111 (2003). 156, 1427–1436 (2000). 21. Devlin, B. & Roeder, K. Genomic control for association 33. Storz, J. F. & Beaumont, M. A. Testing for genetic evidence of 10. Tavaré, S., Balding D. J., Griffiths, R. C. & Donnelly, P. studies. Biometrics 55, 997–1004 (1999). population expansion and contraction: an empirical analysis 22. Pritchard, J. K. & Rosenberg, N. A. Use of unlinked Inferring coalescence times from DNA sequence data. of microsatellite DNA variation using a hierarchical Bayesian genetic markers to detect population stratification in Genetics 145, 505–518 (1997). model. Evolution 56, 154–166 (2002). 11. Wilson, I. J. & Balding, D. J. Genealogical inference from association studies. Am. J. Hum. Genet. 65, 220–228 34. Beaumont, M. A. & Balding, D. J. Identifying adaptive genetic microsatellite data. Genetics 150, 499–510 (1998). (1999). divergence among populations from genome scans. Mol. An early paper that uses MCMC to carry out a fully 23. Pritchard, J. K., Stephens, M., Rosenberg, N. A. & Donnelly, P. Ecol. (in the press). Bayesian analysis of population-genetic data. Association mapping in structured populations. Am. J. Hum. 35. Bustamante, C. D., Nielsen, R. & Hartl, D. L. Maximum 12. Beerli, P. & Felsenstein, J. Maximum likelihood estimation of Genet. 67, 170–181 (2000). likelihood and Bayesian methods for estimating the a migration matrix and effective population sizes in n 24. Pritchard, J. K. & Donnelly, P. Case–control studies of distribution of selective effects among classes of mutations subpopulations by using a coalescent approach. Proc. Natl association in structured or admixed populations. Theor. using DNA polymorphism data. Theor. Popul. Biol. 63, Acad. Sci. USA 98, 4563–4568 (2001). Popul. Biol. 60, 227–237 (2001). 91–103 (2003).

260 | APRIL 2004 | VOLUME 5 www.nature.com/reviews/genetics REVIEWS

36. Nielsen, R. Statistical tests of selective neutrality in the age 67. Ishwaran, H. & Rao, J. S. Detecting differentially expressed 98. Metropolis, N. Rosenbluth, A. N., Rosenbluth, M. N., of genomics. Heredity 86, 641–647 (2001). genes in microarrays using Bayesian model selection. Teller A. H. & Teller, E. Equations of state calculations by fast 37. Nielsen, R. & Yang, Z. Likelihood models for detecting J. Am. Stat. Ass. 98, 438–455 (2003). computing machine. J. Chem. Phys. 21, 1087–1091 (1953). positively selected amino acid sites and applications to the 68. Lee, K. E., Sha, N., Dougherty, E. R., Vannucci, M. & 99. Hastings, W. K. Monte Carlo sampling methods using Markov HIV-1 envelope gene. Genetics 148, 929–936 (1998). Mallick, B. K. Gene selection: a Bayesian variable selection chains and their application. Biometrika 57, 97–109 (1970). The first formal statistical method for inferring site- approach. Bioinformatics 19, 90–97 (2003). 100. Pritchard, J. K., Seielstad, M. T., Perez-Lezaun, A. & Feldman, specific selection on DNA codons. 69. Zhang, M. Q. Large-scale gene expression data analysis: M. W. Population growth of human 38. Holder, M. & Lewis, P. O. Phylogeny estimation: traditional a new challenge to computational biologists. Genome Res. 9, Y chromosomes: a study of Y chromosome microsatellites. and Bayesian approaches. Nature Rev. Genet. 4, 275–284 681–688 (2003). Mol. Biol. Evol. 116, 1791–1798 (1999). (2003). 70. Heard, N. A., Holmes, C. C. & Stephens, D. A. A quantitative The first paper to use an ABC approach to infer Reviews the many recent applications of Bayesian study of gene regulation involved in the immune response of population-genetic parameters in a complicated inference in phylogeny estimation. anopheline mosquitoes: an application of Bayesian demographic model. 39. Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological hierarchical clustering of curves. Department of Statistics, 101. Beaumont, M. A. Detecting population expansion and decline Sequence Analysis, (Cambridge Univ. Press, Cambridge, Imperial College, London [online], (2003). 102. Drummond, A. J., Nicholls, G. K., Rodrigo, A. G. & Solomon, 40. Lawrence, C. E. et al. Detecting subtle sequence signals: a 71. Dove, A. Mapping project moves forward despite controversy. W. Estimating mutation parameters, population history and Gibbs sampling strategy for multiple alignment. Science Nature Med. 12, 1337 (2002). genealogy simultaneously from temporally spaced sequence 262, 208–214 (1993). 72. Rannala, B. Finding genes influencing susceptibility to data. Genetics 161, 1307–1320 (2002). The methods and models used in this paper have led complex diseases in the post-genome era. Am. J. 103. Pybus, O. G., Drummond, A. J., Nakano, T., Robertson, B. H. to the development of a large number of Bayesian Pharmacogenomics 1, 203–221 (2001). & Rambaut, A. The epidemiology and iatrogenic transmission methods for the analyses of sequence data by some 73. Sham, P. Statistics in Human Genetics, (Oxford Univ. Press, of hepatitis C virus in Egypt: a Bayesian coalescent approach. of the authors and their groups. New York, 1998). Mol. Biol. Evol. 20, 381–387 (2003). 41. Churchill, G. A. Stochastic models for heterogeneous DNA 74. Jorde, L. B. Linkage disequilibrium and the search for 104. Beaumont, M. A. Estimation of population growth or decline in sequences. Bull. Math. Biol. 51, 79–94 (1989). complex disease genes. Genome Res. 10, 1435–1444 genetically monitored populations. Genetics 164, 1139–1160 One of the earliest papers to use a hidden Markov (2000). (2003). model to analyse DNA sequence data. 75. Spielman, R. S., McGinnis, R. E. & Ewens, W. J. Transmission 105. Elston, R. C. & Stewart, J. A general model for the analysis of 42. Borodovsky, M., McIninch & J. Genmark: parallel gene test for linkage disequilibrium: the insulin gene region and pedigree data. Human Heredity 21, 523–542 (1971). recognition for both DNA strands. Comput. Chem. 17, insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. 106. Lander, E. S. & Green, P. Construction of multilocus genetic 123–133 (1993). Genet. 52, 506–516 (1993). linkage maps in humans. Proc. Natl Acad. Sci. USA 84, 43. Liu, J. S., Neuwald, A. F. & Lawrence, C. E. Bayesian models The first application of a family-based association test. 2362–2367 (1987). for multiple local sequence alignment and Gibbs sampling The transmission disequilibrium test has been highly 107. Krugylak, L., Daly, M. J. & Lander, E. S. Rapid multipoint strategies. J. Am. Stat. Ass. 90, 1156–1170 (1995). influential and spawned many related approaches. linkage analysis of recessive traits in nuclear families, including 44. Webb, B. M., Liu, J. S. & Lawrence, C. E. BALSA: Bayesian 76. Denham, M. C. & Whittaker, J. C. A Bayesian approach to homozygosity mapping. Am. J. Hum. Gen. 56, 519–527 algorithm for local sequence alignment. Nucleic Acids Res. disease gene location using allelic association. Biostatistics 4, (1995). 30, 1268–1277 (2002). 399–409 (2003). 108. Lange, K. & Sobel, E. A random walk method for computing 45. Thompson, W., Rouchka, E. C., Lawrence & C. E. Gibbs 77. Sham, P. C. & Curtis, D. An extended transmission/ genetic location scores. Am. J. Hum. Gen. 49, 1320–1334 recursive sampler: finding transcription factor binding sites. disequilibrium test (TDT) for multi-allele marker loci. Ann. Hum. (1991). Nucleic Acids Res. 31, 3580–3585 (2003). Genet. 59, 323–336 (1995). 109. Thompson, E. A. in Computer Science and Statistics: 46. Liu, J. S. & Lawrence, C. E. Bayesian inference on 78. Paetkau, D., Calvert, W., Stirling, I. & Strobeck, C. Proceedings of the 23rd Symposium on the Interface (eds biopolymer models. Bioinformatics 15, 38–52 (1999). Microsatellite analysis of population-structure in Canadian Keramidas, E. M. & Kaufman, S. M.) 321–328 (Interface 47. Liu, J. S. & Logvinenko, T. in Handbook of Statistical polar bears. Mol. Ecol. 4, 347–354 (1995). Foundation of North America, Fairfax Station, Virginia, 1991). Genetics (eds Balding, D. J., Bishop, M. & Cannings, C.) 79. Rannala, B. & Mountain, J. L. Detecting immigration by using 110. Hoeschele, I. in Handbook of Statistical Genetics (ed. Balding, 66–93 (John Wiley and Sons, Chichester, 2003). multilocus genotypes. Proc. Natl Acad. Sci. USA 94, D. J.) 599–644 (John Wiley and Sons, New York, 2001). An extensive review of methods used to map 48. Churchill, G. A. & Lazareva, B. Bayesian restoration of a 9197–9201 (1997). quantitative trait loci in humans and other species. hidden Markov chain with aplications to DNA sequencing. 80. Sillanpaa, M. J., Kilpikari, R., Ripatti, S., Onkamo, P. & Uimari, J. Comput. Biol. 6, 261–277 (1999). P. Bayesian association mapping for quantitative traits in a 49. Human Genome Sequencing Consortium. Initial sequencing mixture of two populations. Genet. Epidemiol. 21 (Suppl. 1), Acknowledgements and analysis of the human genome. Nature 409, 860–921 S692–S699 (2001). We thank the four anonymous referees for their comments. Work on (2001). 81. Hoggart, C. J. et al. Control of confounding of genetic this paper was supported by grants from the Biotechnology and 50. Venter, J. C. et al. The sequence of the human genome. associations in stratified populations. Am. J. Hum. Genet. 72, Biological Sciences Research Council and the Natural Environment Science 291, 1304–1351 (2001). 1492–1504 (2003). Research Council to M.A.B., and by grants from the National Institutes 51. Polanski, A. & Kimmel, M. New explicit expressions for 82. Bodmer, W. F. Human genetics: the molecular challenge. Cold of Health and the Canadian Institute of Health Research to B.R. relative frequencies of single-nucleotide polymorphisms with Spring Harb. Symp. Quant. Biol. 51, 1–13 (1986). application to statistical inference on population growth. 83. Lander, E. S. & Botstein, D. Mapping complex genetic traits in Competing interests statement Genetics 165, 427–436 (2003). humans: new methods using a complete RFLP linkage map. The authors declare that they have no competing financial interests. 52. Zhu, Y. L. et al. Single-nucleotide polymorphisms in Cold Spring Harb. Symp. Quant. Biol. 51, 49–62 (1986). soybean. Genetics 163, 1123–1134 (2003). 84. Dean, M. et al. Approaches to localizing disease genes as Online links 53. Marth, G. T. et al. A general approach to single-nucleotide applied to cystic fibrosis. Nucleic Acids Res. 18, 345–350 polymorphism discovery. Nature Genet. 23, 452–456 (1999). (1990). DATABASES 54. Irizarry, K. et al. Genome-wide analysis of single-nucleotide 85. Hastbacka, J. et al. Linkage disequilibrium mapping in The following terms in this article are linked online to: polymorphisms in human expressed sequences. Nature isolated founder populations: diastrophic dysplasia in Finland. Genet. 26, 233–236 (2000). Nature Genet. 2, 204–211 (1992). OMIM: http://www.ncbi.nlm.nih.gov/Omim 55. Ott, J. Analysis of Human Genetic Linkage (Johns Hopkins, 86. Rannala, B. & Slatkin, M. Methods for multipoint disease cystic fibrosis | schizophrenia | type II diabetes Baltimore, 1999). mapping using linkage disequilibrium. Genet. Epidemiol. 19 56. Long, J. C., Williams, R. C. & Urbanek, M. An E-M algorithm (Suppl. 1), S71–S77 (2000). FURTHER INFORMATION and testing strategy for multiple-locus haplotypes. Am. J. A comprehensive review of the various likelihood Bayesian haplotyping programs: Hum. Genet. 56, 799–810 (1995). approximations used in linkage-disequilibrium gene http://www.stats.ox.ac.uk/mathgen/software.html; 57. Excoffier, L. & Slatkin, M. Maximum-likelihood estimation of mapping. http://www-personal.umich.edu/~qin molecular haplotype frequencies in a diploid population. 87. Rannala, B. & Reeve, J. P. High-resolution multipoint linkage- Bayesian population genetics programs and links: Mol. Biol. Evol. 12, 921–927 (1995). disequilibrium mapping in the context of a human genome http://evolve.zoo.ox.ac.uk/beast; 58. Niu, T., Qin, Z. S., Xu, X. & Liu, J. S. Bayesian haplotype sequence. Am. J. Hum. Genet. 69, 159–178 (2001). http://www.maths.abdn.ac.uk/~ijw; inference for multiple linked single-nucleotide The first use of the human genome sequence as an http://www.rubic.rdg.ac.uk/~mab/software.html polymorphisms. Am. J. Hum. Genet. 70, 157–169 (2002). informative prior for Bayesian gene mapping. Bayesian sequence analysis web sites: 59. Stephens, M., Smith, N. J. & Donnelly, P. A new statistical 88. Morris, A. P., Whittaker, J. C. & Balding, D. J. Fine-scale http://www.wadsworth.org/resnres/bioinfo; method for haplotype reconstruction from population data. mapping of disease loci via shattered coalescent modeling of Am. J. Hum. Genet. 68, 978–989 (2001). genealogies. Am. J. Hum. Genet. 70, 686–707 (2002). http://www.people.fas.harvard.edu/~junliu/index1.html#Computa 60. Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum 89. Rannala, B. & Reeve, J. P. Joint Bayesian estimation of tional_Biology likelihood from incomplete data via the EM algorithm. J. Roy. mutation location and age using linkage disequilibrium. Detecting selection with comparative data, population Statist. Soc. B39, 1–38 (1977). Pac. Symp. Biocomput. 526–534 (2003). genetic analysis: http://abacus.gene.ucl.ac.uk/ziheng/ziheng.html 61. Slatkin, M. & Excoffier, L. Testing for linkage disequilibrium in 90. Reeve, J. P. & Rannala, B. DMLE+: Bayesian linkage DMLE+ LD Mapping Program: http://dmle.org genotypic data using the Expectation-Maximization disequilibrium gene mapping. Bioinformatics 18, 894–895 Genetic analysis software links (linkage analysis): algorithm. Heredity 76, 377–383 (1996). (2002). http://linkage.rockefeller.edu/soft 62. Butte, A. The use and analysis of microarray data. Nature 91. Liu, J. S., Sabatti, C., Teng, J., Keats, B. J. & Risch, N. Genetic Software Forum (discussion list): http://rannala.org/gsf Rev. Genet. 1, 951–960 (2002). Bayesian analysis of haplotypes for linkage disequilibrium HapMap: http://www.hapmap.org 63. Huber, W., von Heydebreck, A. & Vingron, M. in Handbook of mapping. Genome Res. 11, 1716–1724 (2001). Human Gene Mutation Database: Statistical Genetics (eds Balding, D. J., Bishop, M. & Cannings, 92. Liu, J. S. Monte Carlo Methods for Scientific Computing http://archive.uwcm.ac.uk/uwcm/mg/hgmd0.html C.) 162–187 (John Wiley and Sons, Chichester, 2003). (Springer, New York, 2001). National Center for Biotechnology Information: 64. Baldi, P. & Long, A. D. A Bayesian framework for the 93. Pavlovic, V., Garg, A. & Kasif, S. A Bayesian framework for http://www.ncbi.nlm.nih.gov analysis of microarray expression data: regularized t-test combining gene predictions. Bioinformatics 18, 19–27 (2002). SNP discovery software: and statistical inferences of gene changes. Bioinformatics 94. Jansen, R. et al. A Bayesian networks approach for 17, 509–519 (2001). predicting protein–protein interactions from genomic data. http://www.genome.wustl.edu/groups/informatics/software/poly 65. Storey, J. D. & Tibshirani, R. Statistical significance for Science 302, 449–453 (2003). bayes/pages/main.html genomewide studies. Proc. Natl Acad. Sci. USA 100, 95. Ross, S. M. Simulation, (Academic, New York, 1997). Software for sequence annotation: 9440–9445 (2003). 96. Ripley, B. D. Stochastic Simulation (Wiley and Sons, New http://opal.biology.gatech.edu/GeneMark 66. Ibrahim, J. G., Chen, M. H. & Gray, R. J. Bayesian models York, 1987). Structure program (Reference 27): for gene expression with DNA microarray data. J. Am. Stat. 97. Hudson, R. R. Gene genealogies and the coalescent process. http://pritch.bsd.uchicago.edu Ass. 97, 88–99 (2002). Oxford Surveys Evol. Biol. 7, 1–44 (1990). Access to this interactive links box is free online.

NATURE REVIEWS | GENETICS VOLUME 5 | APRIL 2004 | 261 News items

Appendix B. Professional activities / materials SAB memberships

Scientific Advisory Board members, Marshfield Clinic Research Foundation, Personalized Medicine Research Project.

Scientific and Technical Advisory Board members, Omicia, Inc.

Dr. Gabor Marth Department of Biology, Boston College Chestnut Hill, MA 02467 July 5, 2008

Dear Gabor,

I am writing to ask if you would be willing to join a Scientific Advisory Committee (SAC) for the Ontario Institute for Cancer Research (OICR) program for participation in the International Cancer Genome Consortium (ICGC). The OICR ICGC Program (Director, John McPherson) includes elements from the Cancer Genomics Platform (Director, John McPherson), the Informatics and Biocomputing Platform (Director, Lincoln Stein) and the ICGC Secretariat (Tom Hudson, OICR Director and President). The main OICR ICGC program includes the collection and analysis of pancreatic tumors and matched controls and the ICGC Data Coordinating Centre (DCC).

The current dynamic nature of the DNA sequencing field necessitates that we set short-term procedural goals that mesh well with our long-term programmatic goals but remain adaptive to new technologies. Our operating plan includes: optimizing next generation sequencing platforms and applications, establishing the necessary supporting bioinformatics infrastructure, establishing the infrastructure for collecting pancreatic tumor samples and generating xenografts, establishing the ICGC DCC, and operating the ICGC secretariat. We have set out an operating plan for the next 12 months and are seeking input from experts in the fields of sequencing, genomics, bioinformatics and cancer biology to review our plans and provide insight and guidance in the future.

The current operating plan will be distributed shortly for SAC review and brief written comment. Conference calls will be scheduled to provide input with a frequency deemed appropriate by the SAC. The SAC will be invited to two meetings in Toronto, one in Fall 2008 and one in Spring 2009. Travel expenses and an honorarium of CDN$1500.00 will be provided for each of these on-site meetings.

On behalf of the OICR ICGC investigators, I hope that you will accept this invitation to join the SAC. We highly value your input and will benefit greatly from your guidance.

Sincerely,

John D. McPherson, Ph.D.

Gabor Marth

From: Elaine Mardis [[email protected]] Sent: Monday, July 28, 2008 6:04 PM To: Gabor Marth Subject: for your consideration

Dear Gabor,

First, I wanted to say thanks to you and Michael and Michelle for such an awesome job of teaching the students this year at the course. I couldn't have done it without you guys-- it just added so much.

Second, I wanted you to consider something, as follows. As we all realize, the emergence of next-generation sequencing is rapidly changing the face of biological inquiry. While there is much promise to be realized, there are many pitfalls to be aware of, such that the common prediction that many small single investigator laboratories will soon have a next generation sequencers simply will not be borne out. In addition, the complexity of bioinformatics required to intelligently utilize next-gen data poses significant challenges that only the most skilled laboratories will be able to handle. Nonetheless, demand for access to these platforms will remain high, especially for investigators that seek only the data and the ability to interpret it intelligently as a springboard to other biological inquiries, not the purchase of the actual instrumentation. In anticipation of these needs, I have been working with an existing supplier of molecular biology consumables, Edge BioSystems, to create a new entity named Edge BioServ. We feel this new company will offer a unique collaborative and client-oriented combination of next- generation sequencing access with novel bioinformatics-based analytical capability. As a part of this, I would like to gauge your interest in serving as a member of the Scientific Advisory Board. This would entail participation in quarterly meetings to be held in the Gaithersburg MD site of Edge BioServ, and occasional teleconferences. Compensation for your services would include reimbursement for travel and hotel lodging and a quarterly retainer of $750. Membership to be renewed yearly.

I look forward to hearing about your interest in participating at Edge BioServ. Please let me know if there are questions that I can answer about this SAB opportunity that will aid your decision. Thanks for considering this request. e Elaine R. Mardis, Ph.D. Associate Professor in Genetics and Molecular Microbiology Co-director, The Genome Center at Washington University Washington University School of Medicine 4444 Forest Park Blvd. St. Louis MO 63108 (314) 286-1805 [email protected]

1 Invitations to serve on NIH study sections Request from the National Human Genome Research Institute Page 1 of 1

Gabor Marth

From: Nakamura, Ken (NIH/NHGRI) [E] [[email protected]] Sent: Wednesday, April 09, 2008 1:54 PM To: Marth, Gabor Subject: Request from the National Human Genome Research Institute

Dear Dr. Marth,

I am in the Review Branch at the NHGRI and would ask you to consider participating on a review committee to evaluate applications responding to an RFA focused on "Development and Application of New Technologies to Targeted Genome-wide Resequencing in Well-Phenotyped Populations". The details of the RFA are at http://grants.nih.gov/grants/guide/rfa-files/RFA-HL-08-004.html , but as the title indicates the goal is to develop and validate approaches to low cost, high throughput resequencing methods that could be applied to large-scale medical sequencing targets. This includes development of target capture methods that are adaptable to a production sequencing pipeline employing new sequencing technologies such as 454, Illumina, ABSolid. The emphasis is on high-throughput and low cost, so those factors will be primary review criteria, but the application also needs to develop plans to fulfill bioinformatic needs including data capture and analysis.

We received about a dozen applications so this will certainly take less than a day to review. It would be a meeting here in the Washington DC area and I am currently exploring dates in June. A typical reviewer workload would be 3-4 applications. Dr. Maynard Olson, Univeristy of Washington has agreed to chair the panel. We would clearly benefit from your experience and expertise, so I hope this is something that interests you and that you can fit into your undoubtedly busy schedule.

Thanks for giving this request your kind consideratioin, Ken

*************************************************** Ken D. Nakamura, Ph.D. Scientific Review Branch National Human Genome Research Institute National Institutes of Health Phone: 301 402-0838 Fax: 301 435-1580 5635 Fishers Lane Suite 4076 MSC 9306 Bethesda, MD 20892-9306 (regular mail) Rockville, MD 20852 (courier/overnight service).

***************************************************

Electronic submission of NIH grant applications is here! Information about the transition timeline and submission procedures are available at http://era.nih.gov/ElectronicReceipt/.

10/6/2008 NIH Inquiry Page 1 of 1

Gabor Marth

From: Ward, Lucy (NIH/NIAID) [E] [[email protected]] Sent: Thursday, August 28, 2008 2:14 PM To: [email protected] Cc: Ward, Lucy (NIH/NIAID) [E]; Walker-Abbey, Annie (NIH/NIAID) [E] Subject: NIH Inquiry Importance: High

Dear Dr. Marth,

This email is to serve as an invitation to assist the NIH in the review of contract proposals submitted in response to RFP-NIH-NIAID-DMID-AI2008-010, Genomic Sequencing Centers for Infectious Diseases (for details please click on title).

An assembled Special Emphasis Panel (SEP) of experts will meet for a one day face-to-face meeting in the Washington Metropolitan area in mid to late October (preferably either October 22nd, 23rd, or 24th depending on reviewer availability) to review the submitted proposals.

If you are interested and available to serve on this SEP, or if you are unavailable, please let me know as soon as possible by replying to this email or calling me at (301) 593-3385.

And I ‘Thank You’ for your time and consideration, and hope to hear favorably from you soon!

Thanks again - Lucy

Dr. Lucy A. Ward, DVM, PhD Division of Extramural Activities

National Institute of Allergy & Infectious Diseases

National Institutes of Health/DHHS

Room 3117, 6700-B Rockledge Drive MSC-7616 Bethesda, MD 20892 (express mail zip code 20817) phone: (301) 594-6635 fax: (301) 480-2408 email: [email protected]

***********************************************************************

Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. The National Institute of Allergy and Infectious Diseases (NIAID) shall not accept liability for any statement made that are the sender’s own and not expressly made on behalf of the NIAID by one of its representatives.

10/6/2008 Page 1 of 1

Gabor Marth

From: Fan, Ping (NIH/CSR) [E] [[email protected]] Sent: Wednesday, May 07, 2008 10:20 AM To: [email protected] Subject: NIH Review Invitation

Gabor Marth, Ph.D. Assitant Professor Department of Biology Higgins Hall, Room 415 Boston College 140 Commonwealth Avenue Chestnut Hill, MA 02467-3961

Dear Dr. Marth,

It was a great pleasure to see your name on projects for Genome Sciences, sequencing and computational tools, as well as the scientific publications that document your academic achievements. As scientific review officer of CSR/NIH, I would like to invite you to participate in the peer review process for an NIH Roadmap Project (Development of New Tools for Computational analysis of Human Microbiome Data). Your participation will help to guide the direction and assure the quality of research activities in the area of technology development related to biomedical research.

The review meeting will be held in Washington DC on either July 11, 2008. Please let me know if you are able to attend and would like more information.

Your help will be greatly appreciated not only from us at NIH but also from researchers, inventers, supporting personnel and patients in relevant fields and at many different countries.

I look forward to hearing from you.

With best regards

Sincerely

Ping Fan, M.D., Ph.D. Scientific Review Officer Instrumentation and Systems Development Center for Scientific Review, NIH 6701 Rockledge Drive, Room 5154, Bethesda, MD 20892-7840 Voice: 301-435-1740; Fax 301-480-4184

10/6/2008 Page 1 of 2

Gabor Marth

From: Charles, Vinod (NIH/NIMH) [E] [[email protected]] Sent: Monday, April 28, 2008 5:30 PM Cc: Charles, Vinod (NIH/NIMH) [E] Subject: RFA MH 08-040 METHODS OF STATISTICAL ANALYSIS OF DNA SEQUENCE DATA... (August 2008 Council)

Dear Reviewers,

My name is Vinod Charles and I am a Scientific Review Officer at NIMH overseeing the evaluation of “Genomic- related” grants for the coming June/July review cycle. I would like to formally invite you to please consider being on a panel titled ZMH1 ERB-C-06, to review several grants received under RFA-MH-08-040 having to do with “Methods of Statistical Analysis of DNA Sequence Data”. This RFA will use the NIH Research Project Grant (R01) award mechanism and the description can be found here: RFA-MH-08-040.

In consultation with my colleague Dr. Thomas Lehner (Chief, NIMH Genomics Research Branch) and others here in the Extramural Office, your name was discussed as one whose expertise fit well with the mission of the RFA and one whose help in reviewing these applications would be highly desirable and appreciated. Your participation in the upcoming phone review would be welcomed and is tentatively being scheduled for mid- to late-June 2008 (subject to change). The entire review will take about 2 hours to complete via teleconference.

With regards to the RFA: This Funding Opportunity Announcement (FOA) will encourage the development of novel methods of statistical analysis of DNA sequence data in studies that aim to relate genetic variation to disease. Areas of interest include, but are not limited to, designing sequencing studies and statistical methods for relating the variation to phenotype; assessing the significance of the associations; incorporating population genetic factors such as population history, admixture, and natural selection; and finding sets of variants that may include functional variants.

The overall goal of this FOA is to support the development of statistical methods for designing sequencing studies, for analyzing the data to find associations with phenotypes, to narrow down and prioritize regions for further study, and to provide information on possible functional roles of the variation. Because of linkage disequilibrium (LD), statistical methods will generally allow the identification of sets of variants that possibly contain variants that affect function, without allowing the identification of the specific causal variants. This FOA supports the development of methods that statistically analyze the sequence data to identify sets of associated variants that contain functional variants, and to provide clues about function that will guide the choice of sets of variants for later functional studies and the types of functional studies to be done. There are several areas where such statistical analyses need to be developed.

I hope that you will consider helping out and will accept this invitation to take part in this grant review meeting. Please RSVP to me as soon as possible so that I can begin to organize the study section and finalize a date and time. Thank you once again for aiding in the peer-review process; your work, as always is greatly appreciated.

Sincerely, Vin

Vinod Charles, Ph.D. Health Sciences Administrator Division of Extramural Activities National Institute of Mental Health NIH - Neuroscience Center 6001 Executive Blvd.

10/6/2008 Page 2 of 2

Room 6151, MSC 9606 Bethesda, MD. 20892-9605 (301) 443-1606 [email protected]

10/6/2008 MGC Review for NCI Page 1 of 1

Gabor Marth

From: Jeff Derge [[email protected]] Sent: Thursday, October 30, 2003 4:28 PM To: '[email protected]' Subject: MGC Review for NCI

Dear Dr. Marth

I am writing to request your participation in the review of proposals we expect to receive this week for the Mammalian Gene Collection (MGC) Project sponsored jointly by the NCI and NHGRI. This will be the second phase of competitive proposals, focused on completing the human and mouse collections of full open reading frame cDNA clones and additional organisms which have been added. The current status of the MGC can be viewed at their web site: http://mgc.nci.nih.gov/

SAIC is the prime operating contractor of the National Cancer Institute at Frederick and has taken a lead in managing several of the research subcontracts supporting the CGAP and MGC and other major research initiatives of the Office of Cancer Genomics. I work closely with Dr. Daniela Gerhard who is the lead person at the NCI for the MGC. You were recommended to me by Dr. Gerhard and Dr. Elise Feingold.

We have received sixteen proposals. An initial evaluation suggests that you should not have any direct conflicts of interest. If you are able to participate, I will send a complete list of offerors to be certain. We would like to complete this review in early December if at all possible.

.

If you are able to participate in this review, please respond, and I will send additional details and try to establish open dates.

Thank you very much for your consideration. I hope you will be able to help us continue with this important project. This is essentially the second phase of the project you helped review at its inception. If you have the opportunity, you would provide a welcome element of continuity for us.

Dr. Jeffery G. Derge SAIC Frederick PO Box B NCI-Frederick, Frederick, MD, 21702 [email protected] PH: 301-228-4018 FAX: 301-644-2049

10/6/2008 Gabor Marth

From: Day, Camilla (NIH/CSR) [[email protected]] Sent: Tuesday, December 30, 2003 4:03 PM To: '[email protected]' Subject: your help?

Dr. Marth,

Would you be willing to be a reviewer at the winter meeting of the Genome Study Section? The dates are February 26-27th. As you may know, Genome Study Section which is run out of the Center for Scientific Review, reviews a high fraction of the genomics-related research grant applications to the NIH. Given the applications we have in, your expertise would clearly be VERY helpful

Other members with computational interest/expertise that will be attending the meeting include: Tom Cassavant; Gary Chase; Jim Fickett, Michael Newton, Bruce Weir, Wing Wong, Laura Almasy, Christian Stockert, Laura Lazzeroni, and Eleanor Feingold.

I'd be happy to explain more before you decide.

Sincerely,

Camilla Day, Ph.D. SRA, Genome Study Section Center for Scientific Review, NIH V: 301-435-1037

1 Page 1 of 2

Gabor Marth

From: Shannon Mondoux [[email protected]] Sent: Wednesday, July 04, 2007 2:24 PM To: [email protected] Subject: Genome Canada External Review of Hegele Progress Report Attachments: External Reviewers COI Conf and Terms of Service Forms.doc; Hegele- Summary Report- April 10 2007.doc

Dear Dr. Marth,

I am contacting you to ask if you would be willing to provide a written assessment as an external reviewer of the Progress Report submitted by the large-scale project led by Robert Hegele, entitled “Structural and Functional Annotation of the Human Genome for Disease Study" which was awarded funding in Genome Canada’s Competition III and is now undergoing interim review.

Genome Canada is a not-for-profit corporation established in 2000 through funding from the federal government of Canada. Genome Canada’s principal objective is to support and coordinate large-scale genomics and proteomics research to enable Canada to become a world leader in selected sectors that are of strategic importance to this country, such as health, agriculture, environment, forestry, fisheries and GE3LS (ethical, environmental, economic, legal and social issues related to genomics). To date Genome Canada has funded 112 large-scale research projects and science & technology (S&T) platforms with a total investment of $1.2 billion (CDN) when combined with funding from other partners.

The projects funded in Competition III are undergoing an interim review of the project’s progress to date relative to the approved milestones in order to: i) evaluate a project’s progress to determine whether funding should be continued, reduced or cancelled; and ii) provide advice regarding alternative approaches and avenues to strengthen the project.. A multidisciplinary Panel of international experts has been established to review the Progress Reports submitted. The Panel will review these reports, prepare a detailed evaluation of a project’s progress, and provide feedback and advice to Genome Canada’s Board of Directors. To assist the Panel members in their evaluation, written reports from external reviewers with expertise specific to each project, such as you, will be made available to them.

To assist you in making your decision to review, please find attached an executive summary of the project we’d like you to review. We would be pleased to provide $350 US as a token of our appreciation for your effort.

Progress Reports will be submitted to Genome Canada on July 13th and external reviewers will be given access to them and instructions for completing their review early in the week of July 16th. The deadline for receipt of your written assessment will be August 1, 2007 allowing you approximately three weeks to complete your review.

If you are willing to serve as an external reviewer, please complete the attached Confidentiality, Terms of Service and Conflict of Interest form. In order to complete the Conflict of Interest portion of the form, please read the description of the conflict of interest and refer to the executive summary which includes a list of project leader(s), co-investigators, collaborators and Science Advisory Board (SAB) members. If you collaborate with collaborators of this project or members of their SAB, this will likely not preclude you from reviewing this progress report, but we would like you to indicate all potential conflicts on the form. Please return the completed form to me either by email (you must have electronic signature for us to accept electronic forms) or fax them to my attention at +1 613 751 4474.

We appreciate your consideration of our invitation. If you have any questions please do not hesitate to contact me by email or at the number below.

Best wishes,

Shannon

10/6/2008 Page 2 of 2

Shannon Mondoux Data Manager and Programs Administrator Genome Canada 2100-150 Metcalfe Street Ottawa, Ontario K2P 1P1 Tel: (613) 751-4460 ext. 126 Fax: (613) 751-4474

10/6/2008 Seminar / keynote talk invitations Page 1 of 1

Gabor Marth

From: Miguel Perez-Enciso [[email protected]] Sent: Wednesday, April 30, 2008 12:23 PM To: [email protected] Subject: Next Generation Sequencing symposium in Barcelona2009

Dear Dr Marth, We are organizing a next generation symposium in Barcelona (Spain) in october 1-3 2009 (next year). Please visit http://web.mac.com/sramosonsins/ICREA-NGS2009/Welcome.html for a very preliminary list of sessions. We would be extremely pleased if you could provide an invited talk on polymorphism discovery with NGS. If you accept, which we hope, we will cover all your expenses, including travel (tourist class) plus housing and meals. The meeting will take place in a beautiful building in the centre of Barcelona. If you do not wish to accept, for whatever reason, we would appreciate if you could provide us for some alternative names. Please feel free to contact me for further details. Thanks a lot on behalf of the organizing committee. Miguel

======Miguel Perez-Enciso ICREA professor Dept. Ciencia Animal i dels Aliments Facultat de Veterinaria Universitat Autonoma de Barcelona 08193 Bellaterra, SPAIN Phone: +34 93 581 4225 Fax: +34 93 581 2106 [email protected] http://www.icrea.es/pag.asp?id=Miguel.Perez ======

10/6/2008 Gabor Marth

From: Zhaohui Qin [[email protected]] Sent: Wednesday, June 11, 2008 3:08 PM To: [email protected] Cc: Jun Li; Robert H Lyons Subject: invitation to talk at Michigan

Dear Dr. Marth,

My name is Steve Qin, an Assistant Professor of Biostatistics at U of Michigan. i am interested in analyzing sequencing related data. as everybody does, i am quite impressed by the work and software programs devloped in your lab in this area.

Together with my colleagues Jun Li (Assistant Professor in Human Genetics) and Bob Lyons (Director of the University DNA sequencing Core), we recently initiated a "Next-Generation Sequencing Technologies" seminar series as an opportunity for UM faculty and students to learn from national or international leaders in this rapidly advancing field. Because of your work in analyzing next generation sequencing data is widely recognized, We are writing to invite you to visit Ann Arbor in the fall (September-December of 2008) and speak in this seminar series.

Currently, about 30 research groups in our university are applying (or planning to apply) the new sequencing methods in their own research. This seminar series has the support of Department of Human Genetics, Center for Genetics in Health and Medicine, and other inter- departmental units such as the Center for Computational Medicine and Biology, Bioinformatics Graduate Program and Center for Statistical Genetics. It is expected to attract a large audience.

The seminar will be held on a Monday afternoon. The typical schedule is for you to arrive on Sunday (or no later than midday on Monday) and stay till Tuesday afternoon. We will cover the entire cost of your visit and provide a modest honorarium. We hope you will accept this invitation and let us know the week(s) that you will be available. If your fall schedule precludes a Monday visit we have the option to move your talk to Wednesday or Friday by coordinating with other seminar series in the Center for Computational Medicine and Biology or the Center for Statistical Genetics.

With best wishes,

Steve Qin ([email protected], 734-763-5965) Jun Li ([email protected], 734-615-5754) Bob Lyons ([email protected], 734-764-8531)

1 Page 1 of 1

Gabor Marth

From: Wortman, Jennifer [[email protected]] Sent: Sunday, April 27, 2008 11:07 PM To: [email protected] Subject: Invitation to IGS

Hi Gabor –

How are you? I had the opportunity to be on a study section with Mark last month, and your name came up in conversation, prompting me to drop you a line.

I’ve been at the Institute for Genome Sciences at the University of Maryland, Baltimore since October, helping Claire get set up and trying to find my way as a new faculty member. Now that our sequencing facility is up and running ‐‐ our Sanger and 454 machines are fully functional and we are expecting a Solexa machine in the next few weeks ‐‐ we are very interested in exploring the best technologies to support assembly and resequencing. I was hoping I could convince you to come visit and give a seminar to highlight your recent work. Please let me know if you would be interested in a trip to scenic Baltimore and what timeframe would be convenient.

I hope we have the opportunity to catch up soon.

Best, Jennifer

Jennifer Russo Wortman Assistant Professor Department of Medicine Institute for Genome Sciences University of Maryland School of Medicine 685 West Baltimore St. HSF‐I, Room 144 Baltimore, MD 21201 410‐706‐6784 [email protected]

10/6/2008 Gabor Marth

From: Angela Frederick Amar [[email protected]] Sent: Wednesday, August 13, 2008 6:11 PM To: [email protected] Subject: opportunity in School of Nursing

Hi Gabor, We would love to have you speak about your research to faculty and grad students in the nursing school. Our Brown Bag lunch series are held on Tuesdays from 12-1. Lunch is provided. I'm hoping you can present on either Jan 27, Feb 10, or April 7. You would present on your program of research for about 35-40 minutes and we'd use the rest of the time for questions. Our purpose is to highlight research and to facilitate potential collaborative efforts. Your area of research and funding mechanism is one that would be of interest to our faculty and students. I'm hoping you are available on one of the dates. If you have any further questions, don't hesitate to ask. I hope you are enjoying your summer. Best and I look forward to hearing from you, Angela Angela Frederick Amar, PhD, RN Assistant Professor Boston College William F. Connell School of Nursing 140 Commonwealth Avenue Chestnut Hill, MA 02467 617-552-0180

1 Gabor Marth

From: Xiaole Shirley Liu [[email protected]] on behalf of Xiaole Shirley Liu [[email protected]] Sent: Thursday, July 31, 2008 2:18 AM To: [email protected] Subject: Invitation

Dear Dr. Marth,

On behalf of the organizing committee, I am writing to invite you to be a panelist for our conference on Emerging Quantitative Issues in Parallel sequencing (http://www.hsph.harvard.edu/research/pqg-annual-conference/index.html). Organized by Harvard School of Public Health, Harvard Medical School, and Dana-Farber Cancer Institute, the conference will be in the New Research Building on 9/23-9/25/08. It focuses on the following three areas: I. Genomic and meta-genomic sequencing II. Phenotypes and populations III. Transcriptome and transcription regulation

There will be a panel discussion in the afternoon of 9/24 comparing the features of different sequencing technologies (i.e. 454, SOLiD, Solexa, and Helicos). The panel includes one technical representative from each company and two academic scientists (one bio and one informatics). Your pioneering work on sequencing informatics and first hand experience in the unique characteristics of each platform make you the perfect fit for this panel. Your expertise opinion and participation at this conference will greatly benefit the community of parallel sequencing technology users.

We thank you for considering our invitation and hope you could inform us your decision within a week.

P.S. We also appreciate you to help us spread the words about the conference, as we are open for registration and abstract submission now.

Xiaole Shirley Liu Associate Professor CLS-11022, 44 Binney St, Boston 02115 Dept of Biostats and Comp Bio Dana-Farber Cancer Institute Harvard School of Public Health Tel: (617) 632-2472

1 Gabor Marth

From: [email protected] Sent: Wednesday, January 30, 2008 4:12 PM To: [email protected] Subject: Speaker Invitation in Epigenomics & Sequencing -2008 Meeting at the Harvard Medical School on July 14-15, 2008

Dear Dr. Gabor Marth:

Upon the successful inauguration of the First international "Epigenomics & Sequencing 2007 Meeting” on ‘Chromatin Methylation to Disease Bioloy & Theranostics’ we are re-organizing the Second international "Epigenomics & Sequencing 2008 Meeting” on ‘Chromatin Methylation to Disease Bioloy & Theranostics’ on July 14-15, 2008 at The Conference Center at Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA 02115, USA.

We are pleased to invite you as a speaker to this interesting conference on July 14-15 2008 and present your work on sequencing.

We are pleased to inform you that we have gathered some excellent young and accomplished scientists as other speakers in this meeting. Keeping our motto “Bridging Academia and Industry” we try to bring world renowned experts from academia, biotech and large pharmaceutical industry for the benefit of educating the researchers in the cutting-edge technologies. For details about our past meetings please visit website www.expressgenes.com

We will be planning to cover topics such as: Mechanisms of Chromatin in gene regulation; Nuclear dynamics and Methylation Assays; Parental imprinting and Histone Deacetylation inhibitors as drugs; Epigenetic re-programming in stem cells; Cutting-edge sequencing technology; Epigenome Sequencing; Epigenetic regulatory processes in diseases & environment; PharmacoEpigenomics.

Scientific Organizing Committee: Krishnarao Appasani, PhD., MBA (Chair) Founder & CEO, GeneExpression Systems, Inc. Waltham, MA USA Shuji Ogino, M.D. Ph.D. Assistant Professor of Pathology, Brigham & Women’s Hospital, Harvard Medical School Mukesh Verma, PhD. Program Director of Epidemiology and Genetics Program National Cancer Institute, National Institutes of Health, Bethesda, MD Laurie Jackson-Grusby, Ph.D. Assistant Professor of Pathology, Children's Hospital Boston, MA, USA

We have a track record of bringing Nobel laureates and National Academy members to our focused theme conferences.

GeneExpression Systems Previous Lifetime Achievement Awardees Eric Kandel (2004) 2002 Nobel Laureate-Columbia University Marshall Nirenberg (2005) 1968 Nobel Laureate-NIH Sidney Altman (2005) 1987 Nobel laureate- Paul Berg (2006) 1980 Nobel laureate-Stanford university Richard Roberts (2006) 1993 Nobel laureate-New England Biolabs Earl Stadtman (2006) Pioneer in Enzymology-National Inst. of Health Tim Hunt (2006) 2001 Nobel Laureate-London Research Inst., UK Irving L. Weissman (2007) Director of Stanford Stem Cell Biology Institute Alexander Rich (2007) Sedwick Professor, Massachusetts Institute of Technology Dudley Herschbach (2007) 1986 Nobel Laureate-Harvard University Peter Mansfield (2007) 2003 Nobel Laureate-University of Nottingham, UK

GeneExpression Systems Previous microRNAs & RNAi innovator Awardees 1 Craig Mello (2004) 2006 Nobel Laureate-U Massachusetts Med School Gary Ruvkun (2004) Harvard Medical school Ronald Plasterk (2005) Hubrecht Lab, Netherlands David Bartel (2005) Massachusetts Institute of Technology Richard Jorgensen (2006) University of Arizona Julie Ahringer (2006) University of Cambridge, UK Eric Miska (2006) University of Cambridge, UK Elizabeth Blackburn (2007) 2006 Lasker Award Winner-Univ. Of California-San Francisco Phillip Zamore (2007 U Massachusetts Med School

I am sure; this advance notice will allow you to mark in your calendar. Please look into your schedule and let us know.

Thanking you in advance, -Krishna ______Krishnarao Appasani, PhD., MBA. Founder & CEO GeneExpression Systems, Inc. P.O. Box 540170 Waltham, Massachusetts 02454-0170 USA Tel: 781-891-8181; Fax: 781-891-8234 E-mail Personal: [email protected] E-mail: [email protected] Internet: www.expressgenes.com

Goals of the Meeting: Our focused meeting is for a group of 100-200 people and our intention to educate them in the newly emerging scientific disciplines. Before you make your decision, we like to explain our goal is to bring academicians and industry leaders on to one platform to foster scientific and business collaborations by sharing ideas. We like to be a catalyst like Keystone, Gordon and Cold Spring Conferences. Most of the above organization meetings are for academic scientists. However, the trend is changing good science is coming from the industry labs too. Therefore, we would like to create a dialogue between these scientific leaders, and GeneExpression Systems Company wants to be connecting the gap between these two schools of thought leaders.

About the Chief Organizer: Krishnarao Appasani: Dr. Appasani is presently the CEO of GeneExpression Systems, Inc. Waltham, MA, USA, and a visiting scientist at Harvard Medical School. Prior to he worked at PerkinElmer Life Sciences and Carl Zeiss Imaging, Inc. and also a member of the faculty of Harvard Medical School. After his PhD in 1986 from Banarus Hindu University, India, he did post- doctoral research at Tufts Medical School with Edward Goldberg then at the MIT with Nobel laureate H. Gobind Khorana. Dr. Appasani has edited a book on “Perspectives in Gene Expression” a technical book for Eaton Publishing Company, which was forwarded by Nobel laureate Dr. Phillip A. Sharp of MIT. He also edited a book for Cambridge University Press on “RNA interference: From Basic Science to Drug Development,” for which RNAi co- discoverer & 2006 Nobel Laureate Dr. Andrew Fire and 1968 Nobel Laureate Marshall Nirenberg of NIH wrote forewords. A book from Springer Press on “Bioarrays: From Basic Science to Diagnostics”, ( for which Sir Edwin Southern wrote foreword) will be released in the late spring. Additionally another one from the Cambridge University Press on “MicroRNAs: From Basic Science to Disease Biology” (for which microRNAs discoverer Dr. Victor Ambros of Dartmouth and 1989 Nobel Laureate Dr. Sidney Altman of Yale wrote forewords) was launched in January 2008.

2 Personal Genomes: Technology, Interpretation, and Challeng Page 1 of 2

Gabor Marth

From: [email protected] Sent: Wednesday, December 12, 2007 10:12 AM To: [email protected] Subject: Personal Genomes: Technology, Interpretation, and Challenges Importance: High Attachments: program_v5.pdf; Personal_07_12_11.pdf

Dear Dr. Marth:

I am sorry that, for various reasons, this invitation is coming to you at very short notice. We would be very grateful if you could let us know as soon as possible whether to not you will be able to participate.

With best wishes Jan

We are writing to invite you to be a speaker at the meeting "Personal Genomes: Technology, Interpretation, and Challenges," being organized by Richard Gibbs (Baylor College of Medicine, Houston), Mary-Claire King (University of Washington, Seattle), Maynard Olson (University of Washington, Seattle) and Lincoln Stein (Cold Spring Harbor Laboratory).

This part of a series of meetings on topics of special interest to Jim Watson and which he feels are in urgent need of review.

The meeting will begin in the evening of Monday March 3 and finish at lunchtime on Thursday March 6, 2008. It will be an open meeting, held at the National Academy of Sciences Beckman Center, Irvine. (Not at the Banbury Center.) We pay for the costs of invited speakers - travel, and board and lodging. We hope to raise additional funds so as to be able to waive the registration fee for attendees.

The meeting is being held both to celebrate and to critically examine a significant milestone in human genetics-the first "personal genomes." These ultra high throughput sequencing strategies are used in a very limited number of laboratories and few scientists, and even fewer clinical geneticists, are familiar with the implications of the "$1000" genome. We believe that a meeting which reviews these topics will be very attractive to a range of scientists including biologists, geneticists, and biomedical researchers. There will be five sessions, each session having five presentations. Each presentation will have a talk of 30 minutes followed by 10 minutes discussion. In addition, there will be evening sessions on special topics. We attach a draft outline and a list of invited speakers, both of which, needless to say, are subject to change.

With best wishes Jan

--

Jan A. Witkowski, Ph.D. Executive Director, Banbury Center

10/6/2008 Personal Genomes: Technology, Interpretation, and Challeng Page 2 of 2

Cold Spring Harbor Laboratory PO Box 534, Cold Spring Harbor NY 11724 ph. (516) 367-8398; fax (516) 367-5106

Professor, Watson School of Biological Sciences Editor-in-Chief "Trends in Biochemical Sciences" Home page: http://www.cshl.edu/banbury/witkowski.html Banbury Center: http://www.cshl.edu/banbury

NEW 3rd EDITION: "Recombinant DNA: Genes and Genomes - A Short Course" James D. Watson. Amy A. Caudy; Richard M. Myers & Jan A. Witkowski http://www.whfreeman.com/college/book.asp?disc=&id_product=2001002416 http://www.amazon.com/Recombinant-DNA-Genes-Genomics-Course/dp/0716728664/

10/6/2008 Page 1 of 2

Gabor Marth

From: MaryAnn Brown [[email protected]] Sent: Friday, June 20, 2008 4:02 PM To: [email protected] Cc: 'Lisa Mooradian' Subject: Keynote Invitation - Next-Generation Sequencing Data Analysis Attachments: SEQ SDA Prelim.pdf

Hello Gabor,

This is to follow‐up with my voice mail message earlier today.

I want to take the opportunity to reintroduce myself and the upcoming Next‐Generation Sequencing Data Analysis meeting this September 21‐23 in Providence, RI. You spoke at a similar meeting I organized last year. We have added this new meeting as it is a great complement to the Four Day Data‐Driven Discovery Summit. Which includes meetings on:

Next‐Generation Sequencing Data Analysis Exploring Next‐Generation Sequencing Multiplexed Genomics Tools Integrative Data Analysis

Attached please find the confidential working draft agenda for the NGS grouping. I would like to invite you to make the Kick‐off Keynote at the Next‐Generation Data Analysis meeting on Monday morning, September 22 addressing the data management, storage, analysis, interpretation needs of the next‐generation sequencing data deluge. I know that you have developed a new software program but as the Keynote Speaker the audience would like to hear about the needs and how you went about to solve them. Not specifically the software itself.

If you are able to participate, I will need the following information by Friday, June 27:

1) Title of your presentation 2) Brief 3‐5 sentence summary 3) Complete contact information 4) Picture of yourself 300 dpi in a jpg or tif format

As an invited speaker, CHI is pleased to cover local transportation and 2 nights hotel.

If you have any questions – please do not hesitate to contact me. I look forward to working with you again!

Best regards,

Mary Ann

P.S. We are preparing to send the brochure off to the printers – I know that this is short notice but if you are at all interested and available we would like to include you as a confirmed speaker. If not, I do need to know right away as an alternative speaker must be secured.

10/6/2008 Page 2 of 2

Mary Ann Brown Executive Director Conferences Cambridge Healthtech Institute 250 First Avenue, Suite 300 Needham, MA 02494 T: 781‐972‐5497 E: [email protected]

You've Sequenced the Genome Now use IT!

10/6/2008 Gabor Marth

From: Michael C. Zody [[email protected]] Sent: Wednesday, May 23, 2007 3:03 PM To: [email protected] Subject: visiting the Broad

Gabor,

After talking about it for a while, it seemed I should actually invite you to come over to the Broad and give a talk and meet with folks. I'll be away the first 2 weeks in June, but around most of the summer after that. Is there some time later in the summer (say late June or anytime in July) that would work well for you? It would be great if you could come over and give an extended version of your Marco/CSH talk and then have some time to talk with our tech dev and assembly/SNP detection guys for half a day or so. best,

Mike

1 Gabor Marth

From: [email protected] on behalf of Lenore Cowen [[email protected]] Sent: Tuesday, August 22, 2006 4:27 PM To: [email protected] Cc: [email protected] Subject: Invitation to speak at Tufts

Dear Dr. Marth,

I am writing to invite you to come speak in the new weekly Bioinformatics and Computational Biology colloquium that is starting this Fall at Tufts. Our colloquium will be meeting from 5-6pm on Wednesdays (so late so that people from the medical school campus and from other area universities and industry can come) on the main (Medford) campus. For a possible date, let me suggest Wednesday, September 20th.

Please let me know if you can come and if the date would work for you. Because we got $$ from the Engineering dean, I can offer you a $200 honorarium. We can also take you out to dinner afterwards, if that works with your schedule.

I'd very much look forward to meeting you,

Best, Lenore

------

Lenore J. Cowen Associate Professor Department of Computer Science Tufts University http://www.cs.tufts.edu/~cowen/compbio

------

1 Gabor Marth

From: Francis Ouellette [[email protected]] Sent: Friday, June 11, 2004 9:54 AM To: [email protected] Cc: VanBUG dev group Subject: [Fwd: VanBUG talk in Vancouver]

Dear Gabor,

How have you been? How is Boston treating you? We would like to invite you to present at the VanBUG seminar series. VanBUG is the Vancouver Bioinformatics User Group (http://vanbug.org) and we present bioinformatics talks (we highlight local and outside of BC and outside of Canada speakers).

We wouldlike you to kick off our 2004-2005 sewries and give the first talk, which ins in the evening of September 9th. We have other dates in the year that we can offer if this Sept date doesn't work for you.

As you will be able to see from the vanbug.org web site, the themes vary, and all speakers leave their powerpoints slides (in the spirit of open access). This coming year (our third year of VanBUG) we are adding, before each highlighted talk, a 10 min presentation from our local bioinformatics trainees (from the UBC/SFU CIHR program), and, as well, we plan to Web Cast all of the talks.

In practice, our speakers are free to talk about any aspect of their work, but we particularly appreciate a "nuts and bolts" kind of talk that highlights some of the bioinformatics challenges and discoveries that have been made recently. The audience is sophisticated with multiple backgrounds, and from diverse (industry/academic and government) environments. We usually draw in 100-200 people on that Thursday evening (talks are from 6-7) and are followed by an hour of beer and pizza (but we usually take the speaker to a nice place for dinner after that :).

Please let me know when and if you can make this date (you will need a very good reason not to make this date :-). We will put you up at the Sheraton Wall Centre (downtown Vancouver) and we obviously pay for your accommodation and travel and we will organize your visit here in Vancouver, and arrange it so you can see all the people you want to see.

Hope to hear from you shortly, all the best, f., on behalf of the VanBUG dev group:

Stefanie Butland Ryan Brinkman Francis Ouellette Stephen Montgomery

-- BF Francis Ouellette http://bioinformatics.ubc.ca/ouellette

1 Invitations for editorial duties

Gabor Marth

From: [email protected] Sent: Tuesday, June 07, 2005 11:25 AM To: [email protected] Subject: PLoS Computational Biology manuscript 05-PLCB-RA-0112

********************************************************************** DO NOT REPLY TO THIS EMAIL – USE THE LINKS TO ACCEPT OR DECLINE *********************************************************************** Dear Gabor,

A manuscript entitled "SNPdetector: a Software Tool for Sensitive and Accurate Detection of Single Nucleotide Polymorphisms in fluorescence-based resequencing" has been submitted to PLoS Computational Biology; the corresponding author is Dr. Zhang. A copy of the abstract is provided below.

I hope you will be able to act as the Associate Editor for this manuscript. Youur name was suggested to us. Note that PLoS Computational Biology does not publish papers just on tools, but biological outcomes from those tools. Feel free to recommend rejection on those grounds or assign reviewers as you see fit.

Within the next 24 hours if possible, please use the following links to accept or to decline this assignment.

To accept, please press the URL below:

To decline, please press the URL below:

When you accept, we ask you first to evaluate the suitability of the manuscript for PLoS Computational Biology within the following 1-2 days. If you view the paper favorably, please be prepared to assign 6 reviewers. If you do not view the paper favorably, we ask that you draft a decision letter explaining briefly the reason for rejecting the paper. I will view and comment as appropriate and forward the decision letter with both our signatures to the authors.

Please don't hesitate to contact me should you have any questions.

Thank you,

Philip Bourne EIC PLoS Computational Biology

Manuscript Title:

SNPdetector: a Software Tool for Sensitive and Accurate Detection of Single Nucleotide Polymorphisms in fluorescence-based resequencing

Authors: Jinghui Zhang (National Cancer Institute) 1 David Wheeler (Baylor College of Medicine) Imtiaz Yakub (Baylor College of Medicine) Sharon Wei (Baylor College of Medicine) Raman Sood (National Human Genome Research Institute) William Rowe (National Cancer Institute/NIH) Paul Liu (NIH) Richard Gibbs (Baylor College of Medicine) Kenneth Buetow (National Cancer Institute/NIH)

Abstract: Identification of Single Nucleotide Polymorphisms (SNPs) and mutations is important for the discovery of genetic predisposition to complex diseases. PCR re-sequencing is the method of choice for de novo SNP discovery. However, manual data analysis has been a major bottleneck for its application in high-throughput screening due to lack of a sensitive and accurate computational method for automated SNP detection. We developed a software tool, SNPdetector, for automated identification of SNPs and mutations by fluorescence-based resequencing. SNPdetector was designed to model the process of human visual inspection and has a very low false positive and false negative rate. We demonstrate the superior performance of SNPdetector in SNP and mutation analysis by comparing its results with those derived by human inspection, a popular SNP detection tool polyphred, and independent genotype assays in three large-scale investigations. The first study identifies and validates inter- and intra-subspecies variations in 4,650 traces of 25 inbred mouse strains that belong to either the Mus musculus species or the Mus spretus species. Unexpected heterozgyosity in the wild-derived inbred strain Cast/Ei was found in two out of 1,167 mouse SNPs. The second study identifies novel SNPs in 133,440 traces in four ENCODE regions of the human genome (ENCODE Consortium, 2004) that were subsequently validated by independent genotyping. The third study detects ENU-induced mutations (at 0.04% allele frequency) in 64,896 traces of 1,236 zebra fish. The three large and diverse test data sets analyzed in this study demonstrate that SNPdetector is an effective tool not only for genome-scale research investigation but also for clinical studies involving large patient sample sets. SNPdetector runs on Unix/Linux platform and is available publicly (http://lpg.nci.nih.gov).

To view the entire manuscript, please press the following URL:

2 Invitations to review journal articles Page 1 of 2

Gabor Marth

From: [email protected] Sent: Thursday, September 11, 2008 2:14 AM To: [email protected] Subject: Request to review from PLoS Genetics: manuscript 08-PLGE-PI-1199

Dear Dr. Marth,

I am writing to ask if you can kindly review 'The Power of Resequencing Studies: A GroupWise Association Test for Rare Disease Susceptibility Mutations' by Bo Madsen and Sharon Browning, submitted for publication in PLoS Genetics. A copy of the abstract is appended below.

************************************************************************** Please use the links to ACCEPT or DECLINE

Click to ACCEPT this assignment: http://genetics.plosjms.org/cgi-bin/main.plex? el=A5Bf4XiW2A4LzC7D1A987gSbSqPfO36jiuEp0xU5wZ

Click to DECLINE this assignment and suggest alternative reviewers: http://genetics.plosjms.org/cgi-bin/main.plex? el=A3Bf6XiW7A4LzC7E5A987gSbSqPfO36jiuEp0xU5wZ

**************************************************************************

Upon clicking on the agreement link, you will be directed to reviewers' materials and the manuscript. We aim to have reviews returned to us within 10 days, although you should feel free to contact the journal office at [email protected] if you will need additional time or if you are delayed.

If you have any conflicts of interest that preclude an objective evaluation or are otherwise unable to review this manuscript at this time, you should click on the decline link above. You will then be prompted for suggestions of other potential reviewers who would be qualified to assess this manuscript, which we would greatly appreciate.

In the spirit of PLoS Biology, with open access and high-quality papers, PLoS Genetics reflects the full breadth and interdisciplinary nature of genetics and genomics research by publishing outstanding original contributions from all areas of biology. Please visit http://www.plosgenetics.org for more information about PLoS Genetics.

Many thanks,

Nicholas Schork Associate Editor PLoS Genetics

********************************Confidential********************************

10/6/2008 Page 2 of 2

Manuscript Title: The Power of Resequencing Studies: A GroupWise Association Test for Rare Disease Susceptibility Mutations

Authors: Bo Madsen (University of Aarhus) Sharon Browning (University of Auckland)

Abstract: Resequencing is an emerging tool for identification of rare disease-associated mutations. Rare mutations are difficult to tag with SNP genotyping, as genotyping studies are designed to detect common variants. However, studies have shown that genetic heterogeneity is a probable scenario for many common diseases, in which multiple rare mutations together explain a large proportion of the genetic basis for the disease. We thus propose a weighted-sum method to jointly analyse a group of mutations, in order to test for groupwise association with disease status. Such a group of mutations may result from resequencing a gene, for example. We compare the proposed weighted-sum method to alternative methods, and show that it is powerful to identify disease-associated genes, both on simulated and Encode data. Using the weighted-sum method, a resequencing study can identify a disease-associated gene with an overall Population Attributable Risk (PAR) of 2%, (Odds Ratio: 1.2) even when each individual mutation has much lower PAR; using 1000 to 7000 affected and unaffected individuals, depending on the underlying genetic model. This study thus demonstrates that resequencing studies can identify important genetic associations, provided that specialised analysis methods, such as the weighted-sum method, are used.

**************************************************************************** If you have any questions, feel free to contact Catriona Silvey at [email protected]; visit PLoS Genetics online: http://www.plosgenetics.org/; Email Alerts: http://register.plos.org/

This e-mail is confidential to the intended recipient. If you have received it in error, please notify the sender and delete it from your system. Any unauthorized use, disclosure, or copying is not permitted. The views or opinions presented are solely those of the sender and do not necessarily represent those of the Public Library of Science unless otherwise specifically stated. Please note that neither the Public Library of Science nor any of its agents accept any responsibility for any viruses that may be contained in this e-mail or its attachments and it is your responsibility to scan the e-mail and attachments (if any).

10/6/2008 Gabor Marth

From: [email protected] Sent: Monday, June 30, 2008 11:27 AM To: [email protected] Subject: Nature Methods Review Request - manuscript NMETH-A05981

A manuscript has been submitted to Nature Methods, which I was hoping you would be interested in reviewing. The manuscript is a Article entitled "Large-scale enrichment and discovery of gene-associated SNPs" and comes from Michael Gore and colleagues. Its abstract is pasted below.

If you are willing to comment on this manuscript, I would like to receive your report within 14 days.

If you are unable to review the manuscript for any reason, suggestions of other highly qualified referees would be greatly appreciated.

If you need any further information, please do not hesitate to contact me.

Thank you in advance for your help and I look forward to hearing from you soon.

Yours sincerely,

Michelle Pflumm, Ph.D. Assistant Editor Nature Methods

Large-scale enrichment and discovery of gene-associated SNPs

Michael Gore, Mark Wright, Elhan Ersoz, Pascal Bouffard, Edward Szekeres, Thomas Jarvie, Bonnie Hurwitz, Apurva Narechania, Tim Harkins, George Grills, Doreen Ware, and Edward Buckler

To conduct genome-wide association studies in diverse maize, several million single nucleotide polymorphism (SNP) markers primarily concentrated within the genic and low copy regions of the maize genome are needed. High throughput, DNA sequencing technologies are potentially powerful for sequencing the hundreds of millions of bases needed for SNP discovery at this scale. To that end, we constructed gene-enriched HpaII genomic libraries for two maize inbred lines, and sequenced them using massively parallel pyrosequencing. A novel SNP calling pipeline was developed that dramatically reduced the number of false positive SNPs by identifying and preventing SNP calls from alignments of non-orthologous sequences. With this pipeline, 108,269 putative SNPs were identified between the B73 and Mo17 inbred lines at an estimated false discovery rate of 11.9%. Sanger sequencing of B73 and Mo17 amplicons successfully validated 91% (600/659) of a subset of these putative SNPs. These results show that this approach has wide applicability for efficiently and accurately detecting gene-associated SNPs in large, complex plant genomes.

1

This email has been sent through the NPG Manuscript Tracking System NY-610A-NPG&MTS.

2 Gabor Marth

From: [email protected] Sent: Monday, December 18, 2006 1:06 PM To: [email protected] Subject: Nature review request - manuscript 2006-11-12549A

Dear Dr. Marth,

A manuscript has been submitted to Nature, which we were hoping you would be interested in reviewing [and you may already be somewhat familiar with it]. The manuscript comes from David Reich et al., and is entitled "Genomic analysis reveals a more intense bottleneck in Asian than European demographic history". Its first paragraph is pasted below.

Is this a paper that you would be able to review for us within by 29 December? If so, please let me know as soon as possible, and I will send instructions to you on how to access the manuscript. Failing that, it would be helpful to us if you could suggest alternative referees.

Many thanks in advance for your help; I look forward to hearing from you.

Best wishes, Chris

Chris Gunter, PhD Senior Editor Nature 75 Varick St 9th Floor New York,NY 10013-1917 Tel: 212 726 9200 Fax: 212 696 9006

Genomic analysis reveals a more intense bottleneck in Asian than European demographic history

David Reich, Alon Keinan, James Mullikin, and Nick Patterson

Advances in molecular genetics have permitted the collection of large scale data sets on genetic variation, but inferences about human history have been compromised by biases in the ways markers were chosen. Here we identify large subsets of SNPs from the International Haplotype Map that are essentially free of ascertainment bias, and make these data sets available for studying history and natural selection. Analyzing these data, we show that East Asians and Europeans shared the same bottleneck expanding out of Africa, but that in both populations there was also a more recent bottleneck, coincident with the Last Glacial Maximum ~20,000 ago. This second bottleneck was stronger in East Asians than Europeans, a novel discovery about the history of these populations.

Please note that your contact details are being held on our editorial database which is used only for this journal's management of the peer review process. If you would prefer us not to contact you in the future please let us know by emailing [email protected].

This email has been sent through the NPG Manuscript Tracking System NY-610A-NPG&MTS

1 Gabor Marth

From: [email protected] on behalf of JMB Editors [[email protected]] Sent: Saturday, December 29, 2007 5:43 PM To: [email protected] Subject: Manuscript JMB91 for review

Dear Professor Gabor Marth,

In view of your expertise I would be very grateful if you could review the following manuscript which has been submitted to Journal of Mathematical Biology.

Manuscript Number: JMB91

Title: A Stochastic Model for Estimation of Mutation Rates in Multiple-replication Proliferation Processes

Author(s): Dr. Xiaoping Xiong, James M Boyett, Ph.D.; Robert G Webster, Ph.D.; Juergen Stech, Ph.D.

Abstract: In this paper we propose a stochastic model based on the branching process for estimation and comparison of the mutation rates in proliferation processes of cells or microbes. We assume in this model that cells or microbes (the elements of a population) are reproduced by generations and thus the model is more suitably applicable to situations in which the new elements in a population are produced by older elements from the previous generation rather than by newly created elements from the same current generation. Cells and bacteria proliferate by binary replication, whereas the RNA viruses proliferate by multiple replication. The model is in terms of multiple replications, which includes the special case of binary replication. We propose statistical procedures for estimation and comparison of the mutation rates from data of multiple cultures with divergent culture sizes. The mutation rate is defined as the probability of mutation per replication per genome and thus can be assumed constant in the entire proliferation process. We derive the number of cultures for planning experiments to achieve desired accuracy for estimation or desired statistical power for comparing the mutation rates of two strains of microbes. We establish the efficiency of the proposed method by demonstrating how the estimation of mutation rates would be affected when the culture sizes were assumed similar but actually diverge.

In case you accept to review this submission please click on this link: http://jomb.edmgr.com/l.asp?i=1373&l=5D32ISBL

If you do not have time to do this, or do not feel qualified, please click on this link: http://jomb.edmgr.com/l.asp?i=1372&l=52QWLPJK

We hope you are willing to review the manuscript. If so, would you be so kind as to return your review to us within 45 days of agreeing to review? Thank you.

You are requested to submit your review online by using the Editorial Manager system which can be found at: http://jomb.edmgr.com/. Your username is: GMarth-799 and your password is: marth.

IN ORDER TO KEEP DELAYS TO A MINIMUM, PLEASE ACCEPT OR DECLINE THIS ASSIGNMENT ONLINE AS SOON AS POSSIBLE!

If you have any questions, please do not hesitate to contact us. We appreciate your assistance.

With kind regards,

1 Peter G. Clote, Ph.D, D.Sc. Journal of Mathematical Biology

2 Gabor Marth

From: [email protected] Sent: Friday, August 05, 2005 5:14 PM To: [email protected] Cc: [email protected] Subject: Human Mutation Review Request - humu-2005-0405

Dear Dr. Marth:

The following paper has been submitted to Human Mutation for consideration as a Research Article:

Title: Non-Synonymous SNPs: Validation Characteristics, Derived Allele Frequency Patterns, And Evidence For Natural Selection

Corresponding author: Dr. David Fredman

Contributing authors: 1) David Fredman 2) Sarah Sawyer 3) Linda Strömqvist 4) Salim Mottagui-Tabar 5) Kenneth Kidd 6) Claes Wahlestedt 7) Stephen Chanock 8) Anthony Brookes

Abstract: at bottom of email

I would like to invite you, on behalf of Communicating Editor Dr. Pui-Yan Kwok, to serve as an expert referee for this paper. Please reply via e-mail as soon as possible to confirm whether you are able to review the paper within about 2 weeks.

If you agree, you will be notified by e-mail shortly with instructions on how to log into the Human Mutation web submission and online review system. You will then have access to the manuscript in your online "Reviewer Center" in order to report your evaluation.

The board members and I realize that our expert reviewers are the foundation for the high quality articles we publish in Human Mutation, and we appreciate the time and effort required to perform this valuable service for the community.

I understand that schedules can be hectic. If for any reason you are unable to accept our invitation, please let me know immediately. Could you suggest a friend or colleague as an alternative reviewer? We would be deeply grateful to receive their name, affiliation, and e-mail address.

Many thanks for your consideration.

Best wishes,

Mark Paalman

Cc: Dr. Pui-Yan Kwok

______Mark H. Paalman, Ph.D. Managing Editor, Human Mutation Official Journal of the Human Genome Variation Society (HGVS) http://humu- wiley.manuscriptcentral.com John Wiley & Sons, Inc. 111 River Street Hoboken, New Jersey 07030 phone: 201-748-6404 fax: 201-748-6398 [email protected] ______1 ABSTRACT: We experimentally investigated more than 1200 dbSNP variants that would change amino-acids (nsSNPs), using 18 global populations comprising over 1000 DNAs. First, we mined our data for any SNP features correlated with a high validation rate. Useful predictors for validated SNPs included multiple submissions to dbSNP, having a dbSNP validation statement, and being present in a low number of ESTs. Together, these features improved validation rates by almost 10-fold. Higher abundance SNPs (e.g., T/C variants) also validated more frequently. Second, we considered derived alleles, and noted a considerably (~10%) increased average derived allele frequency (DAF) in Europeans versus Africans, plus a further increase in some other populations. This was not primarily due to a SNP ascertainment bias, nor to the effects of natural selection. Instead, it can be explained as a drift-based, progressive increase in DAF over many generations, with this phenomenon becoming exaggerated during population bottlenecks. This observation suggests novel DAF- based tests for comparing demographic histories. Finally, we considered individual marker patterns and thereby identified 37 SNPs having allele frequency variance or FST values consistent with the effects of population-specific natural selection. Four particularly striking clusters of these markers were apparent, of which 3 coincide with genes/regions suggested by others to carry signatures of selection.

2 Gabor Marth

From: Hillary Sussman [[email protected]] Sent: Monday, March 10, 2008 10:06 AM To: Gabor Marth Subject: Genome Research - Review Request MS# GENOME/2008/078212

MS ID#: GENOME/2008/078212 MS Editor: Hillary Sussman

Dear Dr. Marth,

We have received a manuscript, "Mapping short DNA sequencing reads and calling variants using mapping quality scores," from Richard Durbin and colleagues, and are wondering if you, or a colleague, would have time to review this manuscript for Genome Research in the next two weeks. (Note: if you can do the review, but need some additional time, please email the Editor directly. We can then determine if this is suitable: we would prefer to have referees prepared to do the review over a slightly longer period than to delay the review process in continued search of suitable referees.) For additional information on the manuscript, a copy of the abstract is printed at the end of this e-mail.

If you are too busy to review this manuscript at this time, we certainly understand, please let us know as well. We would also very much appreciate suggestions of alternative referees.

To accept or decline this request to review, please select the following link and choose from the actions available to you after you have entered your "Reviewer Area":

Please click on the below link to access your Reviewer Area: http://submit.genome.org/tracking/a/a?t=r&k=96279879P28k136339

If you are able to review you will gain immediate access to a PDF of the manuscript and any supplementary material at our online submission site (http://submit.genome.org).

Thank you for your help in this matter. I look forward to hearing from you.

With kind regards,

Hillary Sussman, Ph.D. Executive Editor Genome Research

Mapping short DNA sequencing reads and calling variants using mapping quality scores

BY: Heng Li, Jue Ruan, and Richard Durbin

ABSTRACT: New sequencing technologies promise a new era in the use of DNA sequence. However, some of these technologies produce very short reads, typically of a few tens of base pairs, and to use these reads effectively requires new algorithms and software. In particular, there is a major issue in efficiently aligning short reads to a reference genome, and handling ambiguity or lack of accuracy in this alignment. Here we introduce the concept of mapping quality, a measure of the confidence that a read actually comes from the position it is aligned to by the mapping algorithm. We describe the software MAQ that can build assemblies by mapping shotgun short reads to a reference genome, using quality scores to derive genotype calls of the consensus sequence of a diploid genome, e.g. from a human sample. MAQ makes full use of mate-pair information and accurately estimates the error probability of each read alignment. Error probabilities are also derived for the final genotype calls, using a Bayesian statistical model that incorporates 1 the mapping qualities, error probabilities from the raw sequence quality scores, sampling of the two haplotypes, and an empirical model for correlated errors at a site. Both read mapping and genotype calling are evaluated on simulated data and real data. MAQ is accurate, efficient, versatile and user friendly. It is freely available at http://maq.sourceforge.net.

------**TO ACCESS OUR ONLINE WEBSITE FOR THE FIRST TIME**

1. Print this page. 2. Click on "Create a New Account" in the upper left hand side of the Bench>Press homepage ( http://submit.genome.org ). 3. Enter the email address we used to contact you ([email protected]) in the space provided. 4. IMPORTANT: Reviewers, please be sure to enter the e-mail address that you received this e-mail message at (again, this is [email protected]). You can change this to your preferred e- mail address if you wish once you have created an account (by going to the personal information area). 5. Choose a password for yourself and enter it in the spaces provided. 6. Complete the question of your choice to be used in the event you cannot remember your password at a later time. 7. Click on the "Save" button at the bottom of the screen. 8. Check the e-mail account you registered under, and look for a new e-mail (this may take up to 3 minutes). The required verification number is automatically e-mailed to you for security purposes. 9. Once you receive this verification number, switch back to this site and enter the verification number that was sent to you.

***HAVE REGISTERED, BUT FORGOT YOUR PASSWORD?***

Ask for your security question at the site. If you don't remember the answer to the security question, E-mail us at [email protected], and we will send you the security question and your answer. ------

2 Gabor Marth

From: Genomics [[email protected]] Sent: Tuesday, November 16, 2004 1:37 PM To: [email protected] Subject: Review of GENO-D-04-00650

Dear Dr. Marth,

On behalf of Dr. MARK ADAMS, Associate Editor of Genomics, we would appreciate your review of a Genomics manuscript entitled "Allelic variation in gene expression identified through computational analysis of dbEST database." Because of your expertise in the area, you were suggested as a possible reviewer. Please respond to this invitation within 5 days from the date of this letter.

The manuscript abstract can be found below. To accept or decline this assignment, please access the Editorial Manager as a Reviewer. http://geno.edmgr.com Username: gab8r Password: marth1130.

Click on "Agree to Review" if you are willing to review the paper or "Decline to Review" if you are unable to assist us at this time. After agreeing to review, you will have access to the entire manuscript file.

If this manuscript is a revision, comments from previous reviewers (including yourself) may be included at the bottom of this email for your consideration.

Should you be unable to review the manuscript within ten business days after receipt, we would greatly appreciate a reply to this email with the suggestion of an alternate reviewer.

Again, your help in this matter will be greatly appreciated.

Sincerely,

Kay Felicano Elsevier Editorial Office - Genomics http://geno.edmgr.com

ABSTRACT ------Differential expression between the two alleles of an individual and between people with different genotypes has been commonly observed. Quantitative difference in gene expression between people may provide the genetic basis for the phenotypic difference between individuals and may be the primary cause of complex diseases. In this paper, we developed a computational method to identify genes that displayed allelic variation in gene expression in human EST libraries. To model allele-specific gene expression, we first identified EST libraries in which both A and B alleles were expressed and then identified allelic variation in gene expression based on the EST counts for each alleles using a binomial test. Among 1,155 SNPs that had sufficient number of ESTs for the analysis, 572 (50%) displayed allelic variation in at least one cDNA library. The frequency of allelic variation observed in EST libraries was similar to the previous studies using SNP chip and primer extension method. We found genes that displayed allelic variation were distributed throughout the human genome and were enriched in certain chromosome regions. The SNPs and genes identified in this study will provide a rich source for evaluating the effects of those SNPs and associated haplotypes in human health and diseases.

Reviewer comments on previous version of this paper (if applicable): 1 ------

2 Gabor Marth

From: [email protected] Sent: Friday, September 26, 2008 3:39 AM To: Gabor Marth Subject: Review Genetics GENETICS/2008/095265?

MS ID#: GENETICS/2008/095265 MS Title: Comparison of Genetic Distance Measures Using Human SNP Genotype Data Authors: Ondrej Libiger, Caroline M Nievergelt, and Nicholas J Schork MS Associate Editor: Laurent Excoffier

Dear Dr. Marth:

I would be very grateful if you would you be willing to review the paper listed above within 21 days.

Click one of the links (below) to register your response.

Decline to review http://submit.genetics.org/info?revkey=0952651129055&type=ra&act=1 Accept to review http://submit.genetics.org/info?revkey=0952651129055&type=ra&act=2 Submit conflict of interest http://submit.genetics.org/info?revkey=0952651129055&type=ra&act=3

If you are unable to review the paper, we would appreciate your suggestions for other reviewers. Please email to [email protected], or send directly to the Associate Editor.

Sincerely, Laurent Excoffier Associate Editor

MS Title: Comparison of Genetic Distance Measures Using Human SNP Genotype Data MS Abstract: Quantification of the genetic distance between populations is instrumental in many genetic research initiatives, and a large number of formulae for this purpose have been proposed. However, the choice of an appropriate measure can sometimes be difficult. We compared results obtained with nine widely used genetic distance measures applied to high-density, whole genome SNP genotype data obtained on individuals from 51 world populations. We found substantial differences among the nine distance measures based on the concordance of their values, Procrustes analysis as well as the topology of resulting phylogenetic trees. Overall, Cavalli-Sforza and Edwards’ distance (CE) measure differed the most from the other measures. Wright’s FST for diploid data, Latter’s, Reynolds’ and Nei’s minimum distance measures each yielded values that were most consistent with the other eight distance measures in terms of ordering populations based on genetic distance. CE and Nei’s geometric distance were least consistent. Simulation studies showed that CE is relatively more sensitive in distinguishing genetically very similar populations, while Reynolds’ genetic distance provided the highest sensitivity for highly divergent populations. Finally, our study suggests that using CE may provide less power for studies concerning human migration history.

1 Gabor Marth

From: [email protected] Sent: Tuesday, November 02, 2004 8:06 AM To: [email protected] Subject: Bioinformatics - Original Paper BIOINF-2004-1582 - Invitation to review

2 Nov 2004 Manuscript ID: BIOINF-2004-1582 Word Count: 3884 Title: SNP-PHAGE-ML: Application of Machine Learning in SNP Discovery Author(s): 1) Curtis Van Tassell 2) Lakshmi Matukumalli 3) John Grefenstette 4) Perry Cregan 5) Ik-Young Choi 6) David Hyten

Dear Prof. Marth

I wish to invite you to referee the above Original Paper submitted to Bioinformatics.

The abstract of the manuscript is at the foot of this e-mail in order to help you make your decision as to whether you are able to review it.

This paper is one of a related pair, and should you agree to review this paper, we would appreciate it if you would also agree to review the other and comment on whether they comprise two separate pieces of work.

Should you agree to review the paper, we will require a written assessment on the scientific quality of the work, including any revisions required to ensure high scientific quality, but also to ensure efficiency of the publication process. We currently ask that reviewers submit their report within 2 weeks of the manuscript being made available to them.

Please note that you will be expected to submit your report using the Bioinformatics online submission and reviewing system http://bioinformatics.manuscriptcentral.com. Instructions will be provided if you agree to review the manuscript.

If you are unable to review it, could you kindly recommend an alternative reviewer?

Please respond as soon as possible to [email protected]. I look forward to hearing from you.

Yours sincerely

Frank Dudbridge Associate Editor, Bioinformatics

Here is the abstract:

We have applied machine learning (ML) to reduce the cost of expert intervention in single nucleotide polymorphisms (SNP) discovery. In a large-scale polymorphism discovery project many candidate SNP are detected. Each of these SNP has to be expertly evaluated by visual inspection of the sequence assembly and classified as true or false. This step tends to be time-consuming, expensive and difficult to auto-mate. The ML program C4.5 was applied to a set of carefully chosen features to build an automated SNP classifier from a training dataset of 17,590 observations. The classifier achieved an average prediction accuracy of 97.3% in a five-fold cross-validation and 99.6% accuracy on an unseen test dataset of 21,514 examples. We also analyzed the decision trees and production rules generated during the training phase to understand the inherent expert criteria in polymor-phism classification. This method is incorporated as a part of the SNP discovery pipeline (SNP- PHAGE).

Availability: The system and source code are available upon request from the authors.

Contact: [email protected] 1 Supplementary information: http://SNP-PHAGE.binf.gmu.edu

2 Gabor Marth

From: [email protected] on behalf of Trends in Genetics Editorial Office [[email protected]] Sent: Tuesday, November 06, 2007 12:41 PM To: [email protected] Subject: Invite to review a Review manuscript for Trends in Genetics, TIGS-D-07-00164

06 Nov 2007

Dear Dr Marth,

Manuscript reference TIGS-D-07-00164. I was wondering whether you would be able to referee a Review manuscript by Mihai Pop, Ph.D.; Steven Salzberg, Ph.D. entitled "Bioinformatics challenges of new sequencing technology" that has been submitted to Trends in Genetics.

To accept or decline this invitation, please do not reply to this email but click on the appropriate link below:

To accept this invitation http://tigs.edmgr.com/l.asp?i=1998&l=36N0I8OC To decline this invitation http://tigs.edmgr.com/l.asp?i=1997&l=YHPRGI6P

We have included the abstract below for your consideration.

If you do not feel you will be able to review it in a timely fashion or feel you have a potential conflict of interest that might impact your perception of the article or your impartiality in the review process, I'd appreciate any suggestions you have for other appropriate reviewers.

We would be very grateful if you would let us know as soon as possible if you are able to help us on this occasion.

With best wishes

Treasa Creavin Trends in Genetics

Abstract: New DNA sequencing technologies can sequence up to one billion bases in a single day at very low cost. Despite their speed, the new technologies produce very short read lengths, as little as 25-30 nucleotides. These short read lengths make it far more difficult to assemble and annotate genomes, but they put large-scale sequencing within the reach of many scientists who could not previously afford it. Thus despite these difficulties, many researchers are forging ahead with projects to sequence a vast range of species using the new technologies. Here we review the challenges and describe some of the bioinformatics systems that are being proposed to solve them.

1 Interviews

September 16, 2008 | Vol. 2 No. 37

Study Shows 454, Illumina, ABI Can Profile By Julia Karow SNPs in Whole Genomes at High Coverage

Researchers from Agencourt Bioscience, Boston College, Applied Biosystems, and the Department of Energy’s Joint Genome Institute have published a paper comparing how second-generation sequencing platforms made by ABI, Illumina, and Roche’s 454 Life Sciences resequence the genome of a yeast strain for mapping its SNPs comprehensively.

The study, which used earlier versions of the three technologies than are currently available, concluded that all three are equally suited for the task at above 10-15-fold sequence coverage.

Though the researchers did not provide a cost analysis in their paper, which marks the first time that a direct comparison of the three platforms has appeared in a peer- reviewed journal, prices quoted by sequencing service providers suggest that the short-read technologies have an edge over 454 in this particular application.

“This study really shows the power of the new technologies for finding very rare [mutational] events,” said Gabor Marth, an assistant professor at Boston College and an author of the study. He pointed out that the strain he and his colleagues sequenced only has one SNP per megabase, a lower SNP density than the human genome.

The study, which appears online in Genome Research this month, has been long in the making (see In Sequence 3/6/2007). Paul Richardson, former program head of R&D and head of the microbial program at JGI, and Doug Smith, director of science and technology at Agencourt Genomic Services, Agencourt’s sequencing facility, conceived of the study about two years ago, according to Richardson.

At the time, JGI had 454 and Illumina instruments on site, “and we were trying to get a better idea of which ones of those were useful for different applications, and we also wanted to compare them to the SOLiD,” he told In Sequence last week.

For their comparison, they chose to sequence a mutant strain of Pichia stipitis, a haploid yeast with a 15.4-megabase genome, which is unusually efficient in converting xylose into ethanol.

The team mapped the unpaired sequence data to the genome of a reference strain of P. stipitis that was sequenced by the Sanger method and that JGI and its collaborators published last year.

The data for the latest project was generated over a period of about a year, starting two years ago, according to Richardson, who joined workflow automation company Progentech as vice president of R&D this spring.

JGI sequenced the mutant strain in a single run on an Illumina Genome Analyzer “classic,” generating 826 megabases of filtered data, or 44.2-fold coverage from aligned reads; Agencourt provided sequence data obtained from its 454 Genome Sequencer FLX platform, generating 199 megabases of filtered data in two runs, or 10.8-fold coverage from aligned reads; while ABI produced data on the first version of its SOLiD system, generating 7.9 gigabases of unfiltered data in a single run, or 175-fold coverage from aligned reads.

Marth’s team mapped the Illumina and 454 reads to the reference genome using its Mosaik alignment program. Since Mosaik, at the time, was unable to align SOLiD data, which uses two-base encoding or “color-space,” ABI analyzed the SOLiD data using its own SOLiD alignment tool.

The scientists then screened the Illumina and 454 read alignments for SNPs using the Gigabayes program, a new version of Marth’s Polybayes software; and the SOLiD color-space alignments using ABI’s own mutation-analysis software.

In total, the three technologies discovered 17 candidate mutations that differed from the reference genome, which were all confirmed by Sanger sequencing. Three of them turned out to be mistakes in the reference sequence.

At 10-fold sequence coverage, the SOLiD data resulted in zero false-positive, or spurious, SNPs and zero false-negative, or missed, SNPs.

Illumina’s data, on the other hand, yielded two false-positive and zero false-negative calls at 13-fold coverage, and zero errors at 19.4-fold coverage.

The 454 data, at 10.8-fold coverage, generated one false-positive SNP — which “mostly likely” resulted from a PCR error during sequence library construction, according to the paper — and no false negatives.

Even though ABI was the only vendor who generated and analyzed its own data for the “This study really shows the study, all three vendors were aware of the power of the new technologies project and were able to comment on the data for finding very rare prior to publication of the study, Richardson said. “It was a concern, but we tried to be as [mutational] events.” even-handed as possible.”

“Of course we know from experience that typically, the machine manufacturers can sequence the best [on their platform],” said Marth. “So whenever the data comes from them directly, that’s basically the best quality.”

However, because the study is based on a single dataset from each platform, the differences in the results are not statistically significant, Marth said. Also, he pointed out, the study used a different analysis pipeline for the SOLiD data than for the other two platforms, making the results less comparable.

“There was not a clear winner,” he said. “We were able to find the same mutations with all the platforms.”

“Illumina and ABI were very close in their ability to detect mutations at the lower coverage levels,” Richardson said. “All three were equally good at finding them at the higher coverage levels. […] “I think the take-home message is that you need probably 15-fold-ish data to be absolutely sure you have got most” of the mutations.

The technologies differed slightly in how well they covered the P. stipitis genome. In order to map the unpaired reads from the three technologies uniquely, the scientists had to mask repeat regions. Because of its longer reads, the 454 technology could cover a larger fraction of the genome, 96.7 percent, than the two other technologies, which covered 93.2 percent.

However, the researchers found that the distribution of sequence coverage across the genome was “similar” for the three sequencing technologies, though they all deviated from a Poisson distribution, suggesting that “there are regions of the Pichia genome that are more facile to sequence than others,” according to the paper.

“There did not seem to be any specific regional biases,” Richardson noted, adding, “that’s not to say that there might not be some underlying sequence-specific biases, but we did not find any.”

But according to Michael Egholm, 454’s vice president of research and development, the results show that 454 is “the clear winner” with regard to sequencing coverage bias.

Since the Agencourt/JGI team generated its data for the study, Illumina has replaced its GA “classic” with the GA II, ABI has upgraded its SOLiD platform to version 2.0, and 454 is about to roll out its Titanium upgrade for the GS FLX. All three vendors say the new versions provide better greater throughput and data quality.

As a result, the study “is not completely relevant to today’s technology,” said Agencourt’s Smith, adding that “the results we reported in this paper would be a kind of worst-case scenario for sequencing a haploid genome [today].”

“It’s rapidly changing technology, which is why it was difficult to make too many strong conclusions,” Richardson agreed. However, based on data he has seen from upgrades of the platforms, “I think by and large, the conclusions of the paper haven’t changed.”

Cost Analysis

Cost is another factor that users likely deem important in a cross-platform comparison, but the researchers decided not to include that information in the paper. “There was a lot of discussion about that, and we wanted to try to include that, but there are several reasons we did not,” Richardson said.

One reason is that the technologies are changing rapidly, causing throughput to rise and costs per base to decline.

“But I guess the bigger issue is that the costs for running these are different for everyone,” he said. “Everyone is, really, paying different prices — and quite significantly different prices — for reagents, and the instruments themselves, in some cases.”

But according to an In Sequence poll of three commercial and academic sequencing service providers that employ more than one second-generation sequencing platform, customers currently pay less on the short-read platforms from ABI and Illumina than on the 454 platform to obtain the same amount of sequence coverage. All providers asked to remain anonymous.

One provider said it charges customers $5,000 to $6,400 for about 200 megabases of unpaired sequence data — or 13-fold coverage of the P. stipitis genome — on the Illumina GA II, and $10,700 for the same amount of sequence data on the GS FLX Titanium. Both prices include SNP and indel detection.

Another provider charges $10,000 for 400 megabases of sequence data, or 27-fold coverage of the P. stipitis genome, on the SOLiD, and $25,000 for the same amount of sequence on the GS FLX Titanium.

A third provider told In Sequence that a quarter of a SOLiD plate — which he estimated generates at least 20-fold coverage but probably 40- to 60-fold coverage of the P. stipitis genome — starts at $2,500.

He said that half a plate of sequencing on the GS FLX Titanium, which he expects will provide “at least” 15-fold sequence coverage of P. stipitis, will probably cost about $6,500. A whole run, which “should deliver well over 20x coverage,” will likely cost about $12,000.

But a technology and cost comparison might yield different results if the goal was to discover sequence variations other than SNPs, such as copy number variations or small indels.

“What we were not able to do in this study, which we really wanted to do, is to look at small deletions, but the nature of the data we generated at the time was mostly unpaired libraries,” said Richardson. “That limited our ability.”

Also, the P. stipitis mutant did not appear to harbor many small indels, according to Marth. In a different genome, with more of this type of variation, the results of the comparison “might have looked a little different,” he said.

Whole-Genome Mutation Analysis

The study’s authors believe that whole-genome sequencing will soon become a widely used method for characterizing model organisms with mutant phenotypes, even those with more complex genomes than yeast.

“With another tenfold increase in throughput, this will be very cheap, and it could become routine” for model organisms such as C. elegans, according to Marth.

“We are going to see more and more of this in larger and larger genomes, and probably more complex genomes,” Richardson said. “It’s going to be much easier and more cost-effective to just do a whole genome scan and mutation profiling over a more targeted PCR-type approach, and we may have already reached that point in certain cases.”

But according to Egholm, the mapping approach used in this study has its limits when it comes to discovering structural variations. He said that 454 actually provided a de novo assembly of the P. stipitis genome for the study, which could reveal potential structural variations, but was not included in the paper. “We firmly believe that all resequencing will eventually move to be based — at least in part — on de novo sequencing,” he said.

Agencourt already offers whole-genome sequencing services for mutation analysis on its 454 and its SOLiD platforms, depending on a customer’s needs.

“I think that application is going to be very important, especially for bacterial and fungal organisms, where it’s very efficient,” Smith said.

The company plans to perform more platform comparisons internally, he said.

TGen Team Tests Barcodes for Multiplexed By Julia Karow Targeted Resequencing Studies in Humans

Researchers at the Translational Genomics Research Institute have devised a barcoding method for sequencing multiple samples in parallel on Illumina’s Genome Analyzer, and have tested it by resequencing several regions in 46 HapMap individuals.

The scientists plan to apply the method in targeted resequencing studies, for example in follow-up projects to genome-wide association studies. Coupled with genome capture or partitioning methods, it could be used to sequence hundreds of samples in parallel, they said.

The aim of the study, which was published in Nature Methods this week, was to develop a way to resequence multiple targeted genomic regions in parallel, and to develop an analysis framework for discovering genetic polymorphisms.

The researchers used six-base barcodes to index 46 HapMap samples. In each sample, they amplified multiple 5-kilobase regions by long-range PCR — 10 in one experiment and 14 in another — most of which had previously been sequenced as part of the Encyclopedia of DNA Elements, or ENCODE, project.

May 22, 2007 | Vol. 1 No. 20

Team Marries DNA Amplification and Next-Gen By Julia Karow Sequencers to Enable Targeted Resequencing

Researchers at Stanford University’s Genome Technology Center and their colleagues have coupled a multiplexed DNA-amplification method with next-generation sequencing to resequence multiple human cancer genes in parallel and at lower cost than traditional methods.

Unlike traditional PCR, results provided by the DNA-amplification technology, also known as the selector technology, allows researchers to amplify many different DNA regions in a single reaction tube.

The researchers used 454 Life Sciences’ sequencing technology in their proof-of- concept study, which was published in last week’s Proceedings of the National Academy of Sciences. They are now testing the technology in a larger project that will use other new sequencing platforms, including Illumina’s Genetic Analyzer.

Combining multiplexed DNA amplification and next-gen sequencing could significantly lower the cost and time of large-scale, targeted resequencing projects, and make such studies possible in smaller laboratories that do not have access to the same infrastructure as large genome centers, according to the researchers.

“This is a really great (and important) step forward in the right direction,” Jay Shendure, a researcher in George Church’s group at Harvard University, wrote to In Sequence in an e-mail message last week. Shendure, who has been working on a similar amplification method and was not involved in this study, called the results “impressive.”

Methods to amplify subsets of the human genome will be crucial for targeted resequencing studies, which Shendure and others believe will initially dominate human genome resequencing. characteristics” that can also be used in the industrial production of biological molecules, such as anti-oxidants, the abstract states.

Microbial Genome Sequencing: Genome Sequences for Four Phototrophic Prokaryotes. Start date: Sept. 1, 2006 Expires: June 30, 2007 Awarded amount to date: $412,651 Principal investigator: Robert Blankenship Sponsor: Washington University

Funds a project to sequence the genomes of four photosynthetic bacteria: Heliobacterium modesticaldum, Roseobacter denitrificans, Rhodocista centenaria, and Acaryochloris marina. The genome sequences of these organisms are expected to fill “large gaps in the available genomic data for photosynthetic organisms,” according to the grant abstract. --

SBIR Phase I: Genetic Data Processing for Viral Researchers and Diagnostics. Start date: July 1, 2007 Expires: Dec. 31, 2007 Awarded amount to date: $97,637 Principal investigator: Susanna Lamers Sponsor: BioInfoExperts

Supports development of a web-based tool for analyzing viral sequences. “Availability of a sequence analysis tool that would help investigators manipulate viral sequences and detect contaminants would be of value to researchers as well as to diagnostic laboratories,” according to the grant abstract.

BC Bioinformaticist Gabor Marth Tackles IT Challenges of Next-Gen Sequencers

Gabor Marth Name: Gabor Marth Assistant Professor, Department of Title: Assistant Professor, Department of Biology, Boston Biology College Boston College Age: 42

Experience and Education: Staff Scientist (with Stephen Altschul), National Center for Biotechnology Information, 2000-2003 Postdoc (with Robert Waterston), Washington University Human Genome Center, 1995-2000 D.Sc, (Systems Science and Mathematics), Washington University, 1994 BS-MS, (Electrical Engineering), (Budapest Technical University), 1987

At last week’s Biology of Genomes meeting at Cold Spring Harbor Laboratory, Gabor Marth gave a talk about the informatics challenges of next-generation sequencing, and presented some software tools his group at Boston College has developed.

In Sequence spoke with Marth last week to get more details.

Tell me about your background. Where does your interest in next-gen sequencing derive from?

Originally, I was at the Wash U Genome Center, [where I was a postdoc] after my PhD, where we did genome sequencing informatics for the Human Genome Project. When people started thinking about not only sequencing a single genome, but to see what the difference is between the different genomes, we started writing computer software and developing methods to find polymorphisms. I developed an algorithm called polyBayes, which was one of the first comprehensive polymorphism discovery tools, looking for SNPs and short insertions and deletions in sequences.

Then I went to the [National Center for Biotechnology Information], where we used these and other tools for the first large-scale organismal SNP discoveries, and collaborated with a bunch of other places to publish the first big polymorphism map of the human genome that came out in Nature in 2001.

In addition, I did population genetics and ancestral demographic modeling, but when these machines made their presence felt, there was a real need to apply the old methods, and update a lot of the methods, and write new methods to leverage the next-generation sequencing data.

Why was there such a need? What is different about the new technologies, compared to traditional Sanger sequencing?

Virtually everything. [One difference is that the] signals that come from these machines are fundamentally different from the Sanger machines. [The closest to the] four-color traces that the old Sanger machines produce is the llumina, or Solexa, sequencer, which produces a four-color image, but it’s still different, because it’s discrete positions where you measure color intensities. The 454 sequencer is very different because it does not measure individual nucleotides; it measures intensities of two or three or as many nucleotides as incorporated in a single mononucleotide run. And then the [Applied Biosystems] SOLiD technology is again very different, because the measurement is made in what they call color space. So basically, just to interpret these [data] and produce the nucleotides and base confidence values, or base quality values, it’s actually quite different for all these machines.

The second [difference] is the read length. Sanger reads tended to be 750, even up to 1,000 base pairs after they optimized the technology. With these [new machines], even the 454 FLX machine, [produces only] about 250 base-pair reads, and the really high-throughput sequencers, the Illumina and the SOLiD, they produce what’s called short reads, up to 50 [base pairs, but] typically [you get] more like 30 base pair reads.

And the data that comes off of them is just humungous. We are talking about multiple gigabytes per run. In the old paradigm, you looked at sequences as individual files, for example. You can no longer do that. Just being able to manage this data on a computer and access it fast enough [so] you can do something with 100 million or 200 million reads in a project is just a huge challenge.

How do you analyze the data?

The first challenge [is] you have to look at the raw data that these machines produce, and you have to interpret them and translate them into DNA bases, and [assign] confidence values, which tell you how accurate you think that base is. The general name for such software is ‘base caller.’ For some of the technologies it’s more important to write base callers because the ones that are supplied with the machine don’t perform very well.

We have written several other base callers for various needs, so we have the methodology down, and at least for the 454 machine, [we] were able to write a base caller [called PyroBayes] that seems to perform a lot better. But for example, for the Solexa platform, we didn’t have to write one, because the base calls that come with the machine are actually quite accurate. There is one step that we do with them, but it’s a fairly simple step, more like an adjustment, a calibration step.

[For] the other technologies — the SOLiD technology, for example — we are only starting to get data from them [now]. And [ABI is] actually very interested in us working with their data, but we haven’t actually seen much of their data.

The second [step], which I think is really the crux of dealing with these short reads, is the sequence alignment. [We developed a program for that called Mosaik.] There is a lot of commonality between [different aligners currently being developed] in an algorithmic sense. They have to take these reads and quickly find where, in large genomes, they could possibly fit. Usually, [this kind of] software has a first, sort of quick-and-dirty step, where you are looking through the genome very fast and have an initial scan of where this read could be aligned. Usually, there is a secondary step where you take a more in-depth look. That’s common between many of these programs. What differentiates the programs is the specifics of how they actually do it and how much effort goes into optimizing the code to various read lengths.

Another thing that makes [our] software different is, [it can] deal with situations where a short read has an inserted or a deleted base, relative to the reference genome. For example, I know that Illumina’s own software, called Eland, does not have that capability. I know some of the other software [packages] that people are writing can deal with substitution-style differences, but not insertions or deletions. If you cannot align reads that have insertions or insertions relative to the reference sequence, you cannot detect polymorphisms that are insertions or deletions. [This capability] would [also] be an absolute requirement for the 454 reads, because the number of bases in a homopolymeric run is highly variable with the 454 technology. So if you cannot align reads with a couple of base pairs of insertions or deletions, you are going to be throwing out a lot of the reads.

There are a couple of other things that are going into the algorithmic details. Sometimes, what you want from an assembler is to take a read and place it somewhere in the genome, as long as it can be uniquely placed. But there are many reads that come from really repetitive regions, so there is not a single place where you can place them. Then decisions have to be made: Do you just throw this read out, or do you report every position it can be placed? Sometimes, you are interested in a read not only if it exactly matches somewhere in the genome, but if it matches with a couple of mismatches or insertions.

To find all these locations for a read is actually computationally very intensive. So assemblers will vary in terms of their performance, and their philosophy as to how they will deal with this situation, and whether they are capable of reporting and really finding every position. And it all depends on what your application is, because sometimes it’s not a problem if you don’t find them all. If you just want to know whether there is a single location, or if there is more than a single location, that’s one possible way to look at it. Another application might declare that you find them all, so you know every place where this read could be placed. Our [assembler] is flexible in the sense that we are able to specify how we want our alignments.

Can Mosaik also be used for different read lengths?

Yes, that was the No.1 design consideration, that we can do it for the short reads, up to 50 base pairs in length; we can do it for the medium size, the 100- to 250-base- pair 454 reads; and the ABI [capillary electrophoresis] reads, which are up to 1,000 base pairs [long]. Because the idea is that [for] some applications, you want to co- assemble reads from the different platforms. People are still exploring how to use these machines for de novo sequencing and resequencing, for structural variations, and SNP discovery and mutational profiling, so you want flexibility in the aligner, so that you can try out various assembly strategies, and then you pick the best one, the one that gives you accuracy at the lowest cost. Plus, the way we view our assembler is basically as a research tool. If we need to align transcriptome sequences, as opposed to genome sequences, there may be different algorithmic requirements for it.

The third difference between different aligners is that, it’s one thing to align a single read to a genome. And it’s another thing to align many reads to the genome and then make multiple alignments from all those reads, where each read is not only aligned relative to the reference genome, but relative to each other. That’s what sometimes people call a multiple alignment, or sometimes people call it an assembly. Mosaik has functional units, it has the aligner, and it has the assembler. And the assembler takes each read aligned to the genome individually and then makes a montage out of it, [which is] the multiple alignment. Most programs don’t actually do that; very few programs can do this assembly step.

Have you published a description of Mosaik?

No, it’s new. I have a phenomenal student, Michael Strömberg, who is developing it, and he is a real pro, but he is a 2nd year graduate student. He just developed this, and the publication plan is for this summer, and a beta release is [due] hopefully next month.

In your talk, you mentioned the concept of ‘resequenceability.’ Can you explain that a little?

If you take a read and you can place it into two or more different locations in the genome, because it aligns to all those locations, then you really cannot say with confidence where the DNA came from, because it could have been coming from any of those locations. So regions that are so repetitive that you cannot reassign a read to them uniquely are not really resequencable because you really don’t know whether the read came from there or someplace else. Of course it’s not an absolute concept because it may be that with a single read, you cannot really decide whether this read came from here or there. But with a paired-end read, because the other end of that DNA fragment can be uniquely placed, you [can now] choose between the locations where this read came from. So really, resequencability is a relative concept that depends on read length. It may even depend on the number of errors you expect in a read, and it depends on your strategy, whether you are doing single reads or you are doing paired-end reads. And then for each of these technologies, you can make reasonable decisions of what you consider resequencable or not.

Tell me about the assembly format working group you are heading.

That relates to the data volumes, the huge amounts of data as it comes off the machines. After that, when we take those reads, and we align them to the genome and produce an assembly, all the data has to be represented in a way that the downstream software can use. For example, if you have a viewer application, so you want to look at the assembly, and you have to look at, say, 200 million reads in a 200-megabase genome, the amount of data will be so large that you can’t keep all that stuff in the computer’s memory. So you have to find ways in which you can pan across the genome, or focus in on specific regions of the genome. And you have to manage the data in such a way that it’s not all kept in the memory, but it’s very fast to read them from disk, for example. The take-home message is that you really have to keep the data in formats that are conducive for easy and fast access by other software applications that people then use.

There are two groups: the first one is the short-read format group that’s managed by the University of British Columbia. Their [goal] is to produce data formats that the machine manufacturers would be subscribing to. When they produce their data, the way it comes off the machine, in that standard format, then it’s easy for genome centers and other users to immediately use.

The other [group deals with] the assembly format, which we moderate here at Boston College. Here the thrust is slightly different: It is to produce data formats that are conducive for applications. In addition to just the file format, we are also collaborating to produce software libraries that people could use and other software developers would have access to [for] their applications. They would produce just pre-canned methodologies to access the data in an efficient way.

You also developed a viewer, EagleView?

One of my postdocs is developing [EagleView] to be able to look at very large assemblies of tens of millions, or hundreds of millions of reads and be able to browse through [them] very fast. The function of the viewer has also changed. Back in the old days, when people were finishing genomes based on long reads, they would edit reads if they thought that the base caller was making a mistake. And that’s gone away; there is no way anybody would edit 100 million reads; that’s just not going to happen. Primarily, these applications are there for quality assessment and for software development, because that way, you can look at the data and you can see whether your software tools are doing the right thing with the data.

What about your update of polyBayes, your SNP calling software?

Basically, I am developing quite new, quite different versions of that software now for use with these short reads. The major differences are regarding performance. Looking at a few thousand, even 100,000-long ABI reads is quite different from looking at 100 million reads, or 5 million reads even, with these short reads. The performance had to be really improved in this application. Plus the data types are changing; these data that we collect with these new short-read machines are all haploid, meaning that they only sequence one or the other chromosome. Also, for SNP calling, it’s very important to know the number of DNA molecules involved in your sequences. [In a cancer sample, it could be] many-ploid and only a small fraction of the cells [might] actually have the cancer mutations, whereas others don’t. So these are algorithmic details, but they are very important for accurate mutation detection.

You mentioned testing the software in a number of projects involving the 454 and the Illumina platforms. You said you are now starting to analyze ABI SOLiD data. Are you hoping to add other platforms as they come along, like Helicos?

I have not seen any Helicos data, [and] I don’t know anyone who has seen Helicos data other than the Helicos guys themselves, but I understand that there will be some data released there, too. We are obviously very interested in their methodology. Their machines will yet be different, the type of data they produce will be different from the other four. That’s what my lab does: We want to look at every new data source. We look at them critically and see where software is needed, and if we can, would like to develop software to be able to leverage the data for the community.

When are you going to analyze a mammalian genome using short reads?

The projects that you have heard about so far with the short-read machines have been on smaller than human genomes. It turns out that the informatics scale-up is actually very substantial even up to that point. The C. elegans genome is 100 megabases, the Drosophila genome is 180 megabases, the human genome is 3 gigabases. So we have another 20- to 30-fold scale-up [to do]. If you need maybe a couple of runs of Solexa data to cover the C. elegans genome, then you need 20 to 30 times that much to cover the human genome at the same coverage.

I really think that’s the last big scale-up there. I think on the data end, people have whole human datasets, but I think to do the comprehensive study that you can do for a 100-megabase genome, we are [still] working on the informatics of that, I have not seen that happening. I would still give it another probably four to six months before we can reliably, confidently do mammalian-style comprehensive genome analysis with these short-read sequences.

454 Life Sciences, Qiagen, SanAir, Gene Systems, CSHL, NHGRI, TriLink, Integrated Genomics, TMO Renewables, Illumina

Rothberg Predicts 454 Revenues to Exceed $70M in 2007

In a presentation at the BioIT World conference in Boston earlier this month, 454 Life Sciences founder Jonathan Rothberg said that the company expects that its revenues will exceed $70 million this year, and that the company will sell more than 100 sequencing instruments.

Earlier this month, CuraGen said in a filing with the Securities and Exchange Commission that 454 generated $12.9 million in revenues in the first quarter of 2007. Of those revenues, $8.4 million derived from products, $2.6 million from sequencing services, $375,000 from collaborations, and $1.5 million from milestone payments (see In Sequence 5/15/2007)

Last year, 454 brought in revenues of $37.3 million and placed more than 40 Genome Sequencers.

Qiagen to Distribute Whatman’s DNA-Storage Technology

Qiagen and Whatman said this week that they have signed a non-exclusive agreement that gives Qiagen the right to market and sell Whatman’s FTA DNA- handling technology in the life science research, molecular diagnostics, and applied testing markets.

The agreement gives Qiagen distribution rights for existing Whatman FTA products as well as customized products. Qiagen will pay Whatman an upfront fee that grants it non-exclusive distribution rights, and then the company will pay Whatman royalties on the sale of any products that use FTA.

Whatman's FTA technology allows scientists to collect, transport, archive, and release nucleic acids at room temperature, and has a range of potential applications including forensics, pharmacogenomics, biobanking, and genomics research, Qiagen said.

Appendix C. Teaching materials BC Course syllabi INTRODUCTION TO BIOINFORMATICS: BI420

Instructors: Prof. Gabor Marth (Section 1) 415 Higgins Hall, [email protected] Prof. Stephen Wicks (Section 2), [email protected] Teaching Assistant: TA Deb, 420 Higgins Hall, [email protected]

Class location and meeting times: MARTH (Section 1) TuTh 1:30-2:45 pm Higgins 425 WICKS (Section 2) MW 10:00-11:15 am Higgins 425

Class Communication: We will be using Blackboard Vista. Please be familiar with using Blackboard Vista. If you need help, there are many tutorials and help-sessions available through BC.

OFFICE HOURS: TA Deb – T/TH (1pm-1:30pm) & M/W (11:15-11:45am) and by appointment. Profs. Marth and Wicks -- by appointment

Course description Bioinformatics is an emerging field at the confluence of biology, mathematics and computer science. It strives to better understand the molecules essential for life, by harnessing the power and speed of computers. This introductory course requires that students have a basic understanding of molecular biology, genetics, and the Internet, but does not require extensive background in mathematics or programming. Students will learn how to use bioinformatic tools from the public domain, to mine and analyze public domain databases.

Mandatory text: Discovering Genomics, Proteomics, & Bioinformatics by A. Malcolm Campbell & Laurie J. Heyer, publisher CSHL Press

Optional texts:

1. ***Bioinformatics and Molecular Evolution by Paul G. Higgs & Teresa Attwood, publisher Wiley & Sons (on sequence alignments, database searching, statistical significance of alignments, phylogenetics, Chapters 6, 7, 8) 2. Running Linux, Fourth Edition, by Matt Welsh, publisher O’ Reilly & Associates (on the LINUX operating system and basic UNIX commands). 3. Programming Perl, 3rd Edition, by Tom Christiansen, Larry Wall, and Jon Orwant, publisher O'Reilly & Associates. 4. Beginning Perl for Bioinformatics, by James Tisdall, publisher O'Reilly & Associates (useful text for beginner Bioinformatics programmers)

Grading Your grade is based on in-class participation, homework, midterm exams, and a final presentation:

In-class participation = 10%; Homeworks (4) = 40%; Midterm examinations (2 @ 15% ea) = 30%; Final presentations = 20%

Academic Integrity Policy: Please be sure that you familiarize yourself with the Academic Integrity Policy of Boston College. Any work handed in with your name on it is presumed to be your own work. This applies to ALL coursework, including homework assignments, final projects, and tests. If you use library / Internet resources, full bibliographic references / precise URLs should be given. Any deviation from this policy can immediately result in a course grade of "F" and your being turned over to the Board of Academic Integrity for a hearing. Course Outline:

I. Introduction to Bioinformatics: aspects of cellular and genome organization that lend themselves to computational scientific research. II. Genome sequencing and data mining: genome sequence generation, functional and structural sequence annotations, genetic variation discovery, gene expression analysis, proteomics, storage and retrieval of Biological data. III. Classical Bioinformatics methods: sequence alignments, phylogenetic analysis, and data classification. IV. Computational Genomics: evolutionary genomics, population genomics, medical genomics. V.Practical Bioinformatics: basic UNIX computer skills, programming, using and building Biological databases. VI. Final Presentations: full details TBA, usu ~8min individual presentation on a relevant bioinformatic topic.

Section I. Introduction

Class 1. (MARTH: Tuesday Sept 2, 2008 WICKS: Wednesday September 3, 2008). Introduction to Bioinformatics We will introduce the main fields of Bioinformatics in the context of genome organization on the molecular and the species level. Reading Assignments: Genome Sequencing, Campbell and Heyer pgs 34-39 Lander, ES. Initial Sequencing of the Human Genome. Nature. 2001 Feb 15;409(6822):860-921.

Section II. Genome sequencing and data mining

Class 2. (MARTH: Thursday Sept 4, 2008 WICKS: Monday September 8, 2008). Web Ex Genome sequencing informatics We will learn about the process of DNA sequencing to generate the complete genomes of living organisms, and the many Bioinformatics software tools to automate this process. The instructor will also demonstrate some of the main tools that were used in the Human Genome Project and in the sequencing of other species. Reading assignments: Venter et. al. The sequence of the human genome. Science. 2001 Feb 16;291(5507):1304-51. Erratum in: Science 2001 Jun 5;292(5523):1838. Campbell & Heyer pgs 41-42, 44, 49-54

Class 3. (MARTH: Tues Sept 9, 2008 WICKS: Wednesday September 10, 2008). New Genome Sequencing Technology We will learn about next generation sequencing technologies, surpassing the initial sequencing technologies and allowing for short read sequences, reducing cost of whole genome sequencing and increasing accuracy of sequencing. Reading assignment: Hillier et. al. Whole-genome sequencing and variant discovery in C. elegans. Nat Methods. 2008 Feb;5(2):183-8. Epub 2008 Jan 20.

Class 4. (MARTH Thursday Sept 11, 2008 WICKS: Monday September 15, 2008). Genome annotation and the landscape of the human genome We will learn about the Bioinformatics tools that aid the annotation of genomes: e.g. finding protein coding or RNA genes, regions of known human repeats, or measuring nucleotide composition. We will then learn about the main features of the human genome, and make comparisons to the completed genomes of other organisms. Reading Assignment: Campbell & Heyer pgs 59-60, 62-66, 74-76, 78-90, 90-96 (optional), 96-104.

Class 5. (MARTH: Tues September 16, 2008 WICKS: Wednesday September 17, 2008.) Variation discovery I (theory) http://bioinformatics.bc.edu/~marth/BI420 We will learn about the informatics aspects of polymorphism discovery in DNA sequences. Specifically, we will learn about the PolyBayes SNP discovery approach. Reading assignment. Campbell & Heyer pgs 186-192, pgs 193-198 (optional) Homework assignment #1 given out Thursday.

Class 6. (MARTH: Thurs September 18, 2008 WICKS: Monday September 22, 2008.) Variation Discovery II (practice): The Polybayes Lab We will continue learning about polymorphism discovery informatics. During this class the instructor will demonstrate the use of the PolyBayes SNP discovery tool, validation experiments, and SNP detection in diploid traces. We will then learn about genome-scale polymorphism discovery projects Reading assignment: Marth, GT. Nat Genet. 1999 Dec;23(4):452-6. A general approach to single- nucleotide polymorphism discovery. Campbell and Heyer pg 208 (math minute, Bayes Theorem). pgs 198-207 (optional), 210-213 (optional)

Class 7. (MARTH: Tuesday September 23, 2008 WICKS: Wednesday September 24, 2008). Gene expression analysis We will learn about the informatics of gene expression analysis using DNA microarrays. We will cover the main application areas of this important technology and the computational methods used to analyze expression data. Reading assignment. Campbell & Heyer pgs 235-246, 261 Homework assignment #1 DUE WEDNESDAY in the Biology office (Higgins 355) at 1PM. Put in mailbox for TA Deborah Ritter.

Class 8. (MARTH: Thurs September 25, 2008 WICKS: Monday September 29, 2008). Gene safari (a tour of Biological databases) During this class we set out on a web expedition to tour NCBI, Ensembl, specialized variation resource web sites, PharmGKB, OMIM, etc. for gene information.

Class 9. (MARTH: Tuesday September 30, 2008 WICKS: Wednesday October 1, 2008). Practical Lab: Translational Bioinformatics (TA DEB TEACH) During this class we will overview UCSC Genome Browser/GALAXY, to analyze data using web- interface programs built by bioinformaticians for laboratory biologists. We will also discuss new companies created from personal genome sequencing, visit websites and discuss ethical implications of genomic analysis. Reading Assignment: Campbell and Heyer pgs. 205-214

NO CLASS (MARTH) THURSDAY OCTOBER 2ND!!!!

Class 10. (WICKS: Monday October 6 , 2008 MARTH: Tuesday October 7, 2008).

Examination #1.

Section III. Classical Bioinformatics methods

Materials at the library relevant to the sequence alignment algorithms: Campbell and Heyer, pp. 46-47 (Substitution Matrices). Bioinformatics: Genomics and post-genomics, Dardel and Kepes, pp. 34-38. Essential Bioinformatics, J. Xiong, pp. 31-50. Bioinformatic Databases and Algorithms, N. Gautham, pp. 68-90. BLAST, I. Korf, M. Yandell, and J. Bedell. pp. 40-54, pp. 75-87

Class 11. (WICKS: Wednesday October 8, 2008 MARTH: Thursday October 9, 2008). Sequence alignment (DR. WICKS TEACH, DR. MARTH OUT OF TOWN) During this class, we will discuss the fundamental concepts behind sequence alignments and the basic algorithms to find optimal alignments and to visualize them. Reading assignment. Higgs & Attwood Chapter 6 (pp. 121-137). Homework assignment #2 out (Thursday).

NO CLASS (WICKS) OCTOBER 13th COLUMBUS DAY!!!!!

Class 12. (MARTH: Tuesday October 14, 2008 WICKS: Wednesday October 15, 2008). Dynamic Programming Algorithms During this class, we will continue sequence alignment algorithms and practice local and global alignments in-class. Homework assignment #2 DUE WEDNESDAY in the Biology office (Higgins 355) at 1PM. Put in mailbox for TA Deborah Ritter.

Class 13. (MARTH Thursday October 16, 2008 WICKS: Monday October 20, 2008). Searching sequence databases, statistical significance of alignments. Hidden Markov Models. During this class, we will discuss methods that allow us to decide whether sequence alignments represent true biological relationships and the statistical significance of alignments. Reading assignment. Higgs & Attwood Chapter 7 (pp. 139-157).

Class 14. (MARTH: Tuesday October 21, 2008 WICKS: Wednesday October 22, 2008). Phylogenetics (DR. WICKS TEACH, DR.MARTH OUT OF TOWN) We will discuss the most commonly used methods of phylogenetics tree construction. Reading assignment. Higgs & Attwood Chapter 8 (pp. 158-194). Homework assignment #3 out (Thursday).

Section IV. Computational Genomics

Class 15. (MARTH: Thursday October 23, 2008 WICKS: Monday October 27, 2008). Evolutionary genomics We will learn about the main forces that drive species evolution and the computational methods that allow the determination of the evolutionary relationships among living organisms. Reading assignment. Campbell & Heyer pgs 114-132, 139-143 (including math minute).

Class 16. (MARTH: Tuesday October 28, 2008 WICKS: Wednesday October 29, 2008). Medical genomics In this class we will discuss the medical significance of genetic variations, methods to map disease- causing genetic variants, and the informational reagents that have been developed to aid such efforts. Reading assignment. Campbell & Heyer pgs 152-154, 162-172. Homework Assignment #3 Due Wednesday in the Biology office (Higgins 355) at 1:30PM. Put in mailbox for Deborah Ritter.

Class 17. (MARTH: Thursday October 30, 2008 WICKS: Monday November 3, 2008). Gene regulation and Systems Biology In this class we will sample methods for identifying sequences important in gene regulation. We will discuss complexity in gene regulation and overview gene regulatory networks / systems biology. Reading Assignment: Campbell and Heyer pgs 342-346 PAPER: A survey of DNA motif finding algorithms, by Das, M.K and Dai, H.K. in BMC Bioinformatics

Class 18. (MARTH: Tuesday November 4, 2008 WICKS: Wednesday November 5, 2008). Examination #2

Section V. Practical Bioinformatics

Class 19. (MARTH: Thursday November 6, 2008 WICKS: Monday November 10, 2008). Working on UNIX computers (part I) During this class we will familiarize ourselves with the LINUX computer environment (a flavor of UNIX) installed on the PC laptop computers. We will practice starting up the operating system, and performing simple, basic UNIX commands. Reading assignment. Please refer to the appropriate section of the Beginning UNIX text.

Class 20. (MARTH: Tuesday Nov 11, 2008 WICKS: Wednesday November 12, 2008). Working on UNIX computers (part II) (TA DEB TEACH, DR. MARTH OUT OF TOWN) During this class we will continue with UNIX commands, and learn to use the EMACS test editor program. We will also learn how to execute command-line versions of known Bioinformatics software e.g. BLAST. Reading assignment. Please refer to the appropriate section of the Beginning UNIX text. Homework assignment #4 out Thursday

Class 21. MARTH: Thursday November 13, 2008 WICKS: Monday November 17, 2008). Practical Lab: Building our own Biological database (TA MICHELE / TA DEB TEACH) During this class we will build a small mySQL database to store information about SNP discovery in overlapping sections of BAC clones.

Class 22. (MARTH: Tuesday Nov 18, 2008 WICKS: Wednesday November 19, 2008). Simple Bioinformatics programming (part I) (DR. MARTH TEACH / TA DEB TEACH) During this class we will write PERL computer programs implementing basic Bioinformatics programming tasks e.g. file parsing and command pipelines. Reading assignment. Please refer to the appropriate section of the Programming Perl reference book and the Beginning Perl for Bioinformatics text.

Class 23. (MARTH Thursday November 20, 2008. WICKS: Monday November 24, 2008). Simple Bioinformatics programming (part II) (DR. MARTH TEACH / TA DEB TEACH) During this class we will continue writing simple PERL computer programs for Bioinformatics applications. Reading assignment. Please refer to the appropriate section of the Programming Perl reference book and the Beginning Perl for Bioinformatics text. Homework Assignment #4 Due Wednesday in the Biology office (Higgins 355) at 1:30PM. Put in mailbox for Deborah Ritter.

Class 24. (MARTH: Tuesday November 25, 2008 WICKS: Monday December 1, 2008). Practical Lab: Serving Biological information through the web (DR. MARTH TEACH) During this class we will build a web server that allows users to retrieve information from the SNP discovery database built in the previous class.

**NO CLASS WEDS NOV 26-28!! THANKSGIVING HOLIDAY!!!

Section VI. Final presentations

The last two classes will consist of individual presentations. A list of topics for these presentations will be maintained by TA Deb, students are to sign up for one topic each. Presentations must be limited to ~8 minutes to allow time for every presenter.

Class 25. (MARTH: Tuesday December 2, 2008 WICKS: Wednesday December 3, 2008). Class 26. (MARTH: Thursday December 4, 2008 WICKS: Monday December 8, 2008).

Final Exam: There is no final exam. Your presentation and presentation attendance & class participation is combined into the final exam grade. A detailed grading rubric for the presentations will be provided.

Main Page - BI616 http://bioinformatics.bc.edu/marthlab/BI616/index.php/Main_Page

Main Page

From BI616

Contents

1 Syllabus 1.1 Grading policy 1.2 Text Books 1.3 Click Here to See What You Will Learn In BI616 1.4 Classes 1.5 Homeworks: Subject to Change Until After Class 1.6 Links To Useful Resources 1.7 Academic Integrity Policy

Syllabus

Instructor: Gabor Marth, 415 Higgins Hall. Email: marth at bc dot edu.

Teaching assistant: Michele Busby, 416 Higgins Hall. Email: busbym at bc dot edu

Class location and time: Higgins 425, Tuesdays and Thursdays 4:00 PM - 6:00 PM.

Office hours: By appointment

Prerequisites: No formal requirements

Course synopsis: The computer is rapidly becoming an indispensable research tool for the bench Biologist. Mega-scale projects such as genome sequencing, large-scale scale genotyping, expression microarray analysis, and imaging studies have filled public databases with information that is invaluable for the quest to understand Biological function in living organisms. Although some of these resources are available through web portals, it is impossible to realize the full potential of the vast amount of data without more sophisticated, more subtle, custom analyses. The aim of this course is to develop computer-aided data analysis skills that open the door to Biological information only accessible with Bioinformatics methods. This will be a completely hands-on course where each student works on their own designated UNIX computer.

The following will be covered:

(1) Using the UNIX environment and its productivity tools e.g. executing programs form the command line, editing and manipulating text files; (2) Programming basics in the PERL computer language; (3) Bioinformatics-specific programming skills acquired via solving realistic and typical data manipulation and analysis problems; (4) Creation of automated data analysis pipelines to perform large-scale and repetitive data analysis tasks; and (5) Special topics such as the creation of management of user databases, and programmed access to web resources.

Grading policy

Your grade will be based on class participation, frequent homework assignments, and a term programming project that you will be presenting at the end of the semester, in the following proportions:

1 of 3 10/5/2008 10:05 PM Main Page - BI616 http://bioinformatics.bc.edu/marthlab/BI616/index.php/Main_Page

Class participation = 20% Homework = 40% Term project - 40%

Course syllabus: downloadable in PDF format by clicking here.

Text Books

You need to get a good reference textbook for Perl.

We suggest:

Programming Perl, 3rd Edition, by Tom Christiansen, Larry Wall, and Jon Orwant, publisher O'Reilly & Associates Running Linux, Fourth Edition, by Matt Welsh, publisher O'Reilly & Associates

Note: These are the same books used as last year, so you may be able to get a free copy off of your lab mates.

Also good are:

Beginning Perl for Bioinformatics, by James Tisdall, publisher O'Reilly & Associates Perl Core Language Little Black Book, by Steven Holzner, publisher Paraglyph Press (a favorite in the Chuang lab)

Click Here to See What You Will Learn In BI616

Classes

(Click on the class to access material for the class)

Class 1: Introduction To Linux (1 of 2) (Thursday 9/4/08) Class 2: Introduction To Linux (2 of 2) (Tuesday 9/9/08) Class 3: Introduction Perl Programming (1 of 2) (Thursday 9/11/08) Class 4: Introduction Perl Programming (2 of 2) (Tuesday 9/16/08) Class 5: Complex Data Structures (1 of 2): Lists and Arrays (Thursday 9/18/08) Class 6: Complex data structures (2 of 2): hashes (associative arrays) (Tuesday 9/23/08) Class 7: Subroutines and review of regular expressions(Thursday 9/25/08) Class 8: File parsing (Tuesday 9/30/08) Class 9: Bioinformatics-specific programming: processing nucleotide and protein sequences (Thursday 10/2/08) Class 10: Web programming (Tuesday 10/7/08) Class 11: Databases, accessing databases from PERL (Thursday 10/9/08) Class 12 and a half: Databases, accessing databases from PERL (Tuesday 10/14/08) Class 13: Accessing databases through web forms (Thursday 10/16/08) No class on 10/21/08 Class 14: Final project discussion, advanced programming topics (Thursday 10/23/08) Class 15: Presentation of term projects (Tuesday 11/4/08)

Homeworks: Subject to Change Until After Class

Homework 1: Using Linux Homework 2: Introduction to Perl (Restriction Enzymes), Due: 9/23/08 Homework 3: Arrays and Hash Tables: DNA Translation, Due 9/30/08 Homework 4: Regular Expression (Protein Structure Prediction), Due: 10/7/07 Homework 5: File Parsing and Sequence Processing (Gene Models), Due: Homework 6: Databases (Meta-Analysis of Expression Data), Due:

2 of 3 10/5/2008 10:05 PM Main Page - BI616 http://bioinformatics.bc.edu/marthlab/BI616/index.php/Main_Page

Project Topics 2007 Project Topics 2008 (draft)

Links To Useful Resources

Accessing Bioclass

Unix Tutorial

SQL Tutorial

Regular Expression Tutorial

DBI Tutorial

Regular Expression Tutorial

Academic Integrity Policy

Please be sure that you familiarize yourself with the Academic Integrity Policy of Boston College. Any work handed in with your name on it is presumed to be your own work. This applies to ALL coursework, including homework assignments, final projects, and tests. However, any submitted work must be your own work, and NOT COPIED or plagiarized; if you use library / Internet resources, full bibliographic references / precise URLs should be given. Any deviation from this policy can immediately result in a course grade of "F" and your being turned over to the Board of Academic Integrity for a hearing.

Notes for next year

Retrieved from "http://bioinformatics.bc.edu/marthlab/BI616/index.php/Main_Page"

This page was last modified 21:36, 30 September 2008.

3 of 3 10/5/2008 10:05 PM BI820: Seminar in Quantitative and Computational Problems in Genomics

Instructor: Gabor Marth, 415 Higgins Hall. Phone: 617.552-3571. Email: [email protected]. Class location and time: O’Neill Library Room 245. Mondays, 2:00 – 4:00 PM. Office hours: By appointment Course web site: http://clavius.bc.edu/~marth/BI820 Prerequisites: No formal requirements Course organization: We will cover three main topics: 1) Polymorphism discovery, 2) Sources, structure, and the utility of human variation, and 3) Advanced topics in population, statistical, and medical genetics (I have included an outline of the course material at the end of this document). The material will be presented by: an introductory lecture, student presentations based on a combination of review articles and primary literature, and computer sessions. Grading policy: Your grade will be based on presentations of the assigned reference material ( presentations may require additional literature search on your part), and in-class activity, in the following proportions: Presentations – 60% In-class activity – 40% Planned course material: Topic 1: The informatics of DNA sequencing and polymorphism discovery Introductory lecture by instructor. Genome sequencing informatics: methods and tools. Genome sequencing computer session. Sequence resources for polymorphism discovery (STS, EST, full-length cDNA, BAC-end, whole-genome shotgun sequences). Polymorphism mining informatics: methods and tools. Polymorphism mining computer session. Genome-scale single-nucleotide polymorphism (SNP) mining projects. Sequence variation databases: exploration on the web. Sequence variation databases: building a SNP database in class. Genotyping technologies. Properties of human polymorphisms. Background material: Computational SNP discovery in DNA sequence data. Gabor Marth. In: Methods in Molecular Biology, vol. 212: Single Nucleotide Polymorphisms: Methods and Protocols. Humana Press, 2003. Topic 2: Polymorphism structure, function, and ancestral inference Introductory lecture by instructor. Sources of variation, mechanics of propagation (the mutation process, genealogy and random genetic drift, long-term demography and subdivision recombination, and the various form of selection). Analyzing neutral variations, the Coalescent process, simulation and modeling. Ancestral inference from polymorphism data. The effects of selection on contemporary polymorphism structure and numerical tests of selection. Multi-allele association: linkage disequilibrium and human haplotype structure. Simulating and visualizing human polymorphism and haplotype structure: computer session. Background material: Molecular Evolution. Wen-Hsiung Li. Sinauer Associates, 1997.;Principles of Population Genetics. Daniel L. Hartl and Andrew G. Clark. Sinauer Associates, 1997. Topic 3: Advanced topics in population, statistical, and medical genetics Introductory lecture by instructor. Human genetic diseases and the allelic structure of functional polymorphisms. Genetic linkage analysis. Association mapping, case-control studies to track causative loci of human diseases. Candidate gene approaches to gene mapping. Forensic applications – DNA identification. Background material: Handbook of Statistical Genetics. D.J. Balding, M. Bishop, and C. Cannings. Wiley and Sons, 2001. BC Course evaluation summaries

Boston College Course Evaluation Summary Report - BI42001-2008F-tx1 INTRO TO BIOINFORMATICS Marth, Gabor

Total Surveys Response Q1 Q2 Q3 Q4 Q5 Q6 Surveys Submitted Rate

2007 Fall Term - BC2008F 61,596 54,551 88.56% 3.90 4.02 4.18 4.22 4.05 3.35

Arts & Sciences 43,257 38,507 89.02% 3.85 3.99 4.15 4.18 4.01 3.35

Biology 4,337 3,801 87.64% 3.43 3.58 3.97 3.92 3.71 3.37

BI42001-2008F-tx1 INTRO TO BIOINFORMATICS Marth, Gabor 24 21 87.50% 3.43 3.81 4.14 4.10 3.20 2.71

Response Response Response Response Response N/A No Mean STD 1 2 3 4 5 Response 1. What rating does this instructor deserve as a teacher? 0 1 12 6 2 0 0 3.43 0.75 Poor = 1/Acceptable = 2/Good = 3/Very Good = 4/Excellent = 5 0.00% 4.76% 57.14% 28.57% 9.52% 0.00% 0.00%

2. Regular class attendance was necessary for learning the 0 3 2 12 4 0 0 required content. 3.81 0.93 Strongly Disagree = 1/Disagree = 2/Uncertain = 3/Agree = 4/Strongly Agree = 5 0.00% 14.29% 9.52% 57.14% 19.05% 0.00% 0.00% 3. The course helped me to acquire factual information. 0 0 3 12 6 0 0 4.14 0.65 Strongly Disagree = 1/Disagree = 2/Uncertain = 3/Agree = 4/Strongly Agree = 5 0.00% 0.00% 14.29% 57.14% 28.57% 0.00% 0.00% 4. The course helped me to understand principles and concepts. 0 0 2 15 4 0 0 4.10 0.54 Strongly Disagree = 1/Disagree = 2/Uncertain = 3/Agree = 4/Strongly Agree = 5 0.00% 0.00% 9.52% 71.43% 19.05% 0.00% 0.00% 5. The instructor was available for help outside of class. 0 2 13 4 1 1 0 3.20 0.70 Strongly Disagree = 1/Disagree = 2/Uncertain = 3/Agree = 4/Strongly Agree = 5 0.00% 9.52% 61.90% 19.05% 4.76% 4.76% 0.00% 6. Compared to other courses having the same credits and 0 9 10 1 1 0 0 hours, the time required for this course was: 2.71 0.78

Much Less = 1/Less = 2/The Same = 3/More = 4/Much More = 5 0.00% 42.86% 47.62% 4.76% 4.76% 0.00% 0.00%

BI42001-2008F-tx1 - Course Detail Report 1 of 3 7. What are the strengths of this instructor?

* Excellent knowledge of material. Good rapport with students. Very interesting subject matter.

* good teacher.... makes bioinformatics fun!

* He was a very personable and enthusiastic professor. Having never take such a course before, the material was refreshing and new.

* Extremely nice, very funny. Made class fun, even when the material could be dry at times.

* He knows alot about the material

* He is friendly and approachable. You can tell he wants you to actually learn the material, and aware that it takes some people more than others. He is very patient. Also, he is very aware of the subject and seems heavily involved in research.

* Knows material well, communicates most in a direct and easy to understand way.

* Keeps a relaxed atmosphere even though there's a lot to learn.

* knowledgeable

* Professor Marth is a great professor. He talks very clearly and explains the material well. He also has great eye contact which allowed me to focus on him and his lectures.

* Enthusiastic, fun

* Good teacher, very knowledgeable and able to explain complicated techniques and concepts rather easily.

* good knowledge, funny

* easy to listen to, especially for what would seem a boring subject. very entertaining. very clear.

* Very confident in teaching. Knew how to convey important principles. Solid teacher.

* He is very knowledgeable about the material, especially since he is actively doing research which is nice to have. Also the TA was available for help when Marth wasn't which was a nice option.

* Enthusiastic about the material. Encouraged class participation.

8. How could this instructor improve the course?

BI42001-2008F-tx1 - Course Detail Report 2 of 3 * Sometimes the homeworks took a very long time to complete and there were a lot of questions on the exams which I couldn't completely finish on time.

* More organization and preparation

* Make sure you review the slides before coming to class, and trying to plan out examples before hand. Also for the computer part, having some of the code online beforehand would be nice so we wouldn't have to stare at the screen.

* attend more often

* Greater preparation on lectures. Sometimes instructor was unsure about materials.

* i loved it. maybe more computer activities

* First 2/3 are great, then the last third with PERL programming and such begins to lose focus. The last few weeks of class were mainly spent just copying lines of code from overhead screens, which, to me, was a waste of time for the students and professors.

* make it more interesting... it was tempting to skip class because all the information was online and everyone would just play with the computers during class

* It seemed like the course was taught more by the TA, Deb (who was wonderful), than Professor Marth himself.

* None really. I think he's doing his best with the material at hand, and with that, is doing an excellent job.

* The only problem is that the instructor seems very busy with his own research. He missed a couple classes to go to conferences and normally I would go to the TA if I ever needed help, who was great.

* Make a more stable schedule and try to find one text to cover the whole class, or make a packet handout for the class. If we had a step by step handout on the programing parts it would have helped to follow along.

9. Additional comments:

* The way this course was divided was nice, having background info then moving on to more practical applications was good. More time spent on perl and mysql would have been nice, but ending with presentations was a nice finish, although 5 minutes is too short for a presentation...

* Instead of teaching programming, which few, if not none, of us will ever use, it might instead be interesting to examine out some publications/experiments where bioinformatics is used in a practical application.

* The computers in the lab could use some updating.

* i liked the class set-up with the 2 exams and homeworks, as well as the final presentation. i wish there was more programming during the class though because i really enjoyed that.

* none

* Prof. Marth had to miss several class days. I understand his absence but i think he could have had a little better communication with TA Deb, some days they did not seem to be on the same page.

BI42001-2008F-tx1 - Course Detail Report 3 of 3

Boston College Course Evaluation Summary Report - BI61601-2008F-tx1 GRADUATE BIOINFORMATICS Marth, Gabor

Total Surveys Response Q1 Q2 Q3 Q4 Q5 Q6 Surveys Submitted Rate

2007 Fall Term - BC2008F 61,596 54,551 88.56% 3.90 4.02 4.18 4.22 4.05 3.35

Arts & Sciences 43,257 38,507 89.02% 3.85 3.99 4.15 4.18 4.01 3.35

Biology 4,337 3,801 87.64% 3.43 3.58 3.97 3.92 3.71 3.37

BI61601-2008F-tx1 GRADUATE BIOINFORMATICS Marth, Gabor 13 9 69.23% 4.56 4.78 4.33 4.44 4.33 3.89

Response Response Response Response Response N/A No Mean STD 1 2 3 4 5 Response 1. What rating does this instructor deserve as a teacher? 0 0 1 2 6 0 0 4.56 0.73 Poor = 1/Acceptable = 2/Good = 3/Very Good = 4/Excellent = 5 0.00% 0.00% 11.11% 22.22% 66.67% 0.00% 0.00%

2. Regular class attendance was necessary for learning the 0 0 0 2 7 0 0 required content. 4.78 0.44 Strongly Disagree = 1/Disagree = 2/Uncertain = 3/Agree = 4/Strongly Agree = 5 0.00% 0.00% 0.00% 22.22% 77.78% 0.00% 0.00% 3. The course helped me to acquire factual information. 0 1 0 3 5 0 0 4.33 1.00 Strongly Disagree = 1/Disagree = 2/Uncertain = 3/Agree = 4/Strongly Agree = 5 0.00% 11.11% 0.00% 33.33% 55.56% 0.00% 0.00% 4. The course helped me to understand principles and concepts. 0 1 0 2 6 0 0 4.44 1.01 Strongly Disagree = 1/Disagree = 2/Uncertain = 3/Agree = 4/Strongly Agree = 5 0.00% 11.11% 0.00% 22.22% 66.67% 0.00% 0.00% 5. The instructor was available for help outside of class. 0 0 1 4 4 0 0 4.33 0.71 Strongly Disagree = 1/Disagree = 2/Uncertain = 3/Agree = 4/Strongly Agree = 5 0.00% 0.00% 11.11% 44.44% 44.44% 0.00% 0.00% 6. Compared to other courses having the same credits and 0 0 3 4 2 0 0 hours, the time required for this course was: 3.89 0.78

Much Less = 1/Less = 2/The Same = 3/More = 4/Much More = 5 0.00% 0.00% 33.33% 44.44% 22.22% 0.00% 0.00%

BI61601-2008F-tx1 - Course Detail Report 1 of 2 7. What are the strengths of this instructor?

* Dr. Marth is a very energetic teacher. Teaching advanced bioinformatics to people who never programmed is challenging and he does a good job to get everyone working and programming. Michele, the TA, was excellent too. She helped teach the MySql section and her homeworks and advice were very well thought and helpful for learning the material.

* The best programming class I have taken. Dr. Marth was very receptive to the pace of the class and did not simply run through examples without explanation.

* Very enthusiastic about the subject; very good at answering questions when asked.

8. How could this instructor improve the course?

* I felt that some basic computer science concepts could have been reviewed at the beginning of class; I had never heard of "loop statements" and "strings" before this class, and I feel that it more or less assumed that students had a grasp on these concepts.

* Enforce the homework policies.

* This course should be split between grads who are getting an emphasis in Bioinformatics and those who are not. I would have liked even more challenging homeworks and advanced topics, and to continue through the semester.

9. Additional comments:

* I thought the TA for this course was exceptionally good!

* Dr. Marth should consider teaching a graduate seminar in bioinformatics!!

BI61601-2008F-tx1 - Course Detail Report 2 of 2

Outside teaching information 2007 CSHL course in Revolutionary Sequencing Technologies & Applica... http://meetings.cshl.edu/courses/c-seqtech07.shtml

How To Apply REVOLUTIONARY SEQUENCING Selection Process TECHNOLOGIES & APPLICATIONS and Stipends November 6 - 17, 2007 Application Deadline: July 15, 2007 Travel Instructors: Campus Greg Hannon, Cold Spring Harbor Laboratory Information Elaine Mardis, Washington University School of Medicine Gabor Marth, Boston College W. Richard McCombie, Cold Spring Harbor Laboratory John McPherson, Baylor College of Medicine Michael Zody, The Broad Institute

Over the last decade, large scale DNA sequencing has markedly impacted the practice of modern biology and is beginning to effect the practice of medicine. With the recent introduction of several revolutionary sequencing technologies, costs and timelines have been reduced by orders of magnitude, facilitating investigators to conceptualize and perform sequencing-based projects that heretofore were prohibitive. Furthermore, the application of these technologies to answer questions previously not experimentally approachable is broadening their impact and application.

This intensive twelve day course will explore applications of next generation sequencing technologies, with a focus on commercially available methods. Students will be instructed in the detailed operation of several revolutionary sequencing platforms, including sample preparation procedures, general data handling through pipelines, and in-depth data analysis. A diverse range of biological questions will be explored including DNA re-sequencing of human genomic regions (using cancer samples as a test case), de novo DNA sequencing of bacterial genomes, and the use of these technologies in studying small RNAs, among others. Guest lecturers will highlight their own applications of these revolutionary technologies.

We encourage applicants from a diversity of scientific backgrounds including molecular evolution, development, neuroscience, cancer, plant biology and microbiology.

Sponsored equally by Applied Biosystems, Illumina and 454 Life Sciences

Cost (including board and lodging): $2,915 Currency converter

1 of 2 10/5/2008 4:04 PM 2008 CSHL course in Revolutionary Sequencing Technologies & Applica... http://meetings.cshl.edu/courses/c-seqtech08.shtml

How To Apply REVOLUTIONARY SEQUENCING Selection Process TECHNOLOGIES & APPLICATIONS and Stipends July 6 - 17, 2008 Application Deadline: March 15, 2008 Travel Instructors: Campus Elaine Mardis, Washington University School of Information Medicine Gabor Marth, Boston College W. Richard McCombie, Cold Spring Harbor Laboratory John McPherson, Baylor College of Medicine Michael Zody, The Broad Institute

Over the last decade, large scale DNA sequencing has markedly impacted the practice of modern biology and is beginning to effect the practice of medicine. With the recent introduction of several revolutionary sequencing technologies, costs and timelines have been reduced by orders of magnitude, facilitating investigators to conceptualize and perform sequencing-based projects that heretofore were prohibitive. Furthermore, the application of these technologies to answer questions previously not experimentally approachable is broadening their impact and application.

This intensive twelve day course will explore applications of next generation sequencing technologies, with a focus on commercially available methods. Students will be instructed in the detailed operation of several revolutionary sequencing platforms, including sample preparation procedures, general data handling through pipelines, and in-depth data analysis. A diverse range of biological questions will be explored including DNA re-sequencing of human genomic regions (using cancer samples as a test case), de novo DNA sequencing of bacterial genomes, and the use of these technologies in studying small RNAs, among others. Guest lecturers will highlight their own applications of these revolutionary technologies.

We encourage applicants from a diversity of scientific backgrounds including molecular evolution, development, neuroscience, cancer, plant biology and microbiology.

Sponsored by Applied Biosystems, Illumina and 454 Life Sciences

Cost (including board and lodging): $3,035 Currency converter

1 of 2 10/5/2008 4:03 PM