The Bioinformatics Roadshow Tórshavn, The Faroe Islands 28-29 November 2012

The 1000 Project

Bert Overduin, Ph.D. Vertebrate Genomics Team EMBL - European Bioinformatics Institute Wellcome Trust Campus Hinxton, Cambridge, CB10 1SD, UK EBI is an Outstation of the European Laboratory. Outline

• Introduction • Pilot phase • Phase 1 • Data access • Tools Aim

“The aim of the is to discover, genotype and provide accurate haplotype information on all forms of human DNA in multiple human populations.

Specifically, the goal is to characterize over 95% of variants that are in genomic regions accessible to current high-throughput sequencing technologies and that have allele frequency of 1% or higher in each of five major population groups (populations in or with ancestry from Europe, East Asia, South Asia, West Africa and the Americas).” Pilot phase

Aim: To develop and assess multiple strategies to detect genotype variants of various types and frequencies using high-throughput sequencing.

Strategy: • Low-coverage (2-6x) whole-genome sequencing of 179 individuals from four populations (YRI, CEU, CHB, JPT) • High-coverage (average 42x) sequencing of two mother- father-child trios (YRI, CEU) • Exon-targeted sequencing (average >50x) of 697 individuals from seven populations (YRI, LWK, CEU, TSI, CHB, JPT, CHD)

Pilot results

• Robust protocols now exist for generating both whole-genome shotgun and targeted sequence data. • Algorithms to detect variants from each of these designs have been validated • Low-coverage sequencing offers an efficient approach to detect variation genome wide, whereas targeted sequencing offers an efficient approach to detect and accurately genotype rare variants in regions of functional interest (such as exons). Nature 467, 1061-1073, doi:10.1038/nature09534 (2010).

Production phase

• Low-coverage whole-genome sequencing • Array-based genotyping • Deep targeted sequencing of all coding regions

• 2,500 individuals from five large regions in the world

• Phase 1: Initial round of low-coverage and of 1,092 individuals ✔ • Phase 2: Sequencing of expanded set of 1,700 individuals and method improvement (in progress) • Phase 3: Sequencing of 2,500 individuals and a final variation catalogue (in progress) Additional in Phase 2/3: Chinese Dai in Xishuangbanna (CDX), Kinh in Ho Chi Minh City, Vietnam (KHV) Gambian in Western Division, The Gambia (GWD), Mende in Sierra Leone (MSL), Esan in Nigeria (ESN) African Caribbean in Barbados (ACB), Peruvian in Lima, Peru (PCL) Gujarati Indian in Houston,TX (GIH), Punjabi in Lahore, Pakistan (PJL), Bengali in Bangladesh (BEB), Sri Lankan Tamil in the UK (STU), Indian Telegu in the UK (ITU)

Power to detect SNPs as a function of variant count (and proportion) across the entire set of samples, estimated by comparison to independent SNP array data in the exome (green) and whole genome (blue).

Nature 491, 56-65, doi:10.1038/nature11632 (2012).

Phase 1 results

• Discovered and genotyped 38 million SNPs, 1.4 million bi- allelic indels and 14,000 large deletions. • Individuals from different populations carry different profiles of rare and common variants. • Low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. • Evolutionary conservation and coding consequence are key determinants of the strength of purifying selection. • Rare-variant load varies substantially across biological pathways. • Each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding-sites. Data access

• Project website http://www.1000genomes.org/ • FTP site ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp http://ftp.1000genomes.ebi.ac.uk/vol1/ftp ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp • Amazon Web Services cloud http://aws.amazon.com/1000genomes/ • Browser http://browser.1000genomes.org http://pilotbrowser.1000genomes.org http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/ FTP site

• Sequence data (FASTQ) • Alignment data (BAM) • Variant calls (VCF) • Reference genome (FASTA) • Annotation sets (BED / GTF) • FTP search http://www.1000genomes.org/ftpsearch • current.tree file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree

• Can be accessed using Aspera http://asperasoft.com/

Tools

• Data Slicer Get a subset of data from a BAM or VCF file. • Variation Effect Predictor Analyse your own variants and predict the functional consequences of known and unknown variants. • Variation Pattern Finder Identify variation patterns in a chromosomal region of interest for different individuals. • VCF to PED Converter Parse a VCF file to create a linkage pedigree file (PED) and a marker information file, which together may be loaded into LD visualization tools like Haploview. • http://browser.1000genomes.org/tools.html Nature methods 9, 459-462, doi:10.1038/nmeth.1974 (2012).

Help & keeping in touch

• Helpdesk [email protected] • FAQs http://www.1000genomes.org/faq • Mailing list http://listserver.1000genomes.org/mailman/listinfo/1000announce • Twitter http://twitter.com/1000genomes • RSS feed http://www.1000genomes.org/announcements/rss.xml Acknowledgements

The 1000 Genomes Consortium (close to 800 people from more than 100 different groups)

Data Coordination Centre (DCC) at EBI: • Paul Flicek • Laura Clarke • Richard Smith • Ian Streeter • Holly Zheng Bradley Other large-scale sequencing projects

• 1001 Genomes: 1,001 Arabidopsis thaliana strains http://www.1001genomes.org/ • UK10K: 10,000 people in the UK http://www.uk10k.org/ • Genome 10K: 10,000 vertebrate species http://genome10k.soe.ucsc.edu/ • i5k: 5,000 insect and related arthropod species http://www.arthropodgenomes.org/wiki/i5K

etc. etc.