The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN agcattaacatcaacacagattttcagatcttaggtttctttccgatcta The NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ttctctctgaaccctgctacctggaggcttcatctgcataataaaacttt NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN agtctccacaaccccttatcttaccccagacattcctttctattgataat NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN aactctttcaaccaattgccaatcagggtatgtttaaatctacctatgac NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ctggaagcccccactttgcaccctgagatcaaaccagtgcaaatcttata NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN tgtattgatttgtcAATGAAAACAGTCAAAGCCagtcaggcacagtggct NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN catgcctgtaatcccagcactttgggaggctgaggcgggtagatcacctg GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt aggtcaggagttcgacaccagcctggccaacatggtgaaaccccgtccct cacttcctccttcagGAACATTGCAGTGGGCCTAAGTGCCTCCTCTCGGG actaaaatacaaaaattagcccagcttggtggtgggcacctgtaatctta ACTGGTATGGGGACGGTCATGCAATCTGGACAACATTCACCTTTAAAAGT gctactgcagagactgaggcaggagaatcgcttgaacccaggaggtggag TTATTGATCTTTTGTGACATGCACGTGGGTTCCCAGTAGCAAGAAACTAA gttgcagtgacctgagattttgccattgcactccagcctgggcaacagag AGGGTCGCAGGCCGGTTTCTGCTAATTTCTTTAATTCCAAGACAGTCTCA caagactctatctcaaaaaacaaacaaacaaacaaacaaacaaacaaacT AATATTTTCTTATTAACTTCCTGGAGGGAGGCTTATCATTCTCTCTTTTG gtcaaaatctgtacagtatgtgaagagatttgttctgaaccaaatatgaa GATGATTCTAAGTACCAGCTAAAATACAGCTATCATTCATTTTCCTTGAT tgaccatggtccatgacacagccctcagaagaccctgagaacatgtgccc TTGGGAGCCTAATTTCTTTAATTTAGTATGCAAGAAAACCAATTTGGAAA aaggtggtcacagtgcatcttagttttgtacattttagggagatatgaga TATCAACTGTTTTGGAAACCTTAGACCTAGGTCATCCTTAGTAAGATctt cttcagtcaaatacatttttaaaaaatacattggttttgtccagaaagcc cccatttatataaatacttgcaagtagtagtgccataattaccaaacata agaaccactcaaagcaggggtttccaggttataagtagatttaaaatttt aagccaactgagatgcccaaagggggccactctccttgcttttcctcctt tctgattgacaattggttgaaagagttgtcaatagaaaggaatgtctgca tttagaggatttatttcccatttttcttaaaaaggaagaacaaactgtgc ttgtgacaagaggttgtggagaccaagtttctgtcatgcagatgaagcct cctagggtttactgtgtcagaacagagtgtgccgattgtggtcaggactc tcaggtagcaggcttccaagataacaggttgtaaatagttcttatcagac 3000000000 catagcatttcaccattgagttatttccgcccccttacgtgtctctcttc ttaaGTTCTGTGGAGACGTAAAATGAGGCATATCTGACCTCCACTTccaa agcggtctattatctccaagagggcataaaacactgagtaaacagctctt aaacatctgagacaggtctcagttaattaagaaagtttgttctgcctagt bases ttatatgtgtttcctggatgagccttcttttaattaattttgttaaggga ttaaggacatgcccatgacactgcctcaggaggtcctgacagcatgtgcc tttcctctagggccactgcacgtcatggggagtcacccccagacactccc caaggtggtcaggatacagcttgcttctatatattttagggagaaaatac aattggccccttgtcacccaggggcacatttcagctAtttgtaaaacctg atcaGCCtgtaaacaaaaaattaaattctaaggtccctgaaccatctgaa aaatcactagaaaggaatgtctagtgacttgtgggggccaaggcccttgt tgggctttcttctaggccagggcactctaaaattgaagaacctgaacatt tatggggatgaaggctcttaggtggtagccctccaagagaatagatggtg cctttctattgataatactttcagccagttgagcccattcagaCCACAGC Aatgtctcttttcagacattaaaggtgtcagactctcagttaatctctcc AAGGTGCCAGGCCAGGCAAGGGCTGACTTGAGATACCTGCCAGATGAGTC tagatccaggaaaggcctagaaaaggaaggcctgactgcattaatggaga ACTGGCAAAAGGTGCTGCTCCCTGGTGAGGGAGAAACACCAGGGGCTGGG ttctctccatgtgcaaaatttcctccacaaaagaaatccttgcagggcca AGAGGCCCAGAAGGCTCTGAAGGAGTTTTGGTTTGGCTGGCCATGTGTGC ttttaatgtgttggccctgtgacagccatttcaaaatatgtcaaaaaata AATTAGCGTGATGAGCTCTGACATGGCCTTGCATGGACGGATTGGGCAGG tattttggagtaaaatactttcattttccttcagagtctgctgtcgtatg atgccataccagagtcaggttggaaagtaagccacattatacagcgttaa cctaaaaaaacaaaaaactgtctaacaagattttatggtttatagagcat gattccccggacacattagatagaaatctgggcaagagaagaaaaaaagg tcagagtttaatcctcaTTCCTAAGTTAtgtaaaccaaaaataaaattct A’s T’s C’s and G’s and N’s gaagatgtcctgatcatctgaatggacccttcctctggaccagggcattc caaagttaacctgaaaattggtttgggccatgatgggaagggaggtttgg atatgcctcattatgccctcttccctttcagaattcaggaaaagccaacc

Composition of the human genome The repeat content Jumping -

1. Transposition-derived repeats 2. Inactive retroposed cellular genes. 3. Simple repeats - microstats

• Nearly half the genome is repeats 4. Segmental duplications • Only approximately 1.5% is known coding genes 5. Tandom repeats (telomere, centromere) • Unknown functional fraction?! Few than expected genes Genome complexity

GeneSweep – (Welcome Trust Sanger Institute) Alternative splicing

56% for Humans 22% for Worms

Regulators elements The happy winner . Promoters, enhancers, repressors… Lee Rowen of the Institute for Systems Biology. 25,947 genes. This is where it get complicated.

Variation among chromosomes Variation within chromosomes Recombination GC

Initial sequencing and analysis of the human genome International Human Genome Sequencing Consortium 409 , 860 - 921 (15 February 2001) Gene density

The genome is non-random in its organisation • Overall recombination rate dependent on chromosome length. Recombination – High at telomere • Large variation in the gene density between chromosome. GC – Variation at many scales - Isochores • Difference in organisation Gene Density – Organisation by function New observations Completing the Human Genome

Humans Genome Project starts 1990 2001 Draft Human Genome completed 2001 Fewer gaps 147,821 341 More continuity 81kb 38,500kb • Variation at multiple scales within and between chromosomes Gene rich regions completed 2003 • Only twice as many genes as flies and worms – but more proteins • Genes have arrived from bacteria and transposable elements • Error rate of ~1 in per 100,000 bases • Transposons inactive and LTR probably also (Alu’s in GC rich regions) • 2.85 billion bases • Most mutations occur in males (higher mutation rate) • Covers ~99% of the euchromatic genome . • GC poor regions correspond to dark bands. Each chromosome compiled and annotated. 2006! • Recombination rates are higher at telomeres • Lots of between individual variation Go home?

Not quite finished Chromosome 1

New builds : Build 36, May 2006 Segmental duplications Build 35, May 2004 - allow genes to diversify and Build 34, July 2003 acquire novel functions. Build 33, April 2003

December 2001 - NCBI 28 July 2003 - NCBI 34 • Duplication of a gene from one to many positions on the chromosome.

• A pericentric inversion follows a duplication of two genes Chromosomes 2 and 4 Chromosomes 3

Gene deserts Lowest rate of segmental duplication Megabase sized genomic segments containing no known coding genes. Large inversion from our (some show conservation) ancestor with chimps.

Role of these regions?

Lowest recombination rates of all the autosomes

Chromosomes 7 Chromosomes 10

Multi-species alignment – gene involved in cancer Complex repeat patterns and fragile locations

Williams-Beuren syndrome associated with a large deletion (1.6Mb).

Lots of repetitive and “It is characterized by a distinctive, " elfish " duplicated DNA. facial appearance, along with a low nasal bridge; an unusually cheerful demeanor and What is the true sequences? ease with strangers, coupled with unpredictably occurring negative outbursts; mental retardation coupled with an unusual facility with language ; a love for music ; and Conservation indicates the location of functional elements. cardiovascular problems, such as Some are known genes. supravalvular aortic stenosis and transient Others aren’t – higher levels of conservation! hypercalcemia .” Chromosomes 19 Chromosomes 12 and 3

Very high gene density Recombination rate variation

Increase in all classes Knowing the physical of known genes. positions of variants allows recombination 26 genes per rates megabase.

What is special about this chromosome? Male and female rates differ

Fine scale variation

Has high recombination rate. And repeat density And GC content.

Use drop down controls below and press refresh to alter tracks displayed. Tracks with lots of items will automatically be displayed in more Where is the data available What data available compact modes. Mapping and Sequencing Tracks Base Chromoso STS FISH Recomb Positio me Band Markers Clones Rate N.C.B.I. www.ncbi.nlm.nih.gov/genome/guide/human/ n Map BAC End • Part of the National Institute of Health. • Compositional Contigs Assembly Gap Coverage Pairs Base composition • Has a number of important associated projects. Fosmid WSSD End GC Duplicatio Short Restr Insertion deletions Pairs Percent n Match Enzymes • Mr NCBI – David Lipman. Segmental duplications Phenotype and Disease Associations Repeats RGD Human QTL Mutation Transposable elements www.ensembl.org/Homo_sapiens/ Genes and Gene Prediction Tracks Ensembl Known RefSeq Other MGC Genes CCDS Genes RefSeq Genes • A joint project between EMBL and the Sanger Institute. • Functional Vega Vega Ensembl AceView ECgene Genes Pseudogen Genes Genes Genes • Primarily funded by the Welcome Trust. Genes es N- SGP Geneid Genscan • Mr Ensembl – Ewan Birney Regulatory elements SCAN Genes Genes Genes Exoniphy

Gene expression August us Retropose Superfami Yale EvoFold Genes d Genes ly Pseudo UCSC genome.ucsc.edu/cgi-bin/hgGateway sno/mi • Evolutionary RNA ExonWalk • Based at the University of California Santa Cruz. Species comparison mRNA and EST Tracks Human • Largely funded by the NHGRI. mRNA Spliced Human Other Other ESTs Variation data s ESTs ESTs mRNAs

Population genetic analysis TIGR • Mr UCSC – David Hassler H-Inv Gene UniGene Gene Alt-Splicing Index Bounds Expression and Regulation Allen GNF Atlas GNF Affy HuEx Brain 2 Ratio 1.0 Affy U133 Orientation Annotation - Repeats

Transposable elements

• Make up a large proportion of the genome

• Human chromosomes are numbered Microsatellites and repeats • Arms are labelled p and q • Important in many • Regions labelled ascending from centromere. common diseases • Bases numbered from beginning of small arm to end of long arm. • Some of the most polymorphic loci

Annotation - genes Annotation – Expression and Regulation

• Different levels of evidence for genes Expression Levels & Tissues mRNA evidence Regulatory • Based on homology Elements • Based on expression • Based on prediction

Protein evidence • Regulatory elements might be important in complex diseases Gene prediction EST evidence • Micro array technology is generating Predicted transcripts expression data on a large scale - Known Novel

Manually annotated genes Expression varies in space and time Annotation – Evolutionary Encylopedia of DNA Elements - Encode

Cross Species (issues - alignment) 1% of genome

14 manually chosen regions (Alpha & beta globin, HOXA, FOXP2 and CFTR) Plus 26 random regions

Within Humans (issues - ascertainment) • Variation group – SNPs indels • Function group – Promoters, transcription and binding • Chromatin group – Chromatin modification, replication origins • Multiple sequence alignment – Conservation vs Constraint

Variation is the most important feature of the genome!? Aim: Understand everything possible about these regions.

Human Variation HapMap Project

SNPs – most common variation in the human genome 2002 HapMap phase I begins  Three populations  (YRI) Yoruba in Ibadan, Nigeria 90  (CEU) Utah, USA 90  (CHB) Han Chinese in Beijing 45  (JPT) Japanese in Tokyo 44  Approximately 1 million SNPs

2005 Phase I complete, phase II begins  Increase from 1 million to ~ 4.6 million 10 million common variants. 2006 Phase II complete, “phase III” begins Synonymous Non-synonymous variation  Additional 6 populations Information in the density of SNPs. Information in the frequency of SNPs.  Kenya, African Americans, Mexican Americans, Italy, India Information in the correlation between SNPs. The International HapMap Learing from studies of human variation

•Can learn about how genetic diversity is structured across the globe •Identify regions which have been under recent positive selection •Identify recombination hotspots

• Linkage Disequilibrium • Population genetic information is an important tool annotation is often sample specific

Hot Topics Chromosomes X and Y

• Micro RNA’s Sex chromosomes 20mers of RNA that form a diversity of roles – e.g. regulating mRNA levels

• Structural variation The genome of is full of polymorphic insertions and deletions, from 1kb to a Megabase

• Genome-wide association studies Millions of £s being spend on scanning the genome for loci showing association with disease status.