Reference Genomes + Sequence Comparison
Total Page:16
File Type:pdf, Size:1020Kb
Reference Genomes + Sequence Comparison Mike Cherry Genomics Genetics 211 - Winter 2015 Homo sapiens GRCh38 Number of regions with alternative loci or patches 234 Total sequence length 3,221,487,035 Total assembly gap length 160,028,125 Gaps between scaffolds 349 Number of scaffolds 766 Scaffold N50 67,794,873 Number of contigs 1,423 Contig N50 56,413,054 Total number of chromosomes and plasmids 25 2 hg19 to GRCh38 • several thousand corrected bases in both coding and non-coding regions • 100 assembly gaps closed or reduced • satellite repeat modeled – megabase-sized gaps in centromere regions • updated mito reference sequence • many alternative reference sequences – how to use in mapping? 3 Practical mapping to non-linear References First step: GRCh38 - Alternate Loci GRCh38 • 178 regions with alt loci: 2% of chromosome sequence (61.9 Mb) • 261 Alt Loci: 3.6 Mb novel sequence relative to chromosomes Region Chr Unique Genes APOBEC 22 APOBEC3A_B HLA-DRB2, HLA-BRB3, HLA- MHC 6 BRB4 CCL5_TBC1D3 17 CCL3L1, CCL4L1, CCL3P1 CYP2D6 22 CYP2D6 pseudogene LRC_KIR 19 LILRA3, KIR2DS1 TRB 7 TRVB5-8 5 - Typically 10s-100s kb each – anchored to primary - Many stretches are (near) identical: a few Mb novel in terms of 1000GP decoy from 109 Mb total - Not all are haplotypes, but is that meaningful… Alignments of the ALT sequences to the primary are part of the assembly What’s mapping to repeats in the reference? Blacklist for hg19 • If the signal artifact (uniquely mapping) is extremely severe (> 1000 fold) we flag the region. • If the signal artifact is present in most of the tracks independent of cell-line or experiment type we flag the region. • If the region has dispersed high mappability and low mappability coordinates then it is more likely to be an artifact region. • If the region has a known repeat element then it is more likely to be an artifact • We check if the stranded read counts/signal is structured in the UwDNase, UncFAIRE and input/control datasets i.e. do we see offset mirror peaks on the + and - strand that is typical observed in real, functional peaks. If so we remove these regions from the artifact list. • If the region exactly overlaps a known gene’s TSS, or is in the vicinity or within a known gene, it is more likely to be removed from the artifact list. Our intention is to give such regions the benefit of doubt of being real peaks. 8 Anshul Kundaje, 2014, A comprehensive collection of signal artifact blacklist regions in the human genome Blacklist composition 9 Anshul Kundaje, 2014, A comprehensive collection of signal artifact blacklist regions in the human genome Signal due to unique mapping reads 10 Anshul Kundaje, 2014, A comprehensive collection of signal artifact blacklist regions in the human genome Alignment of CTCF ChIP-Seq (NA12878) to hg19 reference assembly + Mapping Target that Represents Heterochromatic Regions (Sponge) chr1:121,179,675-121,374,269 MACS Peak Calling CTCF Binding A B C True Positive False Positive False Positive Karen Miga & Jim Kent, in press 11 Friday, February 28, 14 3 Signal due to multi-mapping reads 12 Anshul Kundaje, 2014, A comprehensive collection of signal artifact blacklist regions in the human genome Reference from a single individual • GRCh38 is build from multiple individuals and from diploid sources, thus difficult to determine haploid context • hydatidiform mole, haploid growth that forms inside the uterus at the beginning of a pregnancy • Isolated by CHORI and BAC library CHORI-17 in 2002. • Sanger, Illumina, PacBio, BioNano (optical) • GRC refers to this as the platinum genome • Others have build ancestry alternate SNP references 13 Mus musculus GRCm38 - mm10 Number of regions with alternative loci or patches 84 Total sequence length 2,800,055,571 Total assembly gap length 79,356,534 Gaps between scaffolds 191 Number of scaffolds 279 Scaffold N50 52,589,046 Number of contigs 780 Contig N50 32,273,079 Total number of chromosomes and plasmids 22 14 Drosophila melanogaster Release 6 Total sequence length 143,726,002 Total assembly gap length 1,152,978 Number of contigs 2,442 Contig N50 31,485,538 Total number of chromosomes and plasmids 8 15 Caenorhabditis elegans WBcel235 Total sequence length 100,286,401 Total assembly gap length 0 Number of contigs 7 Contig N50 17,493,829 Total number of chromosomes and plasmids 7 16 Saccharomyces cerevisiae R64 Total sequence length 12,157,105 Total assembly gap length 0 Total number of chromosomes and plasmids 17 17 Transitioning between assemblies • UCSC liftOver software – start with BED file – uses BLAT to identify matching regions, then alter the coordinates – creates a chain file to defines the changes needed between assemblies • NCBI Genome Remapping Service – http://www.ncbi.nlm.nih.gov/genome/tools/remap For processed data typically do not want to liftOver, rather remap and then rerun analysis. 18 19 One reference doesn’t fit all. 20 B. Paten, A. Novak & D. Haussler. (2015) Mapping to a Reference Genome Structure. arXiv. 2014:1–26 Reference Genome in a Graph Representation 21 B. Paten, A. Novak & D. Haussler. (2015) Mapping to a Reference Genome Structure. arXiv. 2014:1–26 Sequence Comparison 22 Goals of Sequence Comparison: • Find similarity such that an inference of homology is justified. – Similarity = observed with sequence alignment – Homology = shared evolutionary history (ancestry) • Find a new sequence (gene) of interest • Provide biologically appropriate results. – Substitutions, insertions and deletions • Compare as many sequences as fast as possible. 23 Local vs. Global Alignment • Global Alignment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C • Local Alignment—better alignment to find conserved segment tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc 24 Dynamic Programming Basics for sequence alignment. Smith-Waterman method. A C Use scores to complete the matrix, row by row. Add, or subtract, from neighboring A 2 2-1=1 cell with the highest score using this order: C 2-1=1 2+2=4 1) diagonal 2) up 3) right Scoring for nucleotides: Match = 2 Gap = -1 Mismatch = -1 25 Points: match +2, mismatch or gap -1 Order: Diagonal, right, down - T A C T A A C G C - 0 0 0 0 0 0 0 0 0 0 A 0 0 2 1 0 2 2 1 0 0 C 0 0 1 4 3 2 1 4 3 2 A C G C T 26 Points: match +2, mismatch or gap -1 Order: Diagonal, right, down - T A C T A A C G C - 0 0 0 0 0 0 0 0 0 0 A 0 0 2 1 0 2 2 1 0 0 C 0 0 1 4 3 2 1 4 3 2 A 0 0 2 3 3 5 4 3 3 2 C 0 0 1 4 3 4 4 6 5 5 G C T 27 Points: match +2, mismatch or gap -1 Order: Diagonal, right, down - T A C T A A C G C - 0 0 0 0 0 0 0 0 0 0 A 0 0 2 1 0 2 2 1 0 0 C 0 0 1 4 3 2 1 4 3 2 A 0 0 2 3 3 5 4 3 3 2 C 0 0 1 4 3 4 4 6 5 5 G 0 0 0 3 3 3 3 5 8 7 C 0 0 0 2 2 2 2 5 7 10 T 0 2 1 1 4 3 2 4 6 9 28 Points: match +2, mismatch or gap -1 Order: Diagonal, left, up - T A C T A A C G C - A 2 1 2 2 1 C 1 4 3 2 1 4 3 2 A TACTAACGC2 3 3 5 4 3 2 1 C || | |||1 4 3 4 4 6 5 4 AC-A-CGCT G 3 3 3 3 5 8 7 C 2 2 2 2 5 7 10 T 2 1 1 4 3 2 4 6 9 29 Five flavors of BLAST Program Query DB type 1 BLASTN DNA DNA 1 BLASTP protein protein 6 BLASTX DNA protein 6 TBLASTN protein DNA 36 TBLASTX DNA DNA 30 The BLAST Search Algorithm query word (W=3) Query: GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELRQTLQEDL PQG 17 PEG 14 PRG 13 neighborhood PKG 13 neighborhood words PNG 12 score threshold PDG 12 (T = 13) PHG 12 PMG 12 PSG 12 PQN 11 PQA 10 etc ... Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA 365 +LA++L+ TP G R++ +W+ P+ D + ER + A Sbjct: 290 TLASVLDCTVTPKGSRMLKRWLHMPVRDTRVLLERQQTIGA 330 High-scoring Segment Pair (HSP) (from NCBI Web site) 31 Raw Scores (S values) from an Alignment S = (ΣMij) – cO – dG, where M = score from a similarity matrix for a particular pair of amino acids (ij) c = number of gaps O = penalty for the existence of a gap d = total length of gaps G = per-residue penalty for extending the gap Kerfeld and Scott, PLoS Biology 2011 32 BLAST Parameters and Constants S = raw score (scoring matrix derived) S’ = bit score E = chance of finding zero HSPs with score >= S λ = constant based on scoring matrix K = constant based on gap penalty n = effective length of database m = effective length of query 33 BLAST Scoring System Raw score (S): Sum of scores for each aligned position and scores for gaps S = λ(matches) - λ(mismatches) - λ(gap penalties) note: this score varies with the scoring matrix used and thus may not be meaningfully compared for different searches Bit score (S’): Version of the raw score that is normalized by the scale of the scoring matrix (λ) and the scale of the gap penalty (K) S’ = (λ S – ln(K)) / ln(2) note: because it is normalized the bit score can be meaningfully compared across searches E value: Number of alignments with bit score S’ or better that one would expect to find by chance in a search of a database of the same size E = mn2-S’ m = effective length of database n = effective length of query sequence note: E values may change if databases of different sizes are searched 34 BLAST output S’ S E n λ K m 35 E values or p values p = 1 - e-E Very small E values are very similar to p values.