Applied Bioinformatics (Of Nucleic Acid Sequences)

Applied Bioinformatics of Nucleic Acid Sequences David A. Hendrix January 9, 2020 For the students and learners of the world. Contents 1 Introduction to Biological Sequences, Biopython, and GNU/Linux 6 1.1 Nucleic Acid Bioinformatics . .6 1.1.1 GNU/Linux and the command line . .6 1.2 Sequences, Strings, and the Genetic Code . .8 1.2.1 Introduction to Sequences and Biopython . .8 1.2.2 The Central Dogma . .9 1.2.3 Subsequences and Reverse Complement . 10 1.3 Sequences File Formats . 11 1.3.1 FASTA . 11 1.3.2 FASTQ . 12 1.3.3 GNU/Linux and Sequence Files . 16 1.4 Lab 1: Introduction to GNU/Linux and FASTA files . 16 1.5 Biological Sequence Databases . 18 1.5.1 NCBI . 18 1.5.2 Ensembl . 20 1.5.3 UCSC Genome Bioinformatics . 20 1.5.4 Uniprot . 20 1.6 Lab 2: FASTQ and Quality Scores . 21 2 Sequence Motifs 23 2.1 Introduction to Motifs . 23 2.2 String Matching . 23 2.3 Consensus Sequences . 24 2.3.1 Searching Consensus Sequences with Biopython . 25 2.4 Motif Finding . 25 2.4.1 Sequence Complexity . 25 2.4.2 Weight Matrices . 25 2.4.3 Relative Entropy . 27 2.4.4 Building a Weight Matrix . 27 2.4.5 Biopython Motifs . 28 2.5 Promoters . 30 2.5.1 Core Promoters . 30 2.5.2 Databases of Promoters/TSSs . 30 2.6 De novo Motif Finding . 30 2.6.1 Gibbs Sampling . 30 2.6.2 MEME and the EM Algorithm . 31 2.7 Lab 3: Introduction to Motifs . 32 2.7.1 Part 1: Building a motif and LOGO image . 32 2.7.2 Part 2: JASPAR Database and \sites" format. 33 2.7.3 Part 3: Running MEME on the command line . 33 2 CONTENTS 3 3 Sequence Alignments 34 3.1 Alignment Algorithms and Dynamic Programming . 34 3.1.1 Needleman-Wunsch Algorithm . 35 3.1.2 Smith-Waterman . 37 3.1.3 Comparison . 38 3.1.4 Aligning DNA vs Proteins . 38 3.2 Alignment Software . 39 3.2.1 BLAST: Basic Local Alignment Search Tool . 39 3.3 Alignment Statistics . 39 3.3.1 Running BLAST from the command line . 40 3.4 Short Read Mapping . 41 3.5 Lab 4: Using BLAST on the command line . 41 3.5.1 Part 1: BLASTing to a protein database . 41 3.5.2 Biopython and BLAST (optional) . 42 3.5.3 Part 2: BLASTing to a genome . 42 4 Multiple Sequence Alignments, Molecular Evolution, and Phylogenetics 44 4.1 Multiple Sequence Alignment . 44 4.1.1 MSA Methods . 44 4.1.2 MSA File Formats . 45 4.2 Phylogenetic Trees . 47 4.2.1 Representing a Phylogenetic Tree . 47 4.2.2 Pairwise Distances . 49 4.3 Models of mutations . 49 4.3.1 Genetic Drift . 50 4.3.2 Substitution Models . 50 4.3.3 Jukes-Cantor 1969 (JC69) . 51 4.3.4 Kimura 1980 model (K80) . 52 4.3.5 Felsenstein 1981 model (F81) . 52 4.3.6 The Hasegawa, Kishino and Yano model (HKY85) . 52 4.3.7 Generalized Time-Reversible Model . 52 4.3.8 Building Phylogenetic Trees . 52 4.3.9 Evaluating the Quality of a Phylogenetic Tree . 53 4.3.10 Tree Searching . 54 4.4 Lab 5: Phylogenetics . 55 4.4.1 Download Sequences from NCBI . 55 4.4.2 Create a Multiple Sequence Alignment and Phylogenetic Tree with Clustalw . 55 4.4.3 Create a Multiple Sequence Alignment and Phylogenetic Tree with phyML . 56 5 Genomics 57 5.1 The Three Fundamental \Gotchas" of Genomics . 57 5.1.1 Different Genome Assemblies/Annotations . 57 5.1.2 Different Chromosome Deflines . 57 5.1.3 0 vs 1-based coordinates . 57 5.2 Genomic Data and File Formats . 58 5.2.1 Formats for Genomic Locations . 58 5.2.2 Quantitative Tracks . 60 5.3 Genome Browsers . 61 5.3.1 IGV . 61 5.3.2 UCSC Genome Browser . 61 5.3.3 Gbrowse . 61 5.3.4 JBrowse . 61 5.4 Lab 6: Genome Annotation Data . 61 5.4.1 Part I: Storing a Genome to a Dictionary . 61 CONTENTS 4 5.4.2 Part II: Storing a GFF to a list . 62 5.4.3 Part III: Find a gene of interest in Drosophila melanogaster ............... 62 6 Transcriptomics 65 6.1 High-throughput Sequencing (HTS) . 65 6.2 RNA Deep Sequencing . ..

Load more