Homology & Alignment
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Mouse Kcnip2 Conditional Knockout Project (CRISPR/Cas9)
https://www.alphaknockout.com Mouse Kcnip2 Conditional Knockout Project (CRISPR/Cas9) Objective: To create a Kcnip2 conditional knockout Mouse model (C57BL/6J) by CRISPR/Cas-mediated genome engineering. Strategy summary: The Kcnip2 gene (NCBI Reference Sequence: NM_145703 ; Ensembl: ENSMUSG00000025221 ) is located on Mouse chromosome 19. 10 exons are identified, with the ATG start codon in exon 1 and the TAG stop codon in exon 10 (Transcript: ENSMUST00000162528). Exon 4 will be selected as conditional knockout region (cKO region). Deletion of this region should result in the loss of function of the Mouse Kcnip2 gene. To engineer the targeting vector, homologous arms and cKO region will be generated by PCR using BAC clone RP23-98F2 as template. Cas9, gRNA and targeting vector will be co-injected into fertilized eggs for cKO Mouse production. The pups will be genotyped by PCR followed by sequencing analysis. Note: Mice homozygous for disruptions in this gene are susceptible to induced cardiac arrhythmias but are otherwise normal. Exon 4 starts from about 27.65% of the coding region. The knockout of Exon 4 will result in frameshift of the gene. The size of intron 3 for 5'-loxP site insertion: 574 bp, and the size of intron 4 for 3'-loxP site insertion: 532 bp. The size of effective cKO region: ~625 bp. The cKO region does not have any other known gene. Page 1 of 8 https://www.alphaknockout.com Overview of the Targeting Strategy Wildtype allele gRNA region 5' gRNA region 3' 1 2 3 4 5 6 7 8 9 10 Targeting vector Targeted allele Constitutive KO allele (After Cre recombination) Legends Exon of mouse Kcnip2 Homology arm cKO region loxP site Page 2 of 8 https://www.alphaknockout.com Overview of the Dot Plot Window size: 10 bp Forward Reverse Complement Sequence 12 Note: The sequence of homologous arms and cKO region is aligned with itself to determine if there are tandem repeats. -
To Find Information About Arabidopsis Genes Leonore Reiser1, Shabari
UNIT 1.11 Using The Arabidopsis Information Resource (TAIR) to Find Information About Arabidopsis Genes Leonore Reiser1, Shabari Subramaniam1, Donghui Li1, and Eva Huala1 1Phoenix Bioinformatics, Redwood City, CA USA ABSTRACT The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource of Arabidopsis biology for plant scientists. TAIR curates and integrates information about genes, proteins, gene function, orthologs gene expression, mutant phenotypes, biological materials such as clones and seed stocks, genetic markers, genetic and physical maps, genome organization, images of mutant plants, protein sub-cellular localizations, publications, and the research community. The various data types are extensively interconnected and can be accessed through a variety of Web-based search and display tools. This unit primarily focuses on some basic methods for searching, browsing, visualizing, and analyzing information about Arabidopsis genes and genome, Additionally we describe how members of the community can share data using TAIR’s Online Annotation Submission Tool (TOAST), in order to make their published research more accessible and visible. Keywords: Arabidopsis ● databases ● bioinformatics ● data mining ● genomics INTRODUCTION The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource for the biology of Arabidopsis thaliana (Huala et al., 2001; Garcia-Hernandez et al., 2002; Rhee et al., 2003; Weems et al., 2004; Swarbreck et al., 2008, Lamesch, et al., 2010, Berardini et al., 2016). The TAIR database contains information about genes, proteins, gene expression, mutant phenotypes, germplasms, clones, genetic markers, genetic and physical maps, genome organization, publications, and the research community. In addition, seed and DNA stocks from the Arabidopsis Biological Resource Center (ABRC; Scholl et al., 2003) are integrated with genomic data, and can be ordered through TAIR. -
BIO4342 Exercise 2: Browser-Based Annotation and RNA-Seq Data
BIO4342 Exercise 2: Browser-Based Annotation and RNA-Seq Data Jeremy Buhler March 15, 2010 This exercise continues your introduction to practical issues in comparative annotation. You’ll be annotating genomic sequence from the dot chromosome of Drosophila mojavensis using your knowledge of BLAST and some improved visualization tools. You’ll also consider how best to integrate information from high-throughput sequencing of expressed RNA. 1 Getting Started To begin, go to our local genome browser at http://gander.wustl.edu/. Select “Genome Browser” from the left-side menu and choose the “Improved Dot” assembly of D. mojavensis for viewing. Finally, hit submit to start looking at the sequence. The entire dot assembly is about 1.69 megabases in length; zoom out to see everything. This assembly is built from a set of overlapping fosmid clones prepared for the 2009 edition of BIO 4342. We’ve added a variety of information to the genome browser to help you annotate, such as: • gene-structure predictions from several different tools; • repeats annotated using the RepeatMasker program; • BLAST hits to D. melanogaster proteins; • RNA-Seq data, which we’ll describe in more detail later. Having all this evidence available at once is somewhat overwhelming. To keep the view to a manageable level, I’d suggest that you initially set all the gene prediction tracks (Genscan, Nscan, SNAP, Geneid), as well as the repeat tracks, to “dense” mode, so that each displays on a single line. Set the BLAST hit track (called “D. mel proteins”) to “pack” to see the locations of all BLAST hits, and set the “RNA-Seq Coverage” track to “full” and the “TopHat junctions” track to “pack” to get a detailed view of these results. -
BLAT—The BLAST-Like Alignment Tool
Resource BLAT—The BLAST-Like Alignment Tool W. James Kent Department of Biology and Center for Molecular Biology of RNA, University of California, Santa Cruz, Santa Cruz, California 95064, USA Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments. A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences. BLAT’s speed stems from an index of all nonoverlapping K-mers in the genome. This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly. BLAT has several major stages. It uses the index to find regions in the genome likely to be homologous to the query sequence. It performs an alignment between homologous regions. It stitches together these aligned regions (often exons) into larger alignments (typically genes). Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible. This paper describes how BLAT was optimized. Effects on speed and sensitivity are explored for various K-mer sizes, mismatch schemes, and number of required index matches. BLAT is compared with other alignment programs on various test sets and then used in several genome-wide applications. http://genome.ucsc.edu hosts a web-based BLAT server for the human genome. Some might wonder why in the year 2002 the world needs sions on any number of perfect or near-perfect hits. Where another sequence alignment tool. -
A Multithread Blat Algorithm Speeding up Aligning Sequences to Genomes Meng Wang and Lei Kong*
Wang and Kong BMC Bioinformatics (2019) 20:28 https://doi.org/10.1186/s12859-019-2597-8 SOFTWARE Open Access pblat: a multithread blat algorithm speeding up aligning sequences to genomes Meng Wang and Lei Kong* Abstract Background: The blat is a widely used sequence alignment tool. It is especially useful for aligning long sequences and gapped mapping, which cannot be performed properly by other fast sequence mappers designed for short reads. However, the blat tool is single threaded and when used to map whole genome or whole transcriptome sequences to reference genomes this program can take days to finish, making it unsuitable for large scale sequencing projects and iterative analysis. Here, we present pblat (parallel blat), a parallelized blat algorithm with multithread and cluster computing support, which functions to rapidly fine map large scale DNA/RNA sequences against genomes. Results: The pblat algorithm takes advantage of modern multicore processors and significantly reduces the run time with the number of threads used. pblat utilizes almost equal amount of memory as when running blat. The results generated by pblat are identical with those generated by blat. The pblat tool is easy to install and can run on Linux and Mac OS systems. In addition, we provide a cluster version of pblat (pblat-cluster) running on computing clusters with MPI support. Conclusion: pblat is open source and free available for non-commercial users. It is easy to install and easy to use. pblat and pblat-cluster would facilitate the high-throughput mapping of large scale genomic and transcript sequences to reference genomes with both high speed and high precision. -
Assembly Exercise
Assembly Exercise Turning reads into genomes Where we are • 13:30-14:00 – Primer Design to Amplify Microbial Genomes for Sequencing • 14:00-14:15 – Primer Design Exercise • 14:15-14:45 – Molecular Barcoding to Allow Multiplexed NGS • 14:45-15:15 – Processing NGS Data – de novo and mapping assembly • 15:15-15:30 – Break • 15:30-15:45 – Assembly Exercise • 15:45-16:15 – Annotation • 16:15-16:30 – Annotation Exercise • 16:30-17:00 – Submitting Data to GenBank Log onto ILRI cluster • Log in to HPC using ILRI instructions • NOTE: All the commands here are also in the file - assembly_hands_on_steps.txt • If you are like me, it may be easier to cut and paste Linux commands from this file instead of typing them in from the slides Start an interactive session on larger servers • The interactive command will start a session on a server better equipped to do genome assembly $ interactive • Switch to csh (I use some csh features) $ csh • Set up Newbler software that will be used $ module load 454 A norovirus sample sequenced on both 454 and Illumina • The vendors use different file formats unknown_norovirus_454.GACT.sff unknown_norovirus_illumina.fastq • I have converted these files to additional formats for use with the assembly tools unknown_norovirus_454_convert.fasta unknown_norovirus_454_convert.fastq unknown_norovirus_illumina_convert.fasta Set up and run the Newbler de novo assembler • Create a new de novo assembly project $ newAssembly de_novo_assembly • Add read data to the project $ addRun de_novo_assembly unknown_norovirus_454.GACT.sff -
A Dissertation
A Dissertation entitled Strategies for Membrane Protein Studies and Structural Characterization of a Metabolic Enzyme for Antibiotic Development by Buenafe T. Arachea Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Doctor of Philosophy Degree in Chemistry Dr. Ronald E. Viola, Committee Chair Dr. Max O. Funk, Committee Member Dr. Donald Ronning, Committee Member Dr. Marcia McInerney, Committee Member Dr. Patricia R. Komuniecki, Dean College of Graduate Studies The University of Toledo August 2011 Copyright © 2011, Buenafe T. Arachea This document is copyrighted material. Under copyright law, no parts of this document may be reproduced without the expressed permission of the author. An Abstract of Strategies for Membrane Protein Studies and Structural Characterization of a Metabolic Enzyme for Antibiotic Development by Buenafe T. Arachea Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Doctor of Philosophy Degree in Chemistry The University of Toledo August 2011 Membrane proteins are essential in a variety of cellular functions, making them viable targets for drug development. However, progress in the structural elucidation of membrane proteins has proven to be a difficult task, thus limiting the number of published structures of membrane proteins as compared with the enormous structural information obtained from soluble proteins. The challenge in membrane protein studies lies in the production of the required sample for characterization, as well as in developing methods to effectively solubilize and maintain a functional and stable form of the target protein during the course of crystallization. To address these issues, two different approaches were explored for membrane protein studies. -
The Uniprot Knowledgebase BLAST
Introduction to bioinformatics The UniProt Knowledgebase BLAST UniProtKB Basic Local Alignment Search Tool A CRITICAL GUIDE 1 Version: 1 August 2018 A Critical Guide to BLAST BLAST Overview This Critical Guide provides an overview of the BLAST similarity search tool, Briefly examining the underlying algorithm and its rise to popularity. Several WeB-based and stand-alone implementations are reviewed, and key features of typical search results are discussed. Teaching Goals & Learning Outcomes This Guide introduces concepts and theories emBodied in the sequence database search tool, BLAST, and examines features of search outputs important for understanding and interpreting BLAST results. On reading this Guide, you will Be aBle to: • search a variety of Web-based sequence databases with different query sequences, and alter search parameters; • explain a range of typical search parameters, and the likely impacts on search outputs of changing them; • analyse the information conveyed in search outputs and infer the significance of reported matches; • examine and investigate the annotations of reported matches, and their provenance; and • compare the outputs of different BLAST implementations and evaluate the implications of any differences. finding short words – k-tuples – common to the sequences Being 1 Introduction compared, and using heuristics to join those closest to each other, including the short mis-matched regions Between them. BLAST4 was the second major example of this type of algorithm, From the advent of the first molecular sequence repositories in and rapidly exceeded the popularity of FastA, owing to its efficiency the 1980s, tools for searching dataBases Became essential. DataBase searching is essentially a ‘pairwise alignment’ proBlem, in which the and Built-in statistics. -
Multiple Alignments, Blast and Clustalw
Multiple alignments, blast and clustalW 1. Blast idea: a. Filter out low complexity regions (tandem repeats… that sort of thing) [optional] b. Compile list of high-scoring strings (words, in BLAST jargon) of fixed length in query (threshold T) c. Extend alignments (highs scoring pairs) d. Report High Scoring pairs: score at least S (or an E value lower than some threshold) 2. Multiple Sequence Alignments: a. Attempts to extend dynamic programming techniques to multiple sequences run into problems after only a few proteins (8 average proteins were a problem early in 2000s) b. Heuristic approach c. Idea (Progressive Approach): i. homologous sequences are evolutionarily related ii. Build multiple alignment by series of pairwise alignments based off some phylogenetic tree (the initial tree or the guide tree ) iii. Add in more distantly related sequences d. Progressive Sequence alignment example: 1. NYLS & NKYLS: N YLS N(K|-)YLS NKYLS 2. NFS & NFLS: N YLS NF S NF(L|-)S NKYLS NFLS 3. N(K|-)YLS & NF(L|-)S N YLS N(K|-)(Y|F)(L|-)S NKYLS N YLS N F S N FLS e. Assessment: i. Works great for fairly similar sequences ii. Not so well for highly divergent ones f. Two Problems: i. local minimum problem: Algorithm greedily adds sequences based off of tree— might miss global solution ii. Alignment parameters: Mistakes (misaligned regions) early in procedure can’t be corrected later. g. ClustalW does multiple alignments and attempts to solve alignment parameter problem i. gap costs are dynamically varied based on position and amino acid ii. weight matrices are changed as the level of divergence between sequence increases (say going from PAM30 -> PAM60) iii. -
BLAST Practice
Using BLAST BLAST (Basic Local Alignment Search Tool) is an online search tool provided by NCBI (National Center for Biotechnology Information). It allows you to “find regions of similarity between biological sequences” (nucleotide or protein). The NCBI maintains a huge database of biological sequences, which it compares the query sequences to in order to find the most similar ones. Using BLAST, you can input a gene sequence of interest and search entire genomic libraries for identical or similar sequences in a matter of seconds. The amount of information on the BLAST website is a bit overwhelming — even for the scientists who use it on a frequent basis! You are not expected to know every detail of the BLAST program. BLAST results have the following fields: E value: The E value (expected value) is a number that describes how many times you would expect a match by chance in a database of that size. The lower the E value is, the more significant the match. Percent Identity: The percent identity is a number that describes how similar the query sequence is to the target sequence (how many characters in each sequence are identical). The higher the percent identity is, the more significant the match. Query Cover: The query cover is a number that describes how much of the query sequence is covered by the target sequence. If the target sequence in the database spans the whole query sequence, then the query cover is 100%. This tells us how long the sequences are, relative to each other. FASTA format FASTA format is used to represent either nucleotide or peptide sequences. -
Tutorial 1: Exploring the UCSC Genome Browser
Last updated: May 22, 2013 Tutorial 1: Exploring the UCSC Genome Browser Open the homepage of the UCSC Genome Browser at: http://genome.ucsc.edu/ In the blue bar at the top, click on the Genomes link. Change the drop down menus so that clade is set to Vertebrate, genome is set to Human and assembly is set to Feb. 2009 (GRCh37/hg19). Then type in “adam2” in the position or search term window and click submit. On the subsequent page, click on the top link under the UCSC Genes section. This should bring up a window that looks similar to that shown in Figure 1: Figure 1: ADAM2 as displayed in the UCSC genome browser with default tracks. The UCSC Genome browser is highly configurable. If your view looks much different than Fig. 1, try clicking the default tracks button below the graphic window. This is a complicated window, so we will go through it from top to bottom: 1. In the row labeled move, are buttons that allow you to move left to right on the chromosome as well as zoom in and out on the information. 2. The position window shows the genomic coordinates of the chromosome sequence displayed. You can edit this manually or use the move controls to zoom in and out or change positions. The entire chromosome can be viewed at once, though this takes a very long time to load. To the right of the jump and clear buttons, the size the genomic region being viewed is listed. The view is taken up by the ADAM2 gene, corresponding to about 94Kb in length, not atypical for a mammalian gene. -
Genome-Wide DNA Methylation in Chronic Myeloid Leukaemia
Genome-wide DNA Methylation in Chronic Myeloid Leukaemia Alexandra Bazeos Imperial College London Department of Medicine Centre for Haematology Thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy of Imperial College London 2015 1 Abstract Epigenetic alterations occur frequently in leukaemia and might account for differences in clinical phenotype and response to treatment. Despite the consistent presence of the BCR-ABL1 fusion gene in Philadelphia-positive chronic myeloid leukaemia (CML), the clinical course of patients treated with tyrosine kinase inhibitors (TKI) is heterogeneous. This might be due to differing DNA methylation profiles between patients. Therefore, a validated, epigenome-wide survey in CML CD34+ progenitor cells was performed in newly diagnosed chronic phase patients using array-based DNA methylation and gene expression profiling. In practice, the CML DNA methylation signature was remarkably homogeneous; it differed from CD34+ cells of normal persons and did not correlate with an individual patient’s response to TKI therapy. Using a meta-analysis tool it was possible to demonstrate that this signature was highly enriched for developmentally dynamic regions of the human methylome and represents a combination of CML-unique, myeloid leukemia- specific and pan-cancer sub-signatures. The CML profile involved aberrantly methylated genes in signaling pathways already implicated in CML leukaemogenesis, including TGF-beta, Wnt, Jak-STAT and MAPK. Furthermore, a core set of differentially methylated promoters were identified that likely have a role in modulating gene expression levels. In conclusion, the findings are consistent with the notion that CML starts with the acquisition of a BCR-ABL1 fusion gene by a haematopoietic stem cell, which then either causes or cooperates with a series of DNA methylation changes that are specific for CML.