Tunca Doğan , Alex Bateman , Maria J. Martin Your Choice

Total Page:16

File Type:pdf, Size:1020Kb

Tunca Doğan , Alex Bateman , Maria J. Martin Your Choice (—THIS SIDEBAR DOES NOT PRINT—) UniProt Domain Architecture Alignment: A New Approach for Protein Similarity QUICK START (cont.) DESIGN GUIDE Search using InterPro Domain Annotation How to change the template color theme This PowerPoint 2007 template produces a 44”x44” You can easily change the color theme of your poster by going to presentation poster. You can use it to create your research 1 1 1 the DESIGN menu, click on COLORS, and choose the color theme of poster and save valuable time placing titles, subtitles, text, Tunca Doğan , Alex Bateman , Maria J. Martin your choice. You can also create your own color theme. and graphics. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), We provide a series of online tutorials that will guide you Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK through the poster design process and answer your poster Correspondence: [email protected] production questions. To view our template tutorials, go online to PosterPresentations.com and click on HELP DESK. ABSTRACT METHODOLOGY RESULTS & DISCUSSION When you are ready to print your poster, go online to InterPro Domains, DAs and DA Alignment PosterPresentations.com Motivation: Similarity based methods have been widely used in order to Generation of the Domain Architectures: You can also manually change the color of your background by going to VIEW > SLIDE MASTER. After you finish working on the master be infer the properties of genes and gene products containing little or no 1) Collect the hits for each protein from InterPro. Domain annotation coverage Overlap domain hits problem in Need assistance? Call us at 1.510.649.3001 difference b/w domain databases: the InterPro database: sure to go to VIEW > NORMAL to continue working on your poster. experimental annotation. The most popular ones are the sequence 2) Remove all non-domain type hits. similarity search methods such as BLAST. New approaches that overcome 3) Order the domain hits sequentially. How to add Text The template comes with a number of pre- the limitations of the methods that relying solely upon sequence similarity 4) Merge the hits from the same InterPro hierarchy into single hits using formatted placeholders for headers and text QUICK START the condensed view algorithm provided by this resource. are rising. One of these novel approaches is the comparison of the blocks. You can add more blocks by copying 5) Treat the overlapping hits from unrelated InterPro entries. and pasting the existing ones or by adding a Zoom in and out organization/architecture of the structural domains in the proteins. The 6) Add the stretches of residues without domain hits (> 30 a.a.) as “GAP” text box from the HOME menu. As you work on your poster zoom in and out to the idea is that the shared structural units may indicate shared evolutionary level that is more comfortable to you. Go to VIEW > domains in the DAs. ZOOM. and functional properties associated between these units. Text size Figure 3. Domain hit statistics of UniProtKB/SwissProt Figure 4. The fraction of overlap hits by InterPro proteins from various databases domains on the residues of all UniProtKB/SwissProt Adjust the size of your text based on how much content you have to proteins present. The default template text offers a good starting point. Title, Authors, and Affiliations Results: Here we propose a new algorithm for the comparison of domain Statistics about the directionality in DAs: Follow the conference requirements. Start designing your poster by adding the title, the names of the architectures in order to identify similarities and to propagate functional authors, and the affiliated institutions. You can type or paste text annotations between the proteins in the UniProt Database. The method How to add Tables into the provided boxes. The template will automatically adjust the To add a table from scratch go to the INSERT menu and size of your text to fit the title box. You can manually override this “UniProt Domain Architecture Alignment” is unique from previous click on TABLE. A drop-down box will help you select feature and change the size of your text. Figure 1. Different types of overlapping domain approaches in three major ways: (i) the use of InterPro Database for the hits on protein sequences rows and columns. domain annotation, (ii) the incorporation of the domain weights into the You can also copy and a paste a table from Word or TIP: The font size of your title should be bigger than your name(s) Figure 2. Resolution process for the overlap hits. another PowerPoint document. A pasted table may need and institution name(s). dynamic programming step, and (iii) the inclusion of information regarding to be re-formatted by RIGHT-CLICK > FORMAT SHAPE, non-annotated regions in the proteins into the domain architectures. The TEXT BOX, Margins. Domain weighting: performance of the method was measured through the identification of Graphs / Charts orthology using the OMA database (F1 score: 0.62). The results indicated Inverse domain frequency: Nt : total number of proteins in the test set Figure 5. Co-occurrence frequencies of a selection of domain Nd : number of proteins containing domain d pairs, hit together on UniProtKB/SwissProt proteins (InterPro You can simply copy and paste charts and graphs from Excel or accessions of the domains are shown at the top of the bars). the effectiveness of the approach for similarity detection. We plan to Word. Some reformatting may be required depending on how the Adding Logos / Seals original document has been created. integrate the algorithm into a learning based system for the automatic Neighboring domain count: Ed : total number of distinct neighboring domains to d Most often, logos are added on each side of the title. You can insert Evaluation of the performance of the method a logo by dragging and dropping it from your desktop, copy and annotation of uncharacterized proteins in the UniProtKB/TrEMBL database. The performance of the proposed method in identification of orthologous How to change the column configuration paste or by going to INSERT > PICTURES. Logos taken from web sites Term frequency: Nd,p : domain copy number of domain d in protein p Dp : total number of domains in protein p protein sequences proteins from Orthologous Matrix project (OMA) release RIGHT-CLICK on the poster background and select LAYOUT to see are likely to be low quality when printed. Zoom it at 100% to see March 2014 (Altenhoff, et al., 2011). the column options available for this template. The poster columns INTRODUCTION Zmin(d1,d2) & Zmax(d1,d2) : sizes of the what the logo will look like on the final poster and make any shorter and longer hits respectively; of can also be customized on the Master. VIEW > MASTER. necessary adjustments. Domain hit sizes: domain d in protein 1 and in protein 2 The randomly selected UniProtKB/SwissProt proteins from the OMA groups Zav : average size of all domain hits on • Discovery of functional properties for proteins is a key step in all proteins in the set were subjected to the DA alignment procedure. TIP: See if your school’s logo is available on our free poster biomedical research. How to remove the info bars O : similarity ratio between domain d and domain e The performance of the method was evaluated by measuring its ability to Domain similarity measure: d,e If you are working in PowerPoint for Windows and have finished your templates page. identify the orthologous proteins as orthologs usually share the same • Experimental identification of proteins is still a quite laborious and poster, save as PDF and the bars will not be included. You can also expensive task. function. Ap1,p2, Cp1,p2, Fp1,p2, delete them by going to VIEW > MASTER. On the Mac adjust the Photographs / Graphics Weight matrix: Sp1,p2 & Ip1,p2 : local weight matrices Page-Setup to match the Page-Setup in PowerPoint before you You can add images by dragging and dropping from your desktop, • This led to many computational methods being developed to infer the Table 1. Performance results of the proposed method in the identification of orthologous unknown properties of the proteins based on their sequence similarities proteins in OMA groups. create a PDF. You can also delete them from the Slide Master. copy and paste, or by going to INSERT > PICTURES. Resize images Rp1,p2 : raw scoring matrix to experimentally annotated proteins (i.e. BLAST, PSI-BLAST). Final scoring matrix: Wp1,p2 : general weight matrix proportionally by holding down the SHIFT key and dragging one of between proteins 1 and 2 the corner handles. For a professional-looking poster, do not distort Save your work • Different approaches have been tried lately, especially in the field of Save your template as a PowerPoint document. For printing, save as your images by enlarging them disproportionally. protein function prediction, to augment the performance of sequence PowerPoint of “Print-quality” PDF. methods. Weighted Domain Architecture Alignment: • One of these approaches is the study of protein domains: the structural Needleman-Wunsch Global Sequence Alignment algorithm (Needleman and Print your poster building blocks in proteins that are able to function and fold Wunsch, 1970) is the core of the proposed DA alignment method: When you are ready to have your poster printed go online to independently from the rest of the protein. PosterPresentations.com and click on the “Order Your Poster” • Modification of the algorithm in order to work with 7137 distinct button. Choose the poster type the best suits your needs and submit • The concept of domain architectures (DA), defined as the CONCLUSIONS InterPro domains as its alphabet instead of 20 amino acids. your order. If you submit a PowerPoint document you will be organizational properties of a protein regarding the domains it ORIGINAL DISTORTED receiving a PDF proof for your approval prior to printing.
Recommended publications
  • Enhanced Representation of Natural Product Metabolism in Uniprotkb
    H OH metabolites OH Article Diverse Taxonomies for Diverse Chemistries: Enhanced Representation of Natural Product Metabolism in UniProtKB Marc Feuermann 1,* , Emmanuel Boutet 1,* , Anne Morgat 1 , Kristian B. Axelsen 1, Parit Bansal 1, Jerven Bolleman 1 , Edouard de Castro 1, Elisabeth Coudert 1, Elisabeth Gasteiger 1,Sébastien Géhant 1, Damien Lieberherr 1, Thierry Lombardot 1,†, Teresa B. Neto 1, Ivo Pedruzzi 1, Sylvain Poux 1, Monica Pozzato 1, Nicole Redaschi 1 , Alan Bridge 1 and on behalf of the UniProt Consortium 1,2,3,4,‡ 1 Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, CMU, 1 Michel-Servet, CH-1211 Geneva 4, Switzerland; [email protected] (A.M.); [email protected] (K.B.A.); [email protected] (P.B.); [email protected] (J.B.); [email protected] (E.d.C.); [email protected] (E.C.); [email protected] (E.G.); [email protected] (S.G.); [email protected] (D.L.); [email protected] (T.L.); [email protected] (T.B.N.); [email protected] (I.P.); [email protected] (S.P.); [email protected] (M.P.); [email protected] (N.R.); [email protected] (A.B.); [email protected] (U.C.) 2 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK 3 Protein Information Resource, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE 19711, USA 4 Protein Information Resource, Georgetown University Medical Center, 3300 Whitehaven Street NorthWest, Suite 1200, Washington, DC 20007, USA * Correspondence: [email protected] (M.F.); [email protected] (E.B.); Tel.: +41-22-379-58-75 (M.F.); +41-22-379-49-10 (E.B.) † Current address: Centre Informatique, Division Calcul et Soutien à la Recherche, University of Lausanne, CH-1015 Lausanne, Switzerland.
    [Show full text]
  • Sequencing Alignment I Outline: Sequence Alignment
    Sequencing Alignment I Lectures 16 – Nov 21, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1 Outline: Sequence Alignment What Why (applications) Comparative genomics DNA sequencing A simple algorithm Complexity analysis A better algorithm: “Dynamic programming” 2 1 Sequence Alignment: What Definition An arrangement of two or several biological sequences (e.g. protein or DNA sequences) highlighting their similarity The sequences are padded with gaps (usually denoted by dashes) so that columns contain identical or similar characters from the sequences involved Example – pairwise alignment T A C T A A G T C C A A T 3 Sequence Alignment: What Definition An arrangement of two or several biological sequences (e.g. protein or DNA sequences) highlighting their similarity The sequences are padded with gaps (usually denoted by dashes) so that columns contain identical or similar characters from the sequences involved Example – pairwise alignment T A C T A A G | : | : | | : T C C – A A T 4 2 Sequence Alignment: Why The most basic sequence analysis task First aligning the sequences (or parts of them) and Then deciding whether that alignment is more likely to have occurred because the sequences are related, or just by chance Similar sequences often have similar origin or function New sequence always compared to existing sequences (e.g. using BLAST) 5 Sequence Alignment Example: gene HBB Product: hemoglobin Sickle-cell anaemia causing gene Protein sequence (146 aa) MVHLTPEEKS AVTALWGKVN VDEVGGEALG RLLVVYPWTQ RFFESFGDLS TPDAVMGNPK VKAHGKKVLG AFSDGLAHLD NLKGTFATLS ELHCDKLHVD PENFRLLGNV LVCVLAHHFG KEFTPPVQAA YQKVVAGVAN ALAHKYH BLAST (Basic Local Alignment Search Tool) The most popular alignment tool Try it! Pick any protein, e.g.
    [Show full text]
  • The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe
    The EMBL-European Bioinformatics Institute The hub for bioinformatics in Europe Blaise T.F. Alako, PhD [email protected] www.ebi.ac.uk What is EMBL-EBI? • Part of the European Molecular Biology Laboratory • International, non-profit research institute • Europe’s hub for biological data, services and research The European Molecular Biology Laboratory Heidelberg Hamburg Hinxton, Cambridge Basic research Structural biology Bioinformatics Administration Grenoble Monterotondo, Rome EMBO EMBL staff: 1500 people Structural biology Mouse biology >60 nationalities EMBL member states Austria, Belgium, Croatia, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland and the United Kingdom Associate member state: Australia Who we are ~500 members of staff ~400 work in services & support >53 nationalities ~120 focus on basic research EMBL-EBI’s mission • Provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress • Contribute to the advancement of biology through basic investigator-driven research in bioinformatics • Provide advanced bioinformatics training to scientists at all levels, from PhD students to independent investigators • Help disseminate cutting-edge technologies to industry • Coordinate biological data provision throughout Europe Services Data and tools for molecular life science www.ebi.ac.uk/services Browse our services 9 What services do we provide? Labs around the
    [Show full text]
  • Comparative Analysis of Multiple Sequence Alignment Tools
    I.J. Information Technology and Computer Science, 2018, 8, 24-30 Published Online August 2018 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2018.08.04 Comparative Analysis of Multiple Sequence Alignment Tools Eman M. Mohamed Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected]. Hamdy M. Mousa, Arabi E. keshk Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected], [email protected]. Received: 24 April 2018; Accepted: 07 July 2018; Published: 08 August 2018 Abstract—The perfect alignment between three or more global alignment algorithm built-in dynamic sequences of Protein, RNA or DNA is a very difficult programming technique [1]. This algorithm maximizes task in bioinformatics. There are many techniques for the number of amino acid matches and minimizes the alignment multiple sequences. Many techniques number of required gaps to finds globally optimal maximize speed and do not concern with the accuracy of alignment. Local alignments are more useful for aligning the resulting alignment. Likewise, many techniques sub-regions of the sequences, whereas local alignment maximize accuracy and do not concern with the speed. maximizes sub-regions similarity alignment. One of the Reducing memory and execution time requirements and most known of Local alignment is Smith-Waterman increasing the accuracy of multiple sequence alignment algorithm [2]. on large-scale datasets are the vital goal of any technique. The paper introduces the comparative analysis of the Table 1. Pairwise vs. multiple sequence alignment most well-known programs (CLUSTAL-OMEGA, PSA MSA MAFFT, BROBCONS, KALIGN, RETALIGN, and Compare two biological Compare more than two MUSCLE).
    [Show full text]
  • Chapter 6: Multiple Sequence Alignment Learning Objectives
    Chapter 6: Multiple Sequence Alignment Learning objectives • Explain the three main stages by which ClustalW performs multiple sequence alignment (MSA); • Describe several alternative programs for MSA (such as MUSCLE, ProbCons, and TCoffee); • Explain how they work, and contrast them with ClustalW; • Explain the significance of performing benchmarking studies and describe several of their basic conclusions for MSA; • Explain the issues surrounding MSA of genomic regions Outline: multiple sequence alignment (MSA) Introduction; definition of MSA; typical uses Five main approaches to multiple sequence alignment Exact approaches Progressive sequence alignment Iterative approaches Consistency-based approaches Structure-based methods Benchmarking studies: approaches, findings, challenges Databases of Multiple Sequence Alignments Pfam: Protein Family Database of Profile HMMs SMART Conserved Domain Database Integrated multiple sequence alignment resources MSA database curation: manual versus automated Multiple sequence alignments of genomic regions UCSC, Galaxy, Ensembl, alignathon Perspective Multiple sequence alignment: definition • a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned • homologous residues are aligned in columns across the length of the sequences • residues are homologous in an evolutionary sense • residues are homologous in a structural sense Example: 5 alignments of 5 globins Let’s look at a multiple sequence alignment (MSA) of five globins proteins. We’ll use five prominent MSA programs: ClustalW, Praline, MUSCLE (used at HomoloGene), ProbCons, and TCoffee. Each program offers unique strengths. We’ll focus on a histidine (H) residue that has a critical role in binding oxygen in globins, and should be aligned. But often it’s not aligned, and all five programs give different answers.
    [Show full text]
  • How to Generate a Publication-Quality Multiple Sequence Alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012)
    Tutorial: How to generate a publication-quality multiple sequence alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012) 1) Get your sequences in FASTA format: • Go to the NCBI website; find your sequences and display them in FASTA format. Each sequence should look like this (http://www.ncbi.nlm.nih.gov/protein/6678177?report=fasta): >gi|6678177|ref|NP_033320.1| syntaxin-4 [Mus musculus] MRDRTHELRQGDNISDDEDEVRVALVVHSGAARLGSPDDEFFQKVQTIRQTMAKLESKVRELEKQQVTIL ATPLPEESMKQGLQNLREEIKQLGREVRAQLKAIEPQKEEADENYNSVNTRMKKTQHGVLSQQFVELINK CNSMQSEYREKNVERIRRQLKITNAGMVSDEELEQMLDSGQSEVFVSNILKDTQVTRQALNEISARHSEI QQLERSIRELHEIFTFLATEVEMQGEMINRIEKNILSSADYVERGQEHVKIALENQKKARKKKVMIAICV SVTVLILAVIIGITITVG 2) In a text editor, paste all your sequences together (in the order that you would like them to appear in the end). It should look like this: >gi|6678177|ref|NP_033320.1| syntaxin-4 [Mus musculus] MRDRTHELRQGDNISDDEDEVRVALVVHSGAARLGSPDDEFFQKVQTIRQTMAKLESKVRELEKQQVTIL ATPLPEESMKQGLQNLREEIKQLGREVRAQLKAIEPQKEEADENYNSVNTRMKKTQHGVLSQQFVELINK CNSMQSEYREKNVERIRRQLKITNAGMVSDEELEQMLDSGQSEVFVSNILKDTQVTRQALNEISARHSEI QQLERSIRELHEIFTFLATEVEMQGEMINRIEKNILSSADYVERGQEHVKIALENQKKARKKKVMIAICV SVTVLILAVIIGITITVG >gi|151554658|gb|AAI47965.1| STX3 protein [Bos taurus] MKDRLEQLKAKQLTQDDDTDEVEIAVDNTAFMDEFFSEIEETRVNIDKISEHVEEAKRLYSVILSAPIPE PKTKDDLEQLTTEIKKRANNVRNKLKSMERHIEEDEVQSSADLRIRKSQHSVLSRKFVEVMTKYNEAQVD FRERSKGRIQRQLEITGKKTTDEELEEMLESGNPAIFTSGIIDSQISKQALSEIEGRHKDIVRLESSIKE LHDMFMDIAMLVENQGEMLDNIELNVMHTVDHVEKAREETKRAVKYQGQARKKLVIIIVIVVVLLGILAL IIGLSVGLK
    [Show full text]
  • "Phylogenetic Analysis of Protein Sequence Data Using The
    Phylogenetic Analysis of Protein Sequence UNIT 19.11 Data Using the Randomized Axelerated Maximum Likelihood (RAXML) Program Antonis Rokas1 1Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee ABSTRACT Phylogenetic analysis is the study of evolutionary relationships among molecules, phenotypes, and organisms. In the context of protein sequence data, phylogenetic analysis is one of the cornerstones of comparative sequence analysis and has many applications in the study of protein evolution and function. This unit provides a brief review of the principles of phylogenetic analysis and describes several different standard phylogenetic analyses of protein sequence data using the RAXML (Randomized Axelerated Maximum Likelihood) Program. Curr. Protoc. Mol. Biol. 96:19.11.1-19.11.14. C 2011 by John Wiley & Sons, Inc. Keywords: molecular evolution r bootstrap r multiple sequence alignment r amino acid substitution matrix r evolutionary relationship r systematics INTRODUCTION the baboon-colobus monkey lineage almost Phylogenetic analysis is a standard and es- 25 million years ago, whereas baboons and sential tool in any molecular biologist’s bioin- colobus monkeys diverged less than 15 mil- formatics toolkit that, in the context of pro- lion years ago (Sterner et al., 2006). Clearly, tein sequence analysis, enables us to study degree of sequence similarity does not equate the evolutionary history and change of pro- with degree of evolutionary relationship. teins and their function. Such analysis is es- A typical phylogenetic analysis of protein sential to understanding major evolutionary sequence data involves five distinct steps: (a) questions, such as the origins and history of data collection, (b) inference of homology, (c) macromolecules, developmental mechanisms, sequence alignment, (d) alignment trimming, phenotypes, and life itself.
    [Show full text]
  • Aligning Reads: Tools and Theory Genome Transcriptome Assembly Mapping Mapping
    Aligning reads: tools and theory Genome Transcriptome Assembly Mapping Mapping Reads Reads Reads RSEM, STAR, Kallisto, Trinity, HISAT2 Sailfish, Scripture Salmon Splice-aware Transcript mapping Assembly into Genome mapping and quantification transcripts htseq-count, StringTie Trinotate featureCounts Transcript Novel transcript Gene discovery & annotation counting counting Homology-based BLAST2GO Novel transcript annotation Transcriptome Mapping Reads RSEM, Kallisto, Sailfish, Salmon Transcript mapping and quantification Transcriptome Biological samples/Library preparation Mapping Reads RSEM, Kallisto, Sequence reads Sailfish, Salmon FASTQ (+reference transcriptome index) Transcript mapping and quantification Quantify expression Salmon, Kallisto, Sailfish Pseudocounts DGE with R Functional Analysis with R Goal: Finding where in the genome these reads originated from 5 chrX:152139280 152139290 152139300 152139310 152139320 152139330 Reference --->CGCCGTCCCTCAGAATGGAAACCTCGCT TCTCTCTGCCCCACAATGCGCAAGTCAG CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-34neg Normal:HAH CD133lo:LM-Mel-14neg Normal:HAH CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-42neg Normal:HAH CD133hi:LM-Mel-42pos Normal:HAH CD133lo:LM-Mel-34neg Normal:HAH CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-14neg Normal:HAH CD133hi:LM-Mel-14pos Normal:HAH CD133lo:LM-Mel-34neg Normal:HAH CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-42neg DBTSS:human_MCF7 CD133hi:LM-Mel-42pos CD133lo:LM-Mel-14neg CD133lo:LM-Mel-34neg CD133hi:LM-Mel-34pos CD133lo:LM-Mel-42neg CD133hi:LM-Mel-42pos CD133hi:LM-Mel-42poschrX:152139280
    [Show full text]
  • Alignment of Next-Generation Sequencing Data
    Gene Expression Analyses Alignment of Next‐Generation Sequencing Data Nadia Lanman HPC for Life Sciences 2019 What is sequence alignment? • A way of arranging sequences of DNA, RNA, or protein to identify regions of similarity • Similarity may be a consequence of functional, structural, or evolutionary relationships between sequences • In the case of NextGen sequencing, alignment identifies where fragments which were sequenced are derived from (e.g. which gene or transcript) • Two types of alignment: local and global http://www‐personal.umich.edu/~lpt/fgf/fgfrseq.htm Global vs Local Alignment • Global aligners try to align all provided sequence end to end • Local aligners try to find regions of similarity within each provided sequence (match your query with a substring of your subject/target) gap mismatch Alignment Example Raw sequences: A G A T G and G A T TG 2 matches, 0 4 matches, 1 4 matches, 1 3 matches, 2 gaps insertion insertion end gaps A G A T G A G A ‐ T G . A G A T ‐ G . A G A T G . G A T TG . G A T TG . G A T TG . G A T TG NGS read alignment • Allows us to determine where sequence fragments (“reads”) came from • Quantification allows us to address relevant questions • How do samples differ from the reference genome • Which genes or isoforms are differentially expressed Haas et al, 2010, Nature. Standard Differential Expression Analysis Differential Check data Unsupervised expression quality Clustering analysis Trim & filter Count reads GO enrichment reads, remove aligning to analysis adapters each gene Align reads to Check
    [Show full text]
  • Errors in Multiple Sequence Alignment and Phylogenetic Reconstruction
    Multiple Sequence Alignment Errors and Phylogenetic Reconstruction THESIS SUBMITTED FOR THE DEGREE “DOCTOR OF PHILOSOPHY” BY Giddy Landan SUBMITTED TO THE SENATE OF TEL-AVIV UNIVERSITY August 2005 This work was carried out under the supervision of Professor Dan Graur Acknowledgments I would like to thank Dan for more than a decade of guidance in the fields of molecular evolution and esoteric arts. This study would not have come to fruition without the help, encouragement and moral support of Tal Dagan and Ron Ophir. To them, my deepest gratitude. Time flies like an arrow Fruit flies like a banana - Groucho Marx Table of Contents Abstract ..........................................................................................................................1 Chapter 1: Introduction................................................................................................5 Sequence evolution...................................................................................................6 Alignment Reconstruction........................................................................................7 Errors in reconstructed MSAs ................................................................................10 Motivation and aims...............................................................................................13 Chapter 2: Methods.....................................................................................................17 Symbols and Acronyms..........................................................................................17
    [Show full text]
  • Evolution and Function of Drososphila Melanogaster Cis-Regulatory Sequences
    Evolution and Function of Drososphila melanogaster cis-regulatory Sequences By Aaron Hardin A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Molecular and Cell Biology in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Michael Eisen, Chair Professor Doris Bachtrog Professor Gary Karpen Professor Lior Pachter Fall 2013 Evolution and Function of Drososphila melanogaster cis-regulatory Sequences This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License 2013 by Aaron Hardin 1 Abstract Evolution and Function of Drososphila melanogaster cis-regulatory Sequences by Aaron Hardin Doctor of Philosophy in Molecular and Cell Biology University of California, Berkeley Professor Michael Eisen, Chair In this work, I describe my doctoral work studying the regulation of transcription with both computational and experimental methods on the natural genetic variation in a population. This works integrates an investigation of the consequences of polymorphisms at three stages of gene regulation in the developing fly embryo: the diversity at cis-regulatory modules, the integration of transcription factor binding into changes in chromatin state and the effects of these inputs on the final phenotype of embryonic gene expression. i I dedicate this dissertation to Mela Hardin who has been here for me at all times, even when we were apart. ii Contents List of Figures iv List of Tables vi Acknowledgments vii 1 Introduction1 2 Within Species Diversity in cis-Regulatory Modules6 2.1 Introduction....................................6 2.2 Results.......................................8 2.2.1 Genome wide diversity in transcription factor binding sites......8 2.2.2 Genome wide purifying selection on cis-regulatory modules......9 2.3 Discussion.....................................9 2.4 Methods for finding polymorphisms......................
    [Show full text]
  • Developing and Implementing an Institute-Wide Data Sharing Policy Stephanie OM Dyke and Tim JP Hubbard*
    Dyke and Hubbard Genome Medicine 2011, 3:60 http://genomemedicine.com/content/3/9/60 CORRESPONDENCE Developing and implementing an institute-wide data sharing policy Stephanie OM Dyke and Tim JP Hubbard* Abstract HapMap Project [7], also decided to follow HGP prac- tices and to share data publicly as a resource for the The Wellcome Trust Sanger Institute has a strong research community before academic publications des- reputation for prepublication data sharing as a result crib ing analyses of the data sets had been prepared of its policy of rapid release of genome sequence (referred to as prepublication data sharing). data and particularly through its contribution to the Following the success of the first phase of the HGP [8] Human Genome Project. The practicalities of broad and of these other projects, the principles of rapid data data sharing remain largely uncharted, especially to release were reaffirmed and endorsed more widely at a cover the wide range of data types currently produced meeting of genomics funders, scientists, public archives by genomic studies and to adequately address and publishers in Fort Lauderdale in 2003 [9]. Meanwhile, ethical issues. This paper describes the processes the Organisation for Economic Co-operation and and challenges involved in implementing a data Develop ment (OECD) Committee on Scientific and sharing policy on an institute-wide scale. This includes Tech nology Policy had established a working group on questions of governance, practical aspects of applying issues of access to research information [10,11], which principles to diverse experimental contexts, building led to a Declaration on access to research data from enabling systems and infrastructure, incentives and public funding [12], and later to a set of OECD guidelines collaborative issues.
    [Show full text]