Molecules, Languages, and Automata

Molecules, Languages, and Automata

VL Algorithmen und Datenstrukturen für Bioinformatik (19400001) WS15/2016 Woche 11 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Contains material from David Searls U Pennsylvania & Masbaul Polash Linguistics and Bioinformatics • Automata Theory • Languages • Grammars Parsing Genes Intron structure: Gene Transcript Transcript Promoter PolyAsite Intron Intron Donor Acceptor Donor Acceptor tataaaa gt ag gt ag aataaa Alan Turing (1912-1954) • A pioneer of automata theory • One of the fathers of modern Computer Science • English mathematician • Studied abstract machines called Turing machines even before computers existed • Heard of the Turing test? What is Automata Theory? • Study of abstract computing devices, or “machines” • Automaton = an abstract computing device • Note: A “device” need not even be a physical hardware! • A fundamental question in computer science: • Find out what different models of machines can do and cannot do • The theory of computation • Computability vs. Complexity • Languages: “A language is a collection of sentences of finite length all constructed from a finite alphabet of symbols” • Grammars: “A grammar can be regarded as a device that enumerates the sentences of a language” - nothing more, nothing less N. Chomsky, Information and Control, Vol 2, 1959 Tim Conrad, VL AlDaBi, WT015/16 LANGUAGES & GRAMMARS? Tim Conrad, VL AlDaBi, WT015/16 8 Problems • In automata theory, a problem is to decide whether a given string is a member of some particular language. • This formulation is general enough to capture the difficulty levels of all problems. Natural Language Structure • A sentence has a hierarchical structure, e.g.: “The linguistSentence sees the biologist.” NounPhrase VerbPhrase Verb NounPhrase Determiner Noun Determiner Noun the linguist sees the biologist Tim Conrad, VL AlDaBi, WT015/16 A Natural Language Grammar • Grammars employ modular, hierarchical rules Sentence NounPhrase VerbPhrase NounPhrase Determiner Noun | NounPhrase PrepositionalPhrase VerbPhrase Verb NounPhrase | VerbPhrase PrepositionalPhrase PrepositionalPhrase Preposition NounPhrase Noun linguist | biologist | telescope | ... Verb sees | ... Determiner the | a Preposition with | ... Tim Conrad, VL AlDaBi, WT015/16 Dependency • Grammars capture long-range dependencies Sentence NounPhrase VerbPhrase NounPhrase PrepositionalPhrase Verb NounPhrase Determiner Preposition NounPhrase Determiner Noun Determiner Noun Noun the linguists with the telescope sees the biologist Tim Conrad, VL AlDaBi, WT015/16 Recursion NounPhrase • Rules can call each other PrepositionalPhrase recursively NounPhrase NounPhrase NounPhrase PrepositionalPhrase Determiner Preposition Preposition NounPhrase Noun Determiner Noun Determiner Noun the linguist with the biologist with the telescope ... Tim Conrad, VL AlDaBi, WT015/16 Ambiguity Sentence • Grammars also allow for a syn- VerbPhrase tactic ambiguity NounPhrase NounPhrase PrepositionalPhrase Determiner Verb NounPhrase Preposition NounPhrase Noun Determiner Noun Determiner Noun the linguist sees the biologist with the telescope Tim Conrad, VL AlDaBi, WT015/16 Ambiguity Sentence • Grammars also allow for a syn- tactic ambiguity VerbPhrase NounPhrase VerbPhrase PrepositionalPhrase Determiner Verb NounPhrase Preposition NounPhrase Noun Determiner Noun Determiner Noun the linguist sees the biologist with the telescope Tim Conrad, VL AlDaBi, WT015/16 Gene „Parsing“ BIOLOGY? Tim Conrad, VL AlDaBi, WT015/16 16 A Gene Grammar • Grammars can describe basic gene structure Gene Promoter Transcript Transcript Intron Transcript | Intron PolyAsite | Skip Transcript Intron Donor Acceptor Skip gt | ag Promoter tataaa PolyAsite aataaa Donor gt Acceptor ag | Skip Acceptor • More elaborate grammars can incorporate coding regions, more complex signals, etc. Tim Conrad, VL AlDaBi, WT015/16 Alternative Splicing • Most genes have multiple exons and most of these are alternatively spliced, i.e., ambiguous • Maintaining reading frame is a dependency Exon skipping Intron retention Alternative Alternative 5’ donor sites 3’ acceptor sites Mutually exclusive exons Tim Conrad, VL AlDaBi, WT015/16 Parsing Genes • Intron structure: Gene Transcript Transcript Promoter PolyAsite Intron Intron Donor Acceptor Donor Acceptor tataaaa gt ag gt ag aataaa Tim Conrad, VL AlDaBi, WT015/16 Parsing Genes • Exon skipping: Gene Transcript Promoter Intron PolyAsite Acceptor Acceptor Donor Skip Skip Acceptor tataaaa gt ag gt ag aataaa Tim Conrad, VL AlDaBi, WT015/16 Parsing Genes • Intron Retention: Gene Transcript Transcript Transcript Promoter PolyAsite Intron Skip Skip Donor Acceptor tataaaa gt ag gt ag aataaa Tim Conrad, VL AlDaBi, WT015/16 RNA Secondary Structure BIOLOGY? Tim Conrad, VL AlDaBi, WT015/16 22 Why RNA Is Interesting • In addition to messenger RNA (mRNA), there are other RNA molecules that play key roles in biology • ribosomal RNA (rRNA) • ribosomes are complexes that incorporate several RNA subunits in addition to numerous protein units • transfer RNA (tRNA) • transport amino acids to the ribosome during translation • the spliceosome, which performs intron splicing, is a complex with several RNA units • the genomes for many viruses (e.g. HIV) are encoded in RNA • etc. RNA Secondary Structure • RNA is typically single stranded • folding, in large part is determined by base-pairing • A-U and C-G are the canonical base pairs • other bases will sometimes pair, especially G-U • the base-paired structure is referred to as the secondary structure of RNA • related RNAs often have homologous secondary structure without significant sequence similarity tRNA Secondary Structure tertiary structure Small Subunit Ribosomal RNA Secondary Structure Base Pairing as Dependency • A context-free grammar (single nonterminals on the left) models base pairs: Pair → x Pair x | ε Pair where x = base complement of x g Pair c a Pair u g ga gac gua Pairε uca ugc uc c c Pair g • The base pairs create nested dependencies, and in fact the g Pair c parse tree mimics an RNA stem ε Tim Conrad, VL AlDaBi, WT015/16 Orthodox Secondary Structure • Adding a branching Pair rule makes arbitrary orthodox secondary g Pair c structure possible: a Pair u Pair → Pair Pair | x Pair x | ε • Specific struc- tures can also be specified, such as tRNA, ribozymes, ... Tim Conrad, VL AlDaBi, WT015/16 Secondary Structure Ambiguity • Ambiguity allows for all possible structures Pair STEM g Pair c a Pair u gaucgauc u Pair a c Pair g ε Tim Conrad, VL AlDaBi, WT015/16 Secondary Structure Ambiguity • Ambiguity allows for all possible structures Pair CRUCIFORM g Pair c a Pair u gaucgauc ε Pair Pair Pair Pair ε u Pair a c Pair g ε Tim Conrad, VL AlDaBi, WT015/16 Secondary Structure Ambiguity • Ambiguity allows for all possible structures a g Pair c u DUMBBELL ε Pair Pair Pair Pair Pair Pair ε u c g a gaucgauc – A lexicalized version of this grammar generates each possible structure exactly once, allowing it to be used to count alternative structures of varying energies and study the distribution of folds over sequence space Tim Conrad, VL AlDaBi, WT015/16 Pseudoknots • Nonorthodox structures like pseudoknots have crossing dependencies gacugagucuca u c a Pair Pair Pair c u g a g u Pair Pair Pair g a c Tim Conrad, VL AlDaBi, WT015/16 Protein Structure BIOLOGY? Tim Conrad, VL AlDaBi, WT015/16 33 Protein Structure • Side-chain interactions • Dependencies3 2 7 5 α α α embody dependencies β are parallel / in folded protein chains antiparallel6 1 8 4 β β β • Secondary structures β orientations are a local abstraction and cheirality A A R A R A 1 2 3 4 5 6 7 8 2BOP Tim Conrad, VL AlDaBi, WT015/16 Structural Complexity 1LBU 1PMI 1SBP Concatenation Insertion Translocation Tim Conrad, VL AlDaBi, WT015/16 TOOLS OF LINGUISTICS Tim Conrad, VL AlDaBi, WT015/16 36 Spoonerisms • Spoonerisms switch initial letters, sylla- bles, or words Drink is Work is the curse the curse of the of the working drinking class. class. Tim Conrad, VL AlDaBi, WT015/16 Spoonerisms • Spoonerisms switch initial letters, sylla- bles, or words • Proteins may also exchange features, even entire globular domains, in a domain swap 1DDT Tim Conrad, VL AlDaBi, WT015/16 Rosetta Stone Proteins • Proteins that interact or participate in the same pathway are often fused in evolution: E. coli: γ-glutamyl phosphate reductase + glutamate-5-kinase human: δ-1-pyrroline-5-carboxylate synthetase • Catalogues of fusions can predict function – Called collocation analysis in lexical semantics, which studies word relations, ontologies, etc. – “Promiscuous” domains (e.g., SH3, WD-repeats, ABC, …) are poor predictors, as are common morphemic affixes (inter-, -ism, pre-, -tion, …) Tim Conrad, VL AlDaBi, WT015/16 Correspondences • The organizing Proteins Languages paradigms of Sequence Lexical linguistics and Structure Syntactic biology seem Function Semantic to correspond Role Pragmatic • Proteins and Evolution Etymology words share Paralogy Paronymy a number of Convergence Homonymy analogous Pleiotropy Polysemy concepts Redundancy Synonymy Tim Conrad, VL AlDaBi, WT015/16 MORE FORMALLY Tim Conrad, VL AlDaBi, WT015/16 41 Why Automata Theory? To study abstract computing devices which are closely related to today’s computers. A simple example of finite state machine: 1 start off on 1 There are many different kinds of machines. Another Example 1 0 0 start off off on 1 0 1 When will this be on? Try 100, 1001, 1000, 111, 00, … Grammar and Languages Grammars and languages are closely related to automata

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    67 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us