VL Algorithmen und Datenstrukturen für Bioinformatik (19400001) WS15/2016 Woche 11
Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin
Contains material from David Searls U Pennsylvania & Masbaul Polash Linguistics and Bioinformatics
• Automata Theory • Languages • Grammars
Parsing Genes
Intron structure:
Gene Transcript Transcript Promoter PolyAsite Intron Intron
Donor Acceptor Donor Acceptor tataaaa gt ag gt ag aataaa Alan Turing (1912-1954)
• A pioneer of automata theory • One of the fathers of modern Computer Science • English mathematician • Studied abstract machines called Turing machines even before computers existed • Heard of the Turing test? What is Automata Theory?
• Study of abstract computing devices, or “machines”
• Automaton = an abstract computing device • Note: A “device” need not even be a physical hardware!
• A fundamental question in computer science: • Find out what different models of machines can do and cannot do • The theory of computation
• Computability vs. Complexity • Languages: “A language is a collection of sentences of finite length all constructed from a finite alphabet of symbols”
• Grammars: “A grammar can be regarded as a device that enumerates the sentences of a language” - nothing more, nothing less
N. Chomsky, Information and Control, Vol 2, 1959
Tim Conrad, VL AlDaBi, WT015/16 LANGUAGES & GRAMMARS?
Tim Conrad, VL AlDaBi, WT015/16 8 Problems
• In automata theory, a problem is to decide whether a given string is a member of some particular language.
• This formulation is general enough to capture the difficulty levels of all problems. Natural Language Structure
• A sentence has a hierarchical structure, e.g.: “The linguistSentence sees the biologist.”
NounPhrase VerbPhrase
Verb NounPhrase
Determiner Noun Determiner Noun
the linguist sees the biologist
Tim Conrad, VL AlDaBi, WT015/16 A Natural Language Grammar
• Grammars employ modular, hierarchical rules Sentence NounPhrase VerbPhrase NounPhrase Determiner Noun | NounPhrase PrepositionalPhrase VerbPhrase Verb NounPhrase | VerbPhrase PrepositionalPhrase PrepositionalPhrase Preposition NounPhrase Noun linguist | biologist | telescope | ... Verb sees | ... Determiner the | a Preposition with | ...
Tim Conrad, VL AlDaBi, WT015/16 Dependency
• Grammars capture long-range dependencies Sentence
NounPhrase VerbPhrase
NounPhrase PrepositionalPhrase Verb NounPhrase Determiner Preposition NounPhrase Determiner Noun Determiner Noun Noun the linguists with the telescope sees the biologist
Tim Conrad, VL AlDaBi, WT015/16 Recursion
NounPhrase • Rules can call each other PrepositionalPhrase recursively NounPhrase NounPhrase
NounPhrase PrepositionalPhrase
Determiner Preposition Preposition NounPhrase
Noun Determiner Noun Determiner Noun the linguist with the biologist with the telescope ...
Tim Conrad, VL AlDaBi, WT015/16 Ambiguity
Sentence • Grammars also allow for a syn- VerbPhrase tactic ambiguity NounPhrase
NounPhrase PrepositionalPhrase
Determiner Verb NounPhrase Preposition NounPhrase
Noun Determiner Noun Determiner Noun the linguist sees the biologist with the telescope
Tim Conrad, VL AlDaBi, WT015/16 Ambiguity
Sentence • Grammars also allow for a syn- tactic ambiguity VerbPhrase
NounPhrase VerbPhrase PrepositionalPhrase
Determiner Verb NounPhrase Preposition NounPhrase
Noun Determiner Noun Determiner Noun the linguist sees the biologist with the telescope
Tim Conrad, VL AlDaBi, WT015/16 Gene „Parsing“ BIOLOGY?
Tim Conrad, VL AlDaBi, WT015/16 16 A Gene Grammar
• Grammars can describe basic gene structure Gene Promoter Transcript Transcript Intron Transcript | Intron PolyAsite | Skip Transcript Intron Donor Acceptor Skip gt | ag Promoter tataaa PolyAsite aataaa Donor gt Acceptor ag | Skip Acceptor • More elaborate grammars can incorporate coding regions, more complex signals, etc.
Tim Conrad, VL AlDaBi, WT015/16 Alternative Splicing
• Most genes have multiple exons and most of these are alternatively spliced, i.e., ambiguous • Maintaining reading frame is a dependency
Exon skipping Intron retention
Alternative Alternative 5’ donor sites 3’ acceptor sites
Mutually exclusive exons
Tim Conrad, VL AlDaBi, WT015/16 Parsing Genes
• Intron structure: Gene Transcript Transcript Promoter PolyAsite Intron Intron
Donor Acceptor Donor Acceptor tataaaa gt ag gt ag aataaa
Tim Conrad, VL AlDaBi, WT015/16 Parsing Genes
• Exon skipping: Gene Transcript
Promoter Intron PolyAsite Acceptor Acceptor
Donor Skip Skip Acceptor tataaaa gt ag gt ag aataaa
Tim Conrad, VL AlDaBi, WT015/16 Parsing Genes
• Intron Retention: Gene Transcript Transcript Transcript Promoter PolyAsite Intron
Skip Skip Donor Acceptor tataaaa gt ag gt ag aataaa
Tim Conrad, VL AlDaBi, WT015/16 RNA Secondary Structure BIOLOGY?
Tim Conrad, VL AlDaBi, WT015/16 22 Why RNA Is Interesting
• In addition to messenger RNA (mRNA), there are other RNA molecules that play key roles in biology • ribosomal RNA (rRNA) • ribosomes are complexes that incorporate several RNA subunits in addition to numerous protein units • transfer RNA (tRNA) • transport amino acids to the ribosome during translation • the spliceosome, which performs intron splicing, is a complex with several RNA units • the genomes for many viruses (e.g. HIV) are encoded in RNA • etc. RNA Secondary Structure
• RNA is typically single stranded • folding, in large part is determined by base-pairing • A-U and C-G are the canonical base pairs • other bases will sometimes pair, especially G-U • the base-paired structure is referred to as the secondary structure of RNA • related RNAs often have homologous secondary structure without significant sequence similarity tRNA Secondary Structure
tertiary structure Small Subunit Ribosomal RNA Secondary Structure Base Pairing as Dependency
• A context-free grammar (single nonterminals on the left) models base pairs: Pair → x Pair x | ε Pair where x = base complement of x g Pair c
a Pair u
g ga gac gua Pairε uca ugc uc c c Pair g • The base pairs create nested dependencies, and in fact the g Pair c parse tree mimics an RNA stem ε
Tim Conrad, VL AlDaBi, WT015/16 Orthodox Secondary Structure
• Adding a branching Pair rule makes arbitrary orthodox secondary g Pair c structure possible: a Pair u Pair → Pair Pair | x Pair x | ε • Specific struc- tures can also be specified, such as tRNA, ribozymes, ...
Tim Conrad, VL AlDaBi, WT015/16 Secondary Structure Ambiguity
• Ambiguity allows for all possible structures
Pair STEM g Pair c
a Pair u gaucgauc
u Pair a
c Pair g ε
Tim Conrad, VL AlDaBi, WT015/16 Secondary Structure Ambiguity
• Ambiguity allows for all possible structures
Pair CRUCIFORM g Pair c a Pair u gaucgauc ε Pair Pair Pair Pair ε u Pair a c Pair g ε
Tim Conrad, VL AlDaBi, WT015/16 Secondary Structure Ambiguity
• Ambiguity allows for all possible structures
a g Pair c u DUMBBELL ε Pair Pair Pair Pair Pair Pair ε u c g a gaucgauc – A lexicalized version of this grammar generates each possible structure exactly once, allowing it to be used to count alternative structures of varying energies and study the distribution of folds over sequence space
Tim Conrad, VL AlDaBi, WT015/16 Pseudoknots
• Nonorthodox structures like pseudoknots have crossing dependencies gacugagucuca
u c a Pair Pair Pair c u g a g u Pair Pair Pair g a c
Tim Conrad, VL AlDaBi, WT015/16 Protein Structure BIOLOGY?
Tim Conrad, VL AlDaBi, WT015/16 33 Protein Structure
• Side-chain interactions • Dependencies3 2 7 5 α α α embody dependencies β are parallel /
in folded protein chains antiparallel6 1 8 4 β β β • Secondary structures β orientations are a local abstraction and cheirality
A A
R A R A 1 2 3 4 5 6 7 8 2BOP
Tim Conrad, VL AlDaBi, WT015/16 Structural Complexity
1LBU 1PMI 1SBP Concatenation Insertion Translocation
Tim Conrad, VL AlDaBi, WT015/16 TOOLS OF LINGUISTICS
Tim Conrad, VL AlDaBi, WT015/16 36 Spoonerisms
• Spoonerisms switch initial letters, sylla- bles, or words Drink is Work is the curse the curse of the of the working drinking class. class.
Tim Conrad, VL AlDaBi, WT015/16 Spoonerisms
• Spoonerisms switch initial letters, sylla- bles, or words • Proteins may also exchange features, even entire globular domains, in a domain swap 1DDT
Tim Conrad, VL AlDaBi, WT015/16 Rosetta Stone Proteins
• Proteins that interact or participate in the same pathway are often fused in evolution: E. coli: γ-glutamyl phosphate reductase + glutamate-5-kinase human: δ-1-pyrroline-5-carboxylate synthetase • Catalogues of fusions can predict function – Called collocation analysis in lexical semantics, which studies word relations, ontologies, etc. – “Promiscuous” domains (e.g., SH3, WD-repeats, ABC, …) are poor predictors, as are common morphemic affixes (inter-, -ism, pre-, -tion, …)
Tim Conrad, VL AlDaBi, WT015/16 Correspondences
• The organizing Proteins Languages paradigms of Sequence Lexical linguistics and Structure Syntactic biology seem Function Semantic to correspond Role Pragmatic • Proteins and Evolution Etymology words share Paralogy Paronymy a number of Convergence Homonymy analogous Pleiotropy Polysemy concepts Redundancy Synonymy
Tim Conrad, VL AlDaBi, WT015/16 MORE FORMALLY
Tim Conrad, VL AlDaBi, WT015/16 41 Why Automata Theory?
To study abstract computing devices which are closely related to today’s computers. A simple example of finite state machine:
1
start off on 1 There are many different kinds of machines. Another Example
1
0 0
start off off on 1 0 1
When will this be on? Try 100, 1001, 1000, 111, 00, … Grammar and Languages
Grammars and languages are closely related to automata theory and are the basis of many important software components like:
• Compilers and interpreters • Text editors and processors • Search engines • System verification components • Alphabets • Strings • Languages • Problems
PRELIMINARIES
Tim Conrad, VL AlDaBi, WT015/16 45 Strings
• A string is a finite sequence of symbols from an alphabet. • Examples:
• 0011 and 11 are strings from Σ = {0,1}
• abc and bbb are strings from Σ = {a, b, … , z}
• (()(())) and )(() are strings from Σ = {(, )}
46 Strings
• Empty string: ε
• Length of string: |0010| = 4, |aa| = 2, |ε|=0
• Prefix of string: aaabc, aaabc, aaabc
• Proper prefix of string: aaabc, aaabc
• Suffix of string: aaabc, aaabc, aaabc
• Proper suffix of string: aaabc, aaabc
• Substring of string: aaabc, aaabc, aaabc
47 Strings
• Concatenation: ω=abd, α=ce, ωα=abdce
• Exponentiation: ω=abd, ω3=abdabdabd, ω0=ε
• Reversal: ω=abd, ωR = dba
• Σk = set of all k-length strings formed by symbols in Σ e.g., Σ={a,b}, Σ2={ab, ba, aa, bb}, Σ0={ε}
• What is Σ1? Is Σ1 different from Σ? How?
48 Strings
* 0 1 2 k • Kleene Closure Σ = Σ ∪Σ ∪Σ ∪… = ∪k≥0 Σ e.g., Σ={a, b}, Σ* = {ε, a, b, ab, aa, ba, bb, aaa, aab, abb, … } is the set of all strings formed by a’s and b’s.
+ 1 2 3 k • Σ = Σ ∪Σ ∪Σ ∪… = ∪k>0 Σ i.e., Σ* without the empty string.
49 Languages
• A language is a set of strings over an alphabet. • Examples:
• Σ={(, )}, L1={(), )(, (())} is a language over Σ. • Σ={a, b, c, … , z}, the set L of all legal English words is a language over Σ.
• The set {ε} is a language over any alphabet.
• What is the difference between φ and {ε}?
50 Languages
• Other Examples:
• Σ={0, 1}, L={0n1n | n≥1} is a language over Σ consisting of the strings {01, 0011, 000111, … }
• Σ={0, 1}, L = {0i1j | j≥i≥0} is a language over Σ consisting of the strings with some 0’s (possibly none) followed by at least as many 1’s.
51 Problems
• In automata theory, a problem is to decide whether a given string is a member of some particular language.
• This formulation is general enough to capture the difficulty levels of all problems.
52 Finite Automata (or Finite State Machines)
• This is the simplest kind of machine. • We will consider three types of Finite Automata:
• Deterministic Finite Automata (DFA) • Non-deterministic Finite Automata (NFA) • Finite Automata with ε-transitions (ε-NFA)
53 Deterministic Finite Automata (DFA)
We have seen a simple example before:
1
start off on 1 There are some states and transitions (edges) between the states. The edge labels tell when we can move from one state to another.
54 Definition of DFA
• A DFA is a 5-tuple (Q, Σ, δ, q0, F) where
• Q is a finite set of states • Σ is a finite input alphabet • δ is the transition function mapping Q × Σ to Q
• q0 in Q is the initial state (only one) • F ⊆ Q is a set of final states (zero or more)
55 Definition of DFA
For example:
1
start off on 1
• Q is the set of states: {on, off} • Σ is the set of input symbols: {1} • δ is the transitions: off × 1 → on; on × 1 → off • q0 is the initial state: off • F is the set of final states (double circle): {on}
56 Definition of DFA
Another Example:
1 0 0
q q2 start q0 1 1 0
1
What are Q, Σ, δ, q0 and F in this DFA?
57 Transition Table
For the previous example, the DFA is
(Q,Σ,δ,q0,F) where Q = {q0,q1,q2}, Σ = {0,1}, F = {q2} and δ is such that
Inputs States 0 1 q0 q1 q0 q1 q2 q0 q q *q2 1 0
Note that there is one transition only for each input symbol from each state.
58 Language of a DFA
• Given a DFA M, the language accepted (or recognized) by M is the set of all strings that, starting from the initial state, will reach one of the final states after the whole string is read. • For example, the language accepted by the previous example is the string that ends with 00
59 DFA Example
Consider the DFA M=(Q,Σ,δ,q0,F) where Q = {q0,q1,q2,q3}, Σ = {0,1}, F = {q0} and δ is:
Inputs 1 Start q0 q1 States 0 1 1 q0 q2 q1 0 0 0 0 q q q OR 1 3 0 1 q q q q2 q3 2 0 3 1 q3 q1 q2 We can use a transition table or a transition diagram to specify the transitions. What input can take you to the final state in M?
60 EXAMPLE
Tim Conrad, VL AlDaBi, WT015/16 61 Recognizing Terminators with SCFGs
• [Bockhorst & Craven, IJCAI 2001]
u c u a c Prefix Loop c g g c Stem c g Stem c g Loop a u Suffix g c c-u-c-a-a-a-g-g- c g -u-u-u-u-u-u-u-u
• a prototypical terminator has the structure above • the lengths and base compositions of the elements can vary a fair amount Terminator Grammar
START PREFIX STEM_BOT1 SUFFIX PREFIX B B B B B B B B B
STEM_BOT1 tl STEM_BOT2 tr * * * * STEM_BOT2 tl STEM_MID tr | tl STEM_TOP2 tr * * * * STEM_MID tl STEM_MID tr | tl STEM_TOP2 tr * * STEM_TOP2 tl STEM_TOP1 tr STEM_TOP1 tl LOOP tr LOOP B B LOOP_MID B B LOOP_MID B LOOP_MID | λ SUFFIX B B B B B B B B B B a | c | g | u t = {a,c,g,u}, Nonterminals are uppercase, t* = {a,c,g,u, λ} terminals are lowercase Three Key Questions
• How likely is a given sequence? • the Inside algorithm
• What is the most probable parse for a given sequence? • the Cocke-Younger-Kasami (CYK) algorithm
• How can we learn the SCFG parameters given a grammar and a set of sequences? • the Inside-Outside algorithm OUTLOOK
Tim Conrad, VL AlDaBi, WT015/16 65 The Chomsky Hierarchy
Language Grammar Automaton Recognition Dependency Operations Biology Recursively Unrestricted Turing Machine Undecidable Arbitrary Unknown Enumerable diagonal- Languages Baa → A ? ? ization ?
Context- Context- Linear-Bounded Exponential? Crossing duplication Pseudoknots Sensitive (parallel) Sensitive inversion Languages At → aA transposition
Context- Context- Pushdown Polynomial Nested Hairpins (antiparallel) Free Free (stack) insertion Languages S → gSc
Regular Regular Finite-State Linear Strictly Local concatenation Transcription Languages Machine disjunction (processive) A → cA iteration (∗)
Tim Conrad, VL AlDaBi, WT015/16 Mehr Informationen im Internet unter medicalbioinformatics.de/teaching
Tim Conrad Weitere AG Medical Bioinformatics Fragen www.medicalbioinformatics.de
Tim Conrad, VL AlDaBi, WT015/16