<<

VL Algorithmen und Datenstrukturen für Bioinformatik (19400001) WS15/2016 Woche 11

Tim Conrad AG Medical Institut für Mathematik & Informatik, Freie Universität Berlin

Contains material from David Searls U Pennsylvania & Masbaul Polash Linguistics and Bioinformatics

• Languages • Grammars

Parsing Genes

Intron structure:

Gene Transcript Transcript Promoter PolyAsite Intron Intron

Donor Acceptor Donor Acceptor tataaaa gt ag gt ag aataaa Alan Turing (1912-1954)

• A pioneer of automata theory • One of the fathers of modern • English mathematician • Studied abstract machines called Turing machines even before computers existed • Heard of the Turing test? What is Automata Theory?

• Study of abstract devices, or “machines”

• Automaton = an abstract computing device • Note: A “device” need not even be a physical hardware!

• A fundamental question in computer science: • Find out what different models of machines can do and cannot do • The

vs. • Languages: “A language is a collection of sentences of finite length all constructed from a finite alphabet of symbols”

• Grammars: “A grammar can be regarded as a device that enumerates the sentences of a language” - nothing more, nothing less

N. Chomsky, Information and Control, Vol 2, 1959

Tim Conrad, VL AlDaBi, WT015/16 LANGUAGES & GRAMMARS?

Tim Conrad, VL AlDaBi, WT015/16 8 Problems

• In automata theory, a problem is to decide whether a given string is a member of some particular language.

• This formulation is general enough to capture the difficulty levels of all problems. Natural Language Structure

• A sentence has a hierarchical structure, e.g.: “The linguistSentence sees the biologist.”

NounPhrase VerbPhrase

Verb NounPhrase

Determiner Noun Determiner Noun

the linguist sees the biologist

Tim Conrad, VL AlDaBi, WT015/16 A Natural Language Grammar

• Grammars employ modular, hierarchical rules Sentence  NounPhrase VerbPhrase NounPhrase  Determiner Noun | NounPhrase PrepositionalPhrase VerbPhrase  Verb NounPhrase | VerbPhrase PrepositionalPhrase PrepositionalPhrase  Preposition NounPhrase Noun  linguist | biologist | telescope | ... Verb  sees | ... Determiner  the | a Preposition  with | ...

Tim Conrad, VL AlDaBi, WT015/16 Dependency

• Grammars capture long-range dependencies Sentence

NounPhrase VerbPhrase

NounPhrase PrepositionalPhrase Verb NounPhrase Determiner Preposition NounPhrase Determiner Noun Determiner Noun Noun the linguists with the telescope sees the biologist

Tim Conrad, VL AlDaBi, WT015/16 Recursion

NounPhrase • Rules can call each other PrepositionalPhrase recursively NounPhrase NounPhrase

NounPhrase PrepositionalPhrase

Determiner Preposition Preposition NounPhrase

Noun Determiner Noun Determiner Noun the linguist with the biologist with the telescope ...

Tim Conrad, VL AlDaBi, WT015/16 Ambiguity

Sentence • Grammars also allow for a syn- VerbPhrase tactic ambiguity NounPhrase

NounPhrase PrepositionalPhrase

Determiner Verb NounPhrase Preposition NounPhrase

Noun Determiner Noun Determiner Noun the linguist sees the biologist with the telescope

Tim Conrad, VL AlDaBi, WT015/16 Ambiguity

Sentence • Grammars also allow for a syn- tactic ambiguity VerbPhrase

NounPhrase VerbPhrase PrepositionalPhrase

Determiner Verb NounPhrase Preposition NounPhrase

Noun Determiner Noun Determiner Noun the linguist sees the biologist with the telescope

Tim Conrad, VL AlDaBi, WT015/16 Gene „“ BIOLOGY?

Tim Conrad, VL AlDaBi, WT015/16 16 A Gene Grammar

• Grammars can describe basic gene structure Gene  Promoter Transcript Transcript  Intron Transcript | Intron PolyAsite | Skip Transcript Intron  Donor Acceptor Skip  gt | ag Promoter  tataaa PolyAsite  aataaa Donor  gt Acceptor  ag | Skip Acceptor • More elaborate grammars can incorporate coding regions, more complex signals, etc.

Tim Conrad, VL AlDaBi, WT015/16 Alternative Splicing

• Most genes have multiple exons and most of these are alternatively spliced, i.e., ambiguous • Maintaining reading frame is a dependency

Exon skipping Intron retention

Alternative Alternative 5’ donor sites 3’ acceptor sites

Mutually exclusive exons

Tim Conrad, VL AlDaBi, WT015/16 Parsing Genes

• Intron structure: Gene Transcript Transcript Promoter PolyAsite Intron Intron

Donor Acceptor Donor Acceptor tataaaa gt ag gt ag aataaa

Tim Conrad, VL AlDaBi, WT015/16 Parsing Genes

• Exon skipping: Gene Transcript

Promoter Intron PolyAsite Acceptor Acceptor

Donor Skip Skip Acceptor tataaaa gt ag gt ag aataaa

Tim Conrad, VL AlDaBi, WT015/16 Parsing Genes

• Intron Retention: Gene Transcript Transcript Transcript Promoter PolyAsite Intron

Skip Skip Donor Acceptor tataaaa gt ag gt ag aataaa

Tim Conrad, VL AlDaBi, WT015/16 RNA Secondary Structure BIOLOGY?

Tim Conrad, VL AlDaBi, WT015/16 22 Why RNA Is Interesting

• In addition to messenger RNA (mRNA), there are other RNA molecules that play key roles in biology • ribosomal RNA (rRNA) • ribosomes are complexes that incorporate several RNA subunits in addition to numerous protein units • transfer RNA (tRNA) • transport amino acids to the ribosome during translation • the spliceosome, which performs intron splicing, is a complex with several RNA units • the genomes for many viruses (e.g. HIV) are encoded in RNA • etc. RNA Secondary Structure

• RNA is typically single stranded • folding, in large part is determined by base-pairing • A-U and C-G are the canonical base pairs • other bases will sometimes pair, especially G-U • the base-paired structure is referred to as the secondary structure of RNA • related RNAs often have homologous secondary structure without significant sequence similarity tRNA Secondary Structure

tertiary structure Small Subunit Ribosomal RNA Secondary Structure Base Pairing as Dependency

• A context-free grammar (single nonterminals on the left) models base pairs: Pair → x Pair x | ε Pair where x = base complement of x g Pair c

a Pair u

g ga gac gua Pairε uca ugc uc c c Pair g • The base pairs create nested dependencies, and in fact the g Pair c parse tree mimics an RNA stem ε

Tim Conrad, VL AlDaBi, WT015/16 Orthodox Secondary Structure

• Adding a branching Pair rule makes arbitrary orthodox secondary g Pair c structure possible: a Pair u Pair → Pair Pair | x Pair x | ε • Specific struc- tures can also be specified, such as tRNA, ribozymes, ...

Tim Conrad, VL AlDaBi, WT015/16 Secondary Structure Ambiguity

• Ambiguity allows for all possible structures

Pair STEM g Pair c

a Pair u gaucgauc

u Pair a

c Pair g ε

Tim Conrad, VL AlDaBi, WT015/16 Secondary Structure Ambiguity

• Ambiguity allows for all possible structures

Pair CRUCIFORM g Pair c a Pair u gaucgauc ε Pair Pair Pair Pair ε u Pair a c Pair g ε

Tim Conrad, VL AlDaBi, WT015/16 Secondary Structure Ambiguity

• Ambiguity allows for all possible structures

a g Pair c u DUMBBELL ε Pair Pair Pair Pair Pair Pair ε u c g a gaucgauc – A lexicalized version of this grammar generates each possible structure exactly once, allowing it to be used to count alternative structures of varying energies and study the distribution of folds over sequence space

Tim Conrad, VL AlDaBi, WT015/16 Pseudoknots

• Nonorthodox structures like pseudoknots have crossing dependencies gacugagucuca

u c a Pair Pair Pair c u g a g u Pair Pair Pair g a c

Tim Conrad, VL AlDaBi, WT015/16 Protein Structure BIOLOGY?

Tim Conrad, VL AlDaBi, WT015/16 33 Protein Structure

• Side-chain interactions • Dependencies3 2 7 5 α α α embody dependencies β are parallel /

in folded protein chains antiparallel6 1 8 4 β β β • Secondary structures β orientations are a local abstraction and cheirality

A A

R A R A 1 2 3 4 5 6 7 8 2BOP

Tim Conrad, VL AlDaBi, WT015/16 Structural Complexity

1LBU 1PMI 1SBP Concatenation Insertion Translocation

Tim Conrad, VL AlDaBi, WT015/16 TOOLS OF LINGUISTICS

Tim Conrad, VL AlDaBi, WT015/16 36 Spoonerisms

• Spoonerisms switch initial letters, sylla- bles, or words Drink is Work is the curse the curse of the of the working drinking class. class.

Tim Conrad, VL AlDaBi, WT015/16 Spoonerisms

• Spoonerisms switch initial letters, sylla- bles, or words • Proteins may also exchange features, even entire globular domains, in a domain swap 1DDT

Tim Conrad, VL AlDaBi, WT015/16 Rosetta Stone Proteins

• Proteins that interact or participate in the same pathway are often fused in evolution: E. coli: γ-glutamyl phosphate reductase + glutamate-5-kinase human: δ-1-pyrroline-5-carboxylate synthetase • Catalogues of fusions can predict function – Called collocation analysis in lexical semantics, which studies word relations, ontologies, etc. – “Promiscuous” domains (e.g., SH3, WD-repeats, ABC, …) are poor predictors, as are common morphemic affixes (inter-, -ism, pre-, -tion, …)

Tim Conrad, VL AlDaBi, WT015/16 Correspondences

• The organizing Proteins Languages paradigms of Sequence Lexical linguistics and Structure Syntactic biology seem Function Semantic to correspond Role Pragmatic • Proteins and Evolution Etymology words share Paralogy Paronymy a number of Convergence Homonymy analogous Pleiotropy Polysemy concepts Redundancy Synonymy

Tim Conrad, VL AlDaBi, WT015/16 MORE FORMALLY

Tim Conrad, VL AlDaBi, WT015/16 41 Why Automata Theory?

To study abstract computing devices which are closely related to today’s computers. A simple example of finite machine:

1

start off on 1 There are many different kinds of machines. Another Example

1

0 0

start off off on 1 0 1

When will this be on? Try 100, 1001, 1000, 111, 00, … Grammar and Languages

Grammars and languages are closely related to automata theory and are the basis of many important software components like:

and interpreters • Text editors and processors • Search engines • verification components • Alphabets • Strings • Languages • Problems

PRELIMINARIES

Tim Conrad, VL AlDaBi, WT015/16 45 Strings

• A string is a finite sequence of symbols from an alphabet. • Examples:

• 0011 and 11 are strings from Σ = {0,1}

• abc and bbb are strings from Σ = {a, b, … , z}

• (()(())) and )(() are strings from Σ = {(, )}

46 Strings

• Empty string: ε

• Length of string: |0010| = 4, |aa| = 2, |ε|=0

• Prefix of string: aaabc, aaabc, aaabc

• Proper prefix of string: aaabc, aaabc

• Suffix of string: aaabc, aaabc, aaabc

• Proper suffix of string: aaabc, aaabc

• Substring of string: aaabc, aaabc, aaabc

47 Strings

• Concatenation: ω=abd, α=ce, ωα=abdce

• Exponentiation: ω=abd, ω3=abdabdabd, ω0=ε

• Reversal: ω=abd, ωR = dba

• Σk = set of all k-length strings formed by symbols in Σ e.g., Σ={a,b}, Σ2={ab, ba, aa, bb}, Σ0={ε}

• What is Σ1? Is Σ1 different from Σ? How?

48 Strings

* 0 1 2 k • Kleene Closure Σ = Σ ∪Σ ∪Σ ∪… = ∪k≥0 Σ e.g., Σ={a, b}, Σ* = {ε, a, b, ab, aa, ba, bb, aaa, aab, abb, … } is the set of all strings formed by a’s and b’s.

+ 1 2 3 k • Σ = Σ ∪Σ ∪Σ ∪… = ∪k>0 Σ i.e., Σ* without the empty string.

49 Languages

• A language is a set of strings over an alphabet. • Examples:

• Σ={(, )}, L1={(), )(, (())} is a language over Σ. • Σ={a, b, c, … , z}, the set L of all legal English words is a language over Σ.

• The set {ε} is a language over any alphabet.

• What is the difference between φ and {ε}?

50 Languages

• Other Examples:

• Σ={0, 1}, L={0n1n | n≥1} is a language over Σ consisting of the strings {01, 0011, 000111, … }

• Σ={0, 1}, L = {0i1j | j≥i≥0} is a language over Σ consisting of the strings with some 0’s (possibly none) followed by at least as many 1’s.

51 Problems

• In automata theory, a problem is to decide whether a given string is a member of some particular language.

• This formulation is general enough to capture the difficulty levels of all problems.

52 Finite Automata (or Finite State Machines)

• This is the simplest kind of machine. • We will consider three types of Finite Automata:

• Deterministic Finite Automata (DFA) • Non-deterministic Finite Automata (NFA) • Finite Automata with ε-transitions (ε-NFA)

53 Deterministic Finite Automata (DFA)

We have seen a simple example before:

1

start off on 1 There are some states and transitions (edges) between the states. The edge labels tell when we can move from one state to another.

54 Definition of DFA

• A DFA is a 5- (Q, Σ, δ, q0, F) where

• Q is a finite set of states • Σ is a finite input alphabet • δ is the transition function mapping Q × Σ to Q

• q0 in Q is the initial state (only one) • F ⊆ Q is a set of final states (zero or more)

55 Definition of DFA

For example:

1

start off on 1

• Q is the set of states: {on, off} • Σ is the set of input symbols: {1} • δ is the transitions: off × 1 → on; on × 1 → off • q0 is the initial state: off • F is the set of final states (double circle): {on}

56 Definition of DFA

Another Example:

1 0 0

q q2 start q0 1 1 0

1

What are Q, Σ, δ, q0 and F in this DFA?

57 Transition Table

For the previous example, the DFA is

(Q,Σ,δ,q0,F) where Q = {q0,q1,q2}, Σ = {0,1}, F = {q2} and δ is such that

Inputs States 0 1 q0 q1 q0 q1 q2 q0 q q *q2 1 0

Note that there is one transition only for each input from each state.

58 Language of a DFA

• Given a DFA M, the language accepted (or recognized) by M is the set of all strings that, starting from the initial state, will reach one of the final states after the whole string is read. • For example, the language accepted by the previous example is the string that ends with 00

59 DFA Example

Consider the DFA M=(Q,Σ,δ,q0,F) where Q = {q0,q1,q2,q3}, Σ = {0,1}, F = {q0} and δ is:

Inputs 1 Start q0 q1 States 0 1 1 q0 q2 q1 0 0 0 0 q q q OR 1 3 0 1 q q q q2 q3 2 0 3 1 q3 q1 q2 We can use a transition table or a transition diagram to specify the transitions. What input can take you to the final state in M?

60 EXAMPLE

Tim Conrad, VL AlDaBi, WT015/16 61 Recognizing Terminators with SCFGs

• [Bockhorst & Craven, IJCAI 2001]

u c u a c Prefix Loop c g g c Stem c g Stem c g Loop a u Suffix g c c-u-c-a-a-a-g-g- c g -u-u-u-u-u-u-u-u

• a prototypical terminator has the structure above • the lengths and base compositions of the elements can vary a fair amount Terminator Grammar

START PREFIX STEM_BOT1 SUFFIX PREFIX B B B B B B B B B

STEM_BOT1 tl STEM_BOT2 tr * * * * STEM_BOT2 tl STEM_MID tr | tl STEM_TOP2 tr * * * * STEM_MID tl STEM_MID tr | tl STEM_TOP2 tr * * STEM_TOP2 tl STEM_TOP1 tr STEM_TOP1 tl LOOP tr LOOP B B LOOP_MID B B LOOP_MID B LOOP_MID | λ SUFFIX B B B B B B B B B B a | c | g | u t = {a,c,g,u}, Nonterminals are uppercase, t* = {a,c,g,u, λ} terminals are lowercase Three Key Questions

• How likely is a given sequence? • the Inside

• What is the most probable parse for a given sequence? • the Cocke-Younger-Kasami (CYK) algorithm

• How can we learn the SCFG parameters given a grammar and a set of sequences? • the Inside-Outside algorithm OUTLOOK

Tim Conrad, VL AlDaBi, WT015/16 65 The

Language Grammar Automaton Recognition Dependency Operations Biology Recursively Unrestricted Undecidable Arbitrary Unknown Enumerable diagonal- Languages Baa → A ? ? ization ?

Context- Context- Linear-Bounded Exponential? Crossing duplication Pseudoknots Sensitive (parallel) Sensitive inversion Languages At → aA transposition

Context- Context- Pushdown Polynomial Nested Hairpins (antiparallel) Free Free (stack) insertion Languages S → gSc

Regular Regular Finite-State Linear Strictly Local concatenation Transcription Languages Machine disjunction (processive) A → cA iteration (∗)

Tim Conrad, VL AlDaBi, WT015/16 Mehr Informationen im Internet unter medicalbioinformatics.de/teaching

Tim Conrad Weitere AG Medical Bioinformatics Fragen www.medicalbioinformatics.de

Tim Conrad, VL AlDaBi, WT015/16