DNA, RNA , Protein ... Homology

Home , Sequence analysis, Sequence homology

Sequence analysis Sequences: DNA, RNA , protein ...

Genome: DNA • Analysis of primary, secondary, not tertiary ... transcription ? structures Primary transcript: pre-mRNA, pre-ncRNA processing (splicing*, cleavage) ? • Biological sequences. Central dogma. Processed transcript: mRNA, ncRNA (tRNA, rRNA ...) • Similarities (orthologs, paralogs) translation, modification ? [a] Translated sequence: protein (amino acids). [b] Mature ncRNA • Methods, algorithms (alignments, models) protein cleavage ... ? • Databases (primary, secondary) Mature protein. [ ESTs are nucleotide sequences, might be unspliced, spliced ...]

* Splicing only occurs in Eukaryotes.

Alignments and database searches (Summary) SEQUENCE ANALYSIS Common biological problem: Where and why ? We have a novel protein sequence. What can we infer from this sequence about the biological function of the Sequencing projects, assembly of sequence data protein? Identification of functional elements in sequences Sequence comparison * Sequence homology - BLAST, FASTA, SSEARCH Classification of proteins Simple example: unknown human protein is highly Comparative genomics similar to a protein with known function from RNA structure prediction another organism Protein structure prediction => The human protein has the same function Evolutionary history (it’s a homolog: ortholog or paralog) * Pattern/profile search – PROSITE, Profile search - Pfam ** Secondary structure precition ** Prediction of transmembrane domains ( ~ 25 % of all proteins are membrane bound!)

Comparing non-identical sequences Homology: orthologs & paralogs Protein sequence comparison - basic concepts

When two protein sequences are being compared and the similarity is considered statistically significant, it is highly likely that the two proteins are evolutionary related. There are two kinds of biological relationships:

Orthologs Proteins that carry out the same function in different species Paralogs Proteins that perform different but related functions within one organism

Orthology describes genes in different species that derive from a common ancestor. Proteins are homologous if they are related by divergence (=MouseA, ChickA, FrogA that come from Alfa-chain gene in common ancestor) from a common ancestor. Paralogy describes homologous genes within a single species that diverged by gene duplication (= MouseA and MouseB).

1 Methods in sequence analysis Translation of sequences • Simple transformation/extraction a) Translation: RNA > protein b) Reverse translation protein>RNA • Different nucleotide sequences may translate c) Splicing (removing introns in pre-mRNA, pre-rRNA ...) into identical amino acid sequences. • Comparison of primary sequences • Nucleotide sequence may yield different a) Identity: finding sites, pattern matches b) Alignments: non-identical seqs (pair/multiple/phylogeny) amino acid seqs. (6 reading frames) • Analyzing for other properties • Reverse translation does not give unique a) statistical composition nucleotide sequence. b) profile analysis (PSI-Blast) c) HMMs (probabilities of aa in position, Pfam) • Different splicing of pre-mRNA d) higher order stucture (secondary structure in RNA/prot) 1 gene – several proteins!

The (degenerate) Genetic code Changes that affect translation

UUU Phe F UCU Ser S UAU Tyr Y UGU Cys C UUC Phe F UCC Ser S UAC Tyr Y UGC Cys C UUA Leu L UCA Ser S UAA Stop* UGA Stop* Translation: UUU Phe F UCU Ser S UAU Tyr Y UGU Cys C Translation: UUG Leu L UCG Ser S UAG Stop* UGG Trp W UUC Phe F UCC Ser S UAC Tyr Y UGC Cys C UUA Leu L UCA Ser S UAA Stop* UGA Stop* AUGUUGGGUUGA=MLG* UUG Leu L UCG Ser S UAG Stop* UGG Trp W AUGUUGGGUUGA=MLG* CUU Leu L CCU Pro P CAU His H CGU Arg R CUC Leu L CCC Pro P CAC His H CGC Arg R ||| | || | | ||| | || | | CUA Leu L CCA Pro P CAA Gln Q CGA Arg R AUGCUAGGAUAA=MLG* CUU Leu L CCU Pro P CAU His H CGU Arg R AUGCUAGGAUAA=MLG* CUG Leu L CCG Pro P CAG Gln Q CGG Arg R CUC Leu L CCC Pro P CAC His H CGC Arg R CUA Leu L CCA Pro P CAA Gln Q CGA Arg R Reverse translation: CUG Leu L CCG Pro P CAG Gln Q CGG Arg R AUGUUGGGUUGA=MLG* AUU Ile I ACU Thr T AAU Asn N AGU Ser S AUGUUAGGUUGA=MLG* AUC Ile I ACC Thr T AAC Asn N AGC Ser S MLG* = AUA Ile I ACA Thr T AAA Lys K AGA Arg R AUU Ile I ACU Thr T AAU Asn N AGU Ser S AUGUUCGGUUGA=MFG* AUG Met M ACG Thr T AAG Lys K AGG Arg R AUG UUA GGU UAA 1 AUC Ile I ACC Thr T AAC Asn N AGC Ser S AUGUGAGGUUGA=M*G*(=M*!) AUG UUA GGU UAG 2 AUA Ile I ACA Thr T AAA Lys K AGA Arg R AUG Met M ACG Thr T AAG Lys K AGG Arg R GUU Val V GCU Ala A GAU Asp D GGU Gly G AUG UUA GGU UGA 3 AUG-UGGGUUGA=MTV(+GA.) GUC Val V GCC Ala A GAC Asp D GGC Gly G GUA Val V GCA Ala A GAA Glu E GGA Gly G ... . GUU Val V GCU Ala A GAU Asp D GGU Gly G Frameshift=> new AA seq GUG Val V GCG Ala A GAG Glu E GGG Gly G AUG CUG GGG UGA 72 GUC Val V GCC Ala A GAC Asp D GGC Gly G Last example: no Stop! GUA Val V GCA Ala A GAA Glu E GGA Gly G (1x6x4x3 possible seqs) GUG Val V GCG Ala A GAG Glu E GGG Gly G 3rd position is not so important!

Open Reading Frame (ORF) Translation tables

Forward reading frames: Backward reading frames:

Frames 1-3 Frames 4-6 on reverse • The coding for amino acids depends on species AUGUUGGGUUGA=MLG* (minus) strand: and/or nuclear/mitochondrial DNA. .UGUUGGGUUGA=CTV AUGUUGGGUUGA original ..GUUGGGUUGA=VGL AGUUGGGUUGUA rev • At least 17 translation tables exist: ...UUGGGUUGA= LG* UCAACCCAACAU +complement * The Standard Code = STQH, QPN, ... * The Vertebrate Mitochondrial Code Example unknown RNA: * The Yeast Mitochondrial Code * The Mold, Protozoan, and Coelenterate Mitochondrial Code and ... 1 AUGUUCCGUCUCACGCUCACCAAACGGCUAGCCCGCGCUUCUGCACACGUCACUCCGUCG 60 * The Invertebrate Mitochondrial Code ------UACAAGGCAGAGUGCGAGUGGUUUGCCGAUCGGGCGCGAAGACGUGUGCAGUGAGGCAGC * The Ciliate, Dasycladacean and Hexamita Nuclear Code * The Echinoderm and Flatworm Mitochondrial Code M F R L T L T K R L A R A S A H V T P S C S V S R S P N G * P A L L H T S L R R * The Euplotid Nuclear Code... V P S H A H Q T A S P R F C T R H S V A * ... ------Tables with comments may be found at NCBI: http://www.ncbi.nlm.nih.gov/Taxonomy/ H E T E R E G F P * G A S R C V D S R R T G D * A * W V A L G R K Q V R * E T A N R R V S V L R S A R A E A C T V G D G Frame 4-6

2 Translation tables (cont), examples Ambiguous sequence notation Example: Example:

The Vertebrate The Yeast Mitochondrial Nucleotide examples: Code (transl_table=3) Mitochondrial Code A or C, [AC]: symbol M G A A A A C (transl_table=2) Differences from the Standard Code: A or G, [AG]: symbol R G A G A T C G C A A C C Differences from the Standard Code: Code 3 Standard A or T, [AT]: symbol W AUA Met M Ile I G C G A G C Code 2 Standard A or C or G, [ACG]: V CUU Thr T Leu L ------AGA Ter * Arg R ... etc. CUC Thr T Leu L G[AC][AG]A[ATCG]C AGG Ter * Arg R CUA Thr T Leu L AUA Met M Ile I CUG Thr T Leu L UGA Trp W Ter * UGA Trp W Ter * The 4 sequence example may be written as a sequence : , or Alternative Initiation Codon: CGA absent Arg R GMRANC CGC absent Arg R as a pattern : Bos: AUA G-[AC]-[AG]-A-x(1)-C Homo: AUA, AUU Big differences if start (initiation) Mus: AUA, AUU, AUC Wildcard: x(N) represents N arbitrary symbols. Coturnix, Gallus: also GUG. and stop (termination) codes differ!

Identity (pattern matching) Pairwise alignments: • Finding short exact matches GAATTC – recognition site for enzyme EcoRI GDSGGP – typical of serine proteases (e.g. G-[DE]-S-G-[GS] -[SAPHV] ) Global alignment • Patterns for multiple matches Considers similarity across the full extent of the sequences GA-[AG]-L-[ST] : GA + A or G + L + S or T xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | | ||||||| | | GAALS, GAGLS, GAALT, GAGLT matches xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx GA-x-G-[STLAG] : GA + any 1 aa + G + S or T or L or A or G 100 different sequences match C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H pattern for zinc finger proteins (millions of possible sequences) Local alignment (most common) Programs that use these kinds of patterns: Considers regions of similarity in parts of the sequences only. ”Findpatterns” searches a sequence (or set of sequences) for a pattern. xxxxxxx ||||||| ”Motifs” searches a sequence for motifs present in the PROSITE database. xxxxxxx PROSITE have patterns for >1000 protein families. region of similarity Important: Match or no match – just true or false, no score! (”Profiles” have probabilities for different aminoacids in certain positions.)

Comparing 2 sequences - Dotplot analysis Comparing 2 sequences - Gaps

M A K L Q G A L G K R Y M A K L Q L G K R Y M * M * A * * A * K * * K * * I L * * Q * 2 mismatches G * * Q * A * * G * L * * A * A * * L * * K * * G R * * Y K * * R * Y *

Sequence alignment Sequence alignment

M A K L Q L G K R Y M A K L Q G A L G K R Y Gap * * * * * * * * * * * * * * * * * * * * M A K L Q G A L G K R Y M A K I Q G A L A K R Y

3 Comparing 2 sequences: What are gaps? Alignment report example

Gaps are results of mutations (changes in DNA) that occur during evolution

For instance consider this deletion mutation:

AACTTGACGGACTGGGCGTATCGGGCACCG DNA TTGAACTGCCTGACCCGCATAGCCCGTGGC N L T D W A Y R A P protein Gap

AACTTGACGCGGGCACCG Red lines = matches full sequence (high identity) TTGAACTGCGCCCGTGGC Purple lines = matches contain gap (good identity) N L T R A P

Best alignment = highest score! BLAST lists all matching “words”*

Give scores for match, mismatch Query and gap (and gap extension).

What is better: mismatch or gap?

Calculate best score for each Subject position, “trace back” to find best alignment.

“Dynamic programming” algorithms. For each short match, the program tries to extend in both directions.

Very slow algorithm, cannot be used in database searches! * A word is 7-11 nucleotides or 3-.. aa

Improvement of speed as compared to local alignment algorithm: An alignment that BLAST BLAST and FastA can’t find! Searching databases with BLAST

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG Initial search is for short words. || | || || || | || || || || | ||| |||||| | | || | ||| | Word hits are then extended in either direction. 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG ? we only extend words that are in both sequences ? fast, but gap can’t be long between two close words 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | Searching databases with FastA 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

Initial search for short words. 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC Words are extended, but also linked if they are close! |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC ? slower, but longer alignments

4 Aligning two sequences - Gap extension penalty. Alignment of New settings genomic sequence with mRNA (Global alignment!) Exon 1 Alignment of the following two sequences: Extend gap= 0 V00594 (Human mRNA for metallothionein) and J00271 (corresponding genomic sequence).

Exon 2 Default setting ? Extend gap= 3

Exon 3 !

In a global alignment all residues are matched.

Output from Blast

BLASTP 2.0.11 [Jan-20-2000] E-value, as important as score!

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search Expect Value programs", Nucleic Acids Res. 25:3389-3402.

Query= ramp4.seq E = number of database hits you expect to find by chance (75 letters)

Database: nr size of database 457,798 sequences; 140,871,481 total letters your score Searching...... done

Score E Sequences producing significant alignments: (bits) Value expected number of random hits

gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membr... 126 2e-29 Alignments gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rat... 126 2e-29 gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;... 74 1e-13 gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] 46 3e-05 Score gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] 36 0.048 High score gi|5921764|sp|O13394|CHS5_USTMA CHITIN SYNTHASE 5 (CHITIN-UDP A... 29 3.6 Small database = few random hits. Big database = many random hits! E-value: probability of finding hit in a database of this size. In small databases you get higher E-values.

>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder; Are there better/worse substitutions? cDNA EST EMBL:D71338 comes from this gene; cDNA EST EMBL:D74010 comes from this gene; cDNA EST EMBL:D74852 comes from this gene; cDNA EST EMBL:C07354 comes from this gene; cDNA EST EMBL:C0... • From comparisons of known proteins, it is Length = 65

Score = 74.1 bits (179), Expect = 1e-13 known that some changes/mutations are more Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%) frequent than others. Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73 QR+ +AN++ SKN+ RGNVAK+ + A E+K PWL+ LF+FVVCGSA+F+II+ ++M Sbjct: 5 QRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVCGSAVFEIIRYVKM 63 • Also, not all amino acids* are common ...

Query: 74 G 74 If a rare amino acid is matched, it is more G Sbjct: 64 G 64 significant than if a common amino acid match • How can we give a score to a mismatch/match In protein alignments some mismatches are marked “similar” (+). that is biologically significant? ? substitution matrices Substitution matrices are used to score matches/mismatches! * There are 20 amino acids, but only 4 nucleotides!

5 BLOSUM 62 scores Substitution matrices A 4 R -1 5 Unitary matrices (nucleotide, protein) N -2 0 6 All matches get ’10’, all mismatches ’0’. D -2 -2 1 6 Common amino acids have low weights Used for nucleotide seqs. Bad protein hits due to identities by chance. C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 Point Accepted Mutation, PAM (proteins) G 0 -2 0 -1 -3 -2 -2 6 PAM30, PAM70 ... matrices. Based on evolutionary distance: H -2 0 1 -1 -3 0 0 -2 8 1 PAM = 1 point mutation / 100 residues. I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 Can’t handle distant relationships well. L -1 -2 -3 -4 -1 -2 -3 -Rare4 -3 amino 2 4 acids have high weights K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 Blocks Substitution Matrix, BLOSUM (prots) F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 BLOSUM50, BLOSUM62 ... matrices. P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 Based on alignments in the BLOCKS db. T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 Sequence segments of a certain identity are clustered: Negative for less likely substitutions W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 The most used matrices. BLOSUM62 default in BLAST (>62% identity). Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1Positive -1 -2 for-1 more-1 -1 likely-1 - 1substitutions -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 Remember: Any substitution matrix is making a statement about the A R N D C Q E G H I L K M F P S T W Y V X probability of observing a pair of aligned residues in real alignments!

Blast report

Evolution of protein genes: secondary and tertiary structure conserved Sequences producing significant alignments: (bits) Value

pir||F69494 (R)-hydroxyglutaryl-CoA dehydratase activator (hgdC)... 462 e-129 gb|AAD31675.1| (AF123384) (R)-2-hydroxyglutaryl-CoA dehydratase ... 233 1e-060 ATGGCAAAACTTGAAAAACTGAATCAAGCAGGCCTGATGGTCGCTGGT sp|P39383|YJIL_ECOLI HYPOTHETICAL 27.4 KD PROTEIN IN IADA-MCRD I... 184 9e-046 emb|CAA67409.1| (X98916) orf6 [Methanopyrus kandleri] 170 1e-041 M A K L E K L N Q A G L M V A G gb|AAF13150.1|AF156260_1 (AF156260) unknown [Methanosarcina bark... 143 2e-033 pir||A69117 activator of (R)-2-hydroxyglutaryl-CoA - Methanobact... 132 4e-030 pir||A72369 (R)-2-hydroxyglutaryl-CoA dehydratase activator-rela... 129 4e-029 60% nucleotide identity gb|AAC23928.1| (U75363) benzoyl-CoA reductase subunit [Rhodopseu... 117 1e-025 ATGGCTAGGTTGGAGAAGAUAAACCAAGCTGGGATAATAGTTGCAGGA pir||S04476 hypothetical protein (hdgA 5' region) - Acidaminococ... 104 1e-021 sp|P27542|DNAK_CHLPN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 42 0.005 M V R L E K I N Q A G L L V A G gb|AAC15473.1| (AF016711) heat shock protein 70 [Burkholderia ps... 39 0.036 69% amino acid identity pir||F75029 o-sialoglycoprotein endopeptidase (gcp) PAB1159 - Py... 38 0.082 pir||F72514 probable glucokinase APE2091 - Aeropyrum pernix (str... 37 0.18 sp|P42373|DNAK_BURCE DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 37 0.18 emb|CAA10035.1| (AJ012470) mitochondrial-type hsp70 [Encephalito... 36 0.31 M V R I Q K I N E K G A L L A G sp|P56836|DNAK_CHLMU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 36 0.41 38% gb|AAF39496.1| (AE002336) dnaK protein [Chlamydia muridarum] 36 0.41 pir||B70189 rod shape-determining protein (mreB-1) homolog - Lym... 36 0.41 sp|O57716|GCP_PYRHO PUTATIVE O-SIALOGLYCOPROTEIN ENDOPEPTIDASE (... 36 0.54 sp|O33522|DNAK_ALCEU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 36 0.54 Bad ref|NP_012874.1| Ykl050cp >gi|549677|sp|P35736|YKF0_YEAST HYPOTH... 36 0.54 scores/E- Q V R I Q K I Y E K G A L L A A emb|CAA53420.1| (X75781) D513 [Saccharomyces cerevisiae] >gi|158... 36 0.54 19% (‘twilight zone’) sp|P30722|DNAK_PAVLU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) >gi|99... 36 0.54 values pir||A40158 dnaK-type molecular chaperone - Chlamydia trachomati... 34 1.2 gb|AAF07742.1|AE001584_39 (AE001584) hypothetical protein [Borre... 34 1.6 might Q V R I Q K I Y E K T A L L F A gb|AAF07521.1|AE001577_35 (AE001577) hypothetical protein [Borre... 34 1.6 sometimes gb|AAF38963.1| (AE002276) cell shape-determining protein MreB [C... 34 2.1 6% (‘midnight zone’) gb|AAG08147.1|AE004889_10 (AE004889) DnaK protein [Pseudomonas a... 33 2.7 not matter. dbj|BAB03215.1| (AB017035) dnaK [Bacillus thermoglucosidasius] 33 2.7 sp|P43736|DNAK_HAEIN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 33 2.7 sp|P45554|DNAK_STAAU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 33 2.7 sp|Q58303|FLA3_METJA FLAGELLIN B3 PRECURSOR 32 4.7 gb|AAG08239.1|AE004898_10 (AE004898) phosphoribosylaminoimidazol... 32 6.1

Low complexity sequence tends to Blast variants: (1) increase the number of non-specific hits to database sequences (2) correspond to regions in proteins not associated with a known Query Database biological function (typically unstructured parts of the protein) blastp Protein Protein blastn DNA DNA Therefore, low complexity parts are filtered out by default in tblastn Protein DNA BLAST searches. (Don’t use filtering if you want exact matches.) blastx DNA Protein tblastx DNA DNA 1 MSAAPVQDKDTLSNAERAKNVNGLLQVLMDINTLNGGSSDTADKIRIHAKNFEAALFAKS 60

61 SSKKEYMDSMNEKVAVMRNTYNTRKNAVTAAAANNNIKPVEQHHINNLKNSGNSANNMNV 120 Example:

121 NMNLNPQMFLNQQAQARQQVAQQLRNQQQQQQQQQQQQRRQLTPQQQQLVNQMKVAPIPK 180 Searching a new genome assembly for a protein homolog.

181 QLLQRIPNIPPNINTWQQVTALAQQKLLTPQDMEAAKEVYKIHQQLLFKARLQQQQAQAQ 240 Input: protein. 241 AQANNNNNGLPQNGNINNNINIPQQQQMQPPNSSANNNPLQQQSSQNTVPNVLNQINQIF 300 Database: DNA (genome sequences)

301 SPEEQRSLLQEAIETCKNFEKTQLGSTMTEPVKQSFIRKYINQKALRKIQALRDVKNNNN 360 ? tblastn

361 ANNNGSNLQRAQNVPMNIIQQQQQQNTNNNDTIATSATPNAAAFSQQQNASSKLYQ

6 Rules of database searches (like BLAST) BLAST at NCBI ? Database sequence searches involving proteins should be carried out at the protein level and not at the DNA level * ? Use of smallest possible database (not too small though) ? Sequence statistics should be used rather than percent identity/similarity as criterion for homology ? Consider different scoring matrices and gap penalties

* 1) DNA sequences encoding the same protein sequence can be very different, due to the degeneracy of the genetic code.

TTTCGATTCTCAACAAGAAGC ** * ** ** * * TTCAGGTTTAGCACGCGGTCC F R F S T R S tblastn 2) For nucleotide—nucleotide searches, it is often good to set the word size low (-W 7)

BLAST output at NCBI

Best hit

“HSP” high 1 perfect hit, scoring some hits with pair parts of – there sequence may be matched several!

Next best hit Alignments below

gb|CM000011.1| Canis familiaris chromosome 11, whole genome shot... 86 9e-15 BLAST Databases at NCBI available for BLAST searches >gb|CM000011.1| Canis familiaris chromosome 11, whole genome shotgun sequence Length = 75769841 Protein sequence databases Score = 85.7 bits (43), Expect = 9e-15 output, Identities = 89/102 (87%), Gaps = 3/102 (2%) Strand = Plus / Minus with nr All non-redundant GenBank CDS translations Query: 4 cgtgctgaaggcctgtatcctaggctacacactgaggactctgttcctcccctttccgcc 63 +PDB+SwissProt+PIR+PRF |||||||||||||||| |||||||||||| || ||||||| ||||||| ||| |||| Sbjct: 53542401 cgtgctgaaggcctgtttcctaggctacagacggaggact-tgttcctta--tttgcgcc 53542345 many swissprot the last major release of SWISS-PROT Query: 64 taggggaaagtccccggacctcgggcagagagtgccacgtgc 105 |||||||||||||||||||| ||||||||||||| ||||| HSPs Sbjct: 53542344 taggggaaagtccccggacccttggcagagagtgccgcgtgc 53542303

Score = 75.8 bits (38), Expect = 9e-12 Identities = 75/86 (87%), Gaps = 1/86 (1%) DNA sequence Databases Strand = Plus / Minus

Query: 181 ggggcgtcatccgtcagctccctctagttacgcaggcagtgcgtgtcc-gcgcaccaacc 239 nr All Non-redundant GenBank+EMBL+DDBJ+PDB sequences |||||||| ||||||| ||| ||||||||||||||||| ||| | |||| |||||| Sbjct: 53542216 ggggcgtcgtccgtcaactctatctagttacgcaggcagcgcgcctggtgcgcgccaacc 53542157 (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences)

Query: 240 acacggggctcattctcagcgcggct 265 |||||||||||||||||||||||||| dbest Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions Sbjct: 53542156 acacggggctcattctcagcgcggct 53542131

Score = 36.2 bits (18), Expect = 7.7 Identities = 18/18 (100%) Note: Only the best HSP is shown in the list Strand = Plus / Minus You may also blast against single genomes ... before the alignments. Check the positions

Query: 25 aggctacacactgaggac 42 to understand in which order the HSPs ? |||||||||||||||||| Sbjct: 42727936 aggctacacactgaggac 42727919 match. The strand must be the same!

7 Multiple alignments - applications

Identify conserved motifs - patterns (PROSITE) Profiles (Pfam) Phylogenetic studies Prediction of protein secondary structure Experimental : design of probes

Multiple sequence alignment programs (CLUSTALW, PileUp, T-coffee ...)

PILEUP Trees

PileUp does a series of progressive, pairwise alignments between sequences and clusters of sequences to generate the final multiple alignment. A cluster consists of two or more from already-aligned sequences. SRP54

PileUp begins by doing pairwise alignments that score the similarity between every possible pair of sequences. These similarity scores are used to create a clustering order that can be MSA represented as a dendrogram. The clustering strategy represented by the dendrogram is called UPGMA that stands for unweighted pair-group method using arithmetic averages (Sneath, P.H.A. and Sokal, R.R. (1973) in Numerical Taxonomy (pp; 230-234), W.H. Freeman and Company, San Francisco, California, USA).

The dendrogram shows the order of the pairwise alignments of sequences and clusters of sequences that together generate the final alignment. For example: 3 large groups

SRPR FtsY

Multiple alignment software Pileup (GCG) Clustalx Clustalw / Clustalx

MSA (program that in principle finds the true optimal multiple alignment by the dynamic programming method) njplot T-coffee

Multiple alignment editors/viewers SeqLab (GCG) MACAW (search for motifs, blocks) Jalview Colours of amino acids according to type: charged, CINEMA hydrophobic ... Genedoc Bioedit Makes it easier to see Boxshade matches.

8 How to find homologs with low Position Specific Substitution Rates sequence identity • Sequence identity high if evolutionary distance is small, but low if the distance is big. • Many amino acid positions change. Typical serine • An amino acid may be substituted differently in Active site serine different species. • If we have many known homologs, we can search with “all of them” as queries, but the unknown sequence may have yet another set of substitutions compared to the known homologs. ? align known sequences and make a “profile”

Position Specific Score Matrix (PSSM) PSIBLAST – a more sensitive BLAST! PSI-BLAST is an important tool to identify remote protein similarity. It proceeds by way of Amino acids the following steps:

A R N D C Q E G H I L K M F P S T W Y V (1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 database, using the gapped BLAST program . 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 (2) The program constructs a multiple alignment, and then a profile, from any 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 significant local alignments found. The original query sequence serves 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 as a template for the multiple alignment and profile, whose lengths are 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 identical to that of the query. 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 (3) The profile is compared to the protein database, again seeking local alignments. 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 After a few minor modifications, the BLAST algorithm can be used for 213 N -2 0 2 -1 -6 7 Serine0 -2 0 scored-6 -4 differently2 0 -2 -5 -1 -3 -3 -4 -3 this directly. 214 G -2 -3 -3 -4 -4 -4 -5in 7these -4 - 7two -7 -positions5 -4 -4 -6 -3 -5 -6 -6 -6 (4) PSI-BLAST estimates the statistical significance of the local alignments found. 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 Because profile substitution scores are constructed to a fixed scale , and gap scores remain independent of position, the statistical theory 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 and parameters for gapped BLAST alignments remain applicable to profile alignments. 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 (5) Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number 219 P -2 -6 Active-6 -5 - site6 -5 nucleophile-5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 of times or until convergence. 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 Profile-alignment statistics allow PSI-BLAST to proceed as a natural extension of 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 BLAST; the results produced in iterative search steps are comparable to those 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 produced from the first pass. 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 Advantage : Unlike most profile-based search methods, PSI-BLAST runs as one program, 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3 starting with a single protein sequence, and the intermediate steps of multiple alignment and profile construction are invisible to the user. Example sequence. How does Serine score in positions 211 and 216?

PSI-BLAST creates profiles automatically Example of homology: SRP9/14/21

1st BLAST round 2nd BLAST round 3rd BLAST round

profile profile • SRP9 & SRP14 are related (common ancestor) • SRP9 is not found in Fungi (but SRP21 is) • But weak SRP9 hit in the fungi S.pombe (YE07) threshold • Weak similarity SRP9 S.pombe & SRP21 • Make a profile of known SRP21 sequences and search a database of all known proteins! Can we detect any similarity SRP9/21? When no more new sequences are found, search terminates.

Problem: If bad sequences enters the profile, it finds only trash!

9 SRP21 aligned to SRP9 &14 Profilesearch - based on Saccharomyces SRP21 sequences

Sequence ZScore Orig Length Comment Unaligned box Residues 21 marked 1. S_BAY 84.40 349.21 146 SRP21 S. bayanus according to 2. S_PAR 83.75 351.99 169 Paradoxus similarity in 3. S_KUD 83.71 346.53 146 Kudria 4. SR21_YEAST 82.41 346.02 166 P32342 saccharomyces cerevisiae sequence and 5. S_MIK 82.06 339.92 145 Mikatae 9 chemical 6. S_KLU 75.51 314.59 145 Kluyveri properties. 7. S_CAS 74.91 308.02 125 Castellii 8. C_ALB 21.67 107.92 168 Candida 9. N_CRA 12.74 74.20 197 Neurospora 14 10 YE07_SCHPO 9.61 58.63 120 O13804 schizosaccharomyces pombe 11 CD3D_RAT 8.74 57.34 173 P19377 rattus norvegicus (rat). 12 ARP2_PLAFA 8.52 65.04 451 P13824 plasmodium falciparum. 13 Q23147 8.50 60.12 284 Q23147 caenorhabditis elegans. 14 SR09_ARATH 8.45 53.56 103 Q9smu7 arabidopsis thaliana (mouse-ear 15 Q8K2G5 8.45 60.59 306 Q8k2g5 mus musculus (mouse). riken cdna SRP9/14 ????? secondary structure 16 Q8BFQ4 8.40 60.59 313 Q8bfq4 mus musculus (mouse). 21 (Birse et al.) shown as cylinders (alfa 17 Q8I562 8.39 64.65 459 Q8i562 plasmodium falciparum (isolate 3d7). 18 AAH44174 8.28 60.11 313 Aah44174 brachydanio rerio (zebrafish) helices) and arrows (beta strands). 19 SR09_MAIZE 8.17 52.51 103 O04438 zea mays (maize). signal recognition Secondary 20 CD3D_MOUSE 8.11 54.85 173 P04235 mus musculus (mouse). t-cell surface 9 The most conserved residues are in structure 21 SR09_CAEEL 7.90 50.47 76 P34642 caenorhabditis elegans. signal prediction by secondary structure elements. PSI-Pred also Green box = sequences in profile (should be first!) SRP9, SRP21 more similar. showed the 14 conserved Yellow box = unknown SRP21 (incl YE07 from S.pombe ????? Red box = SRP9 sequences (Best hits in db of >1 million proteins!) structure.

N-terminal Proteins share domains

• In primary sequence searches the found proteins are aligned because they share domains • If the sequences are very different outside the shared domain, they may be paralogs. • The next example shows a MSA in which the middle part is a GTPase domain. The first or last part is missing ... C-terminal

Two different proteins (4+4 sequences ) are aligned. They share a domain.

Pfam – protein domains DB Pfam model amino acid probability plot in the “structure logo” style

• From multiple Structure logo for Pfam alignments of motif trypsin (only part of many related the model shown). proteins, profiles (HMMs) are The size of the letters = probability of made finding that amino acid in the position In these • Input a positions, sequence, match some amino to all acids are families/HMMs. much more • Known common than sequences are in others. Pfam database. Positions in the model

Pfam DB: Karolinska Inst., Sanger (UK), S:t Louis (USA), Pasteur (F)

10 Search a sequence for matches to Pfam models Search at Pfam (Sanger) HMM file: /dbs/pfam/Pfam_ls Sequence file: pop3_spombe ------

Query sequence: gi|3560259|emb|CAA20744.1| For a known protein Accession: [none] Description: SPCC16C4.05 [Schizosaccharomyces pombe] one may use the

Scores for sequence family classification (score includes all domains): Model Description Score E-value N UNIPROT accession ------RNase_P_pop3 RNase P subunit Pop3 332.1 8.6e-97 1 to get a precomputed Parsed for domains: alignment. Model Domain seq-f seq-t hmm-f hmm-t score E-value The HMMER package ------RNase_P_pop3 1/1 7 165 .. 1 175 [] 332.1 8.6e-97 is used for searching sequences against one Alignments of top-scoring domains: If the protein is not in RNase_P_pop3: domain 1 of 1, from 7 to 165: score 332.1, E = 8.6e-97 or all Pfam models (or *->KrkQvyKPVLeNPytNEAkLWPhVtdqklvlELLqekvlkklvhalk a model that you have K+kQ++K+VL+NP++++ WP+V+++ +qek++++lv++l+ the database, just input gi|3560259 7 KVKQTVKLVLRNPLSIS---WPIVDAN------TQEKLAQTLVQWLP 44 made yourself). the sequence ... ashKgneesevtvGfNeivelLsraCsesddvTQPAVvlfvcnkDgtPsv ashK++++s++tvG+N+++elL+r+C++++dvTQPAVv++++++D s+ As in BLAST searches, gi|3560259 45 ASHKDILDSKLTVGLNSVNELLERCCQNAKDVTQPAVVFILHDQD---SM 91 you get a score, e- LlsQlPLLvavanltGSSKVkLVqLpksaqakfdehlGlskavHDGmlLv value and an L++++P+Lva+an++GSSK++LV+L++saqa+++++lGls+a G+++v gi|3560259 92 LVTHMPQLVANANFYGSSKCRLVPLGFSAQALIAKKLGLSRA---GAIAV 138 alignment.

rkdasldksfadlvdskvEepqiPWLep<-* ++d++l+k+++dlv++ +Eepq++WL++ Searches may be done gi|3560259 139 QDDSPLWKYLKDLVMN-IEEPQARWLSE 165 at Pfam WWW.

WWW Pfam notes ... results • Even though the Pfam alignments are curated, they may contain sequences that are very different from Good match your sequence (RNaseP_pop3) ? bad score and eval! Matches below • If your sequence gets a very high score and good threshold evalue, it probably is in the alignment that was used to create the model. • Pfam B models are made “automatically” and not curated (use with caution) • Some Pfam models are domains, others are almost complete proteins ...

SRS: InterPro – search all domain DB Transmembrane prediction

• 25% of all proteins are membrane bound • By comparing known transmembrane proteins, programs like TMHMM make predictions. Some use neural networks that are trained on known PROSITE Pfam TM proteins. PRODOM Seq input PRINTS SMART • Other methods can be combined to get a higher ...... specificity of TMHMM predictions or other InterPro had a “bad” reputation programs (all methods have a flaw somewhere) some years ago, but it is good idea!

11 TMHMM output PSI-pred: secondary structure

Confidence in RF47_[Guillardia len=68 ExpAA=37.41 First60=32.65 PredHel=2 Topology=i2-19o47-64i ORF74_[Odontella len=74 ExpAA=39.05 First60=32.92 PredHel=2 Topology=i2-24o48-65i prediction of this ORF71_[Porphyra len=71 ExpAA=36.0 First60=26.14 PredHel=2 Topology=i7-24o53-70i ORF70_[Chlorella len=70 ExpAA=38.67 First60=32.40 PredHel=2 Topology=i2-21o45-67i residue ------PredHel=2 (= 2 TM dom) Topology=i2-21o45-67i inside-TRANSMEMBRANE-outside-TRANSMEMBRANE-inside Output sent by mail:

EEE = beta strand HHH= alfa helix CCC= coil (“normal”)

Link to image files.

http://www.psipred.net Example in which scores for first TM domain are too low .

Looking for short sequences But what about RNA genes?

• Sometimes you want to find out if there are • RNA genes are genes that do not code for protein short sequences (often called words) that are in (they are not translated) They are usually called “noncoding RNAs” a set of sequences. They may, for instance, be • There are structural, catalytic and regulatory ncRNA, transcription factor binding sites ... few are conserved in all organisms • Alignment programs wont find these ... • Many ncRNAs are part of ribonucleoprotein complexes (RNPs) • MEME is a program that finds “words” of a • Some commonly known ncRNAs are: specified length in a set of sequences. ribosomal RNAs (rRNA), transfer RNAs (tRNAs), • MAST may be used to search for known words signal recognition particle RNA (SRP RNA), ribonuclease P RNA (RNaseP RNA)

ncRNAs are often not annotated ncRNAs in the 3 Kingdoms of Life

NC_006270.1 -TTGCCGTGCTAAGCGGGGAGGTAGCGGTGCCCTGTACTCGCAATCCGCTCGAGCGAGGC X06802|BAC.SUB. NTTGCCGTGCTAAGCGGGGAGGTAGCGGTGCCCTGTACCTGCAATCCGCTCTAGCAGGGC ************************************* *********** *** ***

NC_006270.1 CGAATCCCTTTCTCGAGGTTCGTTTACTTTAAGGTCTGCCTTAAGCAAGTGGTGTTGACG X06802|BAC.SUB. CGAATCCCTT-CTCGAGGTTCGTTTACTTTAAGGCCTGCCTTAAGTAAGTGGTGTTGACG ********** *********************** ********** **************

NC_006270.1 CTTGGGTCCTGCGCAATGGGAATCCATGAACCATGTCAGGTCCGGAAGGAAGCAGCATTA X06802|BAC.SUB. TTTGGGTCCTGCGCAATGGGAATTCATGAACCATGTCAGGTCCGGAAGGAAGCAGCATTA ********************** ************************************

NC_006270.1 AGTGGAACCTTCCATGTGCCGCAGGGTTGCCTGGGCTGAGCTAACTGCTTAAGTAACGCT X06802|BAC.SUB. AGTGAAACCTCTCATGTGCCGCAGGGTTGCCTGGGCCGAGCTAACTGCTTAAGTAACGCT **** ***** ************************ ***********************

NC_006270.1 TAGGGTAGCGAATCGACAGAAGGTGCACGGTA X06802|BAC.SUB. TAGGGTAGCGAATCGACAGAAGGTGCACGGTA ********************************

Sequence alignment of annotated SRP RNA from Bacillus subtilis and identified SRP RNA from the newly sequenced and “fully” annotated Bacillus licheniformis. Rfam: annotating non-coding RNAs in complete genomes. Sequence identity = 94%! Still no SRP RNA is annotated. SRPDB is needed. Sam Griffiths-Jones, Simon Moxon, Mhairi Marshall, Ajay Khanna, Sean R. Eddy and Alex Bateman. Nucleic Acids Res. 2005 33:D121-D124.

12 RNA structure RNA: Conserved secondary structure AU, GC base pairing create ”hairpins”

5’- CAGGAAACUG seq1 Seq1 and seq2 are GGGGAUGUAGCU ...|.||... UAGUGGUAGAGC not similar, but they AUUGGAGUUAUA GCUGCAAAGC seq2 AUCCGGAGGCGC |||||||... both have a hairpin GGGUUCGAAUCC CGUUAUCCCC -3’ GCUGCAACUG seq3 structure, which is not shared by seq3! primary secondary tertiary A A C A C A G A G A G A G-C U-A U C The alignment of the ncRNAs basepair (G-C, A-U, G-U) creating secondary structure A-U C-G C U primary sequences C-G G-C G G (structure) doesn’t Mutations may maintain secondary structure (G-C ? G-U ? A-U) seq1 seq2 seq3 give us any information. ncRNAs first fold into a secondary structure before adding tertiary interactions ? The secondary structure must not change!

Secondary structure pattern Secondary structure pattern

Pattern: A A C A h1 s1 h1’ G A G A G-C U-A h1 NNN:NNN A-U C-G s1 GMAA C-G G-C Note: seq1 seq2 M = [GA] N = [AUGC] ”h” stands for helix ”s” -”- strand Compensatory base changes maintain secondary structure. We need a way to specify the base pairing! Programs that search for secondary structure: Patscan, RNAbob.

Creating probabalistic covariance Multiple Sequence Alignments for models from alignments (Rfam) ncRNAs must specify basepairings

tetraloop with stem Both primary sequence and secondary structure Patterns: Hard to make large patterns and patterns that conservation captured in find new structures. Yes/No match, no scoring. Fast! probabalistic model. Models: covariance model of which bases appear COVE (Eddy 1994) used for together created automatically from alignment. creating models, searching. Time-complexity: O(n3). Slow!

Covariance models are equivalent to stochastic context- Idea: Use smaller pattern to filter, use covariance model free grammars. on filtered sequences ? Fast and sensitive! The Rfam database is the “Pfam for ncRNAs”

13 SRP RNA ncRNA structure evolution Bacteria variants

1. Mutations in ncRNAs maintain the secondary structure ? primary sequence is poorly conserved ? hard to detect similarities by primary sequence searches

2. Structure evolves by loosing / adding helices ? big gaps in alignments even when primary sequence is Archaea conserved Helix 8 is the only part found in all SRP RNA!

Comment: t1 and t2 depict tertiary interactions An example ... Eukarya Helices H3 and H4 missing in yeast!

UA G G A U G C A U A U G C C GU U G U G G C 3 U A CU A Comparative analysis of SRP RNA G C CA U Fungi SRP S. pombe C G C G G C U A 1 U A C U C G U GU AA G A G U C U C A U UC U U A G G CU UG CG G U G U U U A U A CU C C C A U G C A UG U U C G G U GG G GG U G A G U C U CG C U U U C G C C C G A U A A U C G C G C C A C A A A U A U G A A G G U A C G U GU A G G A C G C U U C C A C U U A G A G C G A U A G C G G G C U A U G C G UG AA G C UU GG CA o GGU U 2 UG A G C Saccharomyces species U o C G T A C GU G A A U A T RNA lack U A AG C C G T A A U A C G C G T C G C C G 4 U A A T C A U 3T Using the known SRP sequences A C G G A T T C A G C G G CT A U G T A A U T G A U helices 3,4 from Saccharomyces cerevisiae A T Candida albicans C G A G A T T T A A G C T A A T T T T G G C A A G 1 T C G T G T C A G T T T GA A T G A as queries, regions of the A G G C T G G A T G C T G C G T A C C T G C T G T T T G T A G A G C A A A G T G T C CG G C T C G A T CC C C G A T G A T T C GG A G T C T A CG T T G T A T G G G T G A T A A A T A T G T T T G T T C C A CG G G CC A G C A G G G G C T A C A T C C GT G A T A GG T G T T oT G T 2 T G o A A T T genomes of S.paradoxus, A T T A C G G C These RNAs are < 300 nts. T A S.mikatae, S.kudriavzevii, A T A C T A C G A G C C A G T A C T A 4 G C G C A T G T S.castelli and S.kluyveri were C G G T A G C G C A A TC G G C G 5 BUT ... A G G C C G C G retreived from Washington Univ., A T A C T A T C T A C G A T A G A T Neurospora crassa C G 3 AC C G C G C C G C T A A CC St. Louis. C T G A T T A T G T A 1 T A T T T G G A T A T C A A G G T C A G G T G G G G G C A C T T G A T C T C A A T G C T C T A T T CC A C C C A G C G G C G C C A G C A C G A CC T T G C G G T G G A G G G T C G C T G C G G T CG C G C T G G A C T G C C A C C S.cerevisiae (length 519 nts) T T C A G T T T A T C C C T G T G A G G T A G A G T T G C G A G A T A G G G T A G T G A T G A A A G G A C T C A G G T T 4 T A G T o o 2 T A C G T A U A A T G A SRP RNA was not possible to C G A A U By comparative analysis, SRP A C G C G C A U G 6 A T G T C GC A G U fit to this type of SRP RNAs. U CUG C A G U RNA sequences (453-547 nts) U G G T A G G U U A C G C G G C T A U A A U C G C G 5 C A G U A G C C T and structures were identified*. U 3 C T G U U G U Yarrowia lipolytica C U C G G U G C G C How do we decide on the U A G C G C U G U U U A U U U U G G G UC A U U A U U G G C U G UU A U UG U A structure of this gene? A G G C G U C G G A G U A A A U G C C U U C U A G C U A C U U UG U U G U C C U C C G U C C G G U A The results showed that all U C G G U G C A G CC U C G U U U A U G G A A G G U G G UG A A A C A A CG G G G U G G C G G G C C G A U A A A U A A A G U G U U C 2 A U U A U A U U 4 G U A 1 oC U A U A o U U G C species had large inserts in the G A U C G C G U AG Note that Yarrowia (bottom) A UA helix 5 region, especially close to G C 6 A U G U A C has an extra helix. C A G G the small Alu domain, and that A U C G * The secondary structures were predicted with MFOLD. C G U A A G helix 7 also was variable. A C

Saccharomyces helix insertions

A U MicroRNAs – regulatory ncRNAs A G G A A U G A G C A U G C UC A U C U U C G G C G C G C C G G C G U G U U A G C 7 UG C UU 6 A U C G A U U G G C C G G C C G U G U G A G U U A G C C C AG C U 9 A U 11 C U A U C G U G C G C C G A A G U A U C C G C A U 4 A G C G C G U C A C U A G U G A C C C S.bayanus U G Saccharomyces G U G A G C 5 U G G U G C C G G C G A G G C S. bayanus C G G C G C G C G C C C G C G C C A U A U have a unique G A C G U A A A G CA G G C G U A U A A U U C U U U UG A U U A U U A A C G G A G U C C G G U U U U C C G G C U G A A G U G U C C C G G C U U C G A U C G C G G G U C C U G G U G G C A C C G U C C C U C G G G U U U inserted part in G G A C U A C C G U G G C G G A UG G C U G A C U A U A C A G G G C C G G A G G U A G U G C C C A GU A A A A U C G G C 12 A U 10 C G C 8 U A A C G A U A UU U AC U A U G U G C helix 5 close to G C 3 U G U G G C C U U G A U U A 1 U C U U G G A G A U A G HelixC 7 AUG G C U U G U G G U GG G A U A C C C G A G C the Alu domain. A A 14 U A A U U C G G A C A C C G U C C U A U G U A C G G A A U G U A U G G C G C U 2 G U A U U U G C G G U G U 13 C G U A U G U G U C U U G C C G U A This was found U 15 C G G G o A U A G G U C G G C C G G C U A in all G C A G C U U U This structure is not in U U U A Saccharomyces G A C.albicans, S.pombe. A U G C U G species. C G C U U CUG G U U G G U U A C G G C U A A U C G 5 C G C U A U 3 G U U G U Yarrowia lipolytica C U C G G U G C G C U A G C G C U G U U U A U U U U G G G UC A U U A U U G G C U G UU A U UG U A A G G C G U C G G A G U A A A U G C C U U C U A G C U A C U U U G U U G U C C U C C G U C C G G U A U C G C A G C C U C G U U U A U G G A A G G G U G U GG U G A A A C A A C G GG G U GG C G G G C C G A U A A A U A A A G U GU U C 2 A U U AU A U U 4 G U A 1 oC U A U A o U U G C G A U C G C GG U A A U Red part is the mature miRNA, the sequence is complementary to mRNA! G CA 6 A U G U

14 RNAi pathway Post-genomic Bioinformatics/Genomics Cross-species genome comparisons

Cross-species genomic sequence conservation can be used 1. for discovery of new regions with regulatory functions 2. to enhance gene predictions, and 3. alternative splicing predictions (1 gene ? >1 mRNA ? >1 protein) 4. reveal transcription factor binding sites

Cross-species gene location conservation can be used for 1. identification of unknown ORFs (predicted proteins) 2. adding evidence for discovered new genes

Cross-species gene prevalence can be used for prediction of 1. the probability for the existance of a gene in a species (Keep looking!) 2. the function of a certain gene/protein/RNA (Is the product essential?)

And much more ... (We will show some examples later ...) miRNA and siRNA work in a similar fashion Cell. 2004 Apr 2;117(1):1-3.

SRP component searches Mitochondria and The SRP Chloroplasts pathway is are conserved is all endosymbionts domains of life: Eukarya, Bacteria, Archaea.

All organisms have an SRP particle, but it looks different.

This is part of the secretory pathway.

Origin of photosynthetic organisms Genome (have chloroplasts with own genome!) map of P.purpurea Primary endosymbiosis chloroplast : at NCBI Cyanobacteria + Eukaryote ? algae We downloaded 26 Secondary chloroplast genomes endosymbiosis and searched with : pattern and model for algae + bacterial SRP RNA. Eukaryote

15 Found SRP RNA Genome position for SRP RNA gene candidates in “green plant” group candidates (low scores) in 8 chloroplast genomes Conserved clusters The candidates in phylogenetically linked organisms are Red algal group all found in this position. Odontella and Guillardia have chloroplasts of secondary endosymbiosis origin No overlap with known genes! Green plant group (Conserved gene clusters are marked with ‘3’, ‘4’ ...)

rpoC-rpoB-trnC-RNA

The predicted SRP RNAs have conserved Genome locations of SRP RNA candidates promoters (as in cyanobacteria) in chloroplasts •2 clear groups: Red algae and Green algae

Red algae (incl. secondary endosymbionts) Porphyra purpurea psaJ-apcD-RNA-fabH-(tRnaLeu)-psbX-accD-psbV Cyanidioschyzon mer. psaJ-apcD-RNA------(tRnaLeu)-psbX-accD-psbV Cyanidium caldarium psaJ-apcD-RNA------(tRnaLeu)------accD-psbV Guillardia theta psaJ------RNA------psbX------psbV Odontella sinensis (tRnaPhe)-RNA------psbX-p120-psbV

Green algae + ancestral streptophyta Cyano- Mesostigma viride (ycf6)-RNA------(trnC)-rpoB-rpoC1-rpoC2 bacteria Nephroselmis olivacea ndbH---RNA------(trnC)-rpoB-rpoC1-rpoC2 Chorella vulgaris p133---RNA-p134-(trnC)-rpoB-rpoC1-rpoC2

Distances between –10 TATA box and sequence (5-8 nts), and promoter sequences are consistent with experimentally verified Some of these also contain rnpB (gene for RNase P RNA) promoters in Prochlorococcus (Vogel et al. 2003)