Basics on bioinformacs Lecture 8

Nunzio D’Agostino [email protected]; [email protected] Protein domain: terminology

Superfamily: Proteins that have low sequence identity, but whose structural and functional features suggest a common evolutionary origin.

Family: Proteins clustered together into families are clearly evolutionarily related (accepted rule: pairwise residue identity between the proteins >30% ).

Domain: A domain is defined as a polypeptide chain or a part of polypeptide chain that can be independently fold into a stable tertiary structure. Domains are also units of function and are not unique to the protein products of one gene or one gene family but instead appear in a variety of proteins.

Motif: A pattern of amino acids that is conserved across many proteins and confers a particular function to the protein.

Site: Is the binding site where catalysis occurs. The structure and chemical properties of the active site allow the recognition and binding of the substrate.

2 Why protein domain identification?

By idenfying domains we can: o classify a new protein as belonging to a specific family o infer funconality o infer cellular localizaon of a protein

3 Domain representation-patterns

Some biologically significant amino acid paerns (mofs) can be summarised in the form of regular expressions.

A regular expression is a powerful notaonal algebra that describes a string or a set of strings. One can use them whenever he/she wants to find paerns in strings.

The standard notaons for describing regular expressions use these convenons: [AS] = A and S allowed. D = D allowed. x = Any symbol. x4 = Four arbitrary symbols. {PG} = Any symbol except P and G. [FY]2 = Two posions where F and Y allowed.

x(3,7) = Minimum 3 and maximum 7 residues. 4 Domain representation- patterns

MSA

Detect funconally important residues

Paern as regular expression: AVL]-L-[IV]-M-[TS]-C-[DE]-R-[FY]2- Q

5 Domain representation- PSSM/profile

A PSSM is a Posion Specific Scoring Matrix. A profile is one type of PSSM.

Sequence profile (Gribskov et al. 1987) is essenally a table that lists the frequencies of each amino acid in each posion of protein sequence. Frequencies are calculated from mulple alignments of related sequences (containing a domain of interest).

PSSM scores are generally shown as posive or negave integers. Posive scores indicate that the given amino acid substuon occurs more frequently in the alignment than expected by chance, while negave scores indicate that the substuon occurs less frequently than expected. Large posive scores oen indicate crical funconal residues, which may be acve site residues or residues required for other intermolecular interacons 6 Domain representation-Markov model

Markov model: a way of describing a process that goes through a series of states.

In a regular Markov model, the state is directly visible to the observer. Each state has a probability of transioning to the other states.

Xk is a random variable of state.

States are ∈ {A,C,G,T} State transion example: State at the (K+1) th step

A C G T A 0,3 0,2 0,1 0,4 C 0,1 0,6 0,2 0,1 G 0,2 0,4 0,1 0,3 T 0,5 0,1 0,2 0,1 State at the K th step th K the at State

7 Domain representation-hidden Markov model

In a Hidden Markov model , the state is not directly visible, but variables influenced by the state are visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states.

An essential characteristic of a Markov process is that the change is dependent only on the current state. (Teorema di Bayes) The history of the system does not matter.

The states that the system has been in before are not relevant, only the current state determines what will happen next. The system has no memory.

The Hidden Markov model method was originally used in speech recognition before being applied to biological sequence analysis.

HMM are used to represent sequence families. A particular type of HMM suited to modeling multiple alignments.

Domain representation-hidden Markov model

Each oval shape represents a random variable that can adopt a number of values. The random variable x2 is the hidden state at time 2. From the diagram, it is clear that the value of the hidden variable x2 (at time 2) only depends on the value of the hidden variable x1 (at time 1). The arrows in the diagram denote conditional dependencies. Similarly, the value of the observed variable y2 only depends on the value of the hidden variable x2 (both at time 2).

Hidden Markov model: Each state x emits an output y, at a specific probability. We only know the output (observations). Thus, the states are hidden.

X1 X2 X3 X4

Y1 Y2 Y3 Y4 Basic Architecture of a profile HMM

Start d1 d2 d3 End

i0 i1 i2 i3

m1 m2 m3 C C Y 0.01 0.5 0.01

Insert states: Model insertions of random letters between two alignment positions

Silent states: Model detection which correspond to a gap in the alignment

Transitions: States of neighboring positions are connected by transitions, that indicate the possibility of going from one state to the other

Match State: Model the distribution of symbols in the corresponding column of an alignment Domain representation-hidden Markov model

Given a multiple sequence alignment of a particular domain family, one uses statistical methods to build a specific HMM for that domain family.

In contrast to patterns and profiles, HMMs allow consistent treatment of insertions and deletions.

In contrast to patterns and profiles, Markov Models take into account the information about neighboring residues. Protein domain db – PROSITE

PROSITE is a database of biologically significant paerns and profiles that help to reliably idenfy to which known (if any) a new sequence belongs. There are a number of protein families as well as funconal or structural domains that cannot be detected using paerns due to their extreme sequence divergence; the use of techniques based on weight matrices (also known as profiles) allows the detecon of such domains.

hp://prosite.expasy.org

ABL1_HUMN Protein domain db – PROSITE Protein domain db – PRINTS

The PRINTS database houses a collecon of protein family fingerprints.

Fingerprints = A recognized and powerful method of classifying new protein families is to use conserved regions within mulple alignments of related proteins. Each homologous region is a "mof", and sets of mofs provide a signature or fingerprint for unique idenficaon. These mofs usually denote a common structure and/or funcon between individual family members.

hp://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php

ABL1_HUMN Protein domain db – Pfam is a database of protein domain families. It contains curated mulple sequence alignments for each family and corresponding profile Hidden Markov Models (HMMs).

HMMs are built from HMMER an hidden Markov model soware, stand-alone available.

This database is made up of two parts:

Ø Pfam A: curated mulple alignments Ø Grows slowly Ø Quality controlled by experts

Ø Pfam B: automac generated (ProDom derived) Ø Complements Pfam-A Ø New sequences instantly incorporated Ø Unchecked: false posives, ... Protein domain db – Pfam

hp://pfam.sanger.ac.uk

ABL1_HUMN Protein domain db – ProDom

ProDom families are built by an automated process based on a recursive use of PSI- BLAST (Posion Specific Iterated Blast) homology searches. PSI-BLAST

PSI-BLAST is designed for more sensitive protein-protein similarity searches. Position-Specific Iterated (PSI)-BLAST is the most sensitive BLAST program, making it useful for finding very distantly related proteins or new members of a protein family. Use PSI-BLAST when your standard protein-protein BLAST search either failed to find significant hits, or returned hits with descriptions such as "hypothetical protein" or "similar to...".

18 Protein domain db – SMART

SMART (Simple Modular Architecture Research Tool) domains are extensively annotated with respect to phylec distribuons, funconal class, terary structures and funconally important residues. SMART alignments are opmised manually and following construcon of corresponding hidden Markov models (HMMs).

hp://smart.embl-heidelberg.de

ABL1_HUMN Protein domain db – InterPro

The resources examined unl now, have different areas of opmum applicaon owing to the different strengths and weaknesses of their underlying analysis methods.

Thus, for best results, search strategies should ideally combine all of them.

InterPro (The InterPro Consorum 2001) is a collaborave project aimed at providing an integrated layer on top of the most commonly used signature databases by creang a unique, non-redundant characterisaon of a given protein family, domain or funconal site.

The InterPro project home page is available at:

hp://www.ebi.ac.uk/ Entry types in InterPro o Family: group of evoluonarily related proteins, that share one or more domains/ repeats in common. o Domain: independent structural unit which can be found alone or in conjuncon with other domains or repeats. o Repeat: region occurring more than once that is not expected to fold into a globular domain on its own. o PTM: (post-translaonal modificaon) -The sequence mof is defined by the molecular recognion of this region in a cell. o Acve site:catalyc pockets of enzymes where the catalyc residues are known. o Binding site: binds compounds but is not necessarily involved in catalysis. Protein domain db – InterPro InterProScan

InterProScan is a tool that combines different protein signature recognion methods nave to the InterPro member databases into one resource with look up of corresponding InterPro and GO annotaon.

Protein signature recognion methods: --Sequence-mof methods PROSITE, home of regular expressions and profiles; Gene3D, PANTHER, PIRSF, Pfam, SMART, SUPERFAMILY and TIGRFAMs keepers of HMMs PRINTS, provider of fingerprint -- Sequence-cluster methods ProDom uses PSI-BLAST to find homologous sequences, that are clustered in the same ProDom entry InterProScan: creation of InterPro

Combine ScanRegEXP Prosite Protein Sequences results

ProfileScan Prosite Profile

FingerPrintScan PRINTS Look up InterPro HMMPfam InterProScan Pfam

BlastProdom ProDom Return list of HMMPfam SMART InterPro hits

HMMPfam TIGRFAMMs

HMMPanther PHANTER Application of InterPro

Diagnostic protein family signature database for: o Useful for member databases themselves o Enhancing the functional annotation of TrEMBL entries o Classification of proteins through text and sequence search tools o Large-scale classification using GO terms o Enhancing genome annotation o Proteome Analysis Database Protein domain: summary oDomains are the funconal units of proteins oIdenfying a domain within a new protein may teach us much about it oThere are several types of models to represent domains oThese models can also be used to idenfy the domain they represent oMany Internet databases available to catalogue and idenfy families

HMMER

HMMER is used for searching sequence databases for homologs of protein sequences, and for making protein sequence alignments. It implements methods using probabilisc models called profile hidden Markov models (profile HMMs).

Compared to BLAST, FASTA, and other sequence alignment and database search tools based on older scoring methodology, HMMER aims to be significantly more accurate and more able to detect remote homologs because of the strength of its underlying mathemacal models. In the past, this strength came at significant computaonal expense, but in the new HMMER3 project, HMMER is now essenally as fast as BLAST. HMMER hmmalign - align sequences to an HMM profile ohmmbuild - build a profile HMM from an alignment ohmmcalibrate - calibrate HMM search stascs hmmconvert - convert between profile HMM file formats hmmemit - generate sequences from a profile HMM hmmfetch - retrieve an HMM from an HMM database hmmindex - create a binary SSI index for an HMM database ohmmpfam - search one or more sequences against an HMM database ohmmsearch - search a sequence database with a profile HMM