Structural Feature Analysis of Human Olfactory Receptors KP-28 Based on the Triplet Pattern

Chisato MORISHITA*, Hiroaki KATO

Department of Knowledge-based Information Engineering, Toyohashi University of Technology, 1-1 Hibarigaoka, Tempaku-cho, Toyohashi, Aichi 441-8580 Japan

1 Introduction The odorant receptors (ORs) belong to the family of seven transmembrane G- coupled receptors (GPCRs) and constitute the largest family (family A) in the genome [1]. Many ORs are orphan receptors, i.e. whose endogenous ligand has not yet been identified, and their structures have not been experimentally determined. It is well known that the particular peptide fragments in a protein are closely related to its function, but it is not necessarily to appear in consecutive region in an amino acid sequence. In the present work, the authors have proposed a novel representation of protein structural feature based on the combination of the local pattern of amino acid sequence, and applied to the structural feature analysis of human ORs. 2. Method 2.1 Dataset In the present work, the authors have referred the database of 907 GPCRs in [2]. It contains 377 ORs with sixteen different OR families. The FASTA format files are used for protein sequence data. 2.2 Definition of the triplet pattern of a protein sequence To describe the characteristic pattern of an amino acid sequence, we have defined triplet as a group of three amino acid residues which hold the context in a sequence. There are 400 (= 20*20) triplets for a glycine (G) residue, such as G-G-G, A-G-G, G-G-A, A-G-A, etc. So that, totally 8,000 (= 20*20*20) triples can be considered. For a given protein sequence with n residues, all amino acid residues excepting the N-terminal and C-terminal residues are scanned and generated n-2 triples. They are summarized as an 8,000-dimensional vector according to the number of appearance of each triple, or an 8,000-bit binary string about the information of presence (1) / absence (0) of each triple. We have referred such a vector or a bit string as a triplet pattern. Figure 1 shows a schematic flow of a generation of such triplet pattern. The graphs in the bottom are visualized images for each triplet pattern.

(Given protein sequence) < N-terminal> M R E G G G A Q N S T L …… F G G A I S S < C-terminal>

T(x) = { 0, 1, 0, 0, 1, 0, 2, 0, 1, ……, 0, 1, 2, 0, 0, 1, 0, 2 } ( 8,000 dim)

3 binarize 2 1

0

Fig.1 Generation of the triplet pattern for a protein sequence.

2.3 Comparison of protein sequences using the triplet patterns Here, a binary triplet pattern (bit string) is used for a comparison of protein sequences. For a pair of triplet patterns, a logical conjunction is operated, and the number of common bits is defined as a similarity measurement between the protein sequences. The Tanimoto coefficient is also used as a relative similarity score [3]. kato @ tutkie.tut.ac.jp

3. Results and Discussion 3.1 Structural similarity search based on triplet pattern To validate a performance of our method, OR1A1 (UniProt: Q9P1Q5) and its modifications were prepared. That is a module shuffling for a protein sequence, for example, according to permutation of TM-helix 1 to TM-helix 3, and 5 to 7, and so on. For the original OR1A1 with 309 residues, 294-bits are set in the triple pattern. From a result of comparisons, it is possible to detect highly similarity score between such protein sequences. However, it is difficult to detect a set of common peptide fragments in such TM-helices by using the traditional global sequence alignment. The similarity search based on triplet pattern is carried out to a dataset of 377 ORs. A search Table 1. Result of similarity search. result with OR1A1 query is shown in Table 1 Target Residues DpScore TriScore according to the similarity score (TriScore). It is OR1A1 309 1677 294 known that OR1A1 and OR1A2 belong to the same OR1A2 309 1422 184 OR subfamily and their sequence are quite similar. OR1J2 313 846 87 For reference, a global alignment score using OR1J4 313 839 84 dynamic programming (DpScore; PAM120 matrix, OR1E1 314 802 83 and gap penalty -8) is also shown in Table 1. In … general, the result shows that TriScore has a good OR7E154P 315 86 45 correlation with DpScore. There are some OR5M3 247 67 45 exceptions, for example, OR7E154P and OR5M3 OR12D2 307 630 44 which have low DpScores because of insertion, OR4K14 310 623 44 deletion, or permutation of some peptide fragments. OR6C4 309 618 44 However, TriScores are not so low, and the common triplets in the permuted segments will be detected. 3.2 Detection of combination of common triplets To detect the common triplets in particular group of , the identification of frequent triplets is carried out for each OR families. Here, the user can specify a threshold value for the appearance frequency, e.g. 80% for OR1 family which contains 31 proteins. As a result, we successfully found twenty frequent triples such as V-A-I, L-R-N, and A-I-C. Subsequently, we also tried to detect the common triplets between several OR families. Some triples have been conserved the position of amino acid sequences, Frequently position of P-M-Y and also on similar position of transmembrane domain [4] (Figure 2). Frequently position of F-S-T These results show the potential applicability of the present approach for structural feature analysis of ORs. A set of feature triplets are also used for Fig 2. Result of frequently position analysis of the odor-structure relationship.

References [1] Axel R., and Buck L. B., http://nobelprize.org/nobel_prizes/medicine/laureates/2004/index.html (2004). [2] Zhang Y, Devries ME, Skolnick J., PLoS Comput Biol. 2, 88-99 (2006) [3] Leach AR. and Gillet VJ., An Introduction to Chemoinformatics, Kluwer Academic Publishers, Dordrecht (2003) [4] Hirokawa T., Boon-Chieng S., and Mitaku S., Bioinformatics, 14 378-9 (1998)