Designing exons for human subfamilies using a mathematical paradigm 389

Designing exons for human olfactory receptor gene subfamilies using a mathematical paradigm

SK. SARIF HASSAN1,3, PABITRA PAL CHOUDHURY1, AMITA PAL2, R L BRAHMACHARY3 and ARUNAVA GOSWAMI3,* 1Applied Statistics Unit, 2Bayesian Interdisciplinary Research Unit (BIRU), and 3Biological Sciences Division, Indian Statistical Institute, 203 B T Road, Calcutta 700 108, India *Corresponding author (Email, [email protected], [email protected]) Ligands for only two human olfactory receptors are known. One of them, OR1D2, binds to Bourgeonal, a volatile chemical constituent of the fragrance of the mythical fl ower, Lily of the valley or Our Lady’s tears, Convallaria majalis (also the national fl ower of Finland). OR1D2, OR1D4 and OR1D5 are three full-length olfactory receptors present in an olfactory in the . These receptors are more than 80% identical in DNA sequences and have 108 mismatches among them. Apparently, these mismatch positions show no striking pattern using computer pattern recognition tools. In an attempt to fi nd a mathematical rule in those mismatches, we fi nd that an L-system generated sequence can be inserted into the OR1D2 subfamily-specifi c star model and novel full-length olfactory receptors can be generated. This remarkable mathematical principle could be utilized for making new subfamily olfactory receptor members from any olfactory receptor subfamily. The aroma and electronic nose industry might utilize this rule in future.

[Hassan Sk S, Choudhury P P, Pal A, Brahmachary R L and Goswami A 2010 Designing exons for human olfactory receptor gene subfamilies using a mathematical paradigm; J. Biosci. 35 389–393] DOI 10.1007/s12038-010-0044-0

1. Introduction members might have more than 90% identity (Glusman et al. 2001). Subfamily members are highly similar in DNA The loci of olfactory receptors (ORs) in the human genome and sequences, but they are capable of recognizing occur in clusters ranging from ~51 to 105 and are unevenly different odorant molecules. spread over 21 (Malnic et al. 2004; Young et We hypothesized that there might be a nature-inspired al. 2008). A conservative estimate suggests that 339 full- mathematical rule that determines the sequences of length OR and 297 OR pseudogenes are present in subfamily members and could be extended to subfamily and these clusters (Malnic et al. 2004). Theoretically, there are family. If such a rule exists, it would be of great interest for two possible ways of OR-odorant molecular binding, viz. (i) basic research; furthermore, one could construct ORs useful each OR binds to a large number of different odorants and (ii) for applied research (viz. for studies in connection with an each OR binds to a small number of odorants. In either case, electronic nose). Three full-length model subfamily OR odorant detection at the OR level follows a combinatorial members were chosen from the HORDE database (http: rule, though the stringency of the rule would differ in the //genome.weizmann.ac.il/horde/), OR1D2 (gene length: two cases. Experimentally, it has been demonstrated that 936 bp), OR1D4 (gene length: 936 bp) and OR1D5 (gene each OR recognizes a large number of odorants and perhaps length: 936 bp). OR1D2 (chromosomal position: 17p13.3; a large class of various concentrations of the odorants tested synonym: hOR17-4) recognizes the odorant Bourgeonal (Malnic et al. 1999). The OR gene (conceptually translated which is perceived as Lily of the valley fragrance (Malnic et to protein sequences) family (>40% amino acid identity) can al. 2004). Incidentally, OR1D2, OR1D4 and OR1D5 show be divided into subfamilies (>60% identity) and subfamily very little or no polymorphism in the published sequence

Keywords. ClustalW; human olfactory receptor; L-system; olfaction; star model http://www.ias.ac.in/jbiosci J. Biosci. 35(3), September 2010, 389–393, © IndianJ. Biosci. Academy 35(3), of September Sciences 2010 389 390 Sk Sarif Hassan et al. databases by different research groups (data not shown). It is CAGCAGATCCTGTTTTGGATGTTCCTGTCCATGTAC possible that these groups might have used the same samples CTGGTCACGGTGCTGGGAAATGTGCTCATCATCCT or the same source while cloning and sequencing. OR1D2, GGCCATCAGCTCTGATTCCCCCCTGCACACCCCCG OR1D4 and OR1D5 were aligned using ClustalW and were TGTACTTCTTCCTGGCCAACCTCTCCTTCACTGACC found to contain 108 base pair mismatches out of 936 base TCTTCTTTGTCACCAACACAATCCCCAAGATGCTG pairs available (data not shown). GTGAACCTCCAGTCCCAGAACAAAGCCATCTCCTA If we consider OR1D2, OR1D4 and OR1D5 each as a TGCAGGGTGTCTGACACAGCTCTACTTCCTGGTCT string of A/T/G/C, then out of 936 positions, 828 excluding CCTTGGTGACCCTGGACAACCTCATCCTGGCCGTG 108 mismatches were found to be chosen by nature as ATGGCCTATGATCGCTATGTGGCCAGCTGCTGCCCC fi xed or evolutionarily conserved positions. As OR1D2, CTCCACTACGCCACAGCCATGAGCCCTGCGCTCTG OR1D4 and OR1D5 are highly related sequences, therefore, TCTCTTCCTCCTGTCCTTGTGTTGGGCGCTGTCAGT a canonical sequence for this subfamily, termed as ‘star CCTCTATGGCCTCCTGCCCACCGTCCTCATGACCAG model’ of OR sequence was made by using a computer C CGTGACCTTCTGTGGGCCTCGAGACATCCACTACG program, where 108 gaps were introduced in the respective TCTTCTGTGACATGTACCTGGTGCTGCGGTTGGCAT positions (fi gure 1). GTTCCAACAGCCACATGAATCACACAGCGCTGATT A context free L-system (Prusinkiewicz and Linden- GCCACGGGCTGCTTCATCTTCCTCACTCCCTTGGGA mayer 1990) was used to generate a 243 bp long DNA TTCCTGACCAGGTCCTATGTCCCCATTGTCAGACCC sequence. ATCCTGGGAATACCCTCCGCCTCTAAGAAATACAA L-System: AGCCTTCTCCACCTGTGCCTCCCATTTGGGTGGAG TCTCCCTCTTATATGGGACCCTTCCTATGGTTTACCT Set of variables: A, T, C and G GGAGCCCCTCCATACCTACTCCCTGAAGGACTCAG Axiom: C (C is the starting symbol) TAGCCACAGTGATGTATGCTGTGGTGACACCCATGA TGAACCCGTTCATCTACAGCCTGAGGAACAAGGAC Production rule: A → CTG, C→CCA, T→TGC and ATGCATGGGGCTCAGGGAAGACTCCTACGCAGACC G→GAC CTTTGAGAGGCAAACA (ii) Following the production rule above, the fi rst and Sequence (ii) was blasted using DNA–DNA and translated second iteration would give CCA (03 bp) and protein–protein (Blastx) search engines in the HORDE and CCACCACTG (09 bp), respectively. Four iterations yield NCBI databases from where the initial OR1D2, OR1D4 81 base pair sequences. This is insuffi cient to answer for 108 and OR1D5 sequences were obtained. Results of the Blast mismatches. Five such iterations generate the following 243 searches show that with the search parameters available bp sequence: in the HORDE website (which could not be changed by a CCACCACTGCCACCACTGCCATGCGACCCACC remote user), the (ii) sequence showed 92%, 92% and 91% ACTGCCACCACTGCCATGCGACCCACCACTGTGC identity with OR1D2, OR1D4 and OR1D5, respectively. GACCCAGACCTGCCACCACCACTGCCACCACTGC Signifi cantly, these insertions do not produce any stop codon CATGCGACCCACCACTGCCACCACTGCCATGCGA in the exon sequence. It is interesting to note the following CCCACCACTGTGCGACCCAGACCTGCCACCACC rules that might govern this biological process. ACTGCCACCACTGCCATGCGACTGCGACCCAGA (i) If one utilizes a production rule which starts with CCTGCCACCACCACTGGACCTGCCACCATGCGACC C→ CCC, then a viable OR could be produced. CACCACTG (i) It is tempting to check whether the long poly Using a C computer program, nucleotides present in C-containing region of the telomere serves as sequence (i) were sequentially introduced from the 5′-end a template for insertion as in the case of DNA of the sequence into the star model gaps shown in fi gure 1. replication. Briefl y, (ii) It seems that each OR subfamily utilizes a specifi c Step 1: First, in all the gaps (with 1 bp, 2 bp, 3 bp and star model. We have tested the OR10J, OR10K and 4 bp) in the star model, only one nucleotide would be OR3A loci (data not shown). Rules that govern the inserted. formation of the star model for each subfamily Step 2: 1 bp gaps in the star model would become 0 gaps. member are in the process of analysis. Then the remaining gaps (1 bp, 2 bp and 3 bp) would be (iii) The senses of smell and taste are primordial in fi lled up and the process would be repeated until all the gaps nature. Our current hypothesis is based on the are fi lled. idea that the star model or conserved region of the The resultant OR sequence is shown in (ii) below. ORs was produced following an as yet unidentifi ed ATGGATGGAGCCAACCAGAGTGAGTCCTCACAGT mathematical rule quite early in evolution. Then TCCTTCTCCTGGGGATGTCAGAGAGTCCTGAGCAG mathematical rules such as the L-system and its

J. Biosci. 35(3), September 2010 Designing exons for human olfactory receptor gene subfamilies using a mathematical paradigm 391

Figure 1. (a). ClustalW of three full-length OR sequences (OR1D4, OR1D5 and OR1D2) of the the OR1D subfamily locus as found from the HORDE database. Asterisks and gaps numbered with numerals below the sequence demonstrate the conserved and variable base pairs, respectively. (b). Star model of the OR1D subfamily of OR gene sequences generated based on data from fi gure 1a.

J. Biosci. 35(3), September 2010 392 Sk Sarif Hassan et al.

variants were used to make the variable regions second past is either ‘A’ or ‘G’, then the chosen L-system which contribute to the odorant ligand-binding must produce ‘C’ at the gap, e.g. …TA_AA(/G)… (b) If domains of the ORs. This process of insertion the second previous is ‘T’ and the fi rst previous is ‘A’ , the might have happened at the DNA polymerization fi rst past is ‘G’ and the second past is ‘A’, then the chosen level. L-system must produce ‘C’ at the gap, e.g.,…TA_GA… (c) We have already mentioned earlier in the text that there If the second previous is ‘T’ and the fi rst previous is ‘A’, and are 2–5 highly related yet diverse OR subfamily sequences the fi rst past is either ‘T’ or ‘C’, then the chosen L-system clustered in the human genome. The reason and signifi cance must produce ‘C’ or ‘T’ at the gap, e.g. …TA_T(/C)… (d) of this special genomic architectural plan has to be searched If the second previous is ‘T’ and the fi rst previous is ‘G’, for in an evolutionary framework at the theoretical level. and the fi rst past is ‘A’ and the second past is either ‘A’ or The results obtained following the aforesaid production rule ‘G’, then the chosen L-system must produce ‘C’ or ‘G’ at the as spelt out tempt us to test the hypothesis – whether nature gap, e.g. …TG_AA(/G)… (e) If the second previous is ‘T’ follows this procedure or not. A comparative study of the and the fi rst previous is ‘G’, and the fi rst past is ‘G’ and the usage of L-systems in the olfactory subgenomes of lower second past is ‘A’, then the chosen L-system must produce vertebrates such as mouse with that of human might offer ‘C’ or ‘G’ at the gap, e.g. …TG_GA… (f) If the second clues in this direction. previous is ‘T’ and the fi rst previous is ‘G’, and the fi rst past In summary, in this paper, we report a relatively simple is either ‘T’ or ‘C’, then the chosen L-system must produce model of a context-free L-system for making a variable ‘C’ or ‘G’ or ‘T’ at the gap, e.g. …TG_T(/C)… (g) If the fi rst region of the OR and this could be adopted for making previous is ‘T’, and the fi rst past is ‘A’ and the second past is artifi cial ORs. Many more advanced context-free L-systems either ‘C’ or ‘T’, then the chosen L-system must produce ‘T’ could be designed once it is experimentally established or ‘C’ at the gap, e.g. …T_AC(/T)… (h) If the fi rst previous that this is the kind of rule the OR utilizes for generating is ‘T’, and the fi rst past is ‘A’ and the second past is either subfamily members, at least, if not subfamily and family ‘A’ or ‘G’, then the chosen L-system must produce ‘C’ at the members, more divergent ORs in the genome. Here, we gap, e.g. …T_AA(/G)… (i) If the fi rst previous is ‘T’, and observe that the computer-generated star model sequence, the fi rst past is ‘G’ and the second past is ‘A’, then the chosen sequentially fi lled with A, T, G, C in the way described above L-system must produce ‘G’ or ‘C’ at the gap, e.g. …T_GA… from a sequence generated by an L-system could generate a (j) If the fi rst previous is ‘T’, and the fi rst past is ‘G’ and the sequence that is highly similar to those of OR1D2, OR1D4 second past is ‘C’ or ‘T’ or ‘G’, then the chosen L-system and OR1D5. Therefore, most likely, this work is purely must produce ‘T’ or ‘C’ or ‘G’ at the gap, e.g. …T_GC(/ mathematical in nature at this stage and a large body of T/G)… (h) If the fi rst previous is ‘C’, and the fi rst past is experimental evidence is necessary. ‘A’ and the second past is either ‘A’ or ‘G’, then the chosen L-system must produce ‘C’ or ‘G’ or ‘A’ at the gap, e.g. … C_AA(/G)… (i) If the fi rst previous is ‘C’, and the fi rst past Acknowledgements is ‘G’ and the second past is ‘A’, then the chosen L-system must produce ‘C’ or ‘G’ or ‘A’ at the gap, e.g. …C_GA… This work was supported by the Department of (j) Else the gap can be fi lled by any state such as ‘A’ or ‘C’ Biotechnology (DBT), New Delhi, grants (BT/PR9050/ or ‘T’ or ‘G’. (B) To fi ll the double or more than double gap NNT/28/21/2007 and BT/PR8931/NNT/28/07/2007 to AG) of the star model (one gap fi ll at a time): the system should and NAIP-ICAR-World Bank grant (Comp-4/C3004/2008- check only two previous states of the gap. (a) If the second 09; Project leader: AG) and ISI plan projects for 2001–2011. previous is ‘T’ and the fi rst previous is ‘A’, then the chosen The authors are grateful to their visiting students Rajneesh L-system must produce ‘C’ or ‘T’ at the gap, e.g. …TA_ _ Singh, Snigdha Das and Somnath Mukherjee for their … (b) If the second previous is ‘T’ and the fi rst previous technical help in making advanced C programs and other is ‘G’, then the chosen L-system must produce ‘C’ or ‘T’ computer applications on Windows support used for this or ‘G’ at the gap, e.g. …TG_ _… (c) Else the gap can be study. fi lled by any state such as ‘A’ or ‘C’ or ‘T’ or ‘G’. This rule would be applicable until the number of gaps becomes one. Appendix When the number of gaps becomes one, then rule (A) is applicable. While writing the computer program, the L-system satisfi ed following rules. (A) To fi ll the single gap of the star model: References the system will check the two previous states as well as the two past states of the gap. (a) If the second previous is ‘T’ Glusman G, Yanai I, Rubin I and Lancet D 2001 The complete and the fi rst previous is ‘A’, and the fi rst past is ‘A’ and the human olfactory subgenome; Genome Res. 11 685–702

J. Biosci. 35(3), September 2010 Designing exons for human olfactory receptor gene subfamilies using a mathematical paradigm 393

Malnic B, Godfrey P-A and Buck L-B 2004 The human olfactory Prusinkiewicz P and Lindenmayer A 1990 in The algorithmic receptor gene family; Proc. Natl. Acad. Sci. USA 101 2584–2589 beauty of plants (New York: Springer-Verlag) Erratum in: Proc. Natl. Acad. Sci. USA 2004 101 7205 Young J-M, Endicott R-M, Parghi S-S, Walker M, Kidd J-M and Malnic B, Hirono J, Sato T and Buck L-B 1999 Combinatorial Trask B-J 2008 Extensive copy-number variation of the human receptor codes for odors; Cell 96 713–723 olfactory receptor gene family; Am. J. Hum. Genet. 83 228–242

MS received 19 January 2010; accepted 3 May 2010 ePublication: 18 June 2010

Corresponding editor: SHAHID KHAN

J. Biosci. 35(3), September 2010