Proteins at the plasma membrane: What domains are on the outside and what domains are on the inside?

Kanthida Kusonmano Kwanjeera Wanichthanarak Natapol Pornputtapong

Chalmers University of Technology, Sweden

The cover image is adopted from http://4e.plantphys.net/image.php?id=80.

1 Introduction

Membrane are involved in a wide range of important biological processes, such as signaling, transport of membrane-impermeable molecules, cell adhesion and cell–cell communication, many of which are involved in disease mechanism and drug target discovery. Thus, an understanding of their structure and function is of great importance for biological and pharmacological research. Because of the experimental difficulties, i.e. not easy to crystallize, these membrane proteins are rarely found in structural databases. Sequence-based analysis is therefore an important approach for investigating such proteins [1].

Transmembrane proteins are a class of integral proteins which penetrate into or through the lipid bilayer of cell membrane or plasma membrane. There are three regions that can be defined: the region outside the membrane, the region inside the membrane and the region in the bilayer (Figure 1).

Figure 1 Representation of a transmembrane (integral membrane) protein (figure adopted from [2])

Prediction of transmembrane helices from sequence is a key challenge for bioinformatics. In this study we used TMHMM, a hidden Markov model for predicting transmembrane helices in protein sequences [3], to predict the location and in/out orientation of human transmembrane helices. Then, we investigated further for the protein domains of in/out transmembrane regions using HMMER, a tool for searching protein homologs and for making sequence alignments [4]. The tool was implemented based on profile hidden Markov models (profileHMMs).

2 Material and methods

To examine human transmembrane proteins, we first retrieved all human proteins from UniProt/SwissProt [5]. Since there are both plasma membrane and internal membranes e.g. ER membrane in a human cell, in order to identify only the proteins of the plasma membrane or transmembrane proteins, we used two Gene Ontology (GO ) terms; ‘plasma membrane’ and ‘integral to membrane’. After getting the set of transmembrane proteins, we performed two tasks in parallel:

1. Predict transmembrane helices

TMHMM was used to predict the location and in/out orientation of transmembrane helices [3]. Given a HMM, the tool predicts transmembrane helices by finding the most probable topology of a given residue. There are three possibilities whether a given residue is on the cytoplasmic side (intracellular), on the periplasmic side (extracellular), or in a transmembrane helix (within membrane).

2. Identify domains of transmembrane proteins

HMMER was used to identify the protein domains of transmembrane proteins [4 ]. The tool is based on profileHMMs and it is used for searching functional domains on given protein sequences. Pfam-A profiles, which are derived from high quality and manually curated protein domains in Pfam database [6], were used as query profiles. Prediction is performed by hmmsearch program which is a part of HMMER package with cut-off score at 10E-5.

Information from both TMHMM and HMMER were combined to distinguish domains of in/out transmembrane regions. Analytical steps for transmembrane proteins are illustrated in Figure 2.

3 Figure 2 Analytical pipeline of human transmembrane proteins

4 Results and discussions

Transmembrane prediction

As mention earlier, the structure of transmembrane proteins can be defined into three parts; intracellular, extracellular and transmembrane helix. All of transmembrane proteins have the same core structure which is the transmembrane helix mostly having specific functions of membrane integration and transportation. There are 1076 transmembrane proteins which were collected from UniProt/SwissProt database based on “Plasma membrane” and “Integral to membrane” keywords of Gene Ontology definition. A Ruby script was used to parse data from UniProt/EBI format into fasta format as shown in appendix A.

Figure 3 Number of transmembrane helix of human transmembrane proteins

TMHMM was used to predict the location and in/out orientation of transmembrane helices for these transmembrane proteins. The number of predicted transmembrane helix from TMHMM are varied from no transmembrane helix to 14 transmembrane helices as shown in Figure 3. 43 transmembrane proteins were unable to identify transmembrane helix which is probably because of limit capability of TMHMM

5 algorithm. After tracing back to UniProt, most of transmembrane types for such group are signal-anchor helix which may have different properties from general transmembrane helix. Thus these proteins will not be included in domain prediction step. Among the rest of transmembrane proteins, the most common number of transmembrane helix is 7. This group of transmembrane proteins relates to the G protein-coupled receptors (GPCRs), the largest group of transmembrane proteins. Even though these transmembrane proteins share core functional unit, i.e. 7- transmembrane helix, many of them contain different functional domains in extra and intracellular region [7].

Domain prediction

In addition to transmembrane helix, the other parts are extracellular and intracellular regions which do not insert into the membrane. Diverse functions of membrane proteins are based on functional domains that present in these two regions. There are 237 domains assigned to 2316 positions of the proteins. HMMER provides the positions of predicted-functional domains while TMHMM results give the in/out orientation of transmembrane helices. Combining results from HMMER and TMHMM, we can categorize predicted domains into three groups: domains on transmembrane helix, domains in intracellular and domains in extracellular.

1. Domains on transmembrane helix

This group of domains is situating on most parts of transmembrane region, in other words they cover many topological domains or almost whole part of transmembrane proteins. The properties of domains found in this region is very similar. It shows the properties of membrane integration which only confirm the prediction from TMHMM. There are 97 domains (shown in appendix B) in this group and most of them belong to 7-transmembrane GPCRs family. Nearly half of proteins that we used in this prediction have GPCRs domain family on their sequence. This correlates to the results from TMHMM part.

2. Domains in intracellular region

Domains in this group were predicted to be in intracellular position by TMHMM. Intracellular domain is a part of transmembrane protein that contact to cytoplasm. There are only 16 domains assigned into this group. It is very small when compare to the others. Functions of these domains suppose to be the domains that connect to intracellular components such as, cell structure, metabolism and signal transduction activities. In this study, we found many of them are functioning as binding regions. Functions of intracellular domains are not much diverse when compare to domains of extracellular region. The most

6 common domain is cadherin_C or Cadherin cytoplasmic region, a part of protein commonly found on cadherin protein. Cadherin protein is a member of cell adhesion molecules which are needed during tissue differentiation. The cadherin cytoplasmic region is cytoplasmic tails which link the cytoskeleton by catenins. The other domains are shown in appendix C.

3. Domains in extracellular region

This is the largest group of domains that we classified in this study. There are 110 domains found in extracellular regions of transmembrane proteins. From description data of Pfam database, most of them are non-specific to extracellular regions. In other words, we cannot be certain which domains are specific to the inside or outside of the cell. Only 14 domains are highly specific to extracellular regions as shown in appendix D, for example Cadherin, ig, SEA and Sushi. In addition, we found that a domain, classified from Pfam as an intracellular domain, was found in this prediction as an extracellular domain. This domain is the calx-beta. The calx-beta domain is a tandem repeat in the cytoplasmic domains of Calx Na-Ca exchanger, also presents in the cytoplasmic tail of mammalian integrin-beta4. This motif is used for calcium binding and regulation [8]. There are 3 proteins which were predicted to contain Calx beta domain. To prove the correctness of prediction, we plotted domain architecture of two proteins which have Calx beta domain as show in figure 4 and 5. Predicted Calx beta domain usually comes along with other extracellular domains like VWC, EPTP and GPS. This conflict may be from TMHMM or this domain has another function in extracellular region.

Some domains are detected in both intracellular and extracellular regions based on TMHMM. These might be because of either wrong predictions of positions and orientations from TMHMM or false positive from HMMER. Table in appendix E shows predicted domains occurred in intracellular and extracellular group. There are some domains such as I-set, and V-set which highly dominate in extracellular than intracellular.

7 Figure 4 Transmembrane helices and functional domains position of protein Q8WXG9

Figure 5 Transmembrane helices and functional domains position of protein Q86XX4

8 References

1. Nugent T and Jones DT, 2009, Transmembrane protein topology prediction using support vector machines, BMC Bioinformatics, 10:159.

2. Hurwitz N, Pellegrini-Calace M and Jones DT, 2006, Towards genome-scale structure prediction for transmembrane proteins, Philos Trans R Soc Lond B Biol Sci., 361(1467):465-75.

3. Krogh A, Larsson B, von Heijne G and Sonnhammer EL, 2001, Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes, J Mol Biol., 305(3):567-80.

4. HMMER: biosequence analysis using profile hidden Markov models. Available at: http://hmmer.org/.

5. The Universal Protein Resource (UniProt) in 2010, Nucleic Acids Res., 38(Database issue):D142-8.

6. Finn, R.D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L., Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E.L.L., Eddy, S.R., Bateman, A., 2009. The Pfam protein families database. Nucleic Acids Research 38, D211-D222.

7. Bjarnadóttir, T.K., Gloriam, D.E., Hellstrand, S.H., Kristiansson, H., Fredriksson, R., Schiöth, H.B., 2006. Comprehensive repertoire and phylogenetic analysis of the G protein-coupled receptors in human and mouse. Genomics 88, 263- 273.

8. Schwarz, E., Benzer, S., 1999. The recently reported NIbeta domain is already known as the Calx-beta motif. Trends Biochem. Sci 24, 260.

9 Appendix A parse_uniprot.rb

#!/usr/bin/env ruby

# Use bio ruby library require 'bio'

# Read UniProt/EBI file format File.open('uniprot_sprot_human.dat').each("//\n") do |entry|

obj = Bio::UniProt.new(entry)

if obj.dr['GO'].inspect =~ /C:plasma membrane/ and \\ obj.dr['GO'].inspect =~ /C:integral to membrane/

tmpos = []

if obj.ft.has_key?('TRANSMEM')

obj.ft['TRANSMEM'].each do |item|

tmpos.push([item['From'],item['To'],"tm"])

end

end

# Print protein sequences in fasta format to STDOUT puts ">#{obj.accession}|#{obj.entry_name}"

puts obj.seq

end end

10 Appendix B

List of predicted domains in transmembrane helices

Domain accession Domain name 7tm_1 7 transmembrane receptor (rhodopsin family) 7tm_2 7 transmembrane receptor (Secretin family) 7tm_3 7 transmembrane sweet-taste receptor of 3 GCPR 7TM_GPCR_Srsx Serpentine type 7TM GPCR chemoreceptor Srsx 7TM_GPCR_Srv Serpentine type 7TM GPCR chemoreceptor Srv 7TM_GPCR_Srw Serpentine type 7TM GPCR chemoreceptor Srw 7TM_GPCR_Srx Serpentine type 7TM GPCR chemoreceptor Srx AA_permease permease Aa_trans Transmembrane amino acid transporter protein ABC_membrane ABC transporter transmembrane region ABC2_membrane ABC-2 type transporter Alpha-2-MRAP_N Alpha-2-macroglobulin RAP, N-terminal domain Aph-1 Aph-1 protein ApoL Apolipoprotein L ASC Amiloride-sensitive sodium channel Asp-B-Hydro_N Aspartyl beta-hydroxylase N-terminal region ATG22 Vacuole effluxer Atg22 like ATP1G1_PLM_MAT8 ATP1G1/PLM/MAT8 family Cation_ATPase_C Cation transporting ATPase, C-terminus Cation_ATPase_N Cation transporter/ATPase, N-terminus Cation_efflux Cation efflux family CD225 Interferon-induced transmembrane protein CDP-OH_P_transf CDP-alcohol phosphatidyltransferase Choline_transpo Plasma-membrane choline transporter CitMHS Citrate transporter CSF-1 Macrophage colony stimulating factor-1 (CSF-1) DC_STAMP DC-STAMP-like protein Dicty_CAR Slime mold cyclic AMP receptor DIE2_ALG10 DIE2/ALG10 family DUF21 Domain of unknown function DUF21 DUF2781 Protein of unknown function (DUF2781) DUF300 Domain of unknown function DUF3481 Domain of unknown function (DUF3481) DUF3522 Protein of unknown function (DUF3522) DUF803 Protein of unknown function (DUF803) E1-E2_ATPase E1-E2 ATPase EamA EamA-like transporter family EMP24_GP25L emp24/gp25L/p24 family/GOLD

11 Domain accession Domain name Endomucin Endomucin FA_hydroxylase Fatty acid hydroxylase superfamily Ferric_reduct Ferric reductase like transmembrane component Frag1 Frag1/DRAM/Sfk1 family Frizzled Frizzled/Smoothened family membrane region GAPT GRB2-binding adapter (GAPT) Gate Nucleoside recognition Glycophorin_A Glycophorin A GP41 Retroviral envelope protein HlyIII Haemolysin-III related Ion_trans Ion transport protein Ion_trans_2 Ion channel L6_membrane L6 membrane protein Lamp Lysosome-associated membrane glycoprotein (Lamp) Lectin_N Hepatic lectin, N-terminal domain LMBR1 LMBR1-like membrane protein LSR Lipolysis stimulated receptor (LSR) Ly49 Ly49-like protein, N-terminal region MAPEG MAPEG family MARVEL Membrane-associating domain MatE MatE MFS_1 Major Facilitator Superfamily MgtE Divalent cation transporter Na_Ca_ex Sodium/calcium exchanger protein Na_H_Exchanger Sodium/hydrogen exchanger family Na_K-ATPase Sodium / potassium ATPase beta chain Na_sulph_symp Sodium:sulfate symporter transmembrane region Neurotransmitter-gated ion-channel transmembrane Neur_chan_memb region Neuregulin Neuregulin family NKAIN Na,K-Atpase Interacting protein Nucleos_tra2_C Na+ dependent nucleoside transporter C-terminus Nucleos_tra2_N Na+ dependent nucleoside transporter N-terminus OATP Organic Anion Transporter Polypeptide (OATP) family PAP2 PAP2 superfamily Patatin Patatin-like phospholipase PKD_channel Polycystin cation channel Protocadherin Protocadherin Tumour necrosis factor receptor superfamily member RELT 19 SBF Sodium Bile acid symporter family SCF Stem cell factor Serinc Serine incorporator (Serinc)

12 Domain accession Domain name SMC_N RecF/RecN/SMC N terminal domain SSF Sodium:solute symporter family Sugar_tr Sugar (and other) transporter Sulfatase Sulfatase Sulfate_transp Sulfate transporter family T4_deiodinase Iodothyronine deiodinase TAS2R Mammalian taste receptor protein (TAS2R) Tetraspannin Tetraspanin family TLV_coat ENV polyprotein (coat polyprotein) TRAM_LAG1_CLN8 TLC domain UNC-93 Ion channel regulatory protein UNC-93 UPF0220 Uncharacterised (UPF0220) V_ATPase_I V-type ATPase 116kDa subunit family V1R Vomeronasal organ pheromone receptor family, V1R XK-related XK-related protein Y_phosphatase Protein-tyrosine phosphatase Yip1 Yip1 domain Zip ZIP Zinc transporter

13 Appendix C

List of predicted domains in intracellular

Domain accession Domain name Acyltransferase Acyltransferase Ank Ankyrin repeat Ant_C Anthrax receptor C-terminus region APP_amyloid beta-amyloid precursor protein C-terminus BCMA-Tall_bind BCMA, TALL-1 binding Cadherin_C Cadherin cytoplasmic region DUF1053 Domain of Unknown Function (DUF1053) efhand EF hand F420_oxidored NADP oxidoreductase coenzyme F420-dependent Latrophilin Latrophilin Cytoplasmic C-terminal region Plexin_cytopl Plexin cytoplasmic RasGAP domain SEFIR SEFIR domain SPRY SPRY domain Syntaxin Syntaxin TGF_beta_GS Transforming growth factor beta type I GS-motif TIR TIR domain

14 Appendix D

List of predicted domains in extracellular

Domain accession Domain name A4_EXTRA Amyloid A4 extracellular domain Activin_recp Activin types I and II receptor domain ADAM_CR ADAM cysteine-rich adh_short short chain dehydrogenase Alpha-2-MRAP_C Alpha-2-macroglobulin RAP, C-terminal domain AMP-binding AMP-binding enzyme ANF_receptor Receptor family ligand binding region Anth_Ig Anthrax receptor extracellular domain Antistasin Antistasin family BMP and activin membrane-bound inhibitor (BAMBI) BAMBI N-terminal domain Band_7 SPFH domain / Band 7 family Calcium-activated BK potassium channel alpha BK_channel_a subunit BRICHOS BRICHOS domain C1-set Immunoglobulin C1-set domain C2 C2 domain C2-set_2 CD80-like C2-set immunoglobulin domain C8 C8 domain Cadherin Cadherin domain Cadherin_2 Cadherin-like Calx-beta Calx-beta domain Ceramidase_alk Neutral/alkaline non-lysosomal ceramidase CH Calponin homology (CH) domain Collagen Collagen triple helix repeat (20 copies) Cu_amine_oxid Copper amine oxidase, enzyme domain Cu_amine_oxidN2 Copper amine oxidase, N2 domain Cu_amine_oxidN3 Copper amine oxidase, N3 domain CUB CUB domain CXCR4_N CXCR4 Chemokine receptor N terminal Death Death domain Disintegrin Disintegrin DUF1986 Domain of unknown function (DUF1986) DUF3497 Domain of unknown function (DUF3497) EGF EGF-like domain EGF_CA Calcium-binding EGF domain EMI EMI domain EpoR_lig-bind Erythropoietin receptor, ligand binding

15 Domain accession Domain name EPTP EPTP domain Evr1_Alr Erv1 / Alr family Exo_endo_phos Endonuclease/Exonuclease/phosphatase family F5_F8_type_C F5/8 type C domain Flt3_lig flt3 ligand fn3 Fibronectin type III domain G_glu_transpept Gamma-glutamyltranspeptidase Gal_Lectin Galactose binding lectin domain GDA1_CD39 GDA1/CD39 (nucleoside phosphatase) family GDPD Glycerophosphoryl diester phosphodiesterase family Glyco_hydro_1 Glycosyl hydrolase family 1 Gonadotropin hormone receptor transmembrane GnHR_trans region GPS Latrophilin/CL-1-like GPS domain HRM Hormone receptor domain Intercellular adhesion molecule (ICAM), N-terminal ICAM_N domain ig Immunoglobulin domain IL6Ra-bind Interleukin-6 receptor alpha chain, binding Small cytokines (intecrine/chemokine), interleukin-8 IL8 like Ins145_P3_rec Inositol 1,4,5-trisphosphate/ryanodine receptor Kazal_1 Kazal-type serine inhibitor domain Kazal_2 Kazal-type serine protease inhibitor domain KR KR domain Kunitz_BPTI Kunitz/Bovine pancreatic trypsin inhibitor domain Laminin_EGF EGF-like (Domains III and V) Laminin_G_1 Laminin G domain Laminin_G_2 Laminin G domain Ldl_recept_a Low-density lipoprotein receptor domain class A Ldl_recept_b Low-density lipoprotein receptor repeat class B Lep_receptor_Ig Ig-like C2-type domain Lipase_3 Lipase (class 3) LRRNT Leucine rich repeat N-terminal domain MAM MAM domain Methyltransf_3 O-methyltransferase MIR MIR domain MORN MORN repeat Motile_Sperm MSP (Major sperm protein) domain NCD3G Nine Cysteines Domain of family 3 GPCR Nitroreductase Nitroreductase family OLF Olfactomedin-like domain P_proprotein Proprotein convertase P-domain

16 Domain accession Domain name Pentaxin Pentaxin family Pep_M12B_propep Reprolysin family propeptide Peptidase_M13 Peptidase family M13 Peptidase_M13_N Peptidase family M13 Peptidase_M2 Angiotensin-converting enzyme Peptidase_S8 Subtilase family Type I phosphodiesterase / nucleotide Phosphodiest pyrophosphatase PKD PKD domain PSI Plexin repeat REJ REJ domain Reprolysin Reprolysin (M12B) family zinc metalloprotease Rib_hydrolayse ADP-ribosyl cyclase RIH_assoc RyR and IP3R Homology associated RYDR_ITPR RIH domain SEA SEA domain Sema Sema domain SRCR Scavenger receptor cysteine-rich domain Sushi Sushi domain (SCR repeat) Syntaxin-6_N Syntaxin 6, N-terminal Thioredoxin Thioredoxin TIG IPT/TIG domain TIL Trypsin Inhibitor like cysteine rich domain Tissue_fac Tissue factor TNF TNF(Tumour Necrosis Factor) family Trefoil Trefoil (P-type) domain TRP_2 Transient receptor ion channel II Trypsin Trypsin TSP_1 Thrombospondin type 1 domain VWA von Willebrand factor type A domain VWC von Willebrand factor type C domain VWD von Willebrand factor type D domain WAP WAP-type () 'four- core' Zona_pellucida Zona pellucida-like domain ZU5 ZU5 domain

17 Appendix E

List of predicted domains in both intracellular and extracellular

Domain accession Domain name ABC_tran ABC transporter Plasma membrane calcium transporter ATPase ATP_Ca_trans_C C terminal Fz Fz domain Adenylate and Guanylate cyclase catalytic Guanylate_cyc domain Hydrolase haloacid dehalogenase-like hydrolase I-set Immunoglobulin I-set domain Lectin_C Lectin C-type domain Neurotransmitter-gated ion-channel ligand Neur_chan_LBD binding domain Pkinase Protein kinase domain Pkinase_Tyr Protein tyrosine kinase SNARE SNARE domain STAS STAS domain TNFR_c6 TNFR/NGFR cysteine-rich region V-set Immunoglobulin V-set domain

18