Department of Physics, Chemistry and Biology

Master’s Thesis

Automated annotation of families

Eric Elfving

LiTH-IFM-EX--11/2551--SE

Department of Physics, Chemistry and Biology Linköpings universitet SE-581 83 Linköping, Sweden

Master’s Thesis LiTH-IFM-EX--11/2551--SE

Automated annotation of protein families

Eric Elfving

Supervisor: Joel Hedlund ifm, Linköpings universitet Examiner: Bengt Persson ifm, Linköpings universitet

Linköping, 17 June, 2011

Avdelning, Institution Datum Division, Department Date

Bioinformatics Department of Physics and Measurement Technology 2011-06-17 Linköpings universitet SE-581 83 Linköping, Sweden

Språk Rapporttyp ISBN Language Report category —  Svenska/Swedish  Licentiatavhandling ISRN  Engelska/English  Examensarbete LiTH-IFM-EX--11/2551--SE C-uppsats  Serietitel och serienummer ISSN D-uppsats  Title of series, numbering —   Övrig rapport  URL för elektronisk version http://www.ifm.liu.se/bioinfo/ http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-69393

Titel Automatiserad annotering av proteinfamiljer Title Automated annotation of protein families

Författare Eric Elfving Author

Sammanfattning Abstract

Introduction: The great challenge in is data integration. The amount of available data is always increasing and there are no common unified standards of where, or how, the data should be stored. The aim of this work is to build an automated tool to annotate the different member families within the protein superfamily of medium-chain dehydrogenases/reductases (MDR), by finding common properties among the member . The goal is to increase the understanding of the MDR superfamily as well as the different member families. This will add to the amount of knowledge gained for free when a new, unannotated, protein is matched as a member to a specific MDR member family. Method: The different types of data available all needed different handling. Tex- tual data was mainly compared as strings while numeric data needed some special handling such as statistical calculations. Ontological data was handled as tree nodes where ancestry between terms had to be considered. This was implemented as a plugin-based system to make the tool easy to extend with additional data sources of different types. Results: The biggest challenge was data incompleteness yielding little (or no) results for some families and thus decreasing the statistical significance of the results. Results show that all the human and mouse MDR members have a Pfam ADH domain (ADH_N and/or ADH_zinc_N) and takes part in an oxidation- reduction process, often with NAD or NADP as cofactor. Many of the proteins contain zinc and are expressed in liver tissue. Conclusions: A python based tool for automatic annotation has been created to annotate the different MDR member families. The tool is easily extendable to be used with new databases and much of the results agrees with information found in literature. The utility and necessity of this system, as well as the quality of its produced results, are expected to only increase over time, even if no additional extensions are produced, as the system itself is able to make further and more detailed inferences as more and more data become available.

Nyckelord Keywords Data integration

Abstract

Introduction: The great challenge in bioinformatics is data integration. The amount of available data is always increasing and there are no common unified standards of where, or how, the data should be stored. The aim of this work is to build an automated tool to annotate the different member families within the protein superfamily of medium-chain dehydrogenases/reductases (MDR), by finding common properties among the member proteins. The goal is to increase the understanding of the MDR superfamily as well as the different member families. This will add to the amount of knowledge gained for free when a new, unannotated, protein is matched as a member to a specific MDR member family. Method: The different types of data available all needed different handling. Tex- tual data was mainly compared as strings while numeric data needed some special handling such as statistical calculations. Ontological data was handled as tree nodes where ancestry between terms had to be considered. This was implemented as a plugin-based system to make the tool easy to extend with additional data sources of different types. Results: The biggest challenge was data incompleteness yielding little (or no) results for some families and thus decreasing the statistical significance of the results. Results show that all the human and mouse MDR members have a Pfam ADH domain (ADH_N and/or ADH_zinc_N) and takes part in an oxidation- reduction process, often with NAD or NADP as cofactor. Many of the proteins contain zinc and are expressed in liver tissue. Conclusions: A python based tool for automatic annotation has been created to annotate the different MDR member families. The tool is easily extendable to be used with new databases and much of the results agrees with information found in literature. The utility and necessity of this system, as well as the quality of its produced results, are expected to only increase over time, even if no additional extensions are produced, as the system itself is able to make further and more detailed inferences as more and more data become available.

v Sammanfattning

Introduktion: Den stora utmaningen inom bioinformatik är dataintegration. Mängden tillgänglig data ökar ständigt och det finns inga gemensamma standarder för hur, eller var, data ska lagras. Målet med detta arbete är att skapa ett automa- tiserat verktyg för att annotera proteinfamiljer inom MDR-superfamiljen genom att hitta gemensamma egenskaper hos medlemsproteinerna. Målet med detta är att öka förståelsen för både superfamiljen som helhet och de enskilda proteinfamil- jerna och därgenom öka mängden kunskap man får på köpet när nyfunna protein matchar någon av medlemsfamiljerna. Metod: De olika databaserna innehåller flera olika sorters typer av data som krävde olika hantering. Textuell data hanterades genom vanlig stränghantering medan numerisk data krävde mer avancerad behandling såsom statistiska beräk- ningar. Ontologidata hanterades som trädnoder för att ta hänsyn till termernas inbyggda släktskap. Verktyget implementerades som ett pluginbaserat system för att förenkla utbyggnad och inkludering av fler datakällor. Resultat: Den största utmaningen var att ta hand om databortfall, framför allt inom mindre studerade familjer vilket gav sämre statistisk säkerhet hos resulta- tet. Det visade sig att MDR-proteiner hos människa och mus har minst en av Pfam-domänerna ADH_N och ADH_zinc_N. De är även delaktiga i oxidations- reduktionsprocesser och använder ofta NAD eller NADP som kofaktor. Många av proteinerna binder zink och uttrycks i lever. Slutsatser: Ett pythonbaserat verktyg för automatisk annotering av proteinfamil- jer tillhörande MDR-superfamiljen har tillverkats. Verktyget är enkelt att bygga ut för använding av nya datakällor och det visar sig att dess resultat stämmer väl överens med litteraturen. Användbarheten och behovet av systemet samt kva- liteten av dess resultat kommer kontinuerligt att öka över tid, även om inga nya tillägg skapas, eftersom systemet kan göra fler och mer detaljerade inferenser i takt med att ny data blir tillgänglig.

vi Acknowledgments

Firstly, I would like to thank my supervisor Joel Hedlund for all the good comments and constant positive encouragements and my examiner Bengt Persson. I would also like to thank Torbjörn Jonsson, my boss and mentor at IDA who gave me a lot of administrative work instead of the usual scheduled education to get time to focus on this thesis. Christopher and Erik at home for good comments, discussions and support. My family for all their support and encouragements and finally, all my friends at [hg] (and especially Petter) for all the good times and things to do in my spare time.

vii

Contents

1 Introduction 1 1.1 Bioinformatics ...... 1 1.2 The MDR families ...... 2 1.2.1 ADH - Alcohol Dehydrogenase ...... 2 1.2.2 VAT1 - Vesicle Amine Transport Protein 1 ...... 2 1.2.3 FAS - Fatty Acid Synthase ...... 2 1.2.4 MECR - Mitochondrial trans-2-enoyl-CoA Reductase . . . .2 1.2.5 PDH - Pyruvate Dehydrogenase ...... 3 1.2.6 PTGR - Prostaglandin Reductase ...... 3 1.2.7 vertQOR - Vertebrate Quinone Oxidoreductase ...... 3 1.2.8 ZADH2 - Zinc-binding Alcohol Dehydrogenase domain con- taining protein 2 ...... 3

2 Methods 5 2.1 Implementation ...... 5 2.2 Databases and services ...... 6 2.2.1 Uniprot ...... 6 2.2.2 PICR ...... 6 2.2.3 BioGPS ...... 7 2.2.4 Human Protein Atlas ...... 7 2.2.5 ArrayExpress ...... 8 2.2.6 KEGG ...... 8 2.2.7 Reactome ...... 8 2.2.8 Ontology ...... 8 2.2.9 InterPro ...... 8 2.2.10 PFAM ...... 9 2.2.11 PROSITE ...... 9 2.3 Dendrogram generation ...... 10

3 Results and discussion 11 3.1 ADH ...... 11 3.2 VAT1 ...... 11 3.3 vertQOR ...... 12 3.4 PTGR ...... 12 3.5 FAS ...... 13

ix x Contents

3.6 MECR ...... 14 3.7 PDH ...... 14 3.8 ZADH2 ...... 14

4 Conclusions 15

5 Future Work 17

A Tables 22

B List of Supplementary data 32

C Manual for using the Protein Family Annotation Software (PFAS) 33 1 Introduction

The great challenge in bioinformatics is data integration. Each group focuses on “their” topic and publishes their own data. There are no common unified standards of where, or how, the data should be stored but rather multiple, independently evolving, de facto standards. The amount of available data, as well as the number of databases, is growing steadily. Hedlund et al. [1] have assembled a database of proteins in the medium-chain dehydrogenases/reductases (MDR) protein superfamily by creating hidden Markov models (HMMs) describing the member families. The aim of this project is to build an automated tool to annotate the different member families in this database by finding common properties of member proteins. The data will be gathered from several publicly available databases. The MDRs are widespread and exist in all kingdoms of life, however the main focus of this project will be on human and mouse proteins, since these two species are the richest in reliable annotation sources. The goal is to find common properties of each member family to increase the understanding of both the member families and the entire MDR superfamily. This knowledge will then be helpful when a new, unannotated, protein is matched as a member to a specific MDR member family. The tool will be created in a general way, making it easy to extend to include other databases and to annotate other protein families. The utility and necessity of this system, as well as the projected quality of its produced results, are expected to only ever increase over time, even if no additional extensions are produced, as the system itself is able to make further and more detailed inferences as more and more data become available.

1.1 Bioinformatics

Around the year 2000 the project (HUGO) was started. The goal of HUGO was to sequence the human genome and this generated a lot of data. Handling this wast amount of data required advanced computational algorithms and thus the field of bioinformatics was born. The main goal of bioinformatics is to increase the knowledge of all biological processes by collecting, organising, interpreting and visualising the available data. Easy access to good, annotated data makes it easier for researchers to find new fields of study. At present, there are a lot of data on proteins but there is not as much data

1 2 Introduction on families and superfamilies and we have found no other software that annotates protein families.

1.2 The MDR families

Hedlund et al. [1] estimate the number of MDR families to be around 500. In their work, 86 of those have been classified. In this thesis the focus will be on the eight families with at least two protein members expressed in human or mouse tissue described below (adapted from [1,2]).

1.2.1 ADH - Alcohol Dehydrogenase ADH exists in five different forms (class I-V) in vertebrates. Class I is typi- cally found in liver tissue and has high activity against several alcohols, includ- ing ethanol. Class II has been found in human liver. It is much less studied but metabolises peroxidic aldehyde and norepinephrine [3]. Class III metabolises formaldehyde and fatty acids. Class IV acts in stomach and other mucosa produc- ing tissues as a first step in metabolism of gastric alcohols. As an interesting side note, ADH class IV has an increased expression in esophageal cancer [4], possibly due to the carcinogenicity of acetaldehyde, the first metabolite of ethanol. Class V ADH has been found in human foetal liver.

1.2.2 VAT1 - Vesicle Amine Transport Protein 1 VAT1 shows expression during the development of the nervous system but also in hypothalamus and cerebellum. Members show ATPase activity and binds to ATP and calcium. The family is thought to be involved in vesicle transport of neurotransmitters.

1.2.3 FAS - Fatty Acid Synthase The main function of the proteins is to synthesise fatty acids by elongation of acetyl-CoA with malonyl-CoA. The reaction is NADPH dependent. In mammals, FAS is a homo-dimer, each consisting of five domains. In bacteria and plants, on the other hand, the reaction is carried out by several, monofunctional, enzymes in a system [5].

1.2.4 MECR - Mitochondrial trans-2-enoyl-CoA Reductase MECR seems to be a regulating transcription factor for mitochondrial respiratory proteins. They are mainly expressed in muscle tissue in human but members are present in plants, insects, worms, fish and mammals. Members has been shown to act in fatty acid synthesis in yeast mitochondria. Members catalyses the reduction of trans-2-enoyl-CoA to acyl-CoA and depends on NADPH. 1.2 The MDR families 3

1.2.5 PDH - Pyruvate Dehydrogenase Transforms pyruvate into acetyl-CoA in a NAD-dependent way. Exists in mi- tochondria providing a link between glycolysis and the tricarboxylic acid (TCA) cycle. The general structure is a heterotetramer of two alpha and two beta sub- units.

1.2.6 PTGR - Prostaglandin Reductase This family contains both the human prostaglandin reductase (PGR) 1 and 2 but most of the family consists of procaryotic members. Commonly expressed in liver, kidney, intestine, spleen and stomach but also in bronchial epithelial cells and heart.

1.2.7 vertQOR - Vertebrate Quinone Oxidoreductase This family consists of both zinc containing members and members lacking zinc. Generally, the zinc containing members have been studied in a greater extent, making the available data for the zinc-lacking members less detailed. The enzy- matic function is described as quinone reduction with NADPH as cofactor but a catalytic mechanism has yet to be proposed [6].

1.2.8 ZADH2 - Zinc-binding Alcohol Dehydrogenase domain containing protein 2 ZADH2 is a very small family with only four members in SwissProt, of which only two exists in human or mouse. This family has not yet been studied to any greater extent, and only very little information can currently be found.

2 Methods

The tool was implemented in Python using the algorithm described below. The MDR families were read from a local version of the mdr-enzymes.org database. Each protein was then annotated with data from Uniprot to get syn- onyms for the protein names in different databases and some general information such as [7] terms and InterPro [8], Pfam [9] and PROSITE [10] sites found in the protein sequence. The UniProt database also gives primary and secondary accession numbers for each protein which were used to remove any duplicates in the data. A duplicate entry was defined as having a primary acces- sion number that is a secondary accession number for another protein. Manual sequence comparison of the matched proteins confirms the validity of this method. Duplicate removal was followed by synonym gathering. Many databases use the same name for a specific protein, but some use other names. Therefore mapping between different databases had to be done. Several plugins were developed to fetch synonyms for proteins. When synonyms had been found, information from chosen databases was gath- ered for each protein. Common values from all proteins in each protein family were then collected. A value found in at least half of all proteins in a family was de- clared common for the entire family. A flowchart of the method can be seen in figure 2.1.

2.1 Implementation

The final tool is plugin based with four types of plugins; source plugins in which protein families are read from a source database, synonym plugins which finds syn- onyms for proteins, database plugins which queries a data source for information on proteins and finds common values and output plugins that writes the results to file. All plugins get automatic access to configuration files to ease the alteration of limits and paths to data files and results. There are both global settings and local settings for each plugin that can override the global configuration. For further information see the source code in appendix B and a short manual in appendix C.

5 6 Methods

Gather proteins from source databases

Extract information from UniProt

Remove duplicates

Find synonyms

Gather information from selected databases

Find common values

Save results to file

Figure 2.1: Summarised work flow of the tool.

2.2 Databases and services

The many different databases and services used during this project are described briefly in the following sections.

2.2.1 Uniprot The widely used Universal Protein Resource (UniProt) is a protein sequence and annotation data resource that consists of several databases. The UniProt Know- legebase provides both manually curated information within the SwissProt sub- section, and automatically added annotation within the much larger TrEMBL subsection. No distinction has been made between the two sources in this project.

2.2.2 PICR

The Protein Identifier Cross-Referencing service [11] (PICR) is a web application that maps protein names between different databases. They base the results on the UniProt Archive (UniParc) which is a data warehousing service updated daily 2.2 Databases and services 7 by UniProt to account for new releases of source databases. Many of the results found should also be found with the UniProt polling but the UniParc database store data from many other sources then TrEMBL and SwissProt.

2.2.3 BioGPS

BioGPS [12] is a gene annotation portal that lets the user customise the layout of the page. They also have a atlas based on human and mouse protein-encoding transcriptomes [13]. They have created a data set based on both a combined Human U133/GNF1H array and the mouse GNF1M and MOE430 arrays. The fact that the tissues studied in different experiments are different increases the number of tissues studied but makes it harder to find common expression pro- files. All the calculations are species based to include as many tissues as possible. Three steps had to be taken to gather data from this data set. Since they had several expression levels for each tissue and the difference between a positive and negative value was quite large, the data was normalised by dividing each data point with the harmonic mean value n m = n (2.1) P 1 xi i=1 for that gene and then calculating mean values for each tissue. The BioGPS dataset is provided in two parts; annotations and data. The data part is organised based on probe id, so in order to be applicable, all proteins needed to be annotated with probe id synonyms using the annotations part. Several different synonyms for the protein could be used to extract the probe id but SwissProt accession numbers and Ensembl gene id were used primarily. In order to empasise large overexpressions, the tissues were sorted according to their t-statistic

x − X t = (2.2) s where x is a expression value for a specific tissue, X is the mean value for all tissues and s is the sample standard deviation for the data set. The highest scoring tissues are shown in tables A.4 and A.5.

2.2.4 Human Protein Atlas

The Human Protein Atlas [14,15] (HPA) provides immunohistochemically stained images of both diseased and normal human tissues, effectively providing specific maps of protein presence in the human body in various conditions. It has a coverage of about 25% of the human genome and since all the conclusions are drawn by trained pathologists, the database is quite reliable. Sadly, very few of the MDR proteins are available in the HPA database but it is an active project and hopefully data can be gathered from HPA in future releases of the database. 8 Methods

2.2.5 ArrayExpress

Array Express [16] is a database of functional genomics experiments including the gene expression atlas with a subset of curated data. The ArrayExpress server was queried for every protein in each family and a positive tissue was defined as a tissue having more up-regulated than down-regulated experiments. Tissues that were positive in at least half of the proteins were deemed common for the family.

2.2.6 KEGG

KEGG, or Kyoto Encyclopedia of and Genomes [17], is a database resource containing data from 16 different sources. In this thesis the focus was on the KEGG Pathway database, containing maps for metabolism and other cellular processes. Data was extracted from UniProt and then processed as in the method description.

2.2.7 Reactome

Reactome [18] is a manually curated database over human cell reactions and path- ways with at least 5000 distinct proteins. Reactome data was collected from UniProt and processed according to the method description.

2.2.8 Gene Ontology

The aim of the Gene Ontology [7] (GO) is to standardise the representation of attributes of genes and gene products. It is used as a sort of thesaurus for biological terms where all terms have relationships with other terms (see example 2.1). The relationships can be shown in a tree format for easy overview. Example 2.1: Gene Ontology terms

GO:0032496 (response to LPS) is a GO:0009617 (response to bacterium)

The GO terms were extracted from UniProt for each protein and then the number of positive terms were increased by using the relations between the terms. The number of occurrences of each term was counted whereafter the number of occur- rences of a more specific term was added to a more general term as shown in figure 2.2. The tree structure was generated using AmiGO [19].

2.2.9 InterPro

InterPro [8] combines several member data sources to derive protein signatures. All member databases annotate protein sequences with different focus giving a result with good overview of the sequence. Integration of the source databases is done manually and the resulting entry is annotated with crosslinks to other databases. 2.2 Databases and services 9

GO:0003674 molecular_function

GO:0005488 GO:0003824 binding catalytic activity

GO:0043167 GO:0016491 ion binding oxidoreductase activity

GO:0043169 GO:0016627 GO:0016614 cation binding oxidoreductase oxidoreductase activity, acting on activity, acting on the CH-CH group CH-OH group of donors of donors GO:0046872 metal ion binding

GO:0016628 GO:0016616 oxidoreductase oxidoreductase activity, acting on activity, acting on GO:0046914 the CH-CH group of the CH-OH group of transition metal donors. NAD or NADP donors. NAD or NADP ion binding as acceptor as acceptor

GO:0032440 GO:0047522 GO:0004022 GO:0008270 2-alkenal reductase 15-oxoprostaglandin alcohol dehydrogenase zinc ion binding activity 13-oxidase activity (NAD) activity

Figure 2.2: Part of the PTGR Gene Ontology Tree. Terms in thin boxes were found among the proteins, but not with sufficient frequency for inclusion in the results. Terms in boldface boxes had enough occurrences to be included in the results, either in their own right or after cumulative addition of descendant terms. Terms in dashed boxes were not found in any of the proteins, but are included in this figure to show ancestry.

2.2.10 PFAM

PFAM [9] is a collection of protein domain models represented by HMMs. The databases contain two kind of entries, Pfam-A and Pfam-B entries. The Pfam-A entries are based on manually constructed multiple sequence alignments (MSAs). Pfam-B are generated automatically and have lower quality. Since the MDR family HMMs were bootstrapped from the Pfam HMMs ADH_N and ADH_zinc_N, one could expect to find at least one of these signatures in the majority of the results.

2.2.11 PROSITE

PROSITE annotates protein domains using profiles, which are similar to profile HMMs as used by Pfam-A, but it also annotates smaller sequence features like for example functional sites or posttranslationally modified sites, using a simple pat- tern matching technique similar to regular expressions used in many programming languages such as . 10 Methods

2.3 Dendrogram generation

To generate the cladogram seen in Fig. 3.2, three softwares were used; ClustalW, MrBayes, and Dendroscope. All protein sequences were downloaded in FASTA for- mat from UniProt and then aligned using ClustalW and then trees were generated with MrBayes and displayed using Dendroscope. 3 Results and discussion

The results sorted by database can be seen in appendix A. In the following chapter the results for each family will be presented and analysed.

3.1 ADH

Both ArrayExpress and BioGPS shows that ADH is expressed in liver as ex- pected according to 1.2.1 but both sources also show some expression in heart and adipocytes. Only BioGPS show expression in mucosa such as eye and epidermis and then only for mouse proteins. A reason for this is probably because of the in- clusion limit of 50%. As stated earlier, only ADH class IV is commonly expressed outside of the liver so it is not that unexpected to only find a majority of the proteins in liver tissue. The GO terms shows that most members has oxidoreductase activity (GO:0016491) according to Fig. 3.1 and also that they bind zinc. About 60% is part of an ethanol metabolic process. InterPro shows that the members have a NAD(P) binding domain and confirms the zinc binding shown with GO terms. Only a fraction of the total number of members were found in reactome but all of the six found took part in biological oxidations and confirms the GO results. KEGG only annotated 16 of the members, however since these are distributed more or less evenly throughout the evolutionary tree of the family (Fig. 3.2), there is a high probability that the results presented here are indeed representative for the family as a whole, especially since the KEGG annotations are in complete consensus. Most of the pathways reported are included because of the alcohol → aldehyde conversion function of most members but they also take part in cy- tochrome P450 pathways such as chloral hydrate → trichloroethanol conversion.

3.2 VAT1

VAT1 seems to be expressed mainly in tissues within the central nervous system such as adrenal gland, hypothalamus and the amygdala. Both BioGPS and Ar- rayExpress also show raised levels in adipose tissue. No satisfying reason for this has been found in literature, but is a very interesting venue to explore in further

11 12 Results and discussion

GO:0003674 molecular_function

GO:0003824 catalytic activity

GO:0016491 oxidoreductase activity

GO:0016903 GO:0016651 GO:0016614 oxidoreuctase oxidoreductase oxidoreductase activity, acting on activity, acting on activity, acting on the aledhyde or oxo NADH or NADPH CH-OH group of donors group of donors

GO:0016620 GO:0016623 GO:0016655 GO:0016616 oxidoreductase oxidoreductase oxidoreductase oxidoreductase activity, acting on activity, acting on activity, acting on activity, acting on the aldehyde or oxo the aldehyde or oxo NADH or NADPH, the CH-OH group of group of donors, NAD group of donors. quinone or similar donors, NAD or NADP or NADP as acceptor oxygen as acceptor compound as acceptor as acceptor

GO:0018467 GO:0019115 GO:0004031 GO:0003960 GO:0051903 GO:0004745 GO:0004022 GO:0004033 formaldehyde benzaldehyde aldehyde oxidase NADPH:quinone S-(hydroxymethyl)glu retinol dehydrogenase alcohol dehydrogenase aldo-keto reductase dehydrogenase dehydrogenase activity reductase activity athione dehydrogenase activity (NAD) activity (NADP) activity activity activity activity

GO:0004024 GO:0008106 alcohol dehydrogenase alcohol dehydrogenase activity, zinc-dependent (NADP+) activity

GO:0004032 alditol:NADP+ 1-oxidoreductase activity

Figure 3.1: Part of the ADH GO tree, using the same graphical representation as figure 2.2. laboratory experiments. The proteins have the ADH domain and binds zinc ac- cording to InterPro and PFAM. They also have a NAD binding domain and match the quinone oxidoreductase / ζ-crystallin PFAM pattern.

3.3 vertQOR vertQOR is highly expressed in liver and kidney. This is expected since the main function is to deactivate the toxic quinones and it seems right that the proteins will be found in the renal system. Members have the NAD(P) binding domain and matches the ζ-crystallin pattern, a result that is quite expected since the ζ-crystallin pattern defines the NADP-depending quinone oxidoreductases. The four members found were all zinc-binding.

3.4 PTGR

Both expression databases show that PTGRs are expressed in small intestine and stomach. Members bind zinc and NAD(P). Many of the proteins exist in cyto- plasm according to gene ontology terms and have oxidoreductase activity, mainly 15-oxoprostaglandin 13-oxidase (PGR) activity but also 2-alkenal reductase and alcohol dehydrogenase activity as seen in Fig. 2.2. 3.5 FAS 13

A8MVN9_HUMAN

Q3UKA4_MOUSE

B4DWS1_HUMAN

P00329_MOUSE

P07327_HUMAN

P40394_HUMAN

Q3UMM7_MOUSE

Q9D748_MOUSE

B4DVC3_HUMAN Q548K2_MOUSE

Q64437_MOUSE B4E2R9_HUMAN Q496S1_MOUSE

P00325_HUMAN Q9QYY9_MOUSE

Q3V0P5_MOUSE B4E1R1_HUMAN P08319_HUMAN

A8MYN5_HUMAN Q6FI45_HUMAN

Q6IRT1_HUMAN

P00326_HUMAN P11766_HUMAN

Q3UQ40_MOUSE Q5U043_HUMAN

Q2VIM7_HUMAN

P28474_MOUSE

Q9D932_MOUSE Q6P5I3_MOUSE

A1L3C0_MOUSE

B4DPD8_HUMAN

Q8IUN7_HUMAN

P28332_HUMAN

Figure 3.2: Cladogram of the ADH family with members annotated in KEGG in bold face. This is the consensus tree from 10000 Monte Carlo reconstructions using the MrBayes program, wherein no other credible topologies could be detected.

3.5 FAS

FAS was found to be expressed in high levels in adipocyte tissue as expected but also in muscle (skeletal and cardiac). ArrayExpress indicated high expression in many tissues but the results was not confirmed with BioGPS. In Fig. 3.3, one can see that many of the found GO-terms are linked with fatty acid synthase activity which would have been a very expected result. How- ever, since no member was annotated with that GO-term, it wasn’t included in the result table. Potentially, this sort of non-detection could be remedied by step- wise generalisation up through the ontology tree until an ancestral term has been found that can describe sufficiently many of the member proteins. However, as this traversal could prove computationally costly, and could potentially also yield uselessly over-general terms, it has not been implemented in this version of the tool. All members bind zinc and NAD(P) and most have a acyl carrier and trans- ferase domain. 14 Results and discussion

GO:0004312 fatty acid synthase activity

GO:0004314 GO:0004315 GO:0016297 GO:0004313 GO:0004317 [acyl-carrier-protein] 3-oxocyl-[acyl-carrier- acyl-[acyl-carrier- [acyl-carrier-protein] 3-hydroxypalmitoyl- S-malonyltransferase protein]synthease protein] hydrolase S-acetyltransferase [acyl-carrier-protein] activity activity activity activity dehydrogenase activity

GO:0004315 GO:0004320 GO:0016295 palmitoyl-[acyl-carrier- oleoyl-[acyl-carrier- myristoyl-[acyl-carrier- protein]hydrolase protein]hydrolase protein]hydrolase activity activity activity

Figure 3.3: Reduced GO tree for FAS to highlight the terms associated with fatty acid synthase with the same graphical representation as figure 2.2.

3.6 MECR

No common expression profiles for the two data sources could be found but BioGPS reported a high value for heart tissue. It was quite unexpected that none of the sources confirmed high expression in muscle tissue. According to GO, the proteins exists in mitochondria, has oxidoreductase activity and binds zinc. The latter confirmed by Pfam and InterPro.

3.7 PDH

The PDH family has members in many different species and only two proteins was found, one copy in each species studied. Gene ontology terms show that the mem- bers are located in the mitochondrial membrane, cilium and flagellum. Members are expressed in liver, kidney and testis. Proteins binds zinc and NAD(P).

3.8 ZADH2

Both Pfam and GO show that the members bind zinc. The data gathered from BioGPS and ArrayExpress gives no clear indication of where the members are expressed. InterPro and Prosite both show that all members match the ζ-crystallin pattern. 4 Conclusions

The main goal in this project was to create a tool to annotate protein families by finding common properties in the member proteins. The focus was on the member families of the medium chain dehydrogenase/reductase superfamily with human and/or mouse protein members. The tool was supposed to be extendable and easy to use for other protein families. During this project, a python based tool that annotates protein families with data from several data sources has been created. The tool is plugin based which makes it easy to expand and change according to each user’s preference. Regard- less of this, the utility and necessity of this system, as well as the quality of its produced results, are expected to only ever increase over time, even if no addi- tional extensions are produced, as the system itself is able to make further and more detailed inferences as more and more data become available. ADH members shows expression in liver, heart and adipocytes. BioGPS also show expression in mucosa such as eye and epidermis and then only for mouse proteins. Members bind to zinc and most take part in ethanol metabolic processes. KEGG annotated 16 proteins distributed evenly throughout the evolutionary tree of the family. Most pathways were linked to the alcohol → aldehyde conversion seen in members but some also takes part in cytochrome P450 pathways such as chloral hydrate → trichloroethanol conversion. VAT1 seems to be expressed mainly in tissues within the central nervous system such as adrenal gland, hypothalamus and the amygdala but also in adipose tissue. No satisfying reason for this has been found in literature, but is a very interesting venue to explore in further laboratory experiments. Members match the quinone oxidoreductase / ζ-crystallin PFAM pattern. vertQOR is highly expressed in liver and kidney. Members also match the quinone oxidoreductase / ζ-crystallin PFAM pattern. Both expression databases show that PTGRs are expressed in small intestine and stomach. Many of the proteins exist in cytoplasm and have oxidoreductase activity, mainly 15-oxoprostaglandin 13-oxidase (PGR) activity but also 2-alkenal reductase and alcohol dehydrogenase activity. FAS was found to be expressed in high levels in adipocyte tissue and in muscle tissue (skeletal and cardiac). Many of the found GO-terms are linked with fatty acid synthase activity. All members bind zinc and NAD(P) and most have an acyl carrier and transferase domain. No common expression profiles for MECR have been found between the two

15 16 Conclusions data sources but BioGPS reported a high value for heart tissue. The proteins exist in mitochondria, have oxidoreductase activity and bind zinc. PDH members has been shown to be located in the mitochondrial membrane, cilium and flagellum. Members are expressed in liver, kidney and testis. Proteins bind zinc and NAD(P). Data sources have not given any clear indications of where ZADH2 members are expressed but they seem to bind zinc and match the ζ-crystallin pattern. The result agrees quite well with information gathered from literature and shows that all the human and mouse MDR members have a Pfam ADH domain (ADH_N and/or ADH_zinc_N). This is encouraging since the hidden Markov models used to generate the MDR families were bootstrapped from these signa- tures. The proteins take part in oxidation-reduction processes, often with NAD or NADP as cofactor and many of the proteins contain zinc and are expressed in liver tissue. Some also have the quinone oxidoreductase / ζ-crystallin domain. 5 Future Work

During this project I have identified several interesting venues for further explo- ration that may lead to increased sensitivity and reliability for the system. Firstly, running the tool for all proteins in each family instead of only families with pro- teins expressed in human and mouse tissue would increase the statistical value of the results. A problem would of course be that not all proteins have been anno- tated and some of the databases do not support several of the available organisms. Secondly, the result would be more reliable if distinction between automatically annotated and expert reviewed data had been made. During the development, I chose not to do this distinction because of the small sample size. If the full MDR database had been used, I believe that this would be crucial to get reliable data. A possible improvement of the gene ontology plugin would be to include new terms by doing stepwise generalisation up through the ontology tree until an ancestral term has been found that can describe sufficiently many of the member proteins, as discussed in section 3.5, and possibly with some heuristic to avoid generating uselessly general results.

17

Bibliography

[1] Joel Hedlund, Hans Jörnvall, and Bengt Persson. Subdivision of the mdr su- perfamily of medium-chain dehydrogenases/reductases through iterative hid- den markov model refinement. BMC Bioinformatics, 11:534, 2010. [2] Bengt Persson, Joel Hedlund, and Hans Jörnvall. Medium- and short-chain dehydrogenase/reductase gene and protein families. Cellular and Molecular Life Sciences, 65:3879–3894, 2008. [3] Roger S. Holmes. Alcohol dehydrogenases: a family of isozymes with differ- ential functions. Alcohol and alcoholism. Supplement, 2:127, 1994. [4] Wojciech Jelski, Miroslaw Kozlowski, Jerzy Laudanski, Jacek Niklinski, and Maciej Szmitkowski. The activity of class i, ii, iii, and iv alcohol dehydrogenase (adh) isoenzymes and aldehyde dehydrogenase (aldh) in esophageal cancer. Digestive Diseases and Sciences, 54:725–730, 2009. [5] Timm Maier, Marc Leibundgut, Daniel Boehringer, and Nenad Ban. Struc- ture and function of eukaryotic fatty acid synthases. Quarterly Reviews of Biophysics, 43(03):373–422, 2010. [6] Sergio Porté, Agrin Moeini, Irene Reche, Naeem Shafqat, Udo Oppermann, Jaume Farrés, and Xavier Parés. Kinetic and structural evidence of the alke- nal/one reductase specificity of human ζ-crystallin. Cellular and Molecular Life Sciences, 68:1065–1077. [7] , Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig, Midori A. Harris, David P. Hill, Andrew Tarver, Laurie Issel-and Kasarskis, , John C. Matese, Joel E. Richard- son, Martin Ringwald, Gerald M. Rubin, and Gavin Sherlock. Gene ontology: tool for the unification of biology. Nature Genetics, 25:25–29, May 2000. [8] Sarah Hunter, , Teresa K. Attwood, , Alex Bate- man, David Binns, Peer Bork, Ujjwal Das, Louise Daugherty, Lauranne Duquenne, Robert D. Finn, Julian Gough, Daniel Haft, Nicolas Hulo, Daniel Kahn, Elizabeth Kelly, Aurélie Laugraud, Ivica Letunic, David Lonsdale, Ro- drigo Lopez, Martin Madera, John Maslen, Craig McAnulla, Jennifer Mc- Dowall, Jaina Mistry, Alex Mitchell, Nicola Mulder, Darren Natale, Christine

19 20 BIBLIOGRAPHY

Orengo, Antony F. Quinn, Jeremy D. Selengut, Christian J. A. Sigrist, Man- jula Thimma, Paul D. Thomas, Franck Valentin, Derek Wilson, Cathy H. Wu, and Corin Yeats. Interpro: the integrative protein signature database. Nucleic Acids Research, 37:D211–D215, 2009.

[9] Robert D. Finn, Jaina Mistry, John Tate, Penny Coggill, Andreas Heger, Joanne E. Pollington, O. Luke Gavin, Prasad Gunasekaran, Goran Ceric, Kristoffer Forslund, Liisa Holm, Erik L. L. Sonnhammer, Sean R. Eddy, and . The pfam protein families database. Nucleic Acids Research, 38:D211–D222, 2010.

[10] Christian J. A. Sigrist, Lorenzo Cerutti, Edouard de Castro, Virginie Genevaux, Petra S. Langendijk-and Bulliard, Amos Bairoch, and Nicolas Hulo. Prosite, a protein domain database for functional characterization and annotation. Nucleic Acids Research, 38:D161–D166, 2010.

[11] Richard Cote, Philip Jones, Lennart Martens, Samuel Kerrien, Florian Reisinger, Quan Lin, Rasko Leinonen, Rolf Apweiler, and Henning Herm- jakob. The protein identifier cross-referencing (picr) service: reconciling pro- tein identifiers across multiple source databases. BMC Bioinformatics, 8(1): 401, 2007.

[12] Chunlei Wu, Camilo Orozco, Jason Boyer, Marc Leglise, James Goodale, Serge Batalov, Christopher L Hodge, James Haase, Jeff Janes, Jon W Huss III, and Andrew I Su. Biogps: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biology, 10(11):R130, 2009.

[13] Andrew I. Su, Tim Wiltshire, Serge Batalov, Hilmar Lapp, Keith A. Ching, David Block, Jie Zhang, Richard Soden, Mimi Hayakawa, Michael P. Kreiman, Gabriel Cooke, John R. Walker, and John B. Hogenesch. A gene atlas of the mouse and human protein-encoding transcriptomes. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 101(16):6062–6067, April 2004.

[14] Fredrik Pontén, Karin Jirström, and Mathias Uhlén. The human protein atlas – a tool for pathology. Journal of Pathology, 216:387–393, 2008.

[15] Lisa Berglund, Erik Björling, Per Oksvold, Linn Fagerberg, Anna As- plund, Cristina Al-Khalili Szigyarto, Anja Persson, Jenny Ottosson, Henrik Wernérus, Peter Nilsson, Emma Lundberg, Åsa Sivertsson, Sanjay Navani, Kenneth Wester, Caroline Kampf, Sophia Hober, Fredrik Pontén, and Math- ias Uhlén. A genecentric human protein atlas for expression profiles based on antibodies. Molecular & Cellular Proteomics, 7:2019–2027, 2008.

[16] Helen Parkinson, Ugis Sarkans, Nikolay Kolesnikov, Niran Abeygunawar- dena, Tony Burdett, Miroslaw Dylag, Ibrahim Emam, Anna Farne, Emma Hastings, Ele Holloway, Natalja Kurbatova, Margus Lukk, James Malone, Roby Mani, Ekaterina Pilicheva, Gabriella Rustici, Anjan Sharma, Eleanor BIBLIOGRAPHY 21

Williams, Tomasz Adamusiak, Marco Brandizi, Nataliya Sklyar, and Alvis Brazma. Arrayexpress update-an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Research, 39(suppl 1):D1002–D1004, 2011. [17] and Susumu Goto. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(1):27–30, 2000. [18] Lisa Matthews, Gopal Gopinath, Marc Gillespie, Michael Caudy, David Croft, Bernard de Bono, Phani Garapati, Jill Hemish, Henning Hermjakob, Bijay Jassal, Alex Kanapin, Suzanna Lewis, Shahana Mahajan, Bruce May, Esther Schmidt, Imre Vastrik, Guanming Wu, , Lincoln Stein, and Peter D’Eustachio. Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Research, 37:D619–D622, 2009.

[19] Seth Carbon, Amelia Ireland, Christopher J. Mungall, ShengQiang Shu, Brad Marshall, Suzanna Lewis, the AmiGO Hub, and the Web Presence Work- ing Group. Amigo: online access to ontology and annotation data. Bioinfor- matics, 25(2):288–289, 2009. A Tables

Family Name ADH FAS PDH MECR VAT1 PTGR vertQOR ZADH2 Number of proteins 33 5 2 4 14 8 4 5

Table A.1: Number of proteins found in each family

Number Family Term Percentage of Proteins PF08240 ADH_N 29 88 ADH PF00107 ADH_zinc_N 33 100 PF00975 Thioesterase 3 60 PF00698 Acyl_transf_1 4 80 PF08659 KR 4 80 PF00550 PP-binding 4 80 FAS PF02801 Ketoacyl-synt_C 4 80 PF00109 ketoacyl-synt 4 80 PF08242 Methyltransf_12 5 100 PF00107 ADH_zinc_N 5 100 PF08240 ADH_N 2 100 PDH PF00107 ADH_zinc_N 2 100 PF08240 ADH_N 4 100 MECR PF00107 ADH_zinc_N 4 100 PF08240 ADH_N 12 86 VAT1 PF00107 ADH_zinc_N 14 100 PTGR PF00107 ADH_zinc_N 8 100 PF08240 ADH_N 4 100 vertQOR PF00107 ADH_zinc_N 4 100 PF08240 ADH_N 4 80 ZADH2 PF00107 ADH_zinc_N 5 100

Table A.2: PFAM annotations

22 23

Number Family Term Percentage of Proteins GO:0016491 oxidoreductase activity 31 94 ADH GO:0008270 zinc ion binding 25 76 GO:0006067 ethanol metabolic process 20 61 GO:0031177 phosphopantetheine binding 4 80 GO:0055114 oxidation-reduction process 4 80 GO:0000036 acyl carrier activity 4 80 GO:0042587 glycogen granule 3 60 GO:0048037 cofactor binding 4 80 FAS GO:0009058 biosynthetic process 3 60 GO:0016740 transferase activity 4 80 GO:0005739 mitochondrion 3 60 GO:0016491 oxidoreductase activity 3 60 GO:0008270 zinc ion binding 5 100 GO:0055114 oxidation-reduction process 2 100 GO:0005929 cilium 2 100 GO:0031966 mitochondrial membrane 2 100 PDH GO:0019861 flagellum 2 100 GO:0005625 soluble fraction 2 100 GO:0003939 L-iditol 2-dehydrogenase activity 2 100 GO:0016491 oxidoreductase activity 4 100 GO:0055114 oxidation-reduction process 4 100 MECR GO:0008270 zinc ion binding 4 100 GO:0005739 mitochondrion 3 75 GO:0016491 oxidoreductase activity 14 100 VAT1 GO:0055114 oxidation-reduction process 14 100 GO:0008270 zinc ion binding 14 100 GO:0016491 oxidoreductase activity 8 100 GO:0055114 oxidation-reduction process 8 100 GO:0032440 2-alkenal reductase activity 5 63 PTGR GO:0047522 15-oxoprostaglandin 13-oxidase activity 7 88 GO:0008270 zinc ion binding 8 100 GO:0005737 cytoplasm 5 63 GO:0055114 oxidation-reduction process 4 100 vertQOR GO:0016491 oxidoreductase activity 4 100 GO:0008270 zinc ion binding 4 100 GO:0016491 oxidoreductase activity 5 100 ZADH2 GO:0055114 oxidation-reduction process 5 100 GO:0008270 zinc ion binding 5 100

Table A.3: Gene ontology annotations 24 Tables

Family Organism Tissue t-score Liver 6.51 Fetalliver 3.52 human Adipocyte 3.46 colon 2.55 small intestine 2.07 ADH liver 7.03 cornea 4.15 adrenal gland 2.65 mouse bladder 2.10 kidney 1.65 intestine large 1.32 SkeletalMuscle 2.88 CardiacMyocytes 2.88 Lymphoma burkitts(Raji) 2.27 AdrenalCortex 2.27 TrigeminalGanglion 2.27 human Fetallung 1.97 SuperiorCervicalGanglion 1.67 Bonemarrow 1.67 FAS Tongue 1.67 Liver 1.67 Heart 1.67 adipose brown 7.37 adrenal gland 4.01 mammary gland lact 2.67 mouse mammary gland non-lactating 2.21 ovary 1.74 adipose white 1.58 Heart 6.30 Lymphoma burkitts(Raji) 4.22 Lymphoma burkitts(Daudi) 1.76 human 721 B lymphoblasts 1.50 Wholebrain 1.37 OccipitalLobe 1.11 B-cells marginal zone 4.12 MECR pancreas 3.73 lens 3.27 thymocyte SP CD4+ 2.48 mouse B-cells GL7negative Alum 1.52 T-cells CD8+ 1.41 adipose brown 1.39 macrophage bone marrow 2hr LPS 1.34 retina 1.30

Table A.4: BioGPS annotations, part 1 25

Family Organism Tissue t-score Thyroid 6.10 Prostate 4.71 human FetalThyroid 2.96 Liver 2.58 Kidney 1.64 PDH kidney 6.03 liver 4.88 mouse testis 3.84 mast cells IgE 3.11 mast cells IgE+antigen 1hr 1.25 BronchialEpithelialCells 7.61 small intestine 3.07 human SmoothMuscle 2.13 PTGR CardiacMyocytes 2.03 stomach 7.25 mouse cornea 4.86 bladder 1.95 Adrenalgland 4.60 SmoothMuscle 3.56 Adipocyte 2.47 Lung 2.10 BronchialEpithelialCells 1.59 human Leukemialymphoblastic(MOLT-4) 1.58 CardiacMyocytes 1.38 AdrenalCortex 1.31 VAT1 retina 1.28 Hypothalamus 1.23 macrophage peri LPS thio 1hrs 5.14 macrophage peri LPS thio 0hrs 5.06 macrophage peri LPS thio 7hrs 4.37 mouse neuro2a 1.36 mast cells 1.26 osteoclasts 1.21 721 B lymphoblasts 6.99 Fetalliver 2.64 Leukemia promyelocytic-HL-60 2.00 human Kidney 1.75 vertQOR pineal night 1.44 Thyroid 1.34 PancreaticIslet 1.33 kidney 9.52 mouse liver 1.33 adipose brown 4.32 Baf3 3.39 adrenal gland 2.73 salivary gland 2.53 ZADH2 mouse ovary 2.33 liver 1.83 heart 1.62 stomach 1.45 intestine large 1.18

Table A.5: BioGPS annotations, part 2 26 Tables

Number Family Term Percentage of Proteins IPR011032 GroES-like 33 100 IPR013149 ADH_C 33 100 IPR002328 ADH_Zn_CS 29 88 ADH IPR013154 ADH_GroES-like 29 88 IPR002085 ADH_SF_Zn 33 100 IPR016040 NAD(P)-bd_dom 32 97 IPR016038 Thiolase-like_subgr 4 80 IPR023102 Fatty_acid_synthase_dom_2 3 60 IPR009081 Acyl_carrier_prot-like 4 80 IPR016035 Acyl_Trfase/lysoPlipase 4 80 IPR020843 PKS_ER 5 100 IPR006163 Phsphopanteth-bd 4 80 IPR001227 Ac_transferase_dom 4 80 IPR020842 PKS/FAS_KR 4 80 IPR001031 Thioesterase 3 60 IPR006162 PPantetheine_attach_site 4 80 IPR016036 Malonyl_transacylase_ACP-bd 4 80 FAS IPR013149 ADH_C 5 100 IPR013968 PKS_KR 4 80 IPR014030 Ketoacyl_synth_N 4 80 IPR013217 Methyltransf_12 5 100 IPR011032 GroES-like 5 100 IPR014043 Acyl_transferase 4 80 IPR018201 Ketoacyl_synth_AS 4 80 IPR014031 Ketoacyl_synth_C 4 80 IPR016039 Thiolase-like 4 80 IPR016040 NAD(P)-bd_dom 5 100 IPR000794 Beta-ketoacyl_synthase 5 100 IPR011032 GroES-like 2 100 IPR013149 ADH_C 2 100 IPR002328 ADH_Zn_CS 2 100 PDH IPR013154 ADH_GroES-like 2 100 IPR002085 ADH_SF_Zn 2 100 IPR016040 NAD(P)-bd_dom 2 100 IPR002085 ADH_SF_Zn 4 100 IPR013154 ADH_GroES-like 4 100 MECR IPR011032 GroES-like 4 100 IPR013149 ADH_C 4 100 IPR016040 NAD(P)-bd_dom 4 100

Table A.6: InterPro annotations, part 1 27

Number Family Term Percentage of Proteins IPR011032 GroES-like 13 93 IPR013149 ADH_C 14 100 IPR013154 ADH_GroES-like 12 86 VAT1 IPR002085 ADH_SF_Zn 14 100 IPR016040 NAD(P)-bd_dom 14 100 IPR002364 Quin_OxRdtase/zeta-crystal_CS 14 100 IPR002085 ADH_SF_Zn 8 100 IPR011032 GroES-like 8 100 PTGR IPR013149 ADH_C 8 100 IPR014190 B4_12hDH 5 63 IPR016040 NAD(P)-bd_dom 8 100 IPR011032 GroES-like 4 100 IPR013149 ADH_C 4 100 IPR013154 ADH_GroES-like 4 100 vertQOR IPR002085 ADH_SF_Zn 4 100 IPR016040 NAD(P)-bd_dom 4 100 IPR002364 Quin_OxRdtase/zeta-crystal_CS 4 100 IPR011032 GroES-like 4 80 IPR013149 ADH_C 5 100 IPR013154 ADH_GroES-like 4 80 ZADH2 IPR002085 ADH_SF_Zn 5 100 IPR016040 NAD(P)-bd_dom 5 100 IPR002364 Quin_OxRdtase/zeta-crystal_CS 5 100

Table A.7: InterPro annotations, part 2

Family Term Matching Proteins ADH PS00059: ADH_ZINC 29 PS00012: PHOSPHOPANTETHEINE 4 FAS PS00606: B_KETOACYL_SYNTHASE 4 PS50075: ACP_DOMAIN 4 PDH PS00059: ADH_ZINC 2 VAT1 PS01162: QOR_ZETA_CRYSTAL 14 vertQOR PS01162: QOR_ZETA_CRYSTAL 4 ZADH2 PS01162: QOR_ZETA_CRYSTAL 5

Table A.8: Prosite annotaions 28 Tables

Number Family Term Percentage of Proteins Subcutaneous adipose tissue 18 55 Liver 30 91 ADH Superior cervical ganglion 19 58 Kidney 19 58 Heart 17 53 Pancreas 3 60 Brain 5 100 Trachea 5 100 The region between the LAD artery and the apex 3 60 Brown fat from interscapular depression 3 60 Preoptic area 3 60 Embryo 3 60 Colon 5 100 Ovary 3 60 Preputial gland 3 60 Brainstem 3 60 Hypothalamus 5 100 Subcutaneous adipose tissue 5 100 Intestine 3 60 Epithelium 3 60 Adipose 3 60 Gonadal white adipose tissue 3 60 Lateral geniculate nucleus (thalamus) 3 60 Periaqueductal gray 3 60 Epidermis 5 100 Dorsal skin without fur 3 60 FAS Embryonic stem cell 3 60 Extraocular muscle 3 60 Liver 5 100 Brown fat 3 60 Adipose tissue 5 100 Mesenteric lymph node 3 60 Lung 3 60 Epidydimal white adipose tissue 3 60 Dorsal root ganglion 5 100 Snout epidermis 3 60 Mammary gland 5 100 Nucleus accumbens 3 60 Corpus 3 60 Brown adipose tissue 3 60 Bed nucleus of the stria terminalis 3 60 Skeletal muscle (M. vastus lateralis) 3 60 Collecting duct 3 60 Adrenal gland 3 60 Whole brain 3 60 Forebrain 3 60 White adipose tissue 3 60

Table A.9: ArrayExpress annotations, part 1 29

Number Family Term Percentage of Proteins Kidney 2 100 Dorsal root ganglion 2 100 Muscle 2 100 PDH Liver 2 100 Small intestine 2 100 Testis 2 100 Colon 4 100 Trachea 4 100 MECR Uterus 4 100 Dorsal root ganglion 4 100 Pituitary gland 8 57 Lung 8 57 Hypothalamus 13 93 VAT1 Subcutaneous adipose tissue 10 71 Dorsal root ganglion 8 57 Amygdala 10 71 Heart 8 57 Adipose 5 63 White adipose tissue 5 63 Small intestine 5 63 Nodose ganglion visceral sensory neurons 5 63 Stomach 5 63 Epithelium 5 63 Kidney 7 88 PTGR Brainstem 5 63 Oviduct 5 63 Cingulate cortex homogenate 5 63 Anterior tibialis 5 63 Whole organism 5 63 Intestine 5 63 Preputial gland 5 63

Table A.10: ArrayExpress annotations, part 2 30 Tables

Number Family Term Percentage of Proteins Kidney 3 75 Uterus 3 75 vertQOR Pancreas 3 75 Liver 3 75 Urethra 3 60 Paravertebral muscle 3 60 Skin 3 60 Kidney cortex 3 60 Primary clear-cell renal cell carcinoma 3 60 Lung 3 60 Brain 3 60 Endometrium 3 60 Seminiferous tubule 3 60 Ventricular myocardium 3 60 Hematopoietic and lymphatic system 3 60 Cord blood 3 60 Frontal cortex, superior motor cortex 3 60 Umbilical cord 5 100 Occipital lobe 3 60 Spleen 3 60 Cerebrospinal fluid 3 60 ZADH2 Stomach pyloric 3 60 Cerebellum 5 100 Prostate 3 60 Colon 5 100 Penis 3 60 Blood 3 60 Jejunum 3 60 Endometrium/ovary 3 60 Primary visual cortex 3 60 Omental adipose 3 60 Mammary gland 3 60 Stomach fundus 3 60 Cerebral cortex 5 100 Deltoid muscle 3 60 Kidney medulla 3 60 Thyroid 3 60 Cancer, LCM 3 60

Table A.11: ArrayExpress annotations, part 3 31

Number Family Term Percentage of Proteins Drug metabolism - cytochrome P450 16 49 Fatty acid metabolism 16 49 Retinol metabolism 16 49 ADH Glycolysis / Gluconeogenesis 16 49 Metabolism of xenobiotics by cytochrome P450 16 49 Metabolic pathways 16 49 Tyrosine metabolism 16 49 Insulin signaling pathway 2 40 FAS Metabolic pathways 2 40 Fatty acid biosynthesis 2 40 Metabolic pathways 3 75 MECR Fatty acid elongation in mitochondria 3 75 Metabolic pathways 2 100 PDH Fructose and mannose metabolism 2 100

Table A.12: KEGG annotations. Note that the number of proteins in KEGG for the four families are 16,2,3 and 2 respectively B List of Supplementary data

On the web address http://www.ifm.liu.se/bioinfo/supplements/elfving-2011-thesis, the following supplementary data can be accessed for free: • Full source code of the Protein Family Annotation Software (PFAS) and various scripts written during the project

• Gene Ontology trees for all the families • Cladograms for all studied families • Raw data for the MDR and SDR families

32 C Manual for using the Protein Family Annotation Software (PFAS)

The full source code is included in appendix B free of charge and without restric- tions. The following text describes how to use the software. The tool is written and tested under python 2.7. The software requires no extra packages but some of the plugins do. Please refer to table C.1 for specification.

Package Required in Version PyMySQL mdrDB 0.4 suds PicrSynonyms 0.4 BeatifulSoup ArrayExpress 3.2.0

Table C.1: Packages needed for different plugins

Some of the plugins require separate data files available at each data provider’s site. The paths to these files must then be entered in each plugin’s configuration file. Installation is quite straight-forward, untar the file in a folder of your choice and add that path and the path to the lib and plugins folders to your PYTHONPATH variable and the system should be up and running.

33

Upphovsrätt Detta dokument hålls tillgängligt på Internet — eller dess framtida ersättare — under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för icke- kommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av doku- mentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerhe- ten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan be- skrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förla- gets hemsida http://www.ep.liu.se/

Copyright The publishers will keep this document online on the Internet — or its possi- ble replacement — for a period of 25 years from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for his/her own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/

c Eric Elfving