Institutionen för datavetenskap Department of Computer and Information Science

Examensarbete A Lexicon for Normalization

av Maria Lingemark

LIU-IDA/LITH-EX-A--09/038

2009-09-01

Linköpings universitet Linköpings universitet SE-581 83 Linköping, Sweden 581 83 Linköping Linköpings universitet Institutionen för datavetenskap

Examensarbete A Lexicon for Gene Normalization

av Maria Lingemark

LIU-IDA/LITH-EX-A--09/038

2009-09-01

Handledare: He Tan Examinator: He Tan Datum Avdelning, institution Date Division, department Institutionen för datavetenskap

Department of Computer and Information Science 2009-09-01 Linköpings universitet

Språk Rapporttyp ISBN Language Report category Svenska/Swedish Licentiatavhandling ISRN LIU-IDA/LITH-EX-A--09/038 X Engelska/English x Examensarbete C-uppsats Serietitel och serienummer ISSN

D-uppsats Title of series, numbering Övrig rapport

URL för elektronisk version

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-20250

Titel Title A Lexicon for Gene Normalization

Författare Author Maria Lingemark

Sammanfattning Abstract Researchers tend to use their own or favourite gene names in scientific literature, even though there are official names. Some names may even be used for more than one gene. This leads to problems with ambiguity when automatically mining biological literature. To disambiguate the gene names, gene normalization is used. In this thesis, we look into an existing gene normalization system, and develop a new method to find gene candidates for the ambiguous . For the new method a lexicon is created, using information about the gene names, symbols and synonyms from three different databases. The gene mention found in the scientific literature is used as input for a search in the lexicon, and all genes in the lexicon that match the mention are returned as gene candidates for that mention. These candidates are then used in the systems disambiguation step. Results show that the new method gives a better over all result from the system, with an increase in precision and a small decrease in recall.

Nyckelord Keywords Bioinformatics, gene normalization, string matching, text mining Abstract

Researchers tend to use their own or favourite gene names in scientific literature, even though there are official names. Some names may even be used for more than one gene. This leads to problems with ambiguity when automatically min- ing biological literature. To disambiguate the gene names, gene normalization is used. In this thesis, we look into an existing gene normalization system, and develop a new method to find gene candidates for the ambiguous genes. For the new method a lexicon is created, using information about the gene names, symbols and synonyms from three different databases. The gene mention found in the scientific literature is used as input for a search in this lexicon, and all genes in the lexicon that match the mention are returned as gene candidates for that mention. These candidates are then used in the system’s disambiguation step. Results show that the new method gives a better over all result from the system, with an increase in precision and a small decrease in recall. Contents

1 Introduction 1 1.1 Background...... 1 1.2 Motivation ...... 2 1.3 Problemformulation ...... 2 1.4 Limitations ...... 2 1.5 Method ...... 2 1.6 Structure ...... 3

2 Background 5 2.1 Text Mining and Biological Text Mining ...... 5 2.2 GeneNames...... 6 2.3 GeneNormalization ...... 6 2.3.1 OurSystem...... 7 2.4 LiteratureResources ...... 8 2.5 Databases...... 9 2.5.1 EntrezGene...... 9 2.5.2 HGNC...... 9 2.5.3 SwissProt ...... 10

3 Related Work 11

4 Creating the lexicon 13 4.1 Preprocessing ...... 13 4.2 Merging ...... 15 4.2.1 Step 1: Combining and HGNC ...... 16 4.2.2 Step 2: Combining combo file and SwissProt ...... 16 4.3 Cleaning...... 19 4.4 Rules...... 19 4.5 Implementation...... 21

5 Searching the Lexicon 23 5.1 ExactStringMatching...... 23 5.2 ApproximateStringMatching...... 23 5.2.1 JaccardIndex...... 24

i 5.2.2 Dice’s Coefficient ...... 24 5.2.3 Levenshtein Distance ...... 24 5.2.4 Q-Gram...... 25 5.3 Implementation...... 25

6 Evaluation 27 6.1 TestData ...... 27 6.2 Method ...... 27 6.3 Result ...... 28 6.3.1 ExactStringMatching...... 28 6.3.2 Approximate String Matching ...... 29 6.3.3 Evaluation of the Lexicon in the Gene Normalization System 32

7 Discussion 33 7.1 Comparing Lexicon to Database Query ...... 33 7.2 FutureWork ...... 33

A Vocabulary 35

ii Chapter 1

Introduction

1.1 Background

The amount of scientific literature has increased rapidly in the past decade and much of it has been published online as well as in journal articles. The online literature is an important resource for researchers, but the amount of information that can be found can be overwhelming. It is very impractical to manually identify relevant information and since the information is stored within the free text it also makes it hard to find the relevant documents. There are automated processes, such as natural language processing (NLP) that can be used to retrieve information from this vast quantity of text. When it comes to biomedical publications the problem is not only that of finding the useful information, but also that of identifying the correct genes mentioned in the text. When trying to find information the first problem is to identify the parts of the text that mention genes. When this is done the gene mentions found need to be identified as the actual genes the author is referring to by linking them to gene database entries. This is called gene normalization. Gene normalization is needed because a gene mention found is not always the official name or symbol of that particular gene. It can be a new name the author of the article decided to use for the gene, or the name or symbol used can be ambiguous. Many systems exist wchich are trying to solve this problem, but there are still room for improvement. In this thesis we will look at a method to retrieve gene candidates for gene symbol disambiguation, which means to identify the correct gene if there is an ambiguity. This thesis’ work focuses on one part of a larger system which recognizes gene mentions in raw text, retrieve gene candidates for those mentions and uses the information about the candidates and the gene mentions from the text in a disambiguation step.

1 1.2 Motivation

This thesis focuses on a gene candidate retrieval step for gene normalization. In gene normalization one needs to find a way to decide which gene a mention refers to. One of the ways to do this is to retrieve gene candidates from a database, or a collection of databases, and match the information about the candidates against the information about the mention found in text. The gene candidate retrieval step aims to find the genes a mention in text might refer to.

1.3 Problem formulation

When creating a lexicon for gene candidate retrieval there are a few things that need to be decided. The problem formulation can be summarized to:

• Can a lexicon be used to retrieve gene candidates with a better result than directly querying databases?

• Which databases should be used in the creation of the lexicon?

• What structure should the lexicon have?

• How can we search the lexicon? Is exact string matching enough or do we need to use approximate string matching?

1.4 Limitations

The only organism considered in this thesis is human, which makes querying databases easier since it is possible to focus on only human genes. Some genes also share the same name with the corresponding gene from other species, but we will not have to deal with that problem since we already know it is a human gene we are looking for. There are only three different databases included in the creation of the lexicon which limits the number of synonyms found.

1.5 Method

The gene candidate retrieval is done by using a lexicon which is built by com- bining information, such as gene names and synonyms, from three different databases. The lexicon is used to search for all the genes a given gene mention might refer to. The search is done by using exact string matching as well as a few different approximate string matching algorithms in two different search methods. A lexicon can hold a combination of information, such as gene name and synonyms, from several different gene databases. This can increase the chance of finding the right gene in the gene candidate retrieval step since there can be synonyms which only can be found in one database.

2 1.6 Structure

The thesis will start with an introduction and then some background informa- tion in the areas relevant to the creation of the lexicon. The following chapter is a short description of related work. Chapter 4 and 5 will cover the creation and usage of the lexicon. Chapter 6 is an evaluation of the gene candidate re- trieval results using the lexicon and a comparison to the results from querying the database Entrez Gene. In the last chapter the results are discussed along with future work.

3 4 Chapter 2

Background

2.1 Text Mining and Biological Text Mining

Text mining is a process where one tries to automatically extract useful informa- tion from text documents. The extraction is done by identifying and exploring interesting patterns that are found in the unstructured textual data. A text mining system takes raw data (text documents) as input and produces various types of output, e.g. patterns and trends (Feldman & Sanger, 2007). Text mining can be divided in different steps. Example of steps are information retrieval (IR), natural language processing (NLP) and information extraction (IE)(National Text Mining Centre & Redfearn, 2008). • Information Retrieval: Identifies the documents that match the users query. This narrows down the set of documents to those that are rele- vant to the problem. If one for example is interested in , the set of documents can be narrowed down to those that contain names of proteins (National Text Mining Centre & Redfearn, 2008). • NLP: Can be used to process text documents using the general knowledge of the language. One example of a NLP task is to annotate text with part-of-speech (POS) tags. POS tagging is based on the context in which the words appear. Examples of POS tags are noun, verb and adjective. Another task in NLP is tokenization, which is a process of breaking up text into sentences and words. A challenge in this method is to identify the boundaries of the sentences. This involves distinguishing between the periods that marks the end of a sentence and the ones that is part of a previous token, such as Dr. or Mr. (Feldman & Sanger, 2007). The information from the NLP step can be used in the IE step (National Text Mining Centre & Redfearn, 2008). • Information Extraction: Extracts structured data from the unstructured data, relying on the information from the NLP strep (National Text Min- ing Centre & Redfearn, 2008).

5 Biological text mining is used to find the relevant information in the large num- ber of biological articles that are published. In biological text mining one also needs to consider the heavy use of domain-specific terminology. To assist in the interpretation of scientific articles there are important resources such as terminological repositories and dictionaries. Biological ontologies are used to describe biologically relevant aspects of genes and gene products (Krallinger et al., 2008). One large biological ontology is the , which is a struc- tured controlled vocabulary that describes the roles of genes and gene products in organisms (The Gene Ontology Consortium, 2000).

2.2 Gene Names

The naming of genes causes some problem in gene normalization. Genes are often named using functional terms, such as insulin, or abbreviations, INS for insulin. There are not only gene names describing the genes but also symbols, which may be an abbreviation of the full gene name in some way. Because of the way of naming genes ambiguity exists. Often more than one functional name is used to refer to the same gene, and a functional term can be descriptive of some phenotype of the gene, such as wingless (a gene in the fruit fly D. melanogaster). Problems also exists for gene symbols where an abbreviation can be used to describe more than one gene, or can be matched to other meanings. Guidelines for naming genes are presented by the HUGO1 Gene Nomenclature Committee (HGNC). They also approve of gene names and symbols. All approved symbols are stored in the HGNC database. Even though there are official names for genes, many synonyms are still being used. The usage of names are influenced by important scientific papers and researchers seem to use their own name for the gene or their favourite gene name rather than the official. One example of the problem with gene naming can be seen with the gene lymphotactin. Three groups reported the cloning of that gene almost simultaneously in 1995 and they all named it differently (SCM1, ATAC and LTN). These names have been used since then. The name LPTN is also used as well as the official symbol XCL1 (Tamames & Valencia, 2006).

2.3 Gene Normalization

Gene normalization is the process of identifying genes mentioned in text and give them a unique identifier from an external source, such as the Entrez GeneID from the Entrez Gene database. Different types of gene normalization systems use different methods to do this. There are systems for gene normalization based on machine learning. This means that they need a training data set for the system to learn how to classify genes. Other systems use the information found in the text where the gene mention is found and compares it to the information about the gene candidates found in external sources, such as databases. These

1Human Genome Organisation (http://www.hugo-international.org/)

6 systems do not need a training set and they have an advantage since there is no need for manually curated data to use the system.

2.3.1 Our System

As mentioned earlier this thesis will look into and try to improve one part of an existing system for gene normalization. A schematic description of the system can be seen in figure 2.1. The disambiguation is done by comparing the information about the gene mention to the information about the gene candidates. No training is needed for the system. This system consists of the following five components (Tan, 2008):

1. Mapper - Outputs a list of concepts from ontologies and their mapping to the schemas of the databases used. This information is used in the matching step.

2. NLP - Processes the raw text input and recognizes gene mentions. MetaMap is used to map terms in text to concepts in the Unified Medical Language System (UMLS) Metathesaurus with a simple string matching algorithm. The Metathesaurus is a multi-purpose vocabulary database. It contains information about health and biomedical related concepts, their names and the relationships among them. The Metathesaurus is built from many different source vocabularies (U.S. National Library of Medicine, 2006).

3. Gene candidate retrieval - Retrieves gene candidates for the gene mentions. Information about the candidates such as nomenclature, organism, gene type, location, descriptive text and pathways that include the gene are retrieved. This part of the system uses a direct query of the Entrez Gene database to retrieve gene candidates. This thesis is focused on a new way of retrieving the gene candidates that might be able to replace the query of the database (marked with grey in figure 2.1).

4. Matching - Matches the information about the genes from the gene can- didate retrieval to the information about the gene mention in the text. Different algorithms are used for the matching. The algorithms are based on keyword matching and a linguistic algorithm. The keyword matching is given a list of phrases such as GO terms and proteins the gene encode and the linguistic algorithm match information given by sentences such as the descriptive text.

5. Disambiguation ranking - Ranks the gene candidates based on the result form the matching. The gene with the highest score is proposed as the result. If several genes have the same highest score they are all proposed as the result.

7 Figure 2.1: Our gene normalization system. The grey area marks out the part of the system this thesis is focused on. Original figure from Tan (2008).

2.4 Literature Resources

One of the most important literature resource in biological text mining and bi- ological research is PubMed2 (Krallinger et al., 2008). PubMed is a part of the Entrez retrieval system developed by the U.S. National Library of Medicine’s National Center for Biotechnology Information. PubMed provides free access to the citations and abstracts in MEDLINE, which is the primary component of PubMed, as well as links to sites with full text articles and links to related arti-

2http://www.ncbi.nlm.nih.gov/pubmed/

8 cles (U.S. National Library of Medicine, 2002). At the time of writing PubMed contains over 18 million citations from MEDLINE and other life science journals. MEDLINE is a bibliographic database that contains references to journal ar- ticles in the area of life science with its concentration on biomedicine. It contains citations from approximately 5200 journals in 37 languages. For the citations added between the years 2000 and 2005 about 79% have English abstracts (U.S. National Library of Medicine, 2004).

2.5 Databases

There are several different databases with information about genes and , as well as ontologies such as Gene Ontology mentioned in section 2.1. Exam- ples of databases not used in this thesis are Ensembl3 and SOURCE4. The databases used in the creation of the lexicon are Entrez Gene, HGNC database and SwissProt. They are each described in section 2.5.1 to 2.5.3 below.

2.5.1 Entrez Gene Entrez Gene is a database at the National Center for Biotechnology Information (NCBI). NCBI is a division of the National Library of Medicine (NLM) located on the campus of the US National Institutes of Health (NIH). The database is gene-specific and provides unique integer identifiers, called GeneIDs, for genes and other loci5 of different organisms. The GeneID is not only unique to the gene it represents, it is also species specific. That means that even though the same gene can be found in different species, the genes will have different GeneIDs in the database. Some types of information found in Entrez Gene are gene names, gene symbols and protein names, which are often the same as the gene names. The position of the gene on the chromosome and the nucleotide sequence can also be found. In this thesis we focus on the gene name, symbol and synonyms. There are different ways to query Entrez Gene and the most direct is to submit a query from the NCBI home page. Other ways of querying is by using the e-utilities or by downloading a copy of the database (or parts of it) to a local hard drive. The e-utilities are ways to search and retrieve data from the database from within a program. Downloading parts of the database gives text files with one tab-delimited entry on each row (Maglott et al., 2005). When it is clear from context that no other database is referred to, Entrez Gene might also be called just Entrez.

2.5.2 HGNC The HUGO Gene Nomenclature Committee (HGNC) maintains a database that contains human genes with unique and approved gene names and sym-

3http://www.ensembl.org/ 4http://source.stanford.edu/ 5Plural of locus - the specific site of a particular gene on its chromosome (http://wordnet.princeton.edu/)

9 bols. Other information that can be found in the database is previous names and symbols as well as synonyms for both name and symbol. It also contains database cross-references to both Entrez Gene and SwissProt among others. The HGNC database was previously referred to as Genew, but with a change of the database management system in 2005 the name was also changed to the HGNC database. The database can be accessed online or be downloaded by using a custom download where the user can choose which fields to download. It is also possible to download one or more of the predefined files. These files are split up into Core Data, Core Data by Chromosome and All Data6 (Eyre et al., 2006).

2.5.3 SwissProt The Universal Protein Resource (UniProt) is a free central resource on protein sequences and functional annotations. UniProt has four major components for different uses. One of the components is the UniProt Knowledgebase (UniProt- KB). UniProtKB consists of two parts, one is UniProtKB/SwissProt and the other one is UniProtKB/TrEMBL. The focus on this thesis is on the SwissProt, which is the manually annotated and curated part of the UniProtKB. SwissProt is not, like the other two databases described above, a gene database, but a protein database. This does not mean that there is no useful gene information in it, but rather that the information is structured based on the proteins. This means that an entry in the SwissProt database might contain more than one single gene, as more than one gene can code for a particular protein. It also means that the id, or accession number, can not be used as an identifier for the genes mentioned. UniProt can be accessed online or be downloaded at http://www.uniprot.org (Bairoch & Apweier, 2000;The UniProt Consortium, 2008).

6See http://www.genenames.org/data/gdlw index.html for data fields.

10 Chapter 3

Related Work

The use of a lexicon in gene normalization is not limited to the disambiguation step. There are groups working on lexicons used for finding the genes in text. Kors et al. (2005) uses five different databases (one which is no longer avail- able) to build a lexicon. They also implement a few rewrite rules to the terms in the lexicon. The lexicon is used to find mentions of genes in text. They report that their “results indicate that the combination of information from standard genetic databases expands the number of genes beyond that found in each database separately”. Fang et al. (2006) describes a system with focus on gene normalization. They have created a system that takes the output from a named entity recognition (NER) system and normalizes it by using a lexicon which is created from four online databases. The normalization is done by us- ing string matching and a set of string transformations. The synonym with the highest matching score is considered a match if the score is higher than a defined threshold. No disambiguation is done in this system. The BioCreative II gene normalization challenge1 challenged the groups participating to create a system that could find gene mentions in plain text, in this case MEDLINE abstracts, identify them and return their corresponding Entrez GeneID. The BioCreative II gene normalization challenge provided the groups who participated with a lex- icon. The lexicon provided was built using information from the three databases Entrez Gene, HGNC database and UniProt. The groups chose to use the lexi- con in different ways. Some groups added more terms to the lexicon and some groups pruned it. Many groups used the lexicon to match terms in text against it (Morgan et al., 2008). The basic ideas for rules 1-7 described in chapter 4 come from a combination of rules used by different groups in the BioCreative II challenge. Hakenberg et al. (2007) used some rules for transformation of synonyms including splitting abbreviations at every transition between upper or lower case letters, symbols and digits. Other transformations were changes in Greek/English lettering and Roman/Arabic numbering. Chang & Liu (2007) used normalization rules which included normalization of case, replacement of

1http://biocreative.sourceforge.net/

11 hyphens with space, removal of punctuation and removal of parenthesized ma- terial. Team number 4 in the BioCreative challenge also used normalization of case as well as Roman numbering to Arabic, Greek letters to single letters and removal of punctuation and space (Morgan et al., 2008).

12 Chapter 4

Creating the lexicon

As previously mentioned the databases Entrez Gene, HGNC and SwissProt were chosen for the lexicon creation. The reason for choosing these databases are that they are large, well-known and much used databases. They are also inspected and chosen as good databases for building a lexicon by different research teams. Kors et al. (2005) used the three databases along with two others in their combination of databases and Fang et al. (2006) used four databases where HGNC, Entrez Gene and SwissProt were three of them. The main focus for the lexicon was to include the gene name, symbol and as many synonyms as possible.

4.1 Preprocessing

Since the three databases all have different format, some preprocessing was done to give them more similar formats and to remove unwanted information. The new formats have entries with one type of information per row and a special character combination to mark the end of the entry. For Entrez Gene and HGNC, which both have a format of one gene per row in the text file, the number of rows in the files after preprocessing increased, where as for the SwissProt database, where an entry consists of many rows, the resulting file became smaller than the original database file. This is because there is much information in the SwissProt database that is not relevant for creating the lexicon, e.g. sequence information and references to articles. The preprocessing of the HGNC and Entrez Gene files were done by reading the lines one by one, splitting each line into its different fields of information and storing the information in an array. This way the information that we wanted to keep was always in the same position in the array. In HGNC the information about previous symbols and names were classified as synonyms. In Entrez Gene, if the field Nomenclature status has the value O (official) the symbol from the Symbol from nomenclature authority field is used as symbol, otherwise the symbol from the Symbol field is used. The synonyms were found

13 in the fields Synonyms and Other designations. The field called description in Entrez Gene was also included in the new file. For SwissProt there is no predefined number of rows for an entry. Each row starts with a two letter code that can be used to identify the type of information that can be found in the row. The gene names and synonyms can be found in the lines marked out as gene information but there is also some information that is not needed for the lexicon and thus not all information in the rows are used. Some information about the protein in the entry including the protein name and accession number is saved in addition to the information about the genes. For Entez Gene, the only database cross-reference saved was the one for HGNC, and for HGNC the database cross-reference that was saved was the one to Entrez Gene. In the SwissProt database both the cross-reference to HGNC and Entrez Gene were saved if they were present. Example entries form the Entrez and HGNC can be seen in figure 4.1 and 4.2 respectively. Figure 4.3 shows an entry from SwissProt.

GeneID: 6288 GeneName: GeneSymbol: SAA1 GeneSynonyms: MGC111216 PIG4 SAA TP53I4 tumorproteinp53 inducibleprotein4 Description: serum amyloid A1 DBXref: HGNC|10513 //

Figure 4.1: Entry from Entrez Gene

GeneID: HGNC:10513 GeneName: serum amyloid A1 GeneSymbol: SAA1 GeneSynonyms: SAA PIG4 TP53I4 DBXref: Entrez|6288 //

Figure 4.2: Entry from HGNC

14 AccessionNumber: P02735 P02736 P02737 Q16730 Q16834 Q16835 Q16879 Q3KRB3 Q96QN0 ProteinName: Serum amyloid protein A(4-101) ProteinShort: SAA ProteinSynonyms: Amyloid fibril protein AA GeneName: SAA1 GeneSynonyms: GeneName: SAA2 GeneSynonyms: HGNCXref: HGNC:10513 HGNC:10514 EntrezXref: 6288 6289 //

Figure 4.3: Entry from SwissProt

4.2 Merging

The creation of the lexicon was based on merging the information of the three files created in the preprocessing step. The merging of the files was split up into steps. Only two files were used in each step. First the information in the Entrez1 and HGNC files was combined and written to a new file, called combo. Then the information in the combo file was combined with the information in the SwissProt file creating a first version of the lexicon. Figure 4.4 shows a schematic picture of the merging. Due to the format of the original databases, mainly SwissProt, the lexicon created had some genes occurring more than one time, with the only difference between the entries being their different protein information. The decision to keep the protein information in the lexicon in case it was needed in the future, and the wish to have every gene represented by one single entry meant that the lexicon had to be processed further to achieve this.

Figure 4.4: Merging the database files

1Note that in this section the Entrez Gene databases will only be called Entrez. This is to avoid expressions like “Entrez Gene gene”.

15 4.2.1 Step 1: Combining Entrez and HGNC The HGNC file was read and saved in an array, where each element in the array represented one gene entry in the file. This means that one element in the array represents one gene. Storage of the file in an array is done so that when an entry in the array is matched against an entry in the Entrez file, the array entry can be removed. The reason for this is so that one does not have to look through the entries that already matched. The matching between genes are based on the database cross-reference fields in the two databases. When the cross-reference id to Entrez for a gene in HGNC is the same as the GeneID of the Entrez gene the genes are said to match and their information is combined and written to the combo file. If the genes do not have the same name, the HGNC name is used as gene name and the Entrez name is added as a synonym. The same thing is also done for the symbol. If the symbols are not the same, the HGNC symbol is used and the Entrez symbol is added as a synonym. The reason that the HGNC name and symbol are used if there is a difference is that HGNC approves gene names and symbols and ensure that each gene only has one unique symbol (Eyre et al., 2006). If the Entrez gene does not have a matching gene in HGNC the Entrez information is still written to the combo file, since we want as much information as possible. The last step in the combination of Entrez and HGNC is to write the left over HGNC entries to the combo file. A small example of the combination can be seen in figure 4.5.

4.2.2 Step 2: Combining combo file and SwissProt In the second step the combo file is stored in an array, where every element in the array is one gene entry. For each entry in the SwissProt file, a match is sought in the combo array. There can be more than one gene id of each kind (Entrez and HGNC) in a SwissProt entry since there can be more than one gene in each entry. There might also be more gene ids from one database than from the other. These things need to be taken into consideration when the merging of the files is done. The matching is done by comparing all the Entrez ids and the HGNC ids in the SwissProt entry to the Entrez ids and HGNC ids in the combo file. If a match is found, the next step is to find the gene corresponding to the id. Only if both a match in the gene id and a match of gene symbol or synonym are found it is considered a match that can be used. The reason not to look for a match in gene name, is because the gene names in SwissProt are what we call symbol in the combo file. There is no equivalent to the combo gene name in SwissProt. There are a few cases where there are more database cross-reference ids than there are gene symbols in the SwissProt file. In these cases the id that can not be matched against any other gene info is not used. If only a gene id without gene information is present in the SwissProt file it will not add any new information to the lexicon. An example of the second combination step and the resulting lexicon can be seen in figure 4.6.

16 GeneID: 6288 GeneName: serum amyloid A1 GeneSymbol: SAA1 GeneSynonyms: MGC111216 PIG4 SAA TP53I4 Description: serum amyloid A1 DBXref: HGNC|10513 // The gene serum amy- GeneID: 6289 loid A1 in the Entrez GeneName: serum amyloid A2 Gene file is the first GeneSymbol: SAA2 to be matched against GeneSynonyms: the genes in the HGNC Description: serum amyloid A2 file. The GeneID for DBXref: HGNC|10514 the gene (6288) is // searched in the HGNC Entrez example file file and if a gene entry GeneID: HGNC:10513 with the DBXref field GeneName: serum amyloid A1 Entrez|6288 is found GeneSymbol: SAA1 the information from GeneSynonyms: SAA PIG4 the two files is com- TP53I4 bined. Here we can DBXref: Entrez|6288 see that the gene name // is the same in both GeneID: HGNC:10514 GeneName: serum amyloid A2 files, as well as the gene GeneSymbol: SAA2 symbol, so these are GeneSynonyms: used as gene name and DBXref: Entrez|6289 symbol in the combo // file. The synonyms HGNC example file from the HGNC file is added to the synonym GeneName: serum amyloid A1 GeneSymbol: SAA1 list from the Entrez GeneSynonyms: MGC111216 PIG4 file and lastly both the SAA TP53I4 gene ids are written in SAA PIG4 two DBXref fields. This TP53I4 is repeated until all the Description: serum amyloid A1 genes in the Entrez DBXref: Entrez|6288 Gene file are processed. DBXref: HGNC|HGNC:10513 // GeneName: serum amyloid A2 GeneSymbol: SAA2 GeneSynonyms: Description: serum amyloid A2 DBXref: Entrez|6289 DBXref: HGNC|HGNC:10514 // Combo example file

Figure 4.5: Example of the first step of the combination 17 AccessionNumber: P02735 P02736 In the second part of P02737 Q16730 Q16834 the merge, the gene Q16835 Q16879 Q3KRB3 serum amyloid A1 Q96QN0 from the combo file is ProteinName: Serum amyloid protein A(4-101) again the first gene to ProteinShort: SAA be matched against ProteinSynonyms: Amyloid fibril protein AA a gene in the Swis- GeneName: SAA1 sProt file. Both the GeneSynonyms: gene id from Entrez GeneName: SAA2 Gene (6288) and the GeneSynonyms: gene id from HGNC HGNCXref: HGNC:10513 HGNC:10514 (HGNC:10513) that EntrezXref: 6288 6289 is found in the combo // file are used in the SwissProt example file matching process. For each entry GeneName: serum amyloid A1 in the SwissProt GeneSymbol: SAA1 file all the Entrez GeneSynonyms: MGC111216 PIG4 Gene ids found SAA TP53I4 SAA (6288 and 6289) are PIG4 TP53I4 compared to 6288 ProteinName: Serum amyloid protein A(4-101) and all the HGNC EntrezXref: 6288 ids (HGNC:10513 HGNCXref: HGNC:10513 and HGNC:10514) // are compared to GeneName: serum amyloid A2 HGNC:10513. GeneSymbol: SAA2 GeneSynonyms: ProteinName: Serum amyloid protein A(4-101) EntrezXref: 6289 HGNCXref: HGNC:10514 // Lexicon example file Here we can see that both the Entrez Gene id and the HGNC id matched, but we can also see that there are two possible genes that the ids might refer to (SAA1 and SAA1). Because of this we need to look at the gene names and synonyms as well to find the correct gene. The gene names in the SwissProt entry (SAA1 and SAA2) are compared to the gene symbol from the combo entry that gave the match (SAA1). If one of the SwissProt names matches the combo name, as in this case, the information from the matching entries is merged. If no match is found, the search is expanded to compare synonyms from SwissProt to the combo symbol, and synonyms from the combo to the SwissProt name. If a match is found, an additional step of comparing the gene id(s) from SwissProt (can be both a HGNC id and an Entrez Gene id) to the gene ids (HGNC and Entrez Gene) from the combo file is made to make sure that the ids are still the same. If they match, the merging of information is done. In this merging process the gene name and symbol are chosen from the combo entry and the gene information from the SwissProt entry is added to the synonym list. This matching is then continued until all entries in SwissProt are processed.

Figure 4.6: Example of the second step of the combination 18 4.3 Cleaning

After the combination of the files is done, some genes will have more than one entry describing them. This happens because of the format of the SwissProt database file, and the only thing that differs between the entries for the same gene is the protein information. The aim is the have only one gene per entry and only one entry per gene, so the file needs to be cleaned and the entries combined one more time. The cleaning is initialized by storing the file in an array with one entry per element and by creating an empty array for storing the new unique entries in. The entries from the lexicon file are then looked at one by one, and the id for each entry is compared to the ids in the already added entries in the new array. If the id is found, the new protein information is added to that entry, and if it is not found the entire entry is added to the new array. When all entries are processed, the information in the new array is written to a text file.

4.4 Rules

In order to make the gene names, synonyms and symbols match as many spelling variants as possible a few transformations, from now on called rules, were applied to the lexicon. The following rules were applied:

1. Conversion of Roman numbering to Arabic.

2. Splitting of words at every transition from lower to upper case, or between letter and digit.

3. Conversion of Greek symbol names to their single letter representation.

4. Removal of parenthesized materials.

5. Replacement of punctuation with space.

6. Replacement of hyphen with space.

7. Normalization of case (all to upper).

8. Removal of multiple adjacent spaces that might have been created from the other rules.

It is important that the order in which the rules are applied always is the same as the order listed above, since for example splitting up an abbreviation might create single letters or letter combinations that can be interpreted as Roman numbers even though they are not. The first rule is limited so that only letters that can be interpreted as Roman numbers that also have a value of less then 40 are considered Roman numbers and changed to Arabic. This is done so that letters representing different types of the same gene, for example 5- hydroxytryptamine receptor 3 subunit C, stay that way. Otherwise the C would

19 Table 4.1: Greek-to-Latin alphabet conversion (Wain et al., 2009) Greek symbol name Latin upper case conversion alpha A beta B gamma G delta D epsilon E zeta Z eta H theta Q iota I kappa K lambda L mu M nu N xi X omicron O pi P rho R sigma S tau T upsilon Y phi F chi C psi U omega W

have been identified as a Roman number and changed to 100. The number 40 is chosen because we consider it likely that most Roman numbers would be below that, as they are mostly used to describe types or classes of genes, e.g. type I and type II. Another reason for choosing 40 as limit is because the next Roman symbol after 10 (X) is 50 (L). The replacement of Greek symbol names with their single letter representa- tion is done by using table 4.1. The rule for removal of parenthesized material removes all complete paren- theses, including nested ones. An example of a nested parentheses is: this ((is a) nested parentheses) example. In the example the text this example will be left after the removal of parenthesized material. When applying rules 4, 5 and 6 there is a risk that new spaces are introduced next to previously existing ones. Rule number 8 is used to solve that problem. Figure 4.8 shows an example where rule 8 is needed.

20 Original phrase phosphorylated heat- and acid-stable protein regulated by insulin 1 Rules 1-7 applied PHOSPHORYLATED HEAT AND ACID STABLE PROTEIN REGULATED BY INSULIN 1 Rule 8 applied PHOSPHORYLATED HEAT AND ACID STABLE PROTEIN REGULATED BY INSULIN 1

Figure 4.7: Example of rule 8

4.5 Implementation

The programs used for creating the lexicon are written in Perl. Modules used are Regexp::Common version 2.1222 for the regular expression that finds nested parentheses and Roman version 1.223 for identifying and converting Roman numbers to Arabic. The Perl code is written using the EngInSite Perl Editor Lite4. The programs are not aimed to be as time efficient as possible but to get the work done. This means that there probably is room from improvement in the code. The time factor is not very important when it comes to building the lexicon as it only needs to be done once before the lexicon is ready to be used in searches. The lexicon will need to be built again if any new updates from the databases are wished to be included. The SwissProt and Entrez database files used were downloaded October 8, 2008 and the HGNC database file was downloaded October 14, 2008.

2http://search.cpan.org/∼abigail/Regexp-Common-2.122/ 3http://search.cpan.org/∼chorny/Roman-1.22/ 4http://www.enginsite.com/Perl.htm

21 22 Chapter 5

Searching the Lexicon

Both exact and approximate string matching were used to search the lexicon. If a gene mention is found in the lexicon, i.e. if we get a hit, the corresponding GeneID for the gene name, symbol or synonym that matched is returned. It is possible that more than one GeneID is returned, and if the correct GeneID is found amongst the ones returned, it is considered to be a correct match. From now on the word synonym will be used to refer to everything we try to match in the lexicon, i.e. the gene names, symbols and synonyms.

5.1 Exact String Matching

Our first approach was to simply compare the gene mentions to all the synonyms in the lexicon by using exact string matching, i.e. the two strings must be exactly the same in order to count it as a hit.

5.2 Approximate String Matching

The next approach was to use approximate string matching to search the lexi- con. Several different algorithms were used to compare stings in different ways without the need for an exact match. Different thresholds for each algorithm were also used. The algorithms were used both individually and in combina- tions consisting of two algorithms. The algorithms and thresholds used for the combinations were based on the results from the individual searches. Two different methods were used to search the lexicon. The first method takes a gene mention, and for each synonym in the lexicon, compares it first with exact matching, if that does not give a hit, it tries with the first approximate algorithm. If it still does not match, for the combination searches, a second approximate algorithm is used. The second method is only used in combination searches. Instead of search- ing for a gene mention using the three (the exact and two approximate) string matching algorithms for each synonym in the lexicon, the entire lexicon is

23 searched with one algorithm before it moves on to the next. Only if no match is found in the lexicon after searching with one algorithm the next algorithm is used. This is repeated until the search gets a hit, or all three algorithms are used. With this method the search for a gene mention in the lexicon can end sooner than with the first method, but there is also a possibility that some gene mentions searches takes more time than with the first method. The different approximate string matching algorithms used in the lexicon search are described in sections 5.2.1 to 5.2.4.

5.2.1 Jaccard Index Jaccard index is based on the number of terms in the strings. It is calculated as the size of the intersection of the strings divided by the size of the union of the strings. In equation 5.1 A and B represents the two strings. The intersection of A and B is all the terms that occur in both A and B and the union is all the different terms that occur in either A or B. The sizes of the intersection and union are thus the number of shared terms and the total number of different terms in the strings. Figure 5.1 shows an example of a string and the terms of the string. Note that the | only is used in the example to better mark the beginning and the end of each token. |A ∩ B| J(A, B)= (5.1) |A ∪ B|

This is a test string, split at whitespace → |This| |is| |a| |test| |string,| |split| |at| |whitespace|

Figure 5.1: Example of a string split into its terms

5.2.2 Dice’s Coefficient Dice’s coefficient is similar to the Jaccard index and is calculated by taking 2 times the size of the intersection and divide it by the number of terms in the first string plus the number of terms in the second string. In equation 5.2, the strings are represented as A and B. 2|A ∩ B| s = (5.2) |A| + |B|

5.2.3 Levenshtein Distance Levenshtein distance is a basic edit distance function. The edit distance is measured as the number of operations that is needed to transform one sting into another. The operations for transformation are insertion, deletion and substitution. The fewer the operations needed, the more similar the strings are. In Levenshtein distance all the operations have the same cost.

24 5.2.4 Q-Gram Q-gram, or n-gram, as it is also called, is a method where the strings that are to be compared are split up into smaller strings of length q. The most common values of q are two (bigram) and 3 (trigram). The similarity is calculated as the number of identical q-grams divided by the total number of q-grams. Padding (#) is added at the beginning and end of the string. The number of padding characters at each end is q-1. Figure 5.2 shows an example of a string and its bigrams. Note that the | only is used in the example to better mark the beginning and the end of each token.

This is a test string → |#T| |Th| |hi| |is| |s | | i| |is| |s | | a| |a | | t| |te| |es| |st| |t | | s| |st| |tr| |ri| |in| |ng| |g#|

Figure 5.2: Example of q-gram where q = 2

5.3 Implementation

The lexicon search is written in Java using NetBeans IDE 6.11 with the exception of applying the rules to the gene mentions, which is done by calling a Perl program. The reason for this is that the rules applied to the gene mentions are the same as the ones applied to the lexicon described in section 4.4. To make sure the rules are the same, no new implementation of them was written in Java. For the approximate string matching algorithms the external Java library SimMetrics2 is used. SimMetrics is an open source library of similarity and distance metrics.

1http://www.netbeans.org/ 2http://www.dcs.shef.ac.uk/∼sam/simmetrics.html

25 26 Chapter 6

Evaluation

6.1 Test Data

To evaluate the lexicon we used the training data from the BioCreative II gene normalization challenge1. The training data consists of files with MEDLINE abstracts and a file that holds the gene mentions found in the abstract along with their correct GeneID. The latter is the one used for the tests. The test file contains 998 gene mentions for 640 different genes in 279 abstracts.

6.2 Method

To be able to compare the lexicon results to a direct query of a database we wanted to find the algorithm or combination of algorithms, and threshold values that gave the best result. In order to compare different algorithms and thresholds, precision and recall were calculated for each search. Precision is calculated as the number of correct gene ids returned divided by the total number of ids returned. Recall is calculated as the number of correct gene ids returned divided by the number of expected ids to be returned. A simple example of precision and recall calculation can be found in figure 6.1. The F-measure, which is a harmonic mean of precision and recall, is also calculated. The F-measure is calculated as 2×precision×recall F = precision+recall A second form of precision and recall (precision-2 and recall-2) were also calcu- lated. The calculations were based on the results for each abstract alone. For each abstract the precision and recall were calculated. After that the average of the precision and recall were calculated by taking the sum of the precision and recall for all the abstracts respectively and dividing by the total number of abstracts.

1http://biocreative.sourceforge.net/biocreative 2 dataset.html

27 Gene mentions for genes A, B and C are used in a lexicon search. Gene A and C are correctly returned. Along with gene A and C, genes W, X, Y and Z are also returned. Precision = number of correct genes returned / total number of genes returned = 2 / 6 = 0.33 Recall = number of correct genes returned / number of expected genes = 2 / 3 = 0.67

Figure 6.1: Example of precision and recall calculation

The evaluation was done by first finding good threshold values for each string matching algorithm. This was done by choosing a start threshold and, depending on the result, change the values until a good enough result was found. The search for a good threshold was done on a file containing approximately one third of the gene mentions to speed it up some. When good threshold values were found, the search was run on the file with all the gene mentions with the threshold values found in the small search. The results from the runs with the entire gene mention file can be seen in section 6.3 below. When good threshold values were found the different string matching algorithms were compared to find the algorithms with the best balance of precision and recall. These algorithms were used in the combination searches. When all the precision, recall and f- values were calculated, the best algorithms were compared to the Entrez search. The Entrez search takes each gene mention in the test file and uses it to directly query the Entrez Gene database. The query is limited to human genes by extending the query of the gene mention with a condition for organism to be human. If the correct gene id for the gene mention in returned from the query, it is considered a match.

6.3 Result

6.3.1 Exact String Matching

Table 6.1: Exact string matching Calculation Threshold=1.0 Precision 0.772 Recall 0.543 Precision-2 0.738 Recall-2 0.566

The use of exact string matching alone gave low precision and recall, as can be seen in table 6.1.

28 6.3.2 Approximate String Matching Single Metrics The results from the single metric runs can be seen in table 6.2-6.6 below.

Table 6.2: Jaccard Index Threshold 0.50 0.55 0.60 0.65 Precision 0.160 0.160 0.165 0.196 Recall 0.720 0.720 0.717 0.695 F 0.261 0.261 0.268 0.306

Table 6.3: Dice’s Coefficient Threshold 0.65 0.70 0.75 0.80 Precision 0.0391 0.160 0.195 - Recall 0.792 0.720 0.699 - F 0.0745 0.261 0.305 -

Table 6.4: Levenshtein Distance Threshold 0.75 0.80 0.85 0.90 Precision 0.0856 0.0991 0.287 - Recall 0.648 0.628 0.589 - F 0.151 0.171 0.386 -

Table 6.5: 2-gram Threshold 0.70 0.75 0.80 0.85 Precision 0.0927 0.173 0.234 - Recall 0.725 0.684 0.654 - F 0.164 0.276 0.345 -

Table 6.6: 3-gram Threshold 0.65 0.70 0.75 0.80 Precision 0.119 0.193 0.295 - Recall 0.712 0.672 0.636 - F 0.204 0.300 0.403 -

As can be seen, no recall is higher than 0.80 and the only metric close to that value has a very low precision (only 0.0391). Most of the algorithms give recall values around 0.70 for the chosen threshold values. One can see that the

29 threshold values that give decent precision and recall varies a lot between the algorithms. Most noticeable is the threshold values for Jaccard index, which are very low. One can also see that Jaccard index and Dice’s coefficient gives similar results, with most of the difference in threshold values. Levenshtein distance gives good precision at a high threshold but not very high recall. A small change in threshold gives a big change in precision. The two q-gram distance metrics, 2-gram and 3-gram, are similar in the results, but one can see that 2-gram gives higher recall and 3-gram higher pre- cision when the two are compared for the same thresholds. Since the metrics give similar results, one might want to take into consideration the time it takes to use each metric. The first two metrics are similar in time, which they should be, since they are similar in method too. They are also the fastest metrics of the ones being used, with a time of about 45 minutes to search through the lexicon for all the gene mentions in the test file. Levenshtein distance takes a little longer to run at about 70 minutes. The slowest metrics used were the q-gram metrics, with times above 300 minutes. The execution times for the different metrics depends on the hardware as well as the software used, i.e. if a good computer is used, the times are likely to decrease, and if the algorithms are written to be more time efficient, that will also effect the total run time. The high run time for the q-gram metrics can probably be explained by the fact that for every gene mention search for in the lexicon, that mention and the whole lexicon needs to be tokenized into q-grams. There might be a way to save the tokenization of the lexicon so that it only needs to be done once, but that would require much storage space, since the lexicon file as it is already is close to 8 MB in size and the size for the tokenized string is much greater than the size of the original string, see figure 5.2.

Combinations

Some combinations of approximate algorithms were tried in an attempt to get higher recall. The combinations tried were Dice’s Coefficient and Levenshtein and Dice’s coefficient and 2-gram. Different thresholds were also used for the combination. The threshold for Dice’s coefficient was 0.70 and for Levenshtein the two thresholds 0.80 and 0.85 were used. The 2-gram thresholds were 0.75 and 0.80. The thresholds for the combinations were based on the best results from the single algorithm runs. In the combination searches the second method for searching described in section 5.2 was used as well. Results from the combination runs can be seen in tables 6.7-6.10. As expected the precision from the second method is higher than the precision from the first search method. One can also see that recall values decreased, but not by much. The results form the direct query of Entrez Gene are shown in table 6.11.

30 Table 6.7: Dice’s coefficient and Levenshtein distance, search method 1 Dice: 0.70 Dice: 0.70 Threshold Levenshtein: 0.80 Levenshtein: 0.85 Precision 0.0776 0.133 Recall 0.749 0.733 F 0.141 0.225 Precision-2 0.267 0.424 Recall-2 0.775 0.759 F-2 0.397 0.544

Table 6.8: Dice’s coefficient and 2-gram, search method 1 Dice: 0.70 Dice: 0.70 Threshold 2-gram: 0.75 2-gram: 0.80 Precision 0.108 0.124 Recall 0.757 0.738 F 0.190 0.212 Precision-2 0.328 0.392 Recall-2 0.782 0.763 F-2 0.462 0.518

Table 6.9: Dice’s coefficient and Levenshtein Distance, search method 2 Dice: 0.70 Dice: 0.70 Threshold Levenshtein: 0.80 Levenshtein: 0.85 Precision 0.376 0.407 Recall 0.733 0.723 F 0.497 0.521 Precision-2 0.632 0.646 Recall-2 0.761 0.751 F-2 0.690 0.695

Table 6.10: Dice’s coefficient and 2-gram, search method 2 Dice: 0.70 Dice: 0.70 Threshold 2-gram: 0.75 2-gram: 0.80 Precision 0.373 0.391 Recall 0.738 0.726 F 0.495 0.509 Precision-2 0.626 0.639 Recall-2 0.767 0.754 F-2 0.689 0.692

31 Table 6.11: Entrez Gene query Threshold Entrez Gene query Precision 0.0813 Recall 0.826 F 0.148 Precision-2 0.187 Recall-2 0.828 F-2 0.305

6.3.3 Evaluation of the Lexicon in the Gene Normalization System The algorithm combination Dice’s coefficient with threshold 0.70 and 2-gram distance with threshold 0.75 was chosen to evaluate the lexicon in the gene normalization system. The results from the system using direct query of Entrez Gene is shown in table 6.12 and the results from the system using the lexicon with the algorithm combination above can be seen in table 6.13.

Table 6.12: Results from system with direct query (Tan, 2008). Dataset 1 2 Precision 0.48 0.52 Precision-2 0.69 0.67 Recall 0.82 0.78 F 0.61 0.62 F-2 0.75 0.73

Table 6.13: Results from system using the lexicon Dataset 1 2 Precision 0.65 0.72 Precision-2 0.89 0.90 Recall 0.66 0.73 F 0.65 0.72 F-2 0.76 0.81

As can be seen in the tables, the precision is higher for the lexicon, but the recall is lower. When the system used direct query of Entrez Gene, the system proposed 21 gene candidates for a symbol in dataset 1 and 14 gene candidates for a symbol form dataset 2. For 5 symbols from both datasets more that 10 gene candidates were retrieved. When the system used the lexicon, it never retrieved more than 10 gene candidates. Dataset one is the training data and dataset two is the test data from the BioCreative II Challenge.

32 Chapter 7

Discussion

7.1 Comparing Lexicon to Database Query

The use of exact string matching alone does not give high enough recall to be used as a single algorithm for the lexicon search. No lexicon method gets as high recall as the direct query of Entrez Gene, but the precision is much better for all the lexicon search methods and algorithms. If one only wants high recall, the direct query method is still the better option, but one needs to consider that it might be harder to correctly identify the correct gene in the disambiguation step if there are too many gene candidates. The low recall from the lexicon was an expected result, as there are some gene mentions, discussed in the future work section, that will cause problems. With improvement to the lexicon one might be able to increase the recall further.

7.2 Future Work

There are still improvements that can be done, both for the lexicon and for the finding of the gene mentions in text. The lexicon could be further pruned to increase search time but also as a way to try to increase the precision. At the moment the lexicon contains duplicates of some of the gene synonyms and these should be removed. There are also ids to a database listed as synonyms and these could probably be removed without any decrease in recall. The rules for the lexicon could probably be improved and one rule in par- ticular is the recognition and transformation of Roman numbers to Arabic. If one wants to improve recall further, adding a new database to the lexicon as well as keeping the lexicon up to date are methods to try. One other way of improving precision could be to look into only returning the synonym(s) with the highest match values from the lexicon search. Known problems with the lexicon are if the gene mention found in text mentions more than one gene, such as all genes in an interval, or a group of genes. Examples of this are the gene mention freac1-freac7 which refers to

33 all the genes from freac1 to freac7, and the gene mention PKC isoforms alpha, delta, epsilon, which refers to PKC alpha, PKC delta and PKC epsilon. Methods to recognize and separate this type of mentions needs to be looked into and could be added as a step between the gene mention extraction and the gene candidate retrieval steps in the system. The program that combines the databases to form a lexicon can be improved so that the adding of new database is easier. One example of how this can be done is by making the combination process the same for all databases, if they are all pre-processed to have the exact same format. If this could be done, one would be able to add a database by only writing the pre-process code for the new database.

34 Appendix A

Vocabulary

This list presents some of the vocabulary used in this thesis work.

• Entry - All the information about a single gene (or protein for SwissProt) in the database

• Gene mention - The gene name or symbol found in text or the part of the text that mentions a gene

• GeneID - The unique id for the genes that is used in Entrez Gene • HGNC - A gene database

• Entrez Gene - A gene database

• SwissProt - A protein database

35 36 Bibliography

[1] Bairoch, A. & Apweier, R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Research, 28, pp. 45-48.

[2] Chiang, J-H & Liu, H-H. (2007) A Hybrid Gene Normalization approach with capability of disambiguation. Proceedings of the Second BioCreative Challenge Evaluation Workshop. pp 157-159, Madrid, Spain, 2007.

[3] Eyre T.A. et al. (2006) The HUGO gene nomenclature database, 2006 updates. Nucleic Acids Research, 34, pp. D319-D321.

[4] Fang, H., Murphy, K., Jin, Y., Kim, J.S., White, P.S. (2006) Human Gene Name Normalization using Text Matching with Automatically Extracted Synonym Dictionaries. Proceedings of the BioNLP Workshop on Linking Natural Language Processing and Biology, pp. 41-48, New York City, New York, USA, 2006.

[5] Feldman, R. & Sanger, J. (2007) The Text Mining Handbook. Cambridge University Press.

[6] Kors, J.A., Schumie, M.J., Schijvenaars, B.J.A., Weeber, M., Mons, B. (2005) Combination of Genetic Databases for Improving Identification of Genes and Proteins in Text. BioLINK, Detroit, Michigan, USA, 2005

[7] Krallinger, M., Valencia, A. & Hirschman, L. (2008) Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biology, 9 (Suppl. 2), S8.

[8] Morgan, A.A. et al. (2008) Overview of BioCreative II gene normalization. Genome biology, 9 (Suppl 2), S3.

[9] National Text Mining Centre & Redfearn, J. (2008) Text Mining. Briefing paper.

[10] Tamames, J. & Valencia, A. (2006) The success (or not) of HUGO nomen- clature. Genome Biology, 7(5), 402.

37 [11] Tan, H. (2008) Knowledge-based Gene Symbol Disambiguation. CIKM: Proceeding of the 2nd international workshop on Data and text mining in bioinformatics, pp. 73-76, Napa Valley, California, USA, 2008. [12] The Gene Ontology Consortium. (2000) Gene Ontology: tool for the uni- fication of biology. Nature Genetics, 25, pp. 25-29.

[13] The UniProt Consortium (2008) The Universal Protein Resource (UniProt). Nucleic Acids Research, 36, pp. D190-D195.

[14] U.S. National Library of Medicine. (2002) PubMed: MEDLINE Retrieval on the World Wide Web Fact Sheet [Online] (Updated 25 April 2008) Available at: http://www.nlm.nih.gov/pubs/factsheets/pubmed.html [Accessed 11 May 2009].

[15] U.S National Library of Medicine. (2004) MEDLINE Fact Sheet [Online] (Updated 22 April 2008) Available at: http://www.nlm.nih.gov/pubs/factsheets/medline.html [Accessed 11 May 2009].

[16] U.S. National Library of Medicine. (2006) UMLS Metathesaurus Fact Sheet [Online] (Updated 28 March 2006) Available at: http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html [Accessed May 11 2009].

[17] Wain, H.M et al. (2002) Guidelines for Human Gene Nomenclature. Genomics 79(4), pp. 464-470. Available at: http://www.genenames.org/guidelines.html [Accessed 5 August 2009].

38