Institutionen För Datavetenskap Department of Computer and Information Science
Total Page:16
File Type:pdf, Size:1020Kb
Institutionen för datavetenskap Department of Computer and Information Science Examensarbete A Lexicon for Gene Normalization av Maria Lingemark LIU-IDA/LITH-EX-A--09/038 2009-09-01 Linköpings universitet Linköpings universitet SE-581 83 Linköping, Sweden 581 83 Linköping Linköpings universitet Institutionen för datavetenskap Examensarbete A Lexicon for Gene Normalization av Maria Lingemark LIU-IDA/LITH-EX-A--09/038 2009-09-01 Handledare: He Tan Examinator: He Tan Datum Avdelning, institution Date Division, department Institutionen för datavetenskap Department of Computer and Information Science 2009-09-01 Linköpings universitet Språk Rapporttyp ISBN Language Report category Svenska/Swedish Licentiatavhandling ISRN LIU-IDA/LITH-EX-A--09/038 X Engelska/English x Examensarbete C-uppsats Serietitel och serienummer ISSN D-uppsats Title of series, numbering Övrig rapport URL för elektronisk version http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-20250 Titel Title A Lexicon for Gene Normalization Författare Author Maria Lingemark Sammanfattning Abstract Researchers tend to use their own or favourite gene names in scientific literature, even though there are official names. Some names may even be used for more than one gene. This leads to problems with ambiguity when automatically mining biological literature. To disambiguate the gene names, gene normalization is used. In this thesis, we look into an existing gene normalization system, and develop a new method to find gene candidates for the ambiguous genes. For the new method a lexicon is created, using information about the gene names, symbols and synonyms from three different databases. The gene mention found in the scientific literature is used as input for a search in the lexicon, and all genes in the lexicon that match the mention are returned as gene candidates for that mention. These candidates are then used in the systems disambiguation step. Results show that the new method gives a better over all result from the system, with an increase in precision and a small decrease in recall. Nyckelord Keywords Bioinformatics, gene normalization, string matching, text mining Abstract Researchers tend to use their own or favourite gene names in scientific literature, even though there are official names. Some names may even be used for more than one gene. This leads to problems with ambiguity when automatically min- ing biological literature. To disambiguate the gene names, gene normalization is used. In this thesis, we look into an existing gene normalization system, and develop a new method to find gene candidates for the ambiguous genes. For the new method a lexicon is created, using information about the gene names, symbols and synonyms from three different databases. The gene mention found in the scientific literature is used as input for a search in this lexicon, and all genes in the lexicon that match the mention are returned as gene candidates for that mention. These candidates are then used in the system’s disambiguation step. Results show that the new method gives a better over all result from the system, with an increase in precision and a small decrease in recall. Contents 1 Introduction 1 1.1 Background.............................. 1 1.2 Motivation .............................. 2 1.3 Problemformulation ......................... 2 1.4 Limitations .............................. 2 1.5 Method ................................ 2 1.6 Structure ............................... 3 2 Background 5 2.1 Text Mining and Biological Text Mining . 5 2.2 GeneNames.............................. 6 2.3 GeneNormalization ......................... 6 2.3.1 OurSystem.......................... 7 2.4 LiteratureResources ......................... 8 2.5 Databases............................... 9 2.5.1 EntrezGene.......................... 9 2.5.2 HGNC............................. 9 2.5.3 SwissProt ........................... 10 3 Related Work 11 4 Creating the lexicon 13 4.1 Preprocessing ............................. 13 4.2 Merging ................................ 15 4.2.1 Step 1: Combining Entrez and HGNC . 16 4.2.2 Step 2: Combining combo file and SwissProt . 16 4.3 Cleaning................................ 19 4.4 Rules.................................. 19 4.5 Implementation............................ 21 5 Searching the Lexicon 23 5.1 ExactStringMatching. 23 5.2 ApproximateStringMatching. 23 5.2.1 JaccardIndex......................... 24 i 5.2.2 Dice’s Coefficient . 24 5.2.3 Levenshtein Distance . 24 5.2.4 Q-Gram............................ 25 5.3 Implementation............................ 25 6 Evaluation 27 6.1 TestData ............................... 27 6.2 Method ................................ 27 6.3 Result ................................. 28 6.3.1 ExactStringMatching. 28 6.3.2 Approximate String Matching . 29 6.3.3 Evaluation of the Lexicon in the Gene Normalization System 32 7 Discussion 33 7.1 Comparing Lexicon to Database Query . 33 7.2 FutureWork ............................. 33 A Vocabulary 35 ii Chapter 1 Introduction 1.1 Background The amount of scientific literature has increased rapidly in the past decade and much of it has been published online as well as in journal articles. The online literature is an important resource for researchers, but the amount of information that can be found can be overwhelming. It is very impractical to manually identify relevant information and since the information is stored within the free text it also makes it hard to find the relevant documents. There are automated processes, such as natural language processing (NLP) that can be used to retrieve information from this vast quantity of text. When it comes to biomedical publications the problem is not only that of finding the useful information, but also that of identifying the correct genes mentioned in the text. When trying to find information the first problem is to identify the parts of the text that mention genes. When this is done the gene mentions found need to be identified as the actual genes the author is referring to by linking them to gene database entries. This is called gene normalization. Gene normalization is needed because a gene mention found is not always the official name or symbol of that particular gene. It can be a new name the author of the article decided to use for the gene, or the name or symbol used can be ambiguous. Many systems exist wchich are trying to solve this problem, but there are still room for improvement. In this thesis we will look at a method to retrieve gene candidates for gene symbol disambiguation, which means to identify the correct gene if there is an ambiguity. This thesis’ work focuses on one part of a larger system which recognizes gene mentions in raw text, retrieve gene candidates for those mentions and uses the information about the candidates and the gene mentions from the text in a disambiguation step. 1 1.2 Motivation This thesis focuses on a gene candidate retrieval step for gene normalization. In gene normalization one needs to find a way to decide which gene a mention refers to. One of the ways to do this is to retrieve gene candidates from a database, or a collection of databases, and match the information about the candidates against the information about the mention found in text. The gene candidate retrieval step aims to find the genes a mention in text might refer to. 1.3 Problem formulation When creating a lexicon for gene candidate retrieval there are a few things that need to be decided. The problem formulation can be summarized to: • Can a lexicon be used to retrieve gene candidates with a better result than directly querying databases? • Which databases should be used in the creation of the lexicon? • What structure should the lexicon have? • How can we search the lexicon? Is exact string matching enough or do we need to use approximate string matching? 1.4 Limitations The only organism considered in this thesis is human, which makes querying databases easier since it is possible to focus on only human genes. Some genes also share the same name with the corresponding gene from other species, but we will not have to deal with that problem since we already know it is a human gene we are looking for. There are only three different databases included in the creation of the lexicon which limits the number of synonyms found. 1.5 Method The gene candidate retrieval is done by using a lexicon which is built by com- bining information, such as gene names and synonyms, from three different databases. The lexicon is used to search for all the genes a given gene mention might refer to. The search is done by using exact string matching as well as a few different approximate string matching algorithms in two different search methods. A lexicon can hold a combination of information, such as gene name and synonyms, from several different gene databases. This can increase the chance of finding the right gene in the gene candidate retrieval step since there can be synonyms which only can be found in one database. 2 1.6 Structure The thesis will start with an introduction and then some background informa- tion in the areas relevant to the creation of the lexicon. The following chapter is a short description of related work. Chapter 4 and 5 will cover the creation and usage of the lexicon. Chapter 6 is an evaluation of the gene candidate re- trieval results using the lexicon and a comparison to the results from querying the database Entrez Gene. In the last chapter the results are discussed along with future work. 3 4 Chapter 2 Background 2.1 Text Mining and Biological Text Mining Text mining is a process where one tries to automatically extract useful informa- tion from text documents. The extraction is done by identifying and exploring interesting patterns that are found in the unstructured textual data. A text mining system takes raw data (text documents) as input and produces various types of output, e.g. patterns and trends (Feldman & Sanger, 2007). Text mining can be divided in different steps. Example of steps are information retrieval (IR), natural language processing (NLP) and information extraction (IE)(National Text Mining Centre & Redfearn, 2008).