Domain-Specific Synonym Expansion and Validation

Total Page:16

File Type:pdf, Size:1020Kb

Domain-Specific Synonym Expansion and Validation Domain-Specific Synonym Expansion and Validation for Biomedical Information Retrieval (MultiText Experiments for TREC 2004) Stefan Buttc¨ her Charles L. A. Clarke Gordon V. Cormack School of Computer Science University of Waterloo Waterloo, Ontario, Canada Abstract In the domain of biomedical publications, synonyms and homonyms are omnipresent and pose a great challenge for document retrieval systems. For this year's TREC Genomics Ad hoc Retrieval Task, we mainly addressed the problem of dealing with synonyms. We examined the impact of domain-specific knowledge on the effectiveness of query expansion and analyzed the quality of Google as a source of query expansion terms based on pseudo-relevance feedback. Our results show that automatic acronym expansion, realized by querying the AcroMed database of biomed- ical acronyms, almost always improves the performance of our document retrieval system. Google, on the other hand, produced results that were worse than the other, corpus-based feedback techniques we used as well in our experiments. We believe that the primary reason for Google's bad performance in this task is its highly restricted query language. 1 Introduction Information retrieval in the context of biomedical databases traditionally has to contend with three major problems: the frequent use of (possibly non- standardized) acronyms, the presence of homonyms This paper describes the work done by members of (the same word referring to two or more different the MultiText project at the University of Waterloo entitities) and synonyms (two or more words refer- for the TREC 2004 Genomics track. The MultiText ring to the same entity). For the two runs we sub- group only participated in the Ad hoc retrieval task mitted for the Genomics track, we addressed the of the Genomics track, which consisted of finding problems caused by acronyms and synonyms. We documents that are relevant with respect to certain chose to follow a strategy that was a mixture of topics in a subset of the Medline/PubMed database known heuristics for approximate string matching of medical publications [PM04]. The subset in- for genomics-related data and a knowledge-based cluded articles published between 1963 and 2004. approach that used domain-specific knowledge from It consisted of 4,591,008 different documents with a various sources, such as the AcroMed database of total size of 14 GB (XML text), where the term doc- biomedical acronyms [Acr04]. Since acronyms can ument refers to incomplete articles, annotated with be viewed as special forms of synonyms, it was pos- certain keywords, such as MeSH terms [MeS04]. Out sible to develop a system that can deal with both of these 4,591,008 documents, 3,479,798 contained problems (synonyms and acronyms) in a uniform the abstract of the article, the remaining 1,111,210 way. only the article title and the annotations. In total, 50 topics were given to the track participants. An Furthermore, we implemented an automatic query example topic is shown in Figure 1. expansion technique using pseudo-relevance feed- 1 <TOPIC> <ID>14</ID> <TITLE>Expression or Regulation of TGFB in HNSCC cancers</TITLE> <NEED> Documents regarding TGFB expression or regulation in HNSCC cancers. </NEED> <CONTEXT> The laboratory wants to identify components of the TGFB signaling pathway in HNSCC, and determine new targets to study HNSCC. </CONTEXT> </TOPIC> Figure 1: Example topic in original XML form back. We evaluated the quality of feedback terms is the average document length in the corpus, and produced by a simple Google query and compared qT is the query-specific relative weight of the term the results with known query expansion techniques. T . Usually, qT equals the number of occurrences of We also studied possibilities to combine both strate- T in the original query. We discuss this parameter in gies. more detail in section 8. The remaining parameters in formula 2 were chosen to be k1 = 1:2 and b = 0:75. In addition to mere term queries, MultiText sup- 2 The MultiText system ports structured queries, and in particular disjunc- tions of terms. We call a disjunction of query terms Our work is based on the MultiText text retrieval system, which was initially implemented at the T_ = T1 _ T2 _ · · · _ Tm University of Waterloo in 1994 [CCB94] and has a disjunctive query element. The BM25 weight of a experienced a variety of extensions and modifica- disjunctive query element is tion over the last decade. MultiText supports sev- eral different document scoring mechanisms. We jDj − jDT1 [ · · · [ DTm j + 0:5 wT_ = ln( ): (3) employed the MultiText implementation of Okapi jDT1 [ · · · [ DTm j + 0:5 BM25 [RWJ+94] [RWB98] and MultiText's QAP For convenience, we let the \+" symbol denote the passage scoring algorithm [CCT00] [CCL01], which disjunction operator (\_") whenever we present ex- had initially been developed for question answering ample queries. tasks. 2.1 Okapi BM25 document scoring 2.2 MultiText QAP passage scoring MultiText's QAP passage scoring algorithm com- MultiText's implementation of BM25 is the same putes text passage scores based on the concept of as the original function proposed by Robertson et self-information. It may be extended to a docu- al. [RWJ+94] in the absence of document relevance ment scoring algorithm by finding the passage with information. Every query term T is assigned a term the highest score within a document and assigning weight the passage's score to the entire document, although jDj − jDT j + 0:5 wT = ln( ); (1) this is not the original purpose of QAP. Our pseudo- jD j + 0:5 T relevance feedback mechanisms are derived from the where D is the set of all documents in the corpus, QAP scoring function. and DT is the set of documents containing the term T . For a given query Q = fT1; :::; Tng, the score of Given a query Q, every passage P of length l, con- a document D is computed using the formula taining the terms T ⊆ Q, is assigned the score dT · (1 + k1) N wT · qT · ; (2) H(P ) = X (log2( ) − log2(l)); (4) X lenD dT + k1 · ((1 − b) + b · ) fT T 2Q lenavg T 2T where dT is the number of occurrences of the term T where N is the total number of tokens in the corpus in the document D, lenD is the length of D, lenavg (corpus size), and fT is the number of occurrences Query Expansion Parsing, Acronym/Synonym Finder Species Topic Stop Word Names Variant (XML text) Removal (AcroMed, EuGenes, LocusLink) Finder Generator Unexpanded MultiText Expanded Query BM25 Query Legend Query MultiText MultiText QAP BM25 Stage 1 Process Expansion Doc. Fusion Expansion Validator, (Interleave) Permutation Finder Stage 2 Final MultiText Expanded Result BM25 Query Google Passage Document Feedback Feedback Feedback Species MultiText Names BM25 Filtering Term Fusion Query Expansion (CombSUM) (Pseudo-Relevance Feedback) Stage 3 Figure 2: The document retrieval system of the term T in the corpus. H(P ) is called the in the original query. The unexpanded query self-information of the passage P . is taken by MultiText to score documents using both BM25 and QAP, while the expanded query is used to produce a third, BM25-scored document set. 3 System overview Expansion validation (Stage 2) In the second stage (see section 5 for details), the An overview of the document retrieval system documents returned by the first stage are analyzed developed for the Genomics track is given by Figure and systematically scanned for any occurrences of 2. The system consists of a preprocessing stage and terms added to the query in the expansion phase three document retrieval stages. Most parts of the of stage 1 or permutations of query terms. Terms system can be changed by varying the respective that have been added to the query in stage 1, but parameters. Because there were no training data occur very infrequently in the documents returned, available, many of these parameters are arbitrarily are removed from the query, whereas frequent chosen. permutations of query terms which have not been part of the old query might be added. The resulting Query preprocessing & Synonym expansion query is sent to MultiText, and documents are (Stage 1) scored by BM25. In the preprocessing stage, the input data (XML text) is parsed and all stop words are removed. Pseudo-relevance feedback (Stage 3) The list of stop words contains the usual terms, like The third stage, which is described in sections 6 \the" and \and", but also a number of additional and 7, performs a set of standard pseudo-relevance terms inferred from the sample topics, e.g. “find”, feedback operations in order to create new expan- \information", and \literature". After the stop sion terms. The terms produced by the feedback words have been removed, several synonym genera- are added to the query, and BM25 is used again to tion techniques are applied to the query. A detailed compute the document scores. Finally, if our re- explanation of these techniques can be found in trieval system has found species names in the orig- section 4. The result of this process is an expanded inal query, it tries to match the documents in the query containing possible synonyms of the terms ID: 14 TITLE: Expression or Regulation of TGFB in HNSCC cancers NEED: Documents regarding TGFB expression or regulation in HNSCC cancers. QUERY_TITLE: "expression" "regulation" "tgfb" "hnscc" "cancers" EXPANDED_TITLE: "expression" "regulation" ("tgfb"+"blk"+"ced"+"dlhc"+ "tgfbeta"+"tgfb1"+"tgfb2"+"tgf r"+"transforming growth factor beta"+ "latent transforming growth factor beta"+"ltbp 1"+"ltbp1"+"mav"+ "maverick"+.....) ("hnscc"+"head and neck squamous cell") "cancers" QUERY_NEED: "regarding" "tgfb" "expression" "regulation" "hnscc" "cancers" EXPANDED_NEED: "regarding" ("tgfb"+"blk"+"ced"+"dlhc"+"tgfbeta"+"tgfb1"+ .....) "expression" "regulation" ("hnscc"+"head and neck squamous cell") "cancers" Figure 3: Example topic after preprocessing and query expansion in stage 1 result list against these species names and promotes are likely to have lexical variants).
Recommended publications
  • Evaluation of Stop Word Lists in Chinese Language
    Evaluation of Stop Word Lists in Chinese Language Feng Zou, Fu Lee Wang, Xiaotie Deng, Song Han Department of Computer Science, City University of Hong Kong Kowloon Tong, Hong Kong [email protected] {flwang, csdeng, shan00}@cityu.edu.hk Abstract In modern information retrieval systems, effective indexing can be achieved by removal of stop words. Till now many stop word lists have been developed for English language. However, no standard stop word list has been constructed for Chinese language yet. With the fast development of information retrieval in Chinese language, exploring the evaluation of Chinese stop word lists becomes critical. In this paper, to save the time and release the burden of manual comparison, we propose a novel stop word list evaluation method with a mutual information-based Chinese segmentation methodology. Experiments have been conducted on training texts taken from a recent international Chinese segmentation competition. Results show that effective stop word lists can improve the accuracy of Chinese segmentation significantly. In order to produce a stop word list which is widely 1. Introduction accepted as a standard, it is extremely important to In information retrieval, a document is traditionally compare the performance of different stop word list. On indexed by frequency of words in the documents (Ricardo the other hand, it is also essential to investigate how a stop & Berthier, 1999; Rijsbergen, 1975; Salton & Buckley, word list can affect the completion of related tasks in 1988). Statistical analysis through documents showed that language processing. With the fast growth of online some words have quite low frequency, while some others Chinese documents, the evaluation of these stop word lists act just the opposite (Zipf, 1932).
    [Show full text]
  • PSWG: an Automatic Stop-Word List Generator for Persian Information
    Vol. 2(5), Apr. 2016, pp. 326-334 PSWG: An Automatic Stop-word List Generator for Persian Information Retrieval Systems Based on Similarity Function & POS Information Mohammad-Ali Yaghoub-Zadeh-Fard1, Behrouz Minaei-Bidgoli1, Saeed Rahmani and Saeed Shahrivari 1Iran University of Science and Technology, Tehran, Iran 2Freelancer, Tehran, Iran *Corresponding Author's E-mail: [email protected] Abstract y the advent of new information resources, search engines have encountered a new challenge since they have been obliged to store a large amount of text materials. This is even more B drastic for small-sized companies which are suffering from a lack of hardware resources and limited budgets. In such a circumstance, reducing index size is of paramount importance as it is to maintain the accuracy of retrieval. One of the primary ways to reduce the index size in text processing systems is to remove stop-words, frequently occurring terms which do not contribute to the information content of documents. Even though there are manually built stop-word lists almost for all languages in the world, stop-word lists are domain-specific; in other words, a term which is a stop- word in a specific domain may play an indispensable role in another one. This paper proposes an aggregated method for automatically building stop-word lists for Persian information retrieval systems. Using part of speech tagging and analyzing statistical features of terms, the proposed method tries to enhance the accuracy of retrieval and minimize potential side effects of removing informative terms. The experiment results show that the proposed approach enhances the average precision, decreases the index storage size, and improves the overall response time.
    [Show full text]
  • Natural Language Processing Security- and Defense-Related Lessons Learned
    July 2021 Perspective EXPERT INSIGHTS ON A TIMELY POLICY ISSUE PETER SCHIRMER, AMBER JAYCOCKS, SEAN MANN, WILLIAM MARCELLINO, LUKE J. MATTHEWS, JOHN DAVID PARSONS, DAVID SCHULKER Natural Language Processing Security- and Defense-Related Lessons Learned his Perspective offers a collection of lessons learned from RAND Corporation projects that employed natural language processing (NLP) tools and methods. It is written as a reference document for the practitioner Tand is not intended to be a primer on concepts, algorithms, or applications, nor does it purport to be a systematic inventory of all lessons relevant to NLP or data analytics. It is based on a convenience sample of NLP practitioners who spend or spent a majority of their time at RAND working on projects related to national defense, national intelligence, international security, or homeland security; thus, the lessons learned are drawn largely from projects in these areas. Although few of the lessons are applicable exclusively to the U.S. Department of Defense (DoD) and its NLP tasks, many may prove particularly salient for DoD, because its terminology is very domain-specific and full of jargon, much of its data are classified or sensitive, its computing environment is more restricted, and its information systems are gen- erally not designed to support large-scale analysis. This Perspective addresses each C O R P O R A T I O N of these issues and many more. The presentation prioritizes • identifying studies conducting health service readability over literary grace. research and primary care research that were sup- We use NLP as an umbrella term for the range of tools ported by federal agencies.
    [Show full text]
  • Ontology Based Natural Language Interface for Relational Databases
    Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 92 ( 2016 ) 487 – 492 2nd International Conference on Intelligent Computing, Communication & Convergence (ICCC-2016) Srikanta Patnaik, Editor in Chief Conference Organized by Interscience Institute of Management and Technology Bhubaneswar, Odisha, India Ontology Based Natural Language Interface for Relational Databases a b B.Sujatha* , Dr.S.Viswanadha Raju a. Assistant Professor, Department of CSE, Osmania University, Hyderabad, India b. Professor, Dept. Of CSE, JNTU college of Engineering, Jagityal, Karimnagar, India Abstract Developing Natural Language Query Interface to Relational Databases has gained much interest in research community since forty years. This can be termed as structured free query interface as it allows the users to retrieve the data from the database without knowing the underlying schema. Structured free query interface should address majorly two problems. Querying the system with Natural Language Interfaces (NLIs) is comfortable for the naive users but it is difficult for the machine to understand. The other problem is that the users can query the system with different expressions to retrieve the same information. The different words used in the query can have same meaning and also the same word can have multiple meanings. Hence it is the responsibility of the NLI to understand the exact meaning of the word in the particular context. In this paper, a generic NLI Database system has proposed which contains various phases. The exact meaning of the word used in the query in particular context is obtained using ontology constructed for customer database. The proposed system is evaluated using customer database with precision, recall and f1-measure.
    [Show full text]
  • INFORMATION RETRIEVAL SYSTEM for SILT'e LANGUAGE Using
    INFORMATION RETRIEVAL SYSTEM FOR SILT’E LANGUAGE Using BM25 weighting Abdulmalik Johar 1 Lecturer, Department of information system, computing and informatics college, Wolkite University, Ethiopia ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - The main aim of an information retrieval to time in the Silt’e zone. The growth of digital text system is to extract appropriate information from an information makes the utilization and access of the right enormous collection of data based on user’s need. The basic information difficult. Therefore, an information retrieval concept of the information retrieval system is that when a system needed to store, organize and helps to access Silt’e digital text information user sends out a query, the system would try to generate a list of related documents ranked in order, according to their 1.2. Literature Review degree of relevance. Digital unstructured Silt’e text documents increase from time to time. The growth of digital The Writing System of Silt’e Language text information makes the utilization and access of the right The accurate period that the Ge'ez character set was used information difficult. Thus, developing an information for Silt'e writing system is not known. However, since the retrieval system for Silt’e language allows searching and 1980s, Silt'e has been written in the Ge'ez alphabet, or retrieving relevant documents that satisfy information need Ethiopic script, writing system, originally developed for the of users. In this research, we design probabilistic information now extinct Ge'ez language and most known today in its use retrieval system for Silt’e language. The system has both for Amharic and Tigrigna [9].
    [Show full text]
  • The Applicability of Natural Language Processing (NLP) To
    THE AMERICAN ARCHIVIST The Applicability of Natural Language Processing (NLP) to Archival Properties and Downloaded from http://meridian.allenpress.com/american-archivist/article-pdf/61/2/400/2749134/aarc_61_2_j3p8200745pj34v6.pdf by guest on 01 October 2021 Objectives Jane Greenberg Abstract Natural language processing (NLP) is an extremely powerful operation—one that takes advantage of electronic text and the computer's computational capabilities, which surpass human speed and consistency. How does NLP affect archival operations in the electronic environment? This article introduces archivists to NLP with a presentation of the NLP Continuum and a description of the Archives Axiom, which is supported by an analysis of archival properties and objectives. An overview of the basic information retrieval (IR) framework is provided and NLP's application to the electronic archival environment is discussed. The analysis concludes that while NLP offers advantages for indexing and ac- cessing electronic archives, its incapacity to understand records and recordkeeping systems results in serious limitations for archival operations. Introduction ince the advent of computers, society has been progressing—at first cautiously, but now exponentially—from a dependency on printed text, to functioning almost exclusively with electronic text and other elec- S 1 tronic media in an array of environments. Underlying this transition is the growing accumulation of massive electronic archives, which are prohibitively expensive to index by traditional manual procedures. Natural language pro- cessing (NLP) offers an option for indexing and accessing these large quan- 1 The electronic society's evolution includes optical character recognition (OCR) and other processes for transforming printed text to electronic form. Also included are processes, such as imaging and sound recording, for transforming and recording nonprint electronic manifestations, but this article focuses on electronic text.
    [Show full text]
  • FTS in Database
    Full-Text Search in PostgreSQL Oleg Bartunov Moscow Univiversity PostgreSQL Globall DDevelolopment Group FTFTSS iinn DataDatabbaasese • Full-text search – Find documents, which satisfy query – return results in some order (opt.) • Requirements to FTS – Full integration with PostgreSQL • transaction support • concurrency and recovery • online index – Linguistic support – Flexibility – Scalability PGCon 2007, Ottawa, May 2007 Full-Text Search in PostgreSQL Oleg Bartunov WWhhaatt isis aa DocumeDocumenntt ?? • Arbitrary textual attribute • Combination of textual attributes • Should have unique id id --- id Title Did -- ------ ---- Author Abstract Aid --------- ----------- ---- Keywords ------------- Body Title || Abstract || Keywords || Body || Author PGCon 2007, Ottawa, May 2007 Full-Text Search in PostgreSQL Oleg Bartunov TTexextt SeSeararchch OpOpeeraratotorrss • Traditional FTS operators for textual attributes ~, ~*, LIKE, ILIKE Problems – No linguistic support, no stop-words – No ranking – Slow, no index support. Documents should be scanned every time. PGCon 2007, Ottawa, May 2007 Full-Text Search in PostgreSQL Oleg Bartunov FTFTSS iinn PostgrePostgreSQSQLL =# select 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat':: tsquery; – tsvector – storage for document, optimized for search • sorted array of lexemes • positional information • weights information – tsquery – textual data type for query • Boolean operators - & | ! () – FTS operator tsvector @@ tsquery PGCon 2007, Ottawa, May 2007 Full-Text Search in PostgreSQL
    [Show full text]
  • An Association Thesaurus for Information Retrieval 1 Introduction
    An Asso ciation Thesaurus for Information Retrieval Yufeng Jing and W Bruce Croft Department of Computer Science University of Massachusetts at Amherst Amherst MA jingcsumassedu croftcsumassedu Abstract Although commonly used in b oth commercial and exp erimental information retrieval systems thesauri have not demonstrated consistent b enets for retrieval p erformance and it is dicult to construct a thesaurus automatically for large text databases In this pap er an approach called PhraseFinder is prop osed to construct collectiondep e ndent asso ciation thesauri automatically using large fulltext do cument collections The asso ciation thesaurus can b e accessed through natural language queries in INQUERY an information retrieval system based on the probabilistic inference network Exp eriments are conducted in IN QUERY to evaluate dierent typ es of asso ciation thesauri and thesauri constructed for a variety of collections Intro duction A thesaurus is a set of items phrases or words plus a set of relations b etween these items Although thesauri are commonly used in b oth commercial and exp erimental IR systems ex p eriments have shown inconsistent eects on retrieval eectiveness and there is a lack of viable approaches for constructing a thesaurus automatically There are three basic issues related to thesauri in IR as follows These issues should each b e addressed separately in order to improve retrieval eectiveness Construction There are two typ es of thesauri manual and automatic The fo cus of this pap er is on how to construct a
    [Show full text]
  • 214310991.Pdf
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Governors State University Governors State University OPUS Open Portal to University Scholarship All Capstone Projects Student Capstone Projects Fall 2015 A Survey to Fix the Threshold and Implementation for Detecting Duplicate Web Documents Manojreddy Bhimireddy Governors State University Krishna Pavan Gandi Governors State University Reuven Hicks Governors State University Bhargav Roy Veeramachaneni Governors State University Follow this and additional works at: http://opus.govst.edu/capstones Part of the Software Engineering Commons Recommended Citation Bhimireddy, Manojreddy; Gandi, Krishna Pavan; Hicks, Reuven; and Veeramachaneni, Bhargav Roy, "A Survey to Fix the Threshold and Implementation for Detecting Duplicate Web Documents" (2015). All Capstone Projects. 155. http://opus.govst.edu/capstones/155 For more information about the academic degree, extended learning, and certificate programs of Governors State University, go to http://www.govst.edu/Academics/Degree_Programs_and_Certifications/ Visit the Governors State Computer Science Department This Project Summary is brought to you for free and open access by the Student Capstone Projects at OPUS Open Portal to University Scholarship. It has been accepted for inclusion in All Capstone Projects by an authorized administrator of OPUS Open Portal to University Scholarship. For more information, please contact [email protected]. Table of Contents 1 Project Description ............................................................................................................................
    [Show full text]
  • Accelerating Text Mining Using Domain-Specific Stop Word Lists
    Accelerating Text Mining Using Domain-Specific Stop Word Lists Farah Alshanik, Amy Apon, Alexander Herzog, Ilya Safro, Justin Sybrandt School of Computing Clemson University Clemson, USA ffalshan, aapon, aherzog, isafro, [email protected] Abstract—Text preprocessing is an essential step in text mining. ing tasks, preprocessing begins with eliminating stop words Removing words that can negatively impact the quality of predic- without considering the domain of application. For example, in tion algorithms or are not informative enough is a crucial storage- the popular Python Natural Language ToolKit (nltk) module, saving technique in text indexing and results in improved com- putational efficiency. Typically, a generic stop word list is applied the English stop words list contains frequent words such as to a dataset regardless of the domain. However, many common “the”, “and”, “to”, “a”, and “is” [4]. However, there may words are different from one domain to another but have no exist many domain-specific words that do not add value in significance within a particular domain. Eliminating domain- a text mining application but that increase the computational specific common words in a corpus reduces the dimensionality of time. For instance, the word “protein” is likely to add little the feature space, and improves the performance of text mining tasks. In this paper, we present a novel mathematical approach value in text mining of bioinformatics articles, but not for for the automatic extraction of domain-specific words called a collection of political articles. Reduction in running time the hyperplane-based approach. This new approach depends on is also extremely important for such tasks as streaming and the notion of low dimensional representation of the word in sliding window text mining [5], [6].
    [Show full text]
  • Full-Text Search in Postgresql: a Gentle Introduction by Oleg Bartunov and Teodor Sigaev
    Full-Text Search in PostgreSQL A Gentle Introduction Oleg Bartunov Moscow University [email protected] Moscow Russia Teodor Sigaev Moscow University [email protected] Moscow Russia Full-Text Search in PostgreSQL: A Gentle Introduction by Oleg Bartunov and Teodor Sigaev Copyright © 2001-2007 Oleg Bartunov, Teodor Sigaev This document is a gentle introduction to the full-text search in ORDBMS PostgreSQL (version 8.3+). It covers basic features and contains reference of SQL commands, related to the FTS. Brave and smart can play with the new FTS - patch for the CVS HEAD is available tsearch_core-0.48.gz1. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover texts, and with no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License". 1. http://www.sigaev.ru/misc/tsearch_core-0.48.gz Table of Contents 1. FTS Introduction ...................................................................................................................................1 1.1. Full Text Search in databases......................................................................................................1 1.1.1. What is a document?.......................................................................................................1 1.2. FTS Overview .............................................................................................................................1
    [Show full text]
  • Automatic Multi-Word Term Extraction and Its Application to Web-Page Summarization
    Automatic Multi-word Term Extraction and its Application to Web-page Summarization by Weiwei Huo A Thesis presented to The University of Guelph In partial fulfilment of requirements for the degree of Master of Science in Computer Science Guelph, Ontario, Canada © Weiwei Huo, December, 2012 ABSTRACT Automatic Multi-word Term Extraction and its Application to Web-page Summarization Weiwei Huo Advisor: University of Guelph, 2012 Prof. Fei Song In this thesis we propose three new word association measures for multi-word term extraction. We combine these association measures with LocalMaxs algorithm in our extraction model and compare the results of different multi-word term extraction methods. Our approach is language and domain independent and requires no training data. It can be applied to such tasks as text summarization, information retrieval, and document classification. We further explore the potential of using multi-word terms as an effective representation for general web-page summarization. We extract multi-word terms from human written summaries in a large collection of web-pages, and generate the summaries by aligning document words with these multi-word terms. Our system applies machine translation technology to learn the aligning process from a training set and focuses on selecting high quality multi-word terms from human written summaries to generate suitable results for web-page summarization. iii Acknowledgments I am grateful to Prof. Fei Song for his continued support, encouragement, and insights. And I would like to thank Prof. Mark Wineberg for his precious advice on my writing and thesis work. I would also like to thank my schoolmate Song Lin for his help and encouragement.
    [Show full text]