<<

Natural language processing Week 03 Contents

1 1 1.1 History ...... 1 1.2 Present significance ...... 2 1.3 Tasks and subtasks ...... 2 1.4 applications ...... 3 1.5 Approaches ...... 3 1.6 Free or open source software and services ...... 4 1.7 Commercial software and services ...... 4 1.8 See also ...... 4 1.9 References ...... 5 1.10 External links ...... 5

2 Named-entity recognition 6 2.1 Problem definition ...... 6 2.1.1 Formal evaluation ...... 7 2.2 Approaches ...... 7 2.3 Problem domains ...... 7 2.4 Current challenges and research ...... 8 2.5 Software ...... 8 2.6 See also ...... 8 2.7 References ...... 8 2.8 External links ...... 10

3 Part-of-speech tagging 11 3.1 Principle ...... 11 3.2 History ...... 11 3.2.1 The Brown Corpus ...... 12 3.2.2 Use of hidden Markov models ...... 12 3.2.3 methods ...... 12 3.2.4 Unsupervised taggers ...... 13 3.2.5 Other taggers and methods ...... 13 3.3 Issues ...... 13 3.4 See also ...... 14

i ii CONTENTS

3.5 References ...... 14 3.6 External links ...... 15

4 Phrase chunking 16 4.1 See also ...... 16 4.2 External links ...... 16

5 17 5.1 Applications ...... 17 5.2 Approaches ...... 17 5.3 See also ...... 17 5.4 References ...... 17

6 Sentence boundary disambiguation 19 6.1 Strategies ...... 19 6.2 Software ...... 19 6.3 See also ...... 20 6.4 References ...... 20 6.5 External links ...... 20

7 Shallow 21 7.1 References ...... 21 7.2 External links ...... 21 7.3 See also ...... 21

8 22 8.1 Examples ...... 22 8.2 History ...... 22 8.3 Algorithms ...... 22 8.3.1 The production technique ...... 23 8.3.2 Suffix-stripping algorithms ...... 23 8.3.3 algorithms ...... 24 8.3.4 Stochastic algorithms ...... 24 8.3.5 n-gram analysis ...... 24 8.3.6 Hybrid approaches ...... 24 8.3.7 Affix stemmers ...... 25 8.3.8 Matching algorithms ...... 25 8.4 Language challenges ...... 25 8.4.1 Multilingual stemming ...... 25 8.5 Error metrics ...... 25 8.6 Applications ...... 26 8.6.1 ...... 26 8.6.2 Domain Analysis ...... 26 CONTENTS iii

8.6.3 Use in commercial products ...... 26 8.7 See also ...... 26 8.8 References ...... 27 8.9 Further reading ...... 27 8.10 External links ...... 28

9 30 9.1 Segmentation problems ...... 30 9.1.1 segmentation ...... 30 9.1.2 Sentence segmentation ...... 30 9.1.3 Topic segmentation ...... 31 9.1.4 Other segmentation problems ...... 31 9.2 Automatic segmentation approaches ...... 31 9.3 See also ...... 31 9.4 References ...... 32 9.5 External links ...... 32

10 Tokenization () 33 10.1 Methods and obstacles ...... 33 10.2 Software ...... 33 10.3 See also ...... 34 10.4 References ...... 34

11 Parsing 35 11.1 Human languages ...... 35 11.1.1 Traditional methods ...... 35 11.1.2 Computational methods ...... 36 11.1.3 Psycholinguistics ...... 36 11.2 Computer languages ...... 36 11.2.1 Parser ...... 36 11.2.2 Overview of process ...... 37 11.3 Types of parsers ...... 38 11.4 Parser development software ...... 38 11.5 Lookahead ...... 39 11.6 See also ...... 40 11.7 References ...... 41 11.8 Further reading ...... 41 11.9 External links ...... 41

12 43 12.1 Constituency-based parse trees ...... 43 12.2 Dependency-based parse trees ...... 44 12.3 Phrase markers ...... 44 iv CONTENTS

12.4 See also ...... 45 12.5 Notes ...... 45 12.6 References ...... 45 12.7 External links ...... 45

13 Constituent () 46 13.1 Constituency tests ...... 46 13.1.1 (fronting) ...... 46 13.1.2 Clefting ...... 46 13.1.3 Pseudoclefting ...... 47 13.1.4 Pro-form substitution (replacement) ...... 47 13.1.5 Answer ellipsis (answer fragments, question test) ...... 47 13.1.6 Passivization ...... 47 13.1.7 Omission (deletion) ...... 47 13.1.8 Coordination ...... 48 13.2 Constituency tests and disambiguation ...... 48 13.3 Competing theories ...... 48 13.4 See also ...... 49 13.5 Notes ...... 49 13.6 References ...... 50

14 Dependency 52 14.1 History ...... 52 14.2 Dependency vs. constituency ...... 52 14.3 Dependency ...... 53 14.4 Representing dependencies ...... 54 14.5 Types of dependencies ...... 55 14.5.1 Semantic dependencies ...... 55 14.5.2 Morphological dependencies ...... 56 14.5.3 Prosodic dependencies ...... 57 14.5.4 Syntactic dependencies ...... 57 14.6 Linear order and discontinuities ...... 58 14.7 Syntactic functions ...... 60 14.8 See also ...... 60 14.9 Notes ...... 61 14.10References ...... 62 14.11External links ...... 63

15 64 15.1 Constituency relation ...... 64 15.2 Dependency relation ...... 65 15.3 Non-descript grammars ...... 65 CONTENTS v

15.4 See also ...... 65 15.5 Notes ...... 66 15.6 References ...... 66

16 Verb phrase 67 16.1 Verb phrases in phrase structure grammars ...... 67 16.2 Verb phrases in dependency grammars ...... 68 16.3 Verb phrases narrowly defined ...... 69 16.4 See also ...... 69 16.5 Notes ...... 69 16.6 References ...... 70

17 Information retrieval 71 17.1 Overview ...... 71 17.2 History ...... 71 17.3 Model types ...... 72 17.3.1 First dimension: mathematical basis ...... 72 17.3.2 Second dimension: properties of the model ...... 73 17.4 Performance and correctness measures ...... 73 17.4.1 Precision ...... 73 17.4.2 Recall ...... 74 17.4.3 Fall-out ...... 74 17.4.4 F-score / F-measure ...... 74 17.4.5 Average precision ...... 75 17.4.6 Precision at K ...... 76 17.4.7 R-Precision ...... 76 17.4.8 Mean average precision ...... 76 17.4.9 Discounted cumulative gain ...... 76 17.4.10 Other measures ...... 77 17.4.11 Visualization ...... 77 17.5 Timeline ...... 77 17.6 Awards in the field ...... 79 17.7 Leading IR Research Groups ...... 79 17.8 See also ...... 80 17.9 References ...... 80 17.10Further reading ...... 81 17.11External links ...... 82

18 Vector model 83 18.1 Definitions ...... 83 18.2 Applications ...... 83 18.3 Example: tf-idf weights ...... 84 vi CONTENTS

18.4 Advantages ...... 85 18.5 Limitations ...... 85 18.6 Models based on and extending the vector space model ...... 85 18.7 Software that implements the vector space model ...... 86 18.7.1 Free open source software ...... 86 18.8 Further reading ...... 86 18.9 See also ...... 86 18.10References ...... 86

19 tf–idf 87 19.1 Motivation ...... 87 19.1.1 Term frequency ...... 87 19.1.2 Inverse document frequency ...... 87 19.2 Definition ...... 88 19.2.1 Term frequency ...... 88 19.2.2 Inverse document frequency ...... 88 19.2.3 Term frequency–Inverse document frequency ...... 88 19.3 Justification of idf ...... 89 19.4 Example of tf–idf ...... 89 19.5 tf-idf Beyond Terms ...... 90 19.6 tf-idf Derivates ...... 90 19.7 See also ...... 90 19.8 References ...... 91 19.9 External links and suggested reading ...... 91

20 Synonym 93 20.1 Examples ...... 93 20.2 See also ...... 95 20.3 References ...... 95 20.4 External links ...... 96

21 Relevance 97 21.1 Definition ...... 97 21.2 Epistemology ...... 97 21.3 Relevance logic ...... 97 21.4 Application ...... 98 21.4.1 Politics ...... 98 21.4.2 Economics ...... 98 21.4.3 Cognitive science and pragmatics ...... 99 21.4.4 Law ...... 99 21.4.5 Library and information science ...... 99 21.5 See also ...... 100 CONTENTS vii

21.6 References ...... 100 21.7 External links ...... 100

22 Library and information science 101 22.1 Relations between library science, information science and LIS ...... 101 22.2 Difficulties defining LIS ...... 102 22.2.1 A multidisciplinary, interdisciplinary or monodisciplinary field? ...... 102 22.2.2 A fragmented adhocracy ...... 103 22.2.3 Scattering of the literature ...... 103 22.3 The unique concern of library and information science ...... 103 22.4 LIS theories ...... 103 22.5 Journals ...... 104 22.6 Conferences ...... 104 22.7 Common subfields ...... 105 22.8 See also ...... 106 22.9 References ...... 106 22.10Further reading ...... 107

23 Relevance (information retrieval) 109 23.1 History ...... 109 23.2 Evaluation ...... 109 23.3 Clustering and relevance ...... 110 23.4 Problems and alternatives ...... 110 23.5 References ...... 110 23.6 Additional reading ...... 111

24 Web 112 24.1 History ...... 112 24.2 How web search engines work ...... 114 24.3 Market share ...... 116 24.3.1 East Asia and Russia ...... 116 24.4 Search engine bias ...... 116 24.5 Customized results and filter bubbles ...... 116 24.6 Christian, Islamic and Jewish search engines ...... 116 24.7 Search engine submission ...... 117 24.8 See also ...... 117 24.9 References ...... 117 24.10Further reading ...... 119 24.11External links ...... 119 24.12Text and image sources, contributors, and licenses ...... 120 24.12.1 Text ...... 120 24.12.2 Images ...... 124 viii CONTENTS

24.12.3 Content license ...... 125 Chapter 1

Information extraction

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video could be seen as information extraction. Due to the difficulty of the problem, current approaches to IE focus on narrowly restricted domains. An example is the extraction from news wire reports of corporate mergers, such as denoted by the formal relation:

MergerBetween(company1, company2, date) from an online news sentence such as:

“Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp.”

A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context. Information Extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display. The discipline of information retrieval (IR)[1] has developed automatic methods, typically of a statistical flavor, for indexing large document collections and classifying documents. Another complementary approach is that of natural language processing (NLP) which has solved the problem of modelling human language processing with considerable success when taking into account the magnitude of the task. In terms of both difficulty and emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a set of documents in which each document follows a template, i.e. describes one or more entities or events in a manner that is similar to those in other documents but differing in the details. An example, consider a group of newswire articles on American terrorism with each article presumed to be based upon one or more terroristic acts. We also define for any given IE task a template, which is a(or a set of) case frame(s) to hold the information contained in a single document. For the terrorism example, a template would have slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the event happened. An IE system for this problem is required to “understand” an attack article only enough to find data corresponding to the slots in this template.

1.1 History

Information extraction dates back to the late 1970s in the early days of NLP.[2] An early commercial system from the mid-1980s was JASPER built for Reuters by the Carnegie Group with the aim of providing real-time financial news to financial traders.[3] Beginning in 1987, IE was spurred by a series of Message Understanding Conferences. MUC is a competition-based conference that focused on the following domains:

1 2 CHAPTER 1. INFORMATION EXTRACTION

• MUC-1 (1987), MUC-2 (1989): Naval operations messages. • MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries. • MUC-5 (1993): Joint ventures and microelectronics domain. • MUC-6 (1995): News articles on management changes. • MUC-7 (1998): Satellite launch reports.

Considerable support came from the U.S. Defense Advanced Research Projects Agency (DARPA), who wished to automate mundane tasks performed by government analysts, such as scanning newspapers for possible links to terrorism.

1.2 Present significance

The present significance of IE pertains to the growing amount of information available in unstructured form. Tim Berners-Lee, inventor of the world wide web, refers to the existing as the web of documents [4] and advocates that more of the content be made available as a web of data.[5] Until this transpires, the web largely consists of unstructured documents lacking semantic . Knowledge contained within these documents can be made more accessible for machine processing by means of transformation into relational form, or by marking-up with XML tags. An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with. A typical application of IE is to scan a set of documents written in a natural language and populate a with the information extracted.[6]

1.3 Tasks and subtasks

Applying information extraction on text, is linked to the problem of text simplification in order to create a structured view of the information present in free text. The overall goal being to create a more easily machine-readable text to process the sentences. Typical subtasks of IE include:

• Named entity extraction which could include: • Named entity recognition: recognition of known entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions, employing existing knowledge of the domain or information extracted from other sentences. Typically the recognition task involves assigning a unique identifier to the extracted entity. A simpler task is named entity detection, which aims to detect entities without having any existing knowledge about the entity instances. For example, in - cessing the sentence “M. Smith likes fishing”, named entity detection would denote detecting that the phrase “M. Smith” does refer to a person, but without necessarily having (or using) any knowledge about a certain M. Smith who is (or, “might be”) the specific person whom that sentence is talking about. • Coreference resolution: detection of coreference and anaphoric links between text entities. In IE tasks, this is typically restricted to finding links between previously-extracted named entities. For example, “International Business Machines” and “IBM” refer to the same real-world entity. If we take the two sentences “M. Smith likes fishing. But he doesn't like biking”, it would be beneficial to detect that “he” is referring to the previously detected person “M. Smith”. • Relationship extraction: identification of relations between entities, such as: • PERSON works for ORGANIZATION (extracted from the sentence “Bill works for IBM.”) • PERSON located in LOCATION (extracted from the sentence “Bill is in France.”) • Semi-structured information extraction which may refer to any IE that tries to restore some kind information structure that has been lost through publication such as: • Table extraction: finding and extracting tables from documents. • Comments extraction : extracting comments from actual content of article in order to restore the link between author of each sentence 1.4. WORLD WIDE WEB APPLICATIONS 3

• Language and vocabulary analysis • : finding the relevant terms for a given corpus • Audio extraction • Template-based music extraction: finding relevant characteristic in an audio signal taken from a given repertoire; for instance [7] time indexes of occurrences of percussive sounds can be extracted in order to represent the essential rhythmic component of a music piece.

Note this list is not exhaustive and that the exact meaning of IE activities is not commonly accepted and that many approaches combine multiple sub-tasks of IE in order to achieve a wider goal. , statistical analysis and/or natural language processing are often used in IE. IE on non-text documents is becoming an increasing topic in research and information extracted from multimedia documents can now be expressed in a high level structure as it is done on text. This naturally lead to the fusion of extracted information from multiple kind of documents and sources.

1.4 World Wide Web applications

IE has been the focus of the MUC conferences. The proliferation of the Web, however, intensified the need for developing IE systems that help people to cope with the enormous amount of data that is available online. Systems that perform IE from online text should meet the requirements of low cost, flexibility in development and easy adaptation to new domains. MUC systems fail to meet those criteria. Moreover, linguistic analysis performed for unstructured text does not exploit the HTML/XML tags and layout format that are available in online text. As a result, less linguistically intensive approaches have been developed for IE on the Web using wrappers, which are sets of highly accurate rules that extract a particular page’s content. Manually developing wrappers has proved to be a time-consuming task, requiring a high level of expertise. Machine learning techniques, either supervised or unsupervised, have been used to induce such rules automatically. Wrappers typically handle highly structured collections of web pages, such as product catalogs and telephone direc- tories. They fail, however, when the text type is less structured, which is also common on the Web. Recent effort on adaptive information extraction motivates the development of IE systems that can handle different types of text, from well-structured to almost free text -where common wrappers fail- including mixed types. Such systems can exploit shallow natural language knowledge and thus can be also applied to less structured text. A recent development is Visual Information Extraction,[8][9] that relies on rendering a webpage in a browser and creating rules based on the proximity of regions in the rendered web page. This helps in extracting entities from complex web pages that may exhibit a visual pattern, but lack a discernible pattern in the HTML source code.

1.5 Approaches

Three standard approaches are now widely accepted

• Hand-written regular expressions (perhaps stacked) • Using classifiers • Generative: naïve Bayes classifier • Discriminative: maximum entropy models such as Multinomial logistic regression • Sequence models • • Conditional Markov model (CMM) / Maximum-entropy Markov model (MEMM) • Conditional random fields (CRF) are commonly used in conjunction with IE for tasks as varied as ex- tracting information from research papers[10] to extracting navigation instructions.[11]

Numerous other approaches exist for IE including hybrid approaches that combine some of the standard approaches previously listed. 4 CHAPTER 1. INFORMATION EXTRACTION

1.6 Free or open source software and services

• General Architecture for Text Engineering “General Architecture for Text Engineering”, which is bundled with a free Information Extraction system

• OpenNLP Apache OpenNLP is a Java machine learning toolkit for natural language processing

• OpenCalais Automated information extraction web service from Thomson Reuters (Free limited version)

• Machine Learning for Language Toolkit (Mallet) is a Java-based package for a variety of natural language processing tasks, including information extraction.

• DBpedia Spotlight is an open source tool in Java/ (and free web service) that can be used for named entity recognition and name resolution.

is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python

• See also CRF implementations

1.7 Commercial software and services

• IBM Watson[12][13]

• Wolfram Language[12][14]

1.8 See also

• AI effect

• Applications of artificial intelligence

• DARPA TIPSTER Program

• Faceted search

• Named entity recognition

• Nutch

• Semantic

• Textmining

• Web scraping

Lists

• List of emerging technologies

• Outline of artificial intelligence 1.9. REFERENCES 5

1.9 References

[1] FREITAG, DAYNE. “Machine Learning for Information Extraction in Informal Domains” (PDF). 2000 Kluwer Academic Publishers. Printed in The Netherlands.

[2] Andersen, Peggy M.; Hayes, Philip J.; Huettner, Alison K.; Schmandt, Linda M.; Nirenburg, Irene .; Weinstein, Steven P. “Automatic Extraction of Facts from Press Releases to Generate News Stories”. CiteSeerX 10.1.1.14.7943 .

[3] Cowie, Jim; Wilks, Yorick. “Information Extraction”. CiteSeerX 10.1.1.61.6480 .

[4] “ - The Story So Far” (PDF).

[5] “Tim Berners-Lee on the next Web”.

[6] R. K. Srihari, W. Li, . Niu and T. Cornell,"InfoXtract: A Customizable Intermediate Level Information Extraction Engine”,Journal of Natural Language Engineering, Cambridge U. Press , 14(1), 2008, pp.33-69.

[7] A.Zils, F.Pachet, O.Delerue and F. Gouyon, Automatic Extraction of Drum Tracks from Polyphonic Music Signals, Pro- ceedings of WedelMusic, Darmstadt, Germany, 2002.

[8] Chenthamarakshan, Vijil; Desphande, Prasad M; Krishnapuram, Raghu; Varadarajan, Ramakrishnan; Stolze, Knut. “WYSI- WYE: An Algebra for Expressing Spatial and Textual Rules for Information Extraction”. arXiv:1506.08454.

[9] Baumgartner, Robert; Flesca, Sergio; Gottlob, Georg. “Visual Web Information Extraction with Lixto”. CiteSeerX 10.1.1.21.8236 .

[10] Peng, F.; McCallum, A. (2006). “Information extraction from research papers using conditional random fields☆". Infor- mation Processing & Management. 42 (4): 963. doi:10.1016/j.ipm.2005.09.002.

[11] Shimizu, Nobuyuki; Hass, Andrew (2006). “Extracting Frame-based Knowledge Representation from Route Instructions” (PDF).

[12] Jiang, Jing (2012). “Information Extraction from Text” (PDF). Ohio State University Department of Statistics. Retrieved July 13, 2016.

[13] “IBM Watson Information”. IBM. Retrieved July 13, 2016.

[14] “Wolfram Data Framework: Take Data and Make It Meaningful”. www.wolfram.com. Retrieved 2016-07-13.

1.10 External links

• MUC • ACE (LDC)

• ACE (NIST) • Alias-I “competition” page A listing of academic toolkits and industrial toolkits for natural language informa- tion extraction. • Gabor Melli’s page on IE Detailed description of the information extraction task.

• CRF++: Yet Another CRF toolkit • A Survey of Web Information Extraction Systems A comprehensive survey.

• A multilingual corpus of news annotated with event information Chapter 2

Named-entity recognition

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. Most research on NER systems has been structured as taking an unannotated of text, such as this one:

Jim bought 300 shares of Acme Corp. in 2006.

And producing an annotated block of text that highlights the names of entities:

[Jim]Pₑᵣₒ bought 300 shares of [Acme Corp.]Oᵣₐᵢₐᵢₒ in [2006]Tᵢₑ.

In this example, a person name consisting of one token, a two-token company name and a temporal expression have been detected and classified. State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure while human annotators scored 97.60% and 96.95%.[1][2]

2.1 Problem definition

In the expression named entity, the word named restricts the task to those entities for which one or many rigid designators, as defined by Kripke, stands for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as terms for certain biological species and substances.[3] Full named-entity recognition is often broken down, conceptually and possibly also in implementations,[4] as two distinct problems: detection of names, and classification of the names by the type of entity they refer to (e.g. person, organization, location and other[5]). The first phase is typically simplified to a segmentation problem: names are defined to be contiguous spans of tokens, with no nesting, so that “Bank of America” is a single name, disregarding the fact that inside this name, the substring “America” is itself a name. This segmentation problem is formally similar to chunking. Temporal expressions and some numerical expressions (i.e., money, percentages, etc.) may also be considered as named entities in the context of the NER task. While some instances of these types are good examples of rigid designators (e.g., the year 2001) there are also many invalid ones (e.g., I take my vacations in “June”). In the first case, the year 2001 refers to the 2001st year of the Gregorian calendar. In the second case, the month June may refer to the month of an undefined year (past June, next June, June 2020, etc.). It is arguable that the named entity definition is loosened in such cases for practical reasons. The definition of the term named entity is therefore not strict and often has to be explained in the context in which it is used.[6] Certain hierarchies of named entity types have been proposed in the literature. BBN categories, proposed in 2002, is used for and consists of 29 types and 64 subtypes.[7] Sekine’s extended hierarchy, proposed in

6 2.2. APPROACHES 7

2002, is made of 200 subtypes.[8] More recently, in 2011 Ritter used a hierarchy based on common Freebase entity types in ground-breaking experiments on NER over social media text.[9]

2.1.1 Formal evaluation

To evaluate the quality of a NER system’s output, several measures have been defined. While accuracy on the token level is one possibility, it suffers from two problems: the vast majority of tokens in real-world text are not part of entity names as usually defined, so the baseline accuracy (always predict “not an entity”) is extravagantly high, typically >90%; and mispredicting the full span of an entity name is not properly penalized (finding only a person’s first name when their last name follows is scored as ½ accuracy). In academic conferences such as CoNLL, a variant of the F1 score has been defined as follows:[5]

• Precision is the number of predicted entity name spans that line up exactly with spans in the gold standard evaluation data. I.e. when [Pₑᵣₒ Hans] [Pₑᵣₒ Blick] is predicted but [Pₑᵣₒ Hans Blick] was required, precision for the predicted name is zero. Precision is then averaged over all predicted entity names.

• Recall is similarly the number of names in the gold standard that appear at exactly the same location in the predictions.

• F1 score is the harmonic mean of these two.

It follows from the above definition that any prediction that misses a single token, includes a spurious token, or has the wrong class, is a hard error and does not contribute to either precision or recall. Evaluation models based on a token-by-token matching have been proposed.[10] Such models are able to handle also partially overlapping matches, yet fully rewarding only exact matches. They allow a finer grained evaluation and comparison of extraction systems, taking into account also the degree of mismatch in non-exact predictions.

2.2 Approaches

NER systems have been created that use linguistic grammar-based techniques as well as statistical models, i.e. machine learning. Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists. Statistical NER systems typically require a large amount of manually annotated training data. Semisupervised approaches have been suggested to avoid part of the annotation effort.[11][12] Many different classifier types have been used to perform machine-learned NER, with conditional random fields being a typical choice.[13]

2.3 Problem domains

Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.[14] Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems. Early work in NER systems in the 1990s was aimed primarily at extraction from journalistic articles. Attention then turned to processing of military dispatches and reports. Later stages of the automatic content extraction (ACE) evaluation also included several types of informal text styles, such as weblogs and text transcripts from conversational telephone speech conversations. Since about 1998, there has been a great deal of interest in entity identification in the molecular biology, bioinformatics, and medical natural language processing communities. The most common entity of interest in that domain has been names of genes and gene products. There has been also considerable interest in the recognition of chemical entities and drugs in the context of the CHEMDNER competition, with 27 teams participating in this task.[15] 8 CHAPTER 2. NAMED-ENTITY RECOGNITION

2.4 Current challenges and research

Despite the high F1 numbers reported on the MUC-7 dataset, the problem of Named Entity Recognition is far from being solved. The main efforts are directed to reducing the annotation labor by employing semi-supervised learning,[11][16] robust performance across domains[17][18] and scaling up to fine-grained entity types.[8][19] In recent years, many projects have turned to crowdsourcing, which is a promising solution to obtain high-quality aggregate human judgments for supervised and semi-supervised machine learning approaches to NER.[20] Another challenging task is devising models to deal with linguistically complex contexts such as Twitter and search queries.[21] There are some researchers who did some comparisons about the NER performances from different statistical models such as HMM (Hidden Markov Model), ME (Maximum Entropy), and CRF (Conditional Random Fields) and feature sets.[22] And some researchers recently proposed Graph-based semi-supervised learning model for language specific NER tasks.[23] A recently emerging task of identifying “important expressions” in text and cross-linking them to Wikipedia[24][25][26] can be seen as an instance of extremely fine-grained named entity recognition, where the types are the actual Wikipedia pages describing the (potentially ambiguous) concepts. Below is an example output of a Wikification system: Michael Jordan is a professor at Berkeley

2.5 Software

• GATE supports NER across many languages and domains out of the box, usable via graphical interface and also Java API • OpenNLP includes rule-based and statistical named-entity recognition • Stanford University also has the Stanford Named Entity Recognizer • Baleen a framework for rule-based and statistical named-entity and relationship extraction. • Cogcomp-NER a state of the art NER tagger that tags plain text with18-label type set (based on the OntoNotes corpus). It uses gazetteers extracted from Wikipedia, word class models derived from unlabeled text, and expressive non-local features.

2.6 See also

• Coreference resolution • (aka named entity normalization, entity disambiguation) • Information extraction • Knowledge extraction • Controlled vocabulary • Onomastics • Record linkage • Smart tag ()

2.7 References

[1] Elaine Marsh, Dennis Perzanowski, “MUC-7 Evaluation of IE Technology: Overview of Results”, 29 April 1998 PDF

[2] MUC-07 Proceedings (Named Entity Tasks) 2.7. REFERENCES 9

[3] Nadeau, ; Sekine, Satoshi (2007). A survey of named entity recognition and classification (PDF). Lingvisticae Inves- tigationes. [4] Carreras, Xavier; Màrquez, Lluís; Padró, Lluís (2003). A simple named entity extractor using AdaBoost. CoNLL. [5] Tjong Kim Sang, Erik F.; De Meulder, Fien (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. CoNLL. [6] Named Entity Definition. Webknox.com. Retrieved on 2013-07-21. [7] Brunstein, Ada. “Annotation Guidelines for Answer Types”. LDC Catalog. Linguistic Data Consortium. Retrieved 21 July 2013. [8] Sekine’s Extended Named Entity Hierarchy. Nlp.cs.nyu.edu. Retrieved on 2013-07-21. [9] Ritter, A.; Clark, S.; Mausam; Etzioni., O. (2011). Named Entity Recognition in Tweets: An Experimental Study (PDF). Proc. Empirical Methods in Natural Language Processing. [10] Esuli, Andrea; Sebastiani, Fabrizio (2010). Evaluating Information Extraction (PDF). Cross-Language Evaluation Forum (CLEF). pp. 100–111. [11] Lin, Dekang; Wu, Xiaoyun (2009). Phrase clustering for discriminative learning (PDF). Annual Meeting of the ACL and IJCNLP. pp. 1030–1038. [12] Nothman, Joel; et al. (2013). “Learning multilingual named entity recognition from Wikipedia". Artificial Intelligence. 194: 151–175. doi:10.1016/j.artint.2012.03.006. [13] Jenny Rose Finkel; Trond Grenager; Christopher Manning (2005). Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling (PDF). 43rd Annual Meeting of the Association for Computational Linguistics. pp. 363–370. [14] Poibeau, Thierry; Kosseim, Leila (2001). “Proper Name Extraction from Non-Journalistic Texts”. Language and Comput- ers. 37 (1): 144–157. [15] Krallinger, M; Leitner, F; Rabal, O; Vazquez, M; Oyarzabal, J; Valencia, A. “Overview of the chemical and drug name recognition (CHEMDNER) task”. Proceedings of the Fourth BioCreative Challenge Evaluation Workshop vol. 2. pp. 6–37. [16] Turian, J., Ratinov, L., & Bengio, Y. (2010, July). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 384-394). Association for Computational Linguistics. PDF [17] Ratinov, L., & Roth, D. (2009, June). Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (pp. 147-155). Association for Computational Linguistics. [18] Frustratingly Easy Domain Adaptation. [19] Fine-Grained Named Entity Recognition Using Conditional Random Fields for Question Answering. [20] Web 2.0-based crowdsourcing for high-quality gold standard development in clinical Natural Language Processing [21] Eiselt, Andreas; Figueroa, Alejandro (2013). A Two-Step Named Entity Recognizer for Open-Domain Search Queries. IJCNLP. pp. 829–833. [22] Han, Li-Feng Aaron, Wong, Fai, Chao, Lidia Sam. (2013). Chinese Named Entity Recognition with Conditional Ran- dom Fields in the Light of Chinese Characteristics. Proceeding of International Conference of Language Processing and Intelligent Information Systems. M.A. Klopotek et al. (Eds.): IIS 2013, LNCS Vol. 7912, pp. 57–68 [23] Han, Li-Feng Aaron, Wong, Zeng, Xiaodong, Derek Fai, Chao, Lidia Sam. (2015). Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model. In Proceedings of SIGHAN workshop in ACL-IJCNLP. 2015. [24] Linking Documents to Encyclopedic Knowledge. [25] Learning to link with Wikipedia. [26] Local and Global Algorithms for Disambiguation to Wikipedia.

26. Abedini, Farhad, Fariborz Mahmoudi, and Amir Hossein Jadidinejad. “From text to knowledge: Semantic entity extraction using ontology.” International Journal of Machine Learning and 1.2 (2011): 113. 27. Abedini, Farhad, Fariborz Mahmoudi, and Seyedeh Masoumeh Mirhashem. “Using Semantic Entity Extraction Method for a New Application.” International Journal of Machine Learning and Computing 2.2 (2012): 178. 10 CHAPTER 2. NAMED-ENTITY RECOGNITION

2.8 External links

• Named entity recognition for Arabic - Issues and challenges in morphologically rich languages such as Arabic

• CoNLL Language-independent NER shared tasks (2002) and (2003): NER data sets and methods for Spanish, Dutch, English and German

• Chemical compound and drug name recognition - Community challenge on the recognition of chemical com- pound and drug entity mentions in text Chapter 3

Part-of-speech tagging

In , part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word- category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill’s tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.

3.1 Principle

Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken. This is not rare—in natural languages (as opposed to many artificial languages), a large percentage of word-forms are ambiguous. For example, even “dogs”, which is usually thought of as just a plural noun, can also be a verb:

The sailor dogs the hatch.

Correct grammatical tagging will reflect that “dogs” is here used as a verb, not as the more common plural noun. Grammatical context is one way to determine this; semantic analysis can also be used to infer that “sailor” and “hatch” implicate “dogs” as 1) in the nautical context and 2) an action applied to the object “hatch” (in this context, “dogs” is a nautical term meaning “fastens (a watertight door) securely”). Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. However, there are clearly many more categories and sub-categories. For nouns, the plural, possessive, and singular forms can be distinguished. In many languages words are also marked for their "case" (role as subject, object, etc.), grammatical gender, and so on; while verbs are marked for tense, aspect, and other things. Linguists distinguish parts of speech to various fine degrees, reflecting a chosen “tagging system”. In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English. For example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus). Work on stochastic methods for tagging Koine Greek (DeRose 1990) has used over 1,000 parts of speech, and found that about as many words were ambiguous there as in English. A morphosyntactic descriptor in the case of morphologically rich languages is commonly expressed using very short mnemonics, such as 'Ncmsan for Category=Noun, Type = common, Gender = masculine, Number = singular, Case = accusative, Animate = no.

3.2 History

11 12 CHAPTER 3. PART-OF-SPEECH TAGGING

3.2.1 The Brown Corpus

Research on part-of-speech tagging has been closely tied to corpus linguistics. The first major corpus of English for computer analysis was the Brown Corpus developed at Brown University by Henry Kučera and W. Nelson Francis, in the mid-1960s. It consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications. Each sample is 2,000 or more words (ending at the first sentence-end after 2,000 words, so that the corpus contains only complete sentences). The Brown Corpus was painstakingly “tagged” with part-of-speech markers over many years. A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all. For example, article then noun can occur, but article verb (arguably) cannot. The program got about 70% correct. Its results were repeatedly reviewed and corrected by hand, and later users sent in errata, so that by the late 70s the tagging was nearly perfect (allowing for some cases on which even human speakers might not agree). This corpus has been used for innumerable studies of word-frequency and of part-of-speech, and inspired the devel- opment of similar “tagged” corpora in many other languages. Statistics derived by analyzing it formed the basis for most later part-of-speech tagging systems, such as CLAWS (linguistics) and VOLSUNGA. However, by this time (2005) it has been superseded by larger corpora such as the 100 million word British National Corpus. For some time, part-of-speech tagging was considered an inseparable part of natural language processing, because there are certain cases where the correct part of speech cannot be decided without understanding the or even the pragmatics of the context. This is extremely expensive, especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word.

3.2.2 Use of hidden Markov models

In the mid 1980s, researchers in Europe began to use hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the Lancaster-Oslo-Bergen Corpus of British English. HMMs involve counting cases (such as from the Brown Corpus), and making a table of the probabilities of certain sequences. For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%. Knowing this, a program can decide that “can” in “the can” is far more likely to be a noun than a verb or a modal. The same method can of course be used to benefit from knowledge about following words. More advanced (“higher order”) HMMs learn the probabilities not only of pairs, but triples or even larger sequences. So, for example, if you've just seen a noun followed by a verb, the next item may be very likely a preposition, article, or noun, but much less likely another verb. When several ambiguous words occur together, the possibilities multiply. However, it is easy to enumerate every combination and to assign a relative probability to each one, by multiplying together the probabilities of each choice in turn. The combination with highest probability is then chosen. The European group developed CLAWS, a tagging program that did exactly this, and achieved accuracy in the 93–95% range. It is worth remembering, as Eugene Charniak points out in Statistical techniques for natural language parsing (1997),[1] that merely assigning the most common tag to each known word and the tag "proper noun" to all unknowns will approach 90% accuracy because many words are unambiguous. CLAWS pioneered the field of HMM-based part of speech tagging, but was quite expensive since it enumerated all possibilities. It sometimes had to resort to backup methods when there were simply too many options (the Brown Corpus contains a case with 17 ambiguous words in a row, and there are words such as “still” that can represent as many as 7 distinct parts of speech (DeRose 1990, p. 82)). HMMs underlie the functioning of stochastic taggers and are used in various algorithms one of the most widely used being the bi-directional inference algorithm.[2]

3.2.3 Dynamic programming methods

In 1987, Steven DeRose[3] and Ken Church[4] independently developed dynamic programming algorithms to solve the same problem in vastly less time. Their methods were similar to the known for some time in other fields. DeRose used a table of pairs, while Church used a table of triples and a method of estimating the values for triples that were rare or nonexistent in the Brown Corpus (actual measurement of triple probabilities would require a much larger corpus). Both methods achieved accuracy over 95%. DeRose’s 1990 dissertation at Brown University 3.3. ISSUES 13

included analyses of the specific error types, probabilities, and other related data, and replicated his work for Greek, where it proved similarly effective. These findings were surprisingly disruptive to the field of natural language processing. The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis: , , semantics, and so on. CLAWS, DeRose’s and Church’s methods did fail for some of the known cases where semantics is required, but those proved negligibly rare. This convinced many in the field that part-of-speech tagging could usefully be separated out from the other levels of pro- cessing; this in turn simplified the theory and practice of computerized language analysis, and encouraged researchers to find ways to separate out other pieces as well. Markov Models are now the standard method for part-of-speech assignment.

3.2.4 Unsupervised taggers

The methods already discussed involve working from a pre-existing corpus to learn tag probabilities. It is, however, also possible to bootstrap using “unsupervised” tagging. Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction. That is, they observe patterns in word use, and derive part-of-speech categories themselves. For example, statistics readily reveal that “the”, “a”, and “an” occur in similar contexts, while “eat” occurs in very different ones. With sufficient iteration, similarity classes of words emerge that are remarkably similar to those human linguists would expect; and the differences themselves sometimes suggest valuable new insights. These two categories can be further subdivided into rule-based, stochastic, and neural approaches.

3.2.5 Other taggers and methods

Some current major algorithms for part-of-speech tagging include the Viterbi algorithm, Brill tagger, , and the Baum-Welch algorithm (also known as the forward-backward algorithm). Hidden Markov model and visible Markov model taggers can both be implemented using the Viterbi algorithm. The rule-based Brill tagger is unusual in that it learns a set of rule patterns, and then applies those patterns rather than optimizing a statistical quantity. Unlike the Brill tagger where the rules are ordered sequentially, the POS and morphological tagging toolkit RDRPOSTagger stores rules in the form of a ripple-down rules tree. Many machine learning methods have also been applied to the problem of POS tagging. Methods such as SVM, maximum entropy classifier, perceptron, and nearest-neighbor have all been tried, and most can achieve accuracy above 95%. A direct comparison of several methods is reported (with references) at the ACL Wiki.[5] This comparison uses the Penn tag set on some of the Penn data, so the results are directly comparable. However, many significant taggers are not included (perhaps because of the labor involved in reconfiguring them for this particular dataset). Thus, it should not be assumed that the results reported there are the best that can be achieved with a given approach; nor even the best that have been achieved with a given approach. A more recent development is using the structure regularization method for part-of-speech tagging, achieving 97.36% on the standard benchmark dataset.[6]

3.3 Issues

While there is broad agreement about basic categories, a number of edge cases make it difficult to settle on a single “correct” set of tags, even in a single language such as English. For example, it is hard to say whether “fire” is an adjective or a noun in the big green fire truck A second important example is the use/mention distinction, as in the following example, where “blue” could be replaced by a word from any POS (the Brown Corpus tag set appends the suffix "-NC” in such cases): the word “blue” has 4 letters. 14 CHAPTER 3. PART-OF-SPEECH TAGGING

Words in a language other than that of the “main” text are commonly tagged as “foreign”, usually in addition to a tag for the role the foreign word is actually playing in context. There are also many cases where POS categories and “words” do not map one to one, for example: David’s gonna don't vice versa first-cut cannot pre- and post-secondary look (a word) up In the last example, “look” and “up” arguably function as a single verbal unit, despite the possibility of other words coming between them. Some tag sets (such as Penn) break hyphenated words, contractions, and possessives into separate tokens, thus avoiding some but far from all such problems. It is unclear whether it is best to treat words such as “be”, “have”, and “do” as categories in their own right (as in the Brown Corpus), or as simply verbs (as in the LOB Corpus and the Penn Treebank). “be” has more forms than other English verbs, and occurs in quite different grammatical contexts, complicating the issue. The most popular “tag set” for POS tagging for is probably the Penn tag set, developed in the Penn Treebank project. It is largely similar to the earlier Brown Corpus and LOB Corpus tag sets, though much smaller. In Europe, tag sets from the Eagles Guidelines see wide use, and include versions for multiple languages. POS tagging work has been done in a variety of languages, and the set of POS tags used varies greatly with language. Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences. The tag sets for heavily inflected languages such as Greek and Latin can be very large; tagging words in agglutinative languages such as Inuit may be virtually impossible. At the other extreme, Petrov, D. Das, and R. McDonald (“A Universal Part-of- Speech Tagset” http://arxiv.org/abs/1104.2086) have proposed a “universal” tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, etc.; no distinction of “to” as an infinitive marker vs. preposition, etc.). Whether a very small set of very broad tags or a much larger set of more precise ones is preferable, depends on the purpose at hand. Automatic tagging is easier on smaller tag-sets. A different issue is that some cases are in fact ambiguous. Beatrice Santorini gives examples in “Part-of-speech Tagging Guidelines for the Penn Treebank Project”, (3rd rev, June 1990 ), including the following (p. 32) case in which entertaining can be either an adjective or a verb, and there is no syntactic way to decide: The Duchess was entertaining last night.

3.4 See also

• Semantic net • Sliding window based part-of-speech tagging • tagger • Word sense disambiguation

3.5 References

[1] Eugene Charniak

[2] CLL POS-tagger

[3] DeRose, Steven J. 1988. “Grammatical category disambiguation by statistical optimization.” Computational Linguistics 14(1): 31–39.

[4] Kenneth Ward Church (1988). “A stochastic parts program and noun phrase parser for unrestricted text”. ANLC '88: Proceedings of the second conference on Applied natural language processing. Association for Computational Linguistics Stroudsburg, PA. doi:10.3115/974235.974260.

[5] POS Tagging (State of the art)

[6] Xu Sun (2014). Structure Regularization for Structured Prediction (PDF). Neural Information Processing Systems (NIPS). pp. 2402–2410.

• Charniak, Eugene. 1997. “Statistical Techniques for Natural Language Parsing”. AI Magazine 18(4):33–44. 3.6. EXTERNAL LINKS 15

• Hans van Halteren, Jakub Zavrel, Walter Daelemans. 2001. Improving Accuracy in NLP Through Combina- tion of Machine Learning Systems. Computational Linguistics. 27(2): 199–229. PDF • DeRose, Steven J. 1990. “Stochastic Methods for Resolution of Grammatical Category in Inflected and Uninflected Languages.” Ph.D. Dissertation. Providence, RI: Brown University Department of Cognitive and Linguistic Sciences. Electronic Edition available at

3.6 External links

• RDRPOSTagger - a robust rule-based toolkit for POS and morphological tagging (Python & Java). RDR- POSTagger supports pre-trained POS and morphological tagging models to 13 languages. RDRPOSTagger also supports pre-trained Universal POS tagging models for 40 languages.

• SMILE POS tagger - free online service, includes an HMM based POS tagger (Java API) • Overview of available taggers

• Resources for Studying English Syntax Online • CLAWS

• LingPipe Commercial Java natural language processing software including trainable part-of-speech taggers with first-best, n-best and per-tag confidence output.

• Apache OpenNLP AL 2.0, includes a POS tagger based on maxent and perceptron classifiers • CRFTagger Conditional Random Fields (CRFs) English POS Tagger

• JTextPro A Java-based Toolkit • Citar LGPL C++ Hidden Markov Model trigram POS tagger, a Java port named Jitar is also available

• Ninja-PoST PHP port of GPoSTTL, based on Eric Brill’s rule-based tagger • ComplexityIntelligence, LLC Free and Commercial NLP Web Services for Part Of Speech Tagging (and Named Entity Recognition) • Part-of-Speech tagging based on Soundex features

• FastTag - LGPL Java POS tagger based on Eric Brill’s rule-based tagger • jspos - LGPL Javascript port of FastTag

• Topia TermExtractor - Python implementation of the UPenn BioIE parts-of-speech algorithm

• Stanford Log-linear Part-Of-Speech Tagger • Northwestern MorphAdorner POS Tagger

• Part of speech tagger for Spanish • Stagger – The Stockholm Tagger, for Swedish

• TnT -- Statistical Part-of-Speech Tagging, with one German and one English • petraTAG Part-of-speech tagger Open-source POS tagger written in Java with special features for tagging translated texts. • Rosette linguistics platform Commercial POS tagger, lemmatizer, base noun phrase extractor and other mor- phological analysis in Java and C++ • spaCy Open-source (MIT) Python NLP library including trainable part-of-speech tagger Chapter 4

Phrase chunking

Phrase chunking is a natural language process that separates and segments a sentence into its subconstituents, such as noun, verb, and prepositional phrases.

4.1 See also

• Terminology extraction • Part-of-speech tagging

4.2 External links

• TermExtractor

• TreeTagger Chunker

16 Chapter 5

Relationship extraction

A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts, typically from text or XML documents. The task is very similar to that of information extraction (IE), but IE additionally requires the removal of repeated relations (disambiguation) and generally refers to the extraction of many different relationships.

5.1 Applications

Application domains where relationship extraction is useful include gene-disease relationships,[1] protein-protein interaction[2] etc.

5.2 Approaches

One approach to this problem involves the use of domain ontologies.[3][4] Another approach involves visual detection of meaningful relationships in parametric values of objects listed on a data table that shift positions as the table is permuted automatically as controlled by the software user. The poor coverage, rarity and development cost related to structured resources such as semantic lexicons (e.g. WordNet, UMLS) and domain ontologies (e.g. the Gene Ontology) has given rise to new approaches based on broad, dynamic background knowledge on the Web. For instance, the ARCHILES technique[5] uses only Wikipedia and search engine page count for acquiring coarse-grained relations to construct lightweight ontologies. The relationships can be represented using a variety of formalisms/languages. One such representation language for data on the Web is RDF.

5.3 See also

• Text analytics • Semantic analytics • • Information extraction • Business Intelligence 2.0

5.4 References

[1] Hong-Woo Chun; Yoshimasa Tsuruoka; Jin-Dong Kim; Rie Shiba; Naoki Nagata; Teruyoshi Hishiki; Jun-ichi Tsujii (2006). “Extraction of Gene-Disease Relations from Medline Using Domain and Machine Learning”. Pacific

17 18 CHAPTER 5. RELATIONSHIP EXTRACTION

Symposium on Biocomputing.

[2] Minlie Huang and Xiaoyan Zhu and Yu Hao and Donald G. Payan and Kunbin Qu and Ming Li (2004). “Discovering patterns to extract protein-protein interactions from full texts”. 20 (18): 3604–3612. doi:10.1093/bioinformatics/bth451.

[3] T.C.Rindflesch and L.Tanabe and J.N.Weinstein and L.Hunter (2000). “EDGAR: Extraction of drugs, genes, and relations from the biomedical literature”. Proc. Pacific Symposium on Biocomputing. pp. 514–525.

[4] C. Ramakrishnan and K. J. Kochut and A. P. Sheth (2006). “A Framework for Schema-Driven Relationship Discovery from Unstructured Text”. Proc. International Conference. pp. 583–596.

[5] W. Wong and W. Liu and M. Bennamoun (2009). “Acquiring Semantic Relations using the Web for Constructing Lightweight Ontologies”. Proc. 13th Pacific-Asia Conference on Knowledge Discovery and (PAKDD). doi:10.1007/978- 3-642-01307-2_26. Chapter 6

Sentence boundary disambiguation

Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address - not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations.[1] As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang. Languages like Japanese and Chinese have unambiguous sentence-ending markers.

6.1 Strategies

The standard 'vanilla' approach to locate the end of a sentence:

(a) If it’s a period, it ends a sentence. (b) If the preceding token is in the hand-compiled list of abbreviations, then it doesn't end a sentence. (c) If the next token is capitalized, then it ends a sentence.

This strategy gets about 95% of sentences correct.[2] Things such as shortened names, e.g. "D. H. Lawrence" (with whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like ".hack//SIGN") and usage of non-standard punctuation (or non-standard usage of punctuation) in a text often fall under the remaining 5%. Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre- marked. Solutions have been based on a maximum entropy model.[3] The SATZ architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.

6.2 Software

Perl compatible (“pcre”)

• ((?<=[a-z0-9][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])

• $sentences=preg_split("/(?

Online use, libraries, and api

• sent_detector - Java

19 20 CHAPTER 6. SENTENCE BOUNDARY DISAMBIGUATION

• Lingua-EN-Sentence -

• Sentence.pm -perl • SATZ - An Adaptive Sentence Segmentation System -by David D. Palmer - C

Toolkits that include sentence detection

• Apache OpenNLP -

• Freeling (software) - • Natural Language Toolkit -

• Stanford NLP - • GExp -

6.3 See also

• Sentence spacing • Word divider

• Syllabification

• Punctuation • Text segmentation

• Translation memory • Multiword expression

6.4 References

[1] E. STAMATATOS; N. FAKOTAKIS & G. KOKKINAKIS. “1 AUTOMATIC EXTRACTION OF RULES FOR SEN- TENCE BOUNDARY DISAMBIGUATION”. University of Patras. Retrieved 2009-01-03.

[2] “Doing Things with Words, Part Two: Sentence Boundary Detection”. Retrieved 2009-01-03. |first1= missing |last1= in Authors list (help)

[3] “A Maximum Entropy Approach to Identifying Sentence Boundaries” (PDF). Retrieved 2009-01-03. |first1= missing |last1= in Authors list (help)

6.5 External links

• Search for 'sentence boundary disambiguation', Google Scholar. Chapter 7

Shallow parsing

Shallow parsing (also chunking, “light parsing”) is an analysis of a sentence which first identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and then links them to higher order units that have discrete grammatical meanings (noun groups or phrases,, verb groups, etc.). While the most elementary chunking algorithms simply link constituent parts on the basis of elementary search patterns (e.g. as specified by Regular Expressions), approaches that use machine learning techniques (classifiers, topic modeling, etc.) can take contextual information into account and thus compose chunks in such a way that they better reflect the semantic relations between the basic constituents.[1] That is, these more advanced methods get around the problem that combinations of elementary constituents can have different higher level meanings depending on the context of the sentence. It is a technique widely used in natural language processing. It is similar to the concept of lexical analysis for computer languages. Under the name of the Shallow Structure Hypothesis, it is also used as an explanation for why second language learners often fail to parse complex sentences correctly.[2]

7.1 References

[1] Jurafsky, Daniel; Martin, James H. (2000). Speech and Language Processing. Singapore: Pearson Education Inc. pp. 577–586. [2] Clahsen, Felser, Harald, Claudia (2006). “Grammatical Processing in Language Learners”. Applied Psycholinguistics. 27: 3–42. doi:10.1017/S0142716406060024.

• “NP Chunking (State of the art)". Association for Computational Linguistics. Retrieved 2016-01-30. • Abney, Steven (1991), Parsing By Chunks (PDF), Kluwer Academic Publishers, pp. 257–278.

7.2 External links

• Apache OpenNLP OpenNLP includes a chunker. • GATE General Architecture for Text Engineering GATE includes a chunker. • NLTK chunking • Illinois Shallow Parser Shallow Parser Demo

7.3 See also

• Parser • Semantic role labeling • Named entity recognition

21 Chapter 8

Stemming

For the skiing technique, see Stem (skiing). For the climbing technique, see Glossary of climbing terms § stem.

In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their , base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation. Stemming programs are commonly referred to as stemming algorithms or stemmers.

8.1 Examples

A stemmer for English, for example, should identify the string “cats” (and possibly “catlike”, “catty” etc.) as based on the root “cat”, and “stems”, “stemmer”, “stemming”, “stemmed” as based on “stem”. A stemming algorithm reduces the words “fishing”, “fished”, and “fisher” to the root word, “fish”. On the other hand, “argue”, “argued”, “argues”, “arguing”, and “argus” reduce to the stem “argu” (illustrating the case where the stem is not itself a word or root) but “argument” and “arguments” reduce to the stem “argument”.

8.2 History

The first published stemmer was written by Julie Beth Lovins in 1968.[1] This paper was remarkable for its early date and had great influence on later work in this area. A later stemmer was written by Martin Porter and was published in the July 1980 issue of the journal Program. This stemmer was very widely used and became the de facto standard algorithm used for English stemming. Dr. Porter received the Tony Kent Strix award in 2000 for his work on stemming and information retrieval. Many implementations of the Porter stemming algorithm were written and freely distributed; however, many of these implementations contained subtle flaws. As a result, these stemmers did not match their potential. To eliminate this source of error, Martin Porter released an official (mostly BSD-licensed) implementation[2] of the algorithm around the year 2000. He extended this work over the next few years by building Snowball, a framework for writing stemming algorithms, and implemented an improved English stemmer together with stemmers for several other languages.

8.3 Algorithms

There are several types of stemming algorithms which differ in respect to performance and accuracy and how certain stemming obstacles are overcome.

22 8.3. ALGORITHMS 23

A simple stemmer looks up the inflected form in a lookup table. The advantages of this approach are that it is simple, fast, and easily handles exceptions. The disadvantages are that all inflected forms must be explicitly listed in the table: new or unfamiliar words are not handled, even if they are perfectly regular (e.g. iPads ~ iPad), and the table may be large. For languages with simple morphology, like English, table sizes are modest, but highly inflected languages like Turkish may have hundreds of potential inflected forms for each root. A lookup approach may use preliminary part-of-speech tagging to avoid overstemming.[3]

8.3.1 The production technique

The lookup table used by a stemmer is generally produced semi-automatically. For example, if the word is “run”, then the inverted algorithm might automatically generate the forms “running”, “runs”, “runned”, and “runly”. The last two forms are valid constructions, but they are unlikely.

8.3.2 Suffix-stripping algorithms

Suffix stripping algorithms do not rely on a lookup table that consists of inflected forms and root form relations. Instead, a typically smaller list of “rules” is stored which provides a path for the algorithm, given an input word form, to find its root form. Some examples of the rules include:

• if the word ends in 'ed', remove the 'ed'

• if the word ends in 'ing', remove the 'ing'

• if the word ends in 'ly', remove the 'ly'

Suffix stripping approaches enjoy the benefit of being much simpler to maintain than brute force algorithms, assuming the maintainer is sufficiently knowledgeable in the challenges of linguistics and morphology and encoding suffix stripping rules. Suffix stripping algorithms are sometimes regarded as crude given the poor performance when dealing with exceptional relations (like 'ran' and 'run'). The solutions produced by suffix stripping algorithms are limited to those lexical categories which have well known suffixes with few exceptions. This, however, is a problem, as not all parts of speech have such a well formulated set of rules. Lemmatisation attempts to improve upon this challenge. Prefix stripping may also be implemented. Of course, not all languages use prefixing or suffixing.

Additional algorithm criteria

Suffix stripping algorithms may differ in results for a variety of reasons. One such reason is whether the algorithm constrains whether the output word must be a real word in the given language. Some approaches do not require the word to actually exist in the language lexicon (the set of all words in the language). Alternatively, some suffix stripping approaches maintain a database (a large list) of all known morphological word roots that exist as real words. These approaches check the list for the existence of the term prior to making a decision. Typically, if the term does not exist, alternate action is taken. This alternate action may involve several other criteria. The non-existence of an output term may serve to cause the algorithm to try alternate suffix stripping rules. It can be the case that two or more suffix stripping rules apply to the same input term, which creates an ambiguity as to which rule to apply. The algorithm may assign (by human hand or stochastically) a priority to one rule or another. Or the algorithm may reject one rule application because it results in a non-existent term whereas the other overlapping rule does not. For example, given the English term friendlies, the algorithm may identify the ies suffix and apply the appropriate rule and achieve the result of friendl. friendl is likely not found in the lexicon, and therefore the rule is rejected. One improvement upon basic suffix stripping is the use of suffix substitution. Similar to a stripping rule, a substitution rule replaces a suffix with an alternate suffix. For example, there could exist a rule that replaces ies with y. How this affects the algorithm varies on the algorithm’s design. To illustrate, the algorithm may identify that both the ies suffix stripping rule as well as the suffix substitution rule apply. Since the stripping rule results in a non-existent term in the lexicon, but the substitution rule does not, the substitution rule is applied instead. In this example, friendlies becomes friendly instead of friendl. 24 CHAPTER 8. STEMMING

Diving further into the details, a common technique is to apply rules in a cyclical fashion (recursively, as computer scientists would say). After applying the suffix substitution rule in this example scenario, a second pass is made to identify matching rules on the term friendly, where the ly stripping rule is likely identified and accepted. In summary, friendlies becomes (via substitution) friendly which becomes (via stripping) friend. This example also helps illustrate the difference between a rule-based approach and a brute force approach. In a brute force approach, the algorithm would search for friendlies in the set of hundreds of thousands of inflected word forms and ideally find the corresponding root form friend. In the rule-based approach, the three rules mentioned above would be applied in succession to converge on the same solution. Chances are that the rule-based approach would be slower, as lookup algorithms have a direct access to the solution, while rule-based should try several options, and combinations of them, and then choose which result seems to be the best.

8.3.3 Lemmatisation algorithms

A more complex approach to the problem of determining a stem of a word is lemmatisation. This process involves first determining the part of speech of a word, and applying different normalization rules for each part of speech. The part of speech is first detected prior to attempting to find the root since for some languages, the stemming rules change depending on a word’s part of speech. This approach is highly conditional upon obtaining the correct lexical category (part of speech). While there is overlap between the normalization rules for certain categories, identifying the wrong category or being unable to produce the right category limits the added benefit of this approach over suffix stripping algorithms. The basic idea is that, if the stemmer is able to grasp more information about the word being stemmed, then it can apply more accurate normalization rules (which unlike suffix stripping rules can also modify the stem).

8.3.4 Stochastic algorithms

Stochastic algorithms involve using probability to identify the root form of a word. Stochastic algorithms are trained (they “learn”) on a table of root form to inflected form relations to develop a probabilistic model. This model is typi- cally expressed in the form of complex linguistic rules, similar in nature to those in suffix stripping or lemmatisation. Stemming is performed by inputting an inflected form to the trained model and having the model produce the root form according to its internal ruleset, which again is similar to suffix stripping and lemmatisation, except that the decisions involved in applying the most appropriate rule, or whether or not to stem the word and just return the same word, or whether to apply two different rules sequentially, are applied on the grounds that the output word will have the highest probability of being correct (which is to say, the smallest probability of being incorrect, which is how it is typically measured). Some lemmatisation algorithms are stochastic in that, given a word which may belong to multiple parts of speech, a probability is assigned to each possible part. This may take into account the surrounding words, called the context, or not. Context-free grammars do not take into account any additional information. In either case, after assigning the probabilities to each possible part of speech, the most likely part of speech is chosen, and from there the appropriate normalization rules are applied to the input word to produce the normalized (root) form.

8.3.5 n-gram analysis

Some stemming techniques use the n-gram context of a word to choose the correct stem for a word.[4]

8.3.6 Hybrid approaches

Hybrid approaches use two or more of the approaches described above in unison. A simple example is a suffix tree algorithm which first consults a lookup table using brute force. However, instead of trying to store the entire set of relations between words in a given language, the lookup table is kept small and is only used to store a minute amount of “frequent exceptions” like “ran => run”. If the word is not in the exception list, apply suffix stripping or lemmatisation and output the result. 8.4. LANGUAGE CHALLENGES 25

8.3.7 Affix stemmers

In linguistics, the term affix refers to either a prefix or a suffix. In addition to dealing with suffixes, several approaches also attempt to remove common prefixes. For example, given the word indefinitely, identify that the leading “in” is a prefix that can be removed. Many of the same approaches mentioned earlier apply, but go by the name affix stripping. A study of affix stemming for several European languages can be found here.[5]

8.3.8 Matching algorithms

Such algorithms use a stem database (for example a set of documents that contain stem words). These stems, as men- tioned above, are not necessarily valid words themselves (but rather common sub-strings, as the “brows” in “browse” and in “browsing”). In order to stem a word the algorithm tries to match it with stems from the database, applying various constraints, such as on the relative length of the candidate stem within the word (so that, for example, the short prefix “be”, which is the stem of such words as “be”, “been” and “being”, would not be considered as the stem of the word “beside”).

8.4 Language challenges

While much of the early academic work in this area was focused on the English language (with significant use of the Porter Stemmer algorithm), many other languages have been investigated.[6][7][8][9][10] Hebrew and Arabic are still considered difficult research languages for stemming. English stemmers are fairly trivial (with only occasional problems, such as “dries” being the third-person singular present form of the verb “dry”, “axes” being the plural of “axe” as well as “axis”); but stemmers become harder to design as the morphology, orthography, and encoding of the target language becomes more complex. For example, an Italian stemmer is more complex than an English one (because of a greater number of verb inflections), a Russian one is more complex (more noun declensions), a Hebrew one is even more complex (due to nonconcatenative morphology, a without vowels, and the requirement of prefix stripping: Hebrew stems can be two, three or four characters, but not more), and so on.

8.4.1 Multilingual stemming

Multilingual stemming applies morphological rules of two or more languages simultaneously instead of rules for only a single language when interpreting a search query. Commercial systems using multilingual stemming exist.

8.5 Error metrics

There are two error measurements in stemming algorithms, overstemming and understemming. Overstemming is an error where two separate inflected words are stemmed to the same root, but should not have been—a false positive. Understemming is an error where two separate inflected words should be stemmed to the same root, but are not—a false negative. Stemming algorithms attempt to minimize each type of error, although reducing one type can lead to increasing the other. For example, the widely used Porter stemmer stems “universal”, “university”, and “universe” to “univers”. This is a case of overstemming: though these three words are etymologically related, their modern meanings are in widely different domains, so treating them as synonyms in a search engine will likely reduce the relevance of the search results. An example of understemming in the Porter stemmer is “alumnus” → “alumnu”, “alumni” → “alumni”, “alumna"/"alumnae” → “alumna”. This English word keeps Latin morphology, and so these near-synonyms are not conflated. 26 CHAPTER 8. STEMMING

8.6 Applications

Stemming is used as an approximate method for grouping words with a similar basic meaning together. For exam- ple, a text mentioning “daffodils” is probably closely related to a text mentioning “daffodil” (without the s). But in some cases, words with the same morphological stem have idiomatic meanings which are not closely related: a user searching for “marketing” will not be satisfied by most documents mentioning “markets” but not “marketing”.

8.6.1 Information retrieval

Stemmers are common elements in query systems such as Web search engines. The effectiveness of stemming for English query systems were soon found to be rather limited, however, and this has led early information retrieval researchers to deem stemming irrelevant in general.[11] An alternative approach, based on searching for n-grams rather than stems, may be used instead. Also, stemmers may provide greater benefits in other languages than English.[12][13]

8.6.2 Domain Analysis

Stemming is used to determine domain vocabularies in domain analysis. [14]

8.6.3 Use in commercial products

Many commercial companies have been using stemming since at least the 1980s and have produced algorithmic and lexical stemmers in many languages.[15][16] The Snowball stemmers have been compared with commercial lexical stemmers with varying results.[17][18] adopted word stemming in 2003.[19] Previously a search for “fish” would not have returned “fishing”. Other software search algorithms vary in their use of word stemming. Programs that simply search for substrings obviously will find “fish” in “fishing” but when searching for “fishes” will not find occurrences of the word “fish”.

8.7 See also

• Root (linguistics) - linguistic definition of the term “root”

• Stem (linguistics) - linguistic definition of the term “stem”

• Morphology (linguistics)

• Lemma (morphology) - linguistic definition

• Lemmatization

• Inflection

• Derivation - stemming is a form of reverse derivation

• Natural language processing - stemming is generally regarded as a form of NLP

- stemming algorithms play a major role in commercial NLP software

• Computational linguistics

• Snowball (programming language) - designed for creating stemming algorithms 8.8. REFERENCES 27

8.8 References

[1] Lovins, Julie Beth (1968). “Development of a Stemming Algorithm”. Mechanical Translation and Computational Linguis- tics. 11: 22–31.

[2] http://tartarus.org/~{}martin/PorterStemmer/

[3] Yatsko, V. A.; Y-stemmer

[4] McNamee, Paul (September 21–22, 2005). “Exploring New Languages with HAIRCUT at CLEF 2005” (PDF). CEUR Workshop Proceedings. 1171. Retrieved 3/6/15. Check date values in: |access-date= (help)

[5] Jongejan, B.; and Dalianis, H.; Automatic Training of Lemmatization Rules that Handle Morphological Changes in pre-, in- and Suffixes Alike, in the Proceeding of the ACL-2009, Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Singapore, August 2–7, 2009, pp. 145-153

[6] Dolamic, Ljiljana; and Savoy, Jacques; Stemming Approaches for East European Languages (CLEF 2007)

[7] Savoy, Jacques; Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages, ACM Sym- posium on Applied Computing, SAC 2006, ISBN 1-59593-108-2

[8] Popovič, Mirko; and Willett, Peter (1992); The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data, Journal of the American Society for Information Science, Volume 43, Issue 5 (June), pp. 384–390

[9] Stemming in Hungarian at CLEF 2005

[10] Viera, A. F. G. & , J. (2007); Uma revisão dos algoritmos de radicalização em língua portuguesa, Information Re- search, 12(3), paper 315

[11] Baeza-Yates, Ricardo; and Ribeiro-Neto, Berthier (1999); Modern Information Retrieval, ACM Press/Addison Wesley

[12] Kamps, Jaap; Monz, Christof; de Rijke, Maarten; and Sigurbjörnsson, Börkur (2004); Language-Dependent and Language- Independent Approaches to Cross-Lingual Text Retrieval, in Peters, C.; Gonzalo, J.; Braschler, M.; and Kluck, M. (eds.); Comparative Evaluation of Multilingual Information Access Systems, Springer Verlag, pp. 152–165

[13] Airio, Eija (2006); Word Normalization and Decompounding in Mono- and Bilingual IR, Information Retrieval 9:249–271

[14] Frakes, W.; Prieto-Diaz, R.; & Fox, C. (1998); DARE: Domain Analysis and Reuse Environment, Annals of Software Engineering (5), pp. 125-141

[15] Language Extension Packs, dtSearch

[16] Building Multilingual Solutions by using Sharepoint Products and Technologies, Microsoft Technet

[17] CLEF 2003: Stephen Tomlinson compared the Snowball stemmers with the Hummingbird lexical stemming (lemmatiza- tion) system

[18] CLEF 2004: Stephen Tomlinson “Finnish, Portuguese and Russian Retrieval with Hummingbird SearchServer”

[19] The Essentials of Google Search, Web Search Help Center, Google Inc.

8.9 Further reading

• Dawson, J. L. (1974); Suffix Removal for Word Conflation, Bulletin of the Association for Literary and Lin- guistic Computing, 2(3): 33–46 • Frakes, W. B. (1984); Term Conflation for Information Retrieval, Cambridge University Press • Frakes, W. B. & Fox, C. J. (2003); Strength and Similarity of Affix Removal Stemming Algorithms, SIGIR Forum, 37: 26–30 • Frakes, W. B. (1992); Stemming algorithms, Information retrieval: data structures and algorithms, Upper Saddle River, NJ: Prentice-Hall, Inc. • Hafer, M. A. & Weiss, S. F. (1974); Word segmentation by letter successor varieties, Information Processing & Management 10 (11/12), 371–386 28 CHAPTER 8. STEMMING

• Harman, D. (1991); How Effective is Suffixing?, Journal of the American Society for Information Science 42 (1), 7–15 • Hull, D. A. (1996); Stemming Algorithms – A Case Study for Detailed Evaluation, JASIS, 47(1): 70–84 • Hull, D. A. & Grefenstette, G. (1996); A Detailed Analysis of English Stemming Algorithms, Xerox Technical Report • Kraaij, W. & Pohlmann, R. (1996); Viewing Stemming as Recall Enhancement, in Frei, H.-P.; Harman, D.; Schauble, P.; and Wilkinson, R. (eds.); Proceedings of the 17th ACM SIGIR conference held at Zurich, August 18–22, pp. 40–48 • Krovetz, R. (1993); Viewing Morphology as an Inference Process, in Proceedings of ACM-SIGIR93, pp. 191– 203 • Lennon, M.; Pierce, D. S.; Tarry, B. D.; & Willett, P. (1981); An Evaluation of some Conflation Algorithms for Information Retrieval, Journal of Information Science, 3: 177–183 • Lovins, J. (1971); Error Evaluation for Stemming Algorithms as Clustering Algorithms, JASIS, 22: 28–40 • Lovins, J. B. (1968); Development of a Stemming Algorithm, Mechanical Translation and Computational Lin- guistics, 11, 22—31 • Jenkins, Marie-Claire; and Smith, Dan (2005); Conservative Stemming for Search and Indexing • Paice, C. D. (1990); Another Stemmer, SIGIR Forum, 24: 56–61 • Paice, C. D. (1996) Method for Evaluation of Stemming Algorithms based on Error Counting, JASIS, 47(8): 632–649 • Popovič, Mirko; and Willett, Peter (1992); The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data, Journal of the American Society for Information Science, Volume 43, Issue 5 (June), pp. 384–390 • Porter, Martin F. (1980); An Algorithm for Suffix Stripping, Program, 14(3): 130–137 • Savoy, J. (1993); Stemming of French Words Based on Grammatical Categories Journal of the American Society for Information Science, 44(1), 1–9 • Ulmschneider, John E.; & Doszkocs, Tamas (1983); A Practical Stemming Algorithm for Online Search Assis- tance, Online Review, 7(4), 301–318 • Xu, J.; & Croft, W. B. (1998); Corpus-Based Stemming Using Coocurrence of Word Variants, ACM Transac- tions on Information Systems, 16(1), 61–81

8.10 External links

• Apache OpenNLP includes Porter and Snowball stemmers • SMILE Stemmer - free online service, includes Porter and Paice/Husk' Lancaster stemmers (Java API) • Themis - open source IR framework, includes Porter stemmer implementation (PostgreSQL, Java API) • Snowball - free stemming algorithms for many languages, includes source code, including stemmers for five romance languages • Snowball on C# - port of Snowball stemmers for C# (14 languages) • Python bindings to Snowball API • Ruby-Stemmer - Ruby extension to Snowball API • PECL - PHP extension to the Snowball API • Oleander Porter’s algorithm - stemming library in C++ released under BSD 8.10. EXTERNAL LINKS 29

• Unofficial home page of the Lovins stemming algorithm - with source code in a couple of languages

• Official home page of the Porter stemming algorithm - including source code in several languages • Official home page of the Lancaster stemming algorithm - Lancaster University, UK

• Official home page of the UEA-Lite Stemmer - University of East Anglia, UK • Overview of stemming algorithms

• PTStemmer - A Java/Python/.Net stemming toolkit for the • jsSnowball - open source JavaScript implementation of Snowball stemming algorithms for many languages

• Snowball Stemmer - implementation for Java • hindi_stemmer - open source stemmer for Hindi

• czech_stemmer - open source stemmer for Czech

• Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers • Tamil Stemmer

This article is based on material taken from the Free On-line of Computing prior to 1 November 2008 and incorporated under the “relicensing” terms of the GFDL, version 1.3 or later. Chapter 9

Text segmentation

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of Arabic, such signals are sometimes ambiguous and not present in all written languages. Compare speech segmentation, the process of dividing speech into linguistically meaningful portions.

9.1 Segmentation problems

9.1.1 Word segmentation

See also: Word § Word boundaries

Word segmentation is the problem of dividing a string of written language into its component words. In English and many other languages using some form of the Latin alphabet, the space is a good approximation of a word divider (word ). (Some examples where the space character alone may not be sufficient include contractions like won't for will not.) However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where but not words are delimited. In some writing systems however, such as the Ge'ez script used for Amharic and Tigrinya among other languages, words are explicitly delimited (at least historically) with a non-. The Consortium has published a Standard Annex on Text Segmentation, exploring the issues of segmentation in multiscript texts. Word splitting is the process of parsing concatenated text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist. Word splitting may also refer to the process of hyphenation.

9.1.2 Sentence segmentation

See also: Sentence boundary disambiguation

Sentence segmentation is the problem of dividing a string of written language into its component sentences. In English and some other languages, using punctuation, particularly the full stop/period character is a reasonable approximation.

30 9.2. AUTOMATIC SEGMENTATION APPROACHES 31

However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence. For example Mr. is not its own sentence in "Mr. Smith went to the shops in Jones Street.” When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries. As with word segmentation, not all written languages contain punctuation characters which are useful for approxi- mating sentence boundaries.

9.1.3 Topic segmentation

Main articles: Topic analysis and Document classification

Topic analysis consists of two main tasks: topic identification and text segmentation. While the first is a simple classification of a specific text, the latter case implies that a document may contain multiple topics, and the task of computerized text segmentation may be to discover these topics automatically and segment the text accordingly. The topic boundaries may be apparent from section titles and paragraphs. In other cases, one needs to use techniques similar to those used in document classification. Segmenting the text into topics or discourse turns might be useful in some natural processing tasks: it can improve information retrieval or significantly (by indexing/recognizing documents more precisely or by giving the specific part of a document corresponding to the query as a result). It is also needed in topic detection and tracking systems and text summarizing problems. Many different approaches have been tried:[1][2] e.g. HMM, lexical chains, passage similarity using word co-occurrence, clustering, topic modeling, etc. It is quite an ambiguous task – people evaluating the text segmentation systems often differ in topic boundaries. Hence, text segment evaluation is also a challenging problem.

9.1.4 Other segmentation problems

Processes may be required to segment text into segments besides mentioned, including (a task usually called morphological analysis) or paragraphs.

9.2 Automatic segmentation approaches

Automatic segmentation is the problem in natural language processing of implementing a computer process to seg- ment text. When punctuation and similar clues are not consistently available, the segmentation task often requires fairly non- trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints. Effective natural language processing systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements. The process of developing text segmentation tools starts with collecting a large corpus of text in an application domain. There are two general approaches:

• Manual analysis of text and writing custom software • Annotate the sample corpus with boundary information and use machine learning

Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.

9.3 See also

• Hyphenation 32 CHAPTER 9. TEXT SEGMENTATION

• Natural language processing

• Speech segmentation • Lexical analysis

• Word count • Line breaking

9.4 References

[1] Freddy Y. Y. Choi (2000). “Advances in domain independent linear text segmentation” (PDF). Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00). pp. 26–33.

[2] Jeffrey C. Reynar (1998). “Topic Segmentation: Algorithms and Applications” (PDF). IRCS-98-21. University of Penn- sylvania. Retrieved 2007-11-08.

9.5 External links

• Word Segment An open source software tool for word segmentation in Chinese. • Word Split An open source software tool designed to split conjoined words into human-readable text.

• Stanford Segmenter An open source software tool for word segmentation in Chinese or segmenta- tion in Arabic.

• KyTea An open source software tool for word segmentation in Japanese and Chinese. • Chinese Notes A Chinese–English dictionary that also does word segmentation.

• Zhihuita Segmentor A high precision and high performance Chinese segmentation freeware. • Python wordsegment module An open source Python module for English word segmentation. Chapter 10

Tokenization (lexical analysis)

In lexical analysis, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Tokenization is useful both in linguistics (where it is a form of text segmentation), and in computer science, where it forms part of lexical analysis.

10.1 Methods and obstacles

Typically, tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a “word”. Often a tokenizer relies on simple heuristics, for example:

• Punctuation and whitespace may or may not be included in the resulting list of tokens.

• All contiguous strings of alphabetic characters are part of one token; likewise with numbers

• Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters.

In languages that use inter-word spaces (such as most that use the Latin alphabet, and most programming languages), this approach is fairly straightforward. However, even here there are many edge cases such as contractions, hyphenated words, emoticons, and larger constructs such as URIs (which for some purposes may count as single tokens). A classic example is “New York-based”, which a naive tokenizer may break at the space even though the better break is (arguably) at the . Tokenization is particularly difficult for languages written in scriptio continua which exhibit no word boundaries such as Ancient Greek, Chinese,[1] or Thai. Agglutinative languages, such as Korean, also make tokenization tasks complicated. Some ways to address the more difficult problems include developing more complex heuristics, querying a table of common special-cases, or fitting the tokens to a language model that identifies collocations in a later processing step.

10.2 Software

• Apache OpenNLP includes rule based and statistical tokenizers which support many languages

• U-Tokenizer is an API over HTTP that can cut Mandarin and Japanese sentences at word boundary. English is supported as well.

• HPE Haven OnDemand Text Tokenization API (Commercial product, with freemium access) uses Advanced Probabilistic Concept Modelling to determine the weight that the term holds in the specified text indexes

33 34 CHAPTER 10. TOKENIZATION (LEXICAL ANALYSIS)

10.3 See also

• Tokenization (data security)

10.4 References

[1] Huang, C., Simon, P., Hsieh, S., & Prevot, L. (2007) Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Word break Identification

• “The Art of Tokenization”, developerWorks, Jan 23, 2013. Chapter 11

Parsing

“Parse” redirects here. For other uses, see Parse (disambiguation). “Parser” redirects here. For the computer programming language, see Parser (CGI language).

Parsing (US /ˈpɑːrsɪŋ/; UK /ˈpɑːrzɪŋ/), syntax analysis or syntactic analysis is the process of analysing a string of symbols, either in natural language or in computer languages, conforming to the rules of a . The term parsing comes from Latin pars (orationis), meaning part (of speech).[1][2] The term has slightly different meanings in different branches of linguistics and computer science. Traditional sen- tence parsing is often performed as a method of understanding the exact meaning of a sentence or word, sometimes with the aid of devices such as sentence diagrams. It usually emphasizes the importance of grammatical divisions such as subject and . Within computational linguistics the term is used to refer to the formal analysis by a computer of a sentence or other string of words into its constituents, resulting in a parse tree showing their syntactic relation to each other, which may also contain semantic and other information. The term is also used in psycholinguistics when describing language comprehension. In this context, parsing refers to the way that human beings analyze a sentence or phrase (in spoken language or text) “in terms of grammatical con- stituents, identifying the parts of speech, syntactic relations, etc.”[2] This term is especially common when discussing what linguistic cues help speakers to interpret garden-path sentences. Within computer science, the term is used in the analysis of computer languages, referring to the syntactic analysis of the input code into its component parts in order to facilitate the writing of and interpreters. The term may also be used to describe a split or separation.

11.1 Human languages

Main category: Natural language parsing

11.1.1 Traditional methods

The traditional grammatical exercise of parsing, sometimes known as clause analysis, involves breaking down a text into its component parts of speech with an explanation of the form, function, and syntactic relationship of each part.[3] This is determined in large part from study of the language’s conjugations and declensions, which can be quite intricate for heavily inflected languages. To parse a phrase such as 'man bites dog' involves noting that the singular noun 'man' is the subject of the sentence, the verb 'bites’ is the third person singular of the present tense of the verb 'to bite', and the singular noun 'dog' is the object of the sentence. Techniques such as sentence diagrams are sometimes used to indicate relation between elements in the sentence. Parsing was formerly central to the teaching of grammar throughout the English-speaking world, and widely regarded as basic to the use and understanding of written language. However, the general teaching of such techniques is no

35 36 CHAPTER 11. PARSING longer current.

11.1.2 Computational methods

In some and natural language processing systems, written texts in human languages are parsed by computer programs. Human sentences are not easily parsed by programs, as there is substantial ambiguity in the structure of human language, whose usage is to convey meaning (or semantics) amongst a potentially unlimited range of possibilities but only some of which are germane to the particular case. So an utterance “Man bites dog” versus “Dog bites man” is definite on one detail but in another language might appear as “Man dog bites” with a reliance on the larger context to distinguish between those two possibilities, if indeed that difference was of concern. It is difficult to prepare formal rules to describe informal behaviour even though it is clear that some rules are being followed. In order to parse natural language data, researchers must first agree on the grammar to be used. The choice of syntax is affected by both linguistic and computational concerns; for instance some parsing systems use lexical functional grammar, but in general, parsing for grammars of this type is known to be NP-complete. Head-driven phrase structure grammar is another linguistic formalism which has been popular in the parsing community, but other research efforts have focused on less complex formalisms such as the one used in the Penn Treebank. Shallow parsing aims to find only the boundaries of major constituents such as noun phrases. Another popular strategy for avoiding linguistic controversy is parsing. Most modern parsers are at least partly statistical; that is, they rely on a corpus of training data which has already been annotated (parsed by hand). This approach allows the system to gather information about the frequency with which various constructions occur in specific contexts. (See machine learning.) Approaches which have been used include straightforward PCFGs (probabilistic context-free grammars), maximum entropy, and neural nets. Most of the more successful systems use lexical statistics (that is, they consider the identities of the words involved, as well as their part of speech). However such systems are vulnerable to overfitting and require some kind of smoothing to be effective. Parsing algorithms for natural language cannot rely on the grammar having 'nice' properties as with manually designed grammars for programming languages. As mentioned earlier some grammar formalisms are very difficult to parse computationally; in general, even if the desired structure is not context-free, some kind of context-free approximation to the grammar is used to perform a first pass. Algorithms which use context-free grammars often rely on some variant of the CYK algorithm, usually with some heuristic to prune away unlikely analyses to save time. (See chart parsing.) However some systems trade speed for accuracy using, e.g., linear-time versions of the shift-reduce algorithm. A somewhat recent development has been parse reranking in which the parser proposes some large number of analyses, and a more complex system selects the best option.

11.1.3 Psycholinguistics

In psycholinguistics, parsing involves not just the assignment of words to categories, but the evaluation of the mean- ing of a sentence according to the rules of syntax drawn by inferences made from each word in the sentence. This normally occurs as words are being heard or read. Consequently, psycholinguistic models of parsing are of neces- sity incremental, meaning that they build up an interpretation as the sentence is being processed, which is normally expressed in terms of a partial syntactic structure. Creation of initially wrong structures occurs when interpreting garden path sentences.

11.2 Computer languages

11.2.1 Parser

A parser is a software component that takes input data (frequently text) and builds a data structure – often some kind of parse tree, or other hierarchical structure – giving a structural representation of the input, checking for correct syntax in the process. The parsing may be preceded or followed by other steps, or these may be combined into a single step. The parser is often preceded by a separate lexical analyser, which creates tokens from the sequence of input characters; alternatively, these can be combined in . Parsers may be programmed by hand or may be automatically or semi-automatically generated by a parser generator. Parsing is complementary to templating, which produces formatted output. These may be applied to different domains, but often 11.2. COMPUTER LANGUAGES 37 appear together, such as the scanf/printf pair, or the input (front end parsing) and output (back end code generation) stages of a . The input to a parser is often text in some computer language, but may also be text in a natural language or less structured textual data, in which case generally only certain parts of the text are extracted, rather than a parse tree being constructed. Parsers range from very simple functions such as scanf, to complex programs such as the frontend of a C++ compiler or the HTML parser of a web browser. An important class of simple parsing is done using regular expressions, in which a group of regular expressions defines a and a regular expression engine automatically generating a parser for that language, allowing and extraction of text. In other contexts regular expressions are instead used prior to parsing, as the lexing step whose output is then used by the parser. The use of parsers varies by input. In the case of data languages, a parser is often found as the file reading facility of a program, such as reading in HTML or XML text; these examples are markup languages. In the case of programming languages, a parser is a component of a compiler or , which parses the source code of a computer pro- gramming language to create some form of internal representation; the parser is a key step in the compiler frontend. Programming languages tend to be specified in terms of a deterministic context-free grammar because fast and effi- cient parsers can be written for them. For compilers, the parsing itself can be done in one pass or multiple passes – see one-pass compiler and multi-pass compiler. The implied disadvantages of a one-pass compiler can largely be overcome by adding fix-ups, where provision is made for fix-ups during the forward pass, and the fix-ups are applied backwards when the current program segment has been recognized as having been completed. An example where such a fix-up mechanism would be useful would be a forward GOTO statement, where the target of the GOTO is unknown until the program segment is completed. In this case, the application of the fix-up would be delayed until the target of the GOTO was recognized. Obviously, a backward GOTO does not require a fix-up. Context-free grammars are limited in the extent to which they can express all of the requirements of a language. Informally, the reason is that the memory of such a language is limited. The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced. More powerful grammars that can express this constraint, however, cannot be parsed efficiently. Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out at the semantic analysis (contextual analysis) step. For example, in Python the following is syntactically valid code: x = 1 print(x)

The following code, however, is syntactically valid in terms of the context-free grammar, yielding a syntax tree with the same structure as the previous, but is syntactically invalid in terms of the context-sensitive grammar, which requires that variables be initialized before use: x = 1 print(y)

Rather than being analyzed at the parsing stage, this is caught by checking the values in the syntax tree, hence as part of semantic analysis: context-sensitive syntax is in practice often more easily analyzed as semantics.

11.2.2 Overview of process

The following example demonstrates the common case of parsing a computer language with two levels of grammar: lexical and syntactic. The first stage is the token generation, or lexical analysis, by which the input character stream is split into meaningful symbols defined by a grammar of regular expressions. For example, a calculator program would look at an input such as “12*(3+4)^2” and split it into the tokens 12, *, (, 3, +, 4, ), ^, 2, each of which is a meaningful symbol in the context of an arithmetic expression. The lexer would contain rules to tell it that the characters *, +, ^, ( and ) mark the start of a new token, so meaningless tokens like “12*" or "(3” will not be generated. The next stage is parsing or syntactic analysis, which is checking that the tokens form an allowable expression. This is usually done with reference to a context-free grammar which recursively defines components that can make up an expression and the order in which they must appear. However, not all rules defining programming languages can 38 CHAPTER 11. PARSING be expressed by context-free grammars alone, for example type validity and proper declaration of identifiers. These rules can be formally expressed with attribute grammars. The final phase is or analysis, which is working out the implications of the expression just validated and taking the appropriate action. In the case of a calculator or interpreter, the action is to evaluate the expression or program, a compiler, on the other hand, would generate some kind of code. Attribute grammars can also be used to define these actions.

11.3 Types of parsers

The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar. This can be done in essentially two ways:

• Top-down parsing - Top-down parsing can be viewed as an attempt to find left-most derivations of an input- stream by searching for parse trees using a top-down expansion of the given formal grammar rules. Tokens are consumed from left to right. Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules.[4]

• Bottom-up parsing - A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. LR parsers are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing.

LL parsers and recursive-descent parser are examples of top-down parsers which cannot accommodate left recur- sive production rules. Although it has been believed that simple implementations of top-down parsing cannot ac- commodate direct and indirect left-recursion and may require exponential time and space complexity while parsing ambiguous context-free grammars, more sophisticated algorithms for top-down parsing have been created by Frost, Hafiz, and Callaghan[5][6] which accommodate ambiguity and left recursion in polynomial time and which gener- ate polynomial-size representations of the potentially exponential number of parse trees. Their algorithm is able to produce both left-most and right-most derivations of an input with regard to a given context-free grammar. An important distinction with regard to parsers is whether a parser generates a leftmost derivation or a rightmost derivation (see context-free grammar). LL parsers will generate a leftmost derivation and LR parsers will generate a rightmost derivation (although usually in reverse).[4]

11.4 Parser development software

Some of the well known parser development tools include the following. Also see comparison of parser generators.

• ANTLR

• Bison

• Coco/R

• GOLD

• JavaCC

• LuZc

• Parsec

11.5. LOOKAHEAD 39

• Spirit Parser Framework • Syntax Definition Formalism • SYNTAX • XPL •

11.5 Lookahead

Lookahead establishes the maximum incoming tokens that a parser can use to decide which rule it should use. Looka- head is especially relevant to LL, LR, and LALR parsers, where it is often explicitly indicated by affixing the looka- head to the algorithm name in parentheses, such as LALR(1). Most programming languages, the primary target of parsers, are carefully defined in such a way that a parser with limited lookahead, typically one, can parse them, because parsers with limited lookahead are often more efficient. One important change to this trend came in 1990 when Terence Parr created ANTLR for his Ph.D. thesis, a parser generator for efficient LL(k) parsers, where k is any fixed value. Parsers typically have only a few actions after seeing each token. They are shift (add this token to the stack for later reduction), reduce (pop tokens from the stack and form a syntactic construct), end, error (no known rule applies) or conflict (does not know whether to shift or reduce). Lookahead has two advantages.

• It helps the parser take the correct action in case of conflicts. For example, parsing the if statement in the case of an else clause. • It eliminates many duplicate states and eases the burden of an extra stack. A C language non-lookahead parser will have around 10,000 states. A lookahead parser will have around 300 states.

Example: Parsing the Expression 1 + 2 * 3 Most programming languages (except for a few such as APL and Smalltalk) and algebraic formulas give higher precedence to multiplication than addition, in which case the correct interpretation of the example above is (1 + (2*3)). Note that Rule4 above is a semantic rule. It is possible to rewrite the grammar to incorporate this into the syntax. However, not all such rules can be translated into syntax.

Simple non-lookahead parser actions

Initially Input = [1,+,2,*,3]

1. Shift “1” onto stack from input (in anticipation of rule3). Input = [+,2,*,3] Stack = [1] 2. Reduces “1” to expression “E” based on rule3. Stack = [E] 3. Shift "+" onto stack from input (in anticipation of rule1). Input = [2,*,3] Stack = [E,+] 4. Shift “2” onto stack from input (in anticipation of rule3). Input = [*,3] Stack = [E,+,2] 5. Reduce stack element “2” to Expression “E” based on rule3. Stack = [E,+,E] 6. Reduce stack items [E,+] and new input “E” to “E” based on rule1. Stack = [E] 7. Shift "*" onto stack from input (in anticipation of rule2). Input = [3] Stack = [E,*] 8. Shift “3” onto stack from input (in anticipation of rule3). Input = [] (empty) Stack = [E,*,3] 9. Reduce stack element “3” to expression “E” based on rule3. Stack = [E,*,E] 10. Reduce stack items [E,*] and new input “E” to “E” based on rule2. Stack = [E] 40 CHAPTER 11. PARSING

The parse tree and resulting code from it is not correct according to language semantics. To correctly parse without lookahead, there are three solutions:

• The user has to enclose expressions within parentheses. This often is not a viable solution. • The parser needs to have more logic to backtrack and retry whenever a rule is violated or not complete. The similar method is followed in LL parsers. • Alternatively, the parser or grammar needs to have extra logic to delay reduction and reduce only when it is absolutely sure which rule to reduce first. This method is used in LR parsers. This correctly parses the expression but with many more states and increased stack depth.

Lookahead parser actions

1. Shift 1 onto stack on input 1 in anticipation of rule3. It does not reduce immediately. 2. Reduce stack item 1 to simple Expression on input + based on rule3. The lookahead is +, so we are on path to E +, so we can reduce the stack to E. 3. Shift + onto stack on input + in anticipation of rule1. 4. Shift 2 onto stack on input 2 in anticipation of rule3. 5. Reduce stack item 2 to Expression on input * based on rule3. The lookahead * expects only E before it. 6. Now stack has E + E and still the input is *. It has two choices now, either to shift based on rule2 or reduction based on rule1. Since * has higher precedence than + based on rule4, we shift * onto stack in anticipation of rule2. 7. Shift 3 onto stack on input 3 in anticipation of rule3. 8. Reduce stack item 3 to Expression after seeing end of input based on rule3. 9. Reduce stack items E * E to E based on rule2. 10. Reduce stack items E + E to E based on rule1.

The parse tree generated is correct and simply more efficient than non-lookahead parsers. This is the strategy followed in LALR parsers.

11.6 See also

• Compiler-compiler • • Generating strings • • LALR parser • Lexical analysis • Pratt parser • Shallow parsing • Left corner parser 11.7. REFERENCES 41

• Parsing expression grammar

• ASF+SDF Meta Environment • DMS Software Reengineering Toolkit

• Program transformation • Source code generation

11.7 References

[1] “Bartleby.com homepage”. Retrieved 28 November 2010.

[2] “parse”. dictionary.reference.com. Retrieved 27 November 2010.

[3] “Grammar and Composition”.

[4] Aho, A.V., Sethi, R. and Ullman ,J.D. (1986) " Compilers: principles, techniques, and tools.” Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA.

[5] Frost, R., Hafiz, R. and Callaghan, P. (2007) " Modular and Efficient Top-Down Parsing for Ambiguous Left-Recursive Grammars .” 10th International Workshop on Parsing Technologies (IWPT), ACL-SIGPARSE , Pages: 109 - 120, June 2007, Prague.

[6] Frost, R., Hafiz, R. and Callaghan, P. (2008) " Parser Combinators for Ambiguous Left-Recursive Grammars.” 10th International Symposium on Practical Aspects of Declarative Languages (PADL), ACM-SIGPLAN , Volume 4902/2008, Pages: 167 - 181, January 2008, San Francisco.

11.8 Further reading

• Chapman, Nigel P., LR Parsing: Theory and Practice, Cambridge University Press, 1987. ISBN 0-521-30413- X • Grune, Dick; Jacobs, Ceriel J.H., Parsing Techniques - A Practical Guide, Vrije Universiteit Amsterdam, Am- sterdam, The Netherlands. Originally published by Ellis Horwood, Chichester, England, 1990; ISBN 0-13- 651431-6

11.9 External links

• The Lemon LALR Parser Generator • Stanford Parser The Stanford Parser

• Turin University Parser Natural language parser for the Italian, open source, developed in Common Lisp by Leonardo Lesmo, University of Torino, Italy.

• Short history of parser construction 42 CHAPTER 11. PARSING

Flow of data in a typical parser Chapter 12

Parse tree

A parse tree or parsing tree[1] or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term parse tree itself is used primarily in computational linguistics; in theoretical syntax the term syntax tree is more common. Parse trees concretely reflect the syntax of the input language, making them distinct from the abstract syntax trees used in computer programming. They are also distinct from the sentence diagrams, such as Reed-Kellogg diagrams, used for teaching grammar. Parse trees are usually constructed based on either the constituency relation of constituency grammars (phrase struc- ture grammars) or the dependency relation of dependency grammars. Parse trees may be generated for sentences in natural languages (see natural language processing), as well as during processing of computer languages, such as programming languages. A related concept is that of phrase marker or P-marker, as used in transformational generative grammar. A phrase marker is a linguistic expression marked as to its phrase structure. This may be presented in the form of a tree, or as a bracketed expression. Phrase markers are generated by applying phrase structure rules, and themselves are subject to further transformational rules.

12.1 Constituency-based parse trees

The constituency-based parse trees of constituency grammars (= phrase structure grammars) distinguish between terminal and non-terminal nodes. The interior nodes are labeled by non-terminal categories of the grammar, while the leaf nodes are labeled by terminal categories. The image below represents a constituency-based parse tree; it shows the syntactic structure of the English sentence John hit the ball:

The parse tree is the entire structure, starting from S and ending in each of the leaf nodes (John, hit, the, ball). The following abbreviations are used in the tree:

• S for sentence, the top-level structure in this example

43 44 CHAPTER 12. PARSE TREE

• NP for noun phrase. The first (leftmost) NP, a single noun “John”, serves as the subject of the sentence. The second one is the object of the sentence.

• VP for verb phrase, which serves as the predicate

• V for verb. In this case, it’s a transitive verb hit.

• D for , in this instance the definite article “the”

• N for noun

Each node in the tree is either a root node, a branch node, or a leaf node.[2] A root node is a node that doesn't have any branches on top of it. Within a sentence, there is only ever one root node. A branch node is a mother node that connects to two or more daughter nodes. A leaf node, however, is a terminal node that does not dominate other nodes in the tree. S is the root node, NP and VP are branch nodes, and John (N), hit (V), the (D), and ball (N) are all leaf nodes. The leaves are the lexical tokens of the sentence.[3] A node can also be referred to as parent node or a child node. A parent node is one that has at least one other node linked by a branch under it. In the example, S is a parent of both N and VP. A child node is one that has at least one node directly above it to which it is linked by a branch of a tree. From the example, hit is a child node of V. The terms mother and daughter are also sometimes used for this relationship.

12.2 Dependency-based parse trees

The dependency-based parse trees of dependency grammars[4] see all nodes as terminal, which means they do not acknowledge the distinction between terminal and non-terminal categories. They are simpler on average than constituency-based parse trees because they contain fewer nodes. The dependency-based parse tree for the example sentence above is as follows:

This parse tree lacks the phrasal categories (S, VP, and NP) seen in the constituency-based counterpart above. Like the constituency-based tree, constituent structure is acknowledged. Any complete sub-tree of the tree is a constituent. Thus this dependency-based parse tree acknowledges the subject noun John and the object noun phrase the ball as constituents just like the constituency-based parse tree does. The constituency vs. dependency distinction is far-reaching. Whether the additional syntactic structure associated with constituency-based parse trees is necessary or beneficial is a matter of debate.

12.3 Phrase markers

Phrase markers, or P-markers, were introduced in early transformational generative grammar, as developed by Noam Chomsky and others. A phrase marker representing the deep structure of a sentence is generated by applying phrase structure rules; this may then undergo further transformations. Phrase markers may be presented in the form of trees (as in the above section on constituency-based parse trees), but are often given instead in the form of bracketed expressions, which occupy less space. For example, a bracketed expression corresponding to the constituency-based tree given above may be something like:

[S [NP John][VP [V hit][NP the [N ball]]]] As with trees, the precise construction of such expressions and the amount of detail shown can depend on the theory being applied and on the points that the author wishes to illustrate. 12.4. SEE ALSO 45

12.4 See also

• Constituent (linguistics) • Dependency grammar • Computational linguistics • Terminal and non-terminal functions • Parsing • Phrase structure grammar • Sentence diagram • Verb phrase • Parse Thicket

12.5 Notes

[1] See Chiswell and Hodges 2007: 34.

[2] See Carnie (2013:118ff.) for an introduction to the basic concepts of syntax trees (e.g. root node, terminal node, non- terminal node, etc.).

[3] See Alfred et al. 2007.

[4] See for example Ágel et al. 2003/2006.

12.6 References

• Ágel, V., Ludwig Eichinger, Hans-Werner Eroms, Peter Hellwig, Hans Heringer, and Hennig Lobin (eds.) 2003/6. Dependency and valency: An international handbook of contemporary research. Berlin: Walter de Gruyter. • Carnie, A. 2013. Syntax: A generative introduction, 3rd edition. Malden, MA: Wiley-Blackwell. • Chiswell, Ian and Wilfrid Hodges 2007. Mathematical logic. Oxford: Oxford University Press. • Aho, Alfred et al. 2007. Compilers: Principles, techniques, & tools. Boston: Pearson/Addison Wesley.

12.7 External links

• Syntax Tree Editor • Linguistic Tree Constructor • phpSyntaxTree – Online parse tree drawing site • phpSyntaxTree (Unicode) – Online parse tree drawing site (improved version that supports Unicode) • Qtree – LaTeX package for drawing parse trees • TreeForm Syntax Tree Drawing Software • rSyntaxTree Enhanced version of phpSyntaxTree in Ruby with Unicode and Vectorized graphics • Visual Introduction to Parse Trees Introduction and Transformation • OpenCourseOnline Dependency Parse Introduction (Christoper Manning) Chapter 13

Constituent (linguistics)

In syntactic analysis, a constituent is a word or a group of words that function(s) as a single unit within a hierar- chical structure. The analysis of constituent structure is associated mainly with phrase structure grammars, although dependency grammars also allow sentence structure to be broken down into constituent parts. The constituent struc- ture of sentences is identified using constituency tests. These tests manipulate some portion of a sentence and based on the result, clues are delivered about the immediate constituent structure of the sentence. Many constituents are phrases. A phrase is a sequence of one or more words (in some theories two or more) built around a head lexical item and working as a unit within a sentence. A word sequence is shown to be a phrase/constituent if it exhibits one or more of the behaviors discussed below.[1]

13.1 Constituency tests

Constituency tests are diagnostics used to identify the constituent structure of sentences.[2] There are numerous con- stituency tests applied to English sentences, many of which are listed here: 1. topicalization (fronting), 2. clefting, 3. pseudoclefting, 4. pro-form substitution (replacement), 5. answer ellipsis (question test), 6. passivization, 7. omission (deletion), 8. coordination, etc. These tests are rough-and-ready tools which grammarians employ to reveal clues about syntactic structure. A word of caution is warranted when employing these tests, since they often de- liver contradictory results. Some syntacticians even arrange the tests on a scale of reliability, with less-reliable tests treated as useful to confirm constituency though not sufficient on their own.[3] Failing to pass a single test does not mean that the unit is not a constituent, and conversely, passing a single test does not mean necessarily that the unit is a constituent. It is best to apply as many tests as possible to a given unit in order to prove or to rule out its status as a constituent.

13.1.1 Topicalization (fronting)

Topicalization involves moving the test sequence to the front of the sentence. It is a simple movement operation:[4]

He is going to attend another course to improve his English. To improve his English, he is going to attend another course.

13.1.2 Clefting

Clefting involves placing a sequence of words X within the structure beginning with It is/was: It was X that...[5]

She bought a pair of gloves with silk embroidery. It was a pair of gloves with silk embroidery that she bought.

46 13.1. CONSTITUENCY TESTS 47

13.1.3 Pseudoclefting

Pseudoclefting (also preposing) is similar to clefting in that it puts emphasis on a certain phrase in a sentence. It involves inserting a sequence of words before is/are what or is/are who:[6]

She bought a pair of gloves with silk embroidery. A pair of gloves with silk embroidery is what she bought.

13.1.4 Pro-form substitution (replacement)

Pro-form substitution, or replacement, involves replacing the test constituent with the appropriate pro-form (e.g. pronoun). Substitution normally involves using a definite pro-form like it, he, there, here, etc. in place of a phrase or a clause. If such a change yields a grammatical sentence where the general structure has not been altered, then the test sequence is a constituent:[7]

I don't know the man who is sleeping in the car. *I don't know him who is sleeping in the car. (ungrammatical) I don't know him.

The ungrammaticality of the first changed version and the grammaticality of the second one demonstrates that the whole sequence, the man who is sleeping in the car, and not just the man is a constituent functioning as a unit.

13.1.5 Answer ellipsis (answer fragments, question test)

The answer ellipsis test refers to the ability of a sequence of words to stand alone as a reply to a question. It is often used to test the constituency of a verbal phrase but can also be applied to other phrases:[8]

What did you do yesterday? - Worked on my new project. What did you do yesterday? - *Worked on. (unacceptable, so worked on is not a constituent).

Linguists do not agree whether passing the answer ellipsis test is sufficient, though at a minimum they agree that it can help confirm the results of another constituency test.

13.1.6 Passivization

Passivization involves changing an active sentence to a passive sentence, or vice versa. The object of the active sentence is changed to the subject of the corresponding passive sentence:[9]

A car driving too fast nearly hit the little dog. The little dog was nearly hit by a car driving too fast.

In case passivization results in a grammatical sentence, the phrases which have been moved can be regarded as constituents.

13.1.7 Omission (deletion)

Omission checks whether a sequence of words can be omitted without influencing the grammaticality of the sentence — in most cases, local or temporal adverbials can be safely omitted and thus qualify as constituents.[10]

Fred relaxes at night on his couch. Fred relaxes on his couch. Fred relaxes at night.

Since they can be omitted, the prepositional phrases at night and on his couch are constituents. 48 CHAPTER 13. CONSTITUENT (LINGUISTICS)

13.1.8 Coordination

The coordination test assumes that only constituents can be coordinated, i.e., joined by means of a coordinator such as and:[11]

He enjoys [writing sentences] and [reading them]. [He enjoys writing] and [she enjoys reading] sentences. [He enjoys] but [she hates] writing sentences.

Based on the fact that writing sentences and reading them are coordinated using and, one can conclude that they are constituents. The validity of the coordination test is challenged by additional data, however. The latter two sentences, which are instances of so-called right node raising, suggest that the sequences in bold should be understood as constituents. Most grammars do not view sequences such as He enjoys to the exclusion of the VP writing sentences as a constituent. Thus while the coordination test is widely employed as a diagnostic for constituent structure, it is faced with major difficulties and is therefore perhaps the least reliable of all the tests mentioned.[12]

13.2 Constituency tests and disambiguation

Syntactic ambiguity characterizes sentences which can be interpreted in different ways depending solely on how one perceives syntactic connections between words and arranges them into phrases. Possible interpretations of the sentence They killed the man with a gun are:

'The man was shot.' 'The man who was killed had a gun with him.'

The ambiguity of this sentence results from two possible arrangements into constituents:

They killed [the man] [with a gun]. They killed [the man with a gun].

In the first sentence, with a gun is an independent constituent with instrumental meaning. In the second sentence, it is embedded in the noun phrase the man with a gun and is modifying the noun man. The autonomy of the unit with a gun in the first interpretation can be tested by the answer ellipsis test:

How did they kill the man? - With a gun.

However, the same test can be used to prove that the man with a gun in the second sentence should be treated as a unit:

Who(m) did they kill? - The man with a gun.

The ability of constituency tests to disambiguate certain sentences in this manner bears witness to their utility. Most if not all syntacticians employ constituency tests in some form or another to arrive at the structures that they assign to sentences.

13.3 Competing theories

Alternate theoretical approaches to syntax make different assumptions regarding what is considered a constituent. In mainstream phrase structure grammar (and its derivatives), individual words are constituents in and of themselves as well as being parts of other constituents, whereas in dependency grammar,[13] certain core words in each phrase are not a constituent by themselves, but only members of a phrasal constituent. The following trees show the same sentence in two different theoretical representations, with a phrase structure representation on the left and a dependency grammar representation on the right. In both trees, a constituent is understood to be the entire tree or any labelled subtree (a node plus all the nodes dominated by that node); note that words like killed and with, for instance, form subtrees (and are considered constituents) in the phrase structure representation but not in the dependency structure representation.[14] 13.4. SEE ALSO 49

13.4 See also

(linguistics)

• Non-finite verb

13.5 Notes

[1] Tests for constituent structure can be found in most textbooks on syntax. See for instance Sobin (2011: 30).

[2] See for instance Burton-Roberts (1997:7–23) and Carnie (2002:51-53).

[3] April 22, 2006 Language Log posting by Eric Bakovic of University of California, San Diego

[4] For examples of topicalization used as a constituency test, see for instance Allerton (1979:114), Borsley (1991:24), Napoli (1993:422), Burton-Roberts (1997:17), Poole (2002:32), Radford (2004:72), Haegeman (2006:790).

[5] For examples of clefting used as a constituency test, see Brown and Miller (1980:25), Borsley (1991:24), Napoli (1993:148), McCawley (1997:64), Haegman and Guéron (1999:49), Santorini and Kroch (2000), Akmajian et al. (2001:178); Carnie (2002:52), Haegeman (2006:85).

[6] For examples of pseudoclefting used as a constituency test, see Brown and Miller (1980:25), Borsley (1991:24), McCawley (1997:661), Haegeman and Guéron (1999:50), Haegeman (2006).

[7] For examples of pro-form substitution used as a constituency test, see Radford (1988:92, 1997:109), Haegeman and Guéron (1999:46), Lasnik (2000:9), Santorini and Kroch (2000), Dalrymple (2001:48), Carnie (2002:51), Poole (2002:29), Rad- ford (2004:71), Haegeman (2006:74).

[8] For examples of answer ellipsis used as a constituency test, see Brown and Miller (1980:25), Radford (1988:91, 96), Burton-Roberts (1997:16), Radford (1997:107), Haegeman and Guéron (1999:46), Santorini and Kroch (2000), Carnie (2002:52), Haegeman (2006:82).

[9] For an example of passivization used as a test for constituent structure, see Borsley (1991:24).

[10] For examples of omission used as a constituency test, see Allerton (1979:101f.), Burton-Roberts (1997:15), and Haegeman and Guéron (1999:49).

[11] For examples of coordination used as a test for constituent structure, see Radford (1988:90), Borsley (1991:25), Cowper (1992:34), Napoli (1993:165), Ouhalla (1994:17), Jacobson (1996:60), McCawley (1997:58), Radford (1997:104), Lasnik (2000:11), Akmajian et al.(2001:179), Poole (2002:31). 50 CHAPTER 13. CONSTITUENT (LINGUISTICS)

[12] The problems with coordination as a test for constituent structure have been pointed out in numerous places in the literature. See for instance Brinker (1972:52), Dalrymple (2001:48), Nerbonne (1994:120f.), Carnie (2002:53).

[13] Two prominent sources on dependency grammar are Tesnière (1959) and Ágel, et al. (2003/2006).

[14] For a comparison of these two competing views of constituent structure, see Osborne (2008:1126-32).

13.6 References

• Ágel, V., L. Eichinger, H.-W. Eroms, P. Hellwig, H. Heringer, and H. Lobin (eds.) 2003/6. Dependency and valency: An international handbook of contemporary research. Berlin: Walter de Gruyter. • Akmajian, A., R. Demers, A. Farmer and R. Harnish. 2001. Linguistics: An introduction to language and communication, 5th edn. Cambridge: MIT Press. • Allerton, D. 1979. Essentials of grammatical theory: A consensus view of syntax and morphology. London: Routledge and Kegan Paul. • Borsley, R. 1991. Syntactic theory: A unified approach. London: Edward Arnold. • Brinker, K. 1972. Konstituentengrammatik und operationale Satzgliedanalyse: Methodenkritische Unter- suchungen zur Syntax des einfachen deutschen Satzes. Frankfurt a. M.: Athenäum. • Brown, K. and J. Miller 1980. Syntax: A linguistic introduction to sentence structure. London: Hutchinson. • Burton-Roberts, N. 1997. Analysing sentences: An introduction to English syntax. 2nd Edition. Longman. • Carnie, A. 2002. Syntax: A generative introduction. Oxford: Blackwell. • Carnie, A. 2010. Constituent Structure. Oxford: Oxford University Press. • Cowper, E. 1992. A concise introduction to syntactic theory: The government-binding approach. Chicago: The University of Chicago Press. • Dalrymple, M. 2001. Lexical functional grammar. Syntax and semantics 34. San Diego: Academic Press. • Haegeman, L. 2006. Thinking syntactically: A guide to argumentation and analysis. Malden, MA: Blackwell. • Haegeman, L. and J. Guéron 1999. English grammar: A generative perspective. Oxford: Basil Blackwell. • Jacobson, P. 1996. Constituent structure. In Concise encyclopedia of syntactic theories. Cambridge: Perga- mon. • Lasnik, H. 2000. Syntactic structures revisited: Contemporary lectures on classic transformational theory. Cambridge: MIT Press. • McCawley, J. 1997. The syntactic phenomena of English, 2nd edn. Chicago: University of Chicago Press. • Napoli, D. 1993. Syntax: Theory and problems. New York: Oxford University Press. • Nerbonne, J. 1994. Partial verb phrases and spurious . In: J. Nerbonne, K. Netter and C. Pol- lard (eds.), German in Head-Driven Phrase Structure Grammar, CSLI Lecture Notes Number 46. 109-150. Stanford: CSLI Publications. • Osborne, T. 2008. Major constituents: And two dependency grammar constraints on sharing in coordination. Linguistics 46, 6, 1109-1165. • Ouhalla, J. 1994. Introducing transformational grammar: From rules to principles and parameters. Oxford: Oxford University Press. • Poole, G. 2002. Syntactic theory. New York: Palgrave. • Radford, A. 1988. Transformational grammar: A first course. Cambridge, UK: Cambridge University Press. • Radford, A. 1997. Syntactic theory and the structure of English: A minimalist approach. Cambridge, UK: Cambridge University Press. 13.6. REFERENCES 51

• Radford, A. 2004. English syntax: An introduction. Cambridge, UK: Cambridge University Press.

• Santorini, B. and A. Kroch 2000. The syntax of natural language: An online introduction using the trees pro- gram. Available at (accessed on March 14, 2011): http://www.ling.upenn.edu/~{}beatrice/syntax-textbook/ 00/index.html. • Sobin, N. 2011. Syntactic analysis: The basics. Malden, MA: Wiley-Blackwell.

• Tesnière, L. 1959. Éléments de syntaxe structurale. Paris: Klincksieck. Chapter 14

Dependency grammar

Dependency grammar (DG) is a class of modern syntactic theories that are all based on the dependency relation (as opposed to the constituency relation) and that can be traced back primarily to the work of Lucien Tesnière. Dependency is the notion that linguistic units, e.g. words, are connected to each other by directed links. The (finite) verb is taken to be the structural center of clause structure. All other syntactic units (words) are either directly or indirectly connected to the verb in terms of the directed links, which are called dependencies. DGs are distinct from phrase structure grammars (constituency grammars) since DGs lack phrasal nodes although they acknowledge phrases. Structure is determined by the relation between a word (a head) and its dependents. Dependency structures are flatter than constituency structures in part because they lack a finite verb phrase constituent, and they are thus well suited for the analysis of languages with free , such as Czech, Turkish, and Warlpiri.

14.1 History

The notion of dependencies between grammatical units has existed since the earliest recorded grammars, e.g. Pāṇini, and the dependency concept therefore arguably predates the constituency notion by many centuries.[1] Ibn Maḍāʾ, a 12th-century linguist from Córdoba, Andalusia, may have been the first grammarian to use the term dependency in the grammatical sense that we use it today. In early modern times, the dependency concept seems to have coexisted side by side with the constituency concept, the latter having entered Latin, French, English and other grammars from the widespread study of term logic of antiquity.[2] Dependency is also concretely present in the works of Sámuel Brassai (1800–1897), a Hungarian linguist, and of Heimann Hariton Tiktin (1850–1936), a Romanian linguist.[3] Modern dependency grammars, however, begin primarily with the work of Lucien Tesnière. Tesnière was a French- man, a , and a professor of linguistics at the universities in Strasbourg and Montpellier. His major work Éléments de syntaxe structurale was published posthumously in 1959 – he died in 1954. The basic approach to syn- tax he developed seems to have been seized upon independently by others in the 1960s[4] and a number of other dependency-based grammars have gained prominence since those early works.[5] DG has generated a lot of interest in Germany[6] in both theoretical syntax and language pedagogy. In recent years, the great development surround- ing dependency-based theories has come from computational linguistics and is due, in part, to the influential work that David Hays did in machine translation at the RAND Corporation in the 1950s and 1960s. Dependency-based systems are increasingly being used to parse natural language and generate tree banks. Interest in dependency gram- mar is growing at present, international conferences on dependency linguistics being a relatively recent development (Depling 2011, Depling 2013, Depling 2015).

14.2 Dependency vs. constituency

Dependency is a one-to-one correspondence: for every element (e.g. word or morph) in the sentence, there is exactly one node in the structure of that sentence that corresponds to that element. The result of this one-to-one corre- spondence is that dependency grammars are word (or morph) grammars. All that exist are the elements and the dependencies that connect the elements into a structure. This situation should be compared with the constituency relation of phrase structure grammars. Constituency is a one-to-one-or-more correspondence, which means that, for every element in a sentence, there are one or more nodes in the structure that correspond to that element. The result

52 14.3. DEPENDENCY GRAMMARS 53 of this difference is that dependency structures are minimal[7] compared to their constituency structure counterparts, since they tend to contain many fewer nodes.

These two trees illustrate just two possible ways to render the dependency and constituency relations (see below). This dependency tree is an “ordered” tree, i.e. it reflects actual word order. Many dependency trees abstract away from linear order and focus just on hierarchical order, which means they do not show actual word order. This constituency tree follows the conventions of bare phrase structure (BPS), whereby the words themselves are employed as the node labels. The distinction between dependency- and constituency-based grammars derives in large part from the initial division of the clause. The constituency relation derives from an initial binary division, whereby the clause is split into a subject noun phrase (NP) and a predicate verb phrase (VP). This division is certainly present in the basic analysis of the clause that we find in the works of, for instance, Leonard Bloomfield and Noam Chomsky. Tesnière, however, argued vehemently against this binary division, preferring instead to position the verb as the root of all clause structure. Tesnière’s stance was that the subject-predicate division stems from term logic and has no place in linguistics.[8] The importance of this distinction is that if one acknowledges the initial subject-predicate division in syntax as something real, then one is likely to go down the path of constituency grammar, whereas if one rejects this division, then the only alternative is to position the verb as the root of all structure, which means one has chosen the path of dependency grammar.

14.3 Dependency grammars

The following frameworks are dependency-based:

• Algebraic syntax • Operator grammar • • Functional generative description • Lexicase • Meaning–text theory • Word grammar • Extensible dependency grammar

Link grammar is based on the dependency relation, but link grammar does not include directionality in the depen- dencies between words, and thus does not describe head-dependent relationships. Hybrid dependency/constituency grammar uses dependencies between words, but also includes dependencies between phrasal nodes – see for example 54 CHAPTER 14. DEPENDENCY GRAMMAR

Hybrid constituency/dependency tree from the Quranic Arabic Corpus the Quranic Arabic Dependency Treebank. The derivation trees of tree-adjoining grammar are dependency-based, although the full trees of TAG are constituency-based, so in this regard, it is not clear whether TAG should be viewed more as a dependency or constituency grammar. There are major differences between the grammars just listed. In this regard, the dependency relation is compatible with other major tenets of theories of grammar. Thus like constituency grammars, dependency grammars can be mono- or multistratal, representational or derivational, construction- or rule-based.

14.4 Representing dependencies

There are various conventions that DGs employ to represent dependencies. The following schemata (in addition to the tree above and the trees further below) illustrate some of these conventions: 14.5. TYPES OF DEPENDENCIES 55

The representations in (a–d) are trees, whereby the specific conventions employed in each tree vary. Solid lines are dependency edges and lightly dotted lines are projection lines. The only difference between tree (a) and tree (b) is that tree (a) employs the category class to label the nodes whereas tree (b) employs the words themselves as the node labels.[9] Tree (c) is a reduced tree insofar as the string of words below and projection lines are deemed unnecessary and are hence omitted. Tree (d) abstracts away from linear order and reflects just hierarchical order.[10] The arrow arcs in (e) are an alternative convention used to show dependencies and are favored by Word Grammar[11] The brackets in (f) are seldom used, but are nevertheless quite capable of reflecting the dependency hierarchy; dependents appear enclosed in more brackets than their heads. And finally, the indentations like those in (g) are another convention that is sometimes employed to indicate the hierarchy of words.[12] Dependents are placed underneath their heads and indented. Like tree (d), the indentations in (g) abstract away from linear order. The point to these conventions is that they are just that, namely conventions. They do not influence the basic com- mitment to dependency as the relation that is grouping syntactic units.

14.5 Types of dependencies

The dependency representations above (and further below) show syntactic dependencies. Indeed, most work in de- pendency grammar focuses on syntactic dependencies. Syntactic dependencies are, however, just one of three or four types of dependencies. Meaning–text theory, for instance, emphasizes the role of semantic and morphological dependencies in addition to syntactic dependencies.[13] A fourth type, prosodic dependencies, can also be acknowl- edged. Distinguishing between these types of dependencies can be important, in part because if one fails to do so, the likelihood that semantic, morphological, and/or prosodic dependencies will be mistaken for syntactic dependencies is great. The following four subsections briefly sketch each of these dependency types. During the discussion, the existence of syntactic dependencies is taken for granted and used as an orientation point for establishing the nature of the other three dependency types.

14.5.1 Semantic dependencies

Semantic dependencies are understood in terms of predicates and their arguments.[14] The arguments of a predicate are semantically dependent on that predicate. Often, semantic dependencies overlap with and point in the same 56 CHAPTER 14. DEPENDENCY GRAMMAR

direction as syntactic dependencies. At times, however, semantic dependencies can point in the opposite direction of syntactic dependencies, or they can be entirely independent of syntactic dependencies. The hierarchy of words in the following examples show standard syntactic dependencies, whereas the arrows indicate semantic dependencies:

The two arguments Sam and Sally in tree (a) are dependent on the predicate likes, whereby these arguments are also syntactically dependent on likes. What this means is that the semantic and syntactic dependencies overlap and point in the same direction (down the tree). Attributive adjectives, however, are predicates that take their head noun as their argument, hence big is a predicate in tree (b) that takes bones as its one argument; the semantic dependency points up the tree and therefore runs counter to the syntactic dependency. A similar situation obtains in (c), where the preposition predicate on takes the two arguments the picture and the wall; one of these semantic dependencies points up the syntactic hierarchy, whereas the other points down it. Finally, the predicate to help in (d) takes the one argument Jim but is not directly connected to Jim in the syntactic hierarchy, which means that that semantic dependency is entirely independent of the syntactic dependencies.

14.5.2 Morphological dependencies

Morphological dependencies obtain between words or parts of words.[15] When a given word or part of a word influences the form of another word, then the latter is morphologically dependent on the former. Agreement and concord are therefore manifestations of morphological dependencies. Like semantic dependencies, morphological dependencies can overlap with and point in the same direction as syntactic dependencies, overlap with and point in the opposite direction of syntactic dependencies, or be entirely independent of syntactic dependencies. The arrows are now used to indicate morphological dependencies.

The plural houses in (a) demands the plural of the demonstrative determiner, hence these appears, not this, which means there is a morphological dependency that points down the hierarchy from houses to these. The situation is reversed in (b), where the singular subject Sam demands the appearance of the agreement suffix -s on the finite verb works, which means there is a morphological dependency pointing up the hierarchy from Sam to works. The type of determiner in the German examples (c) and (d) influences the inflectional suffix that appears on the adjective alt. When the indefinite article ein appears, it lacks gender, so the strong masculine ending -er appears on the adjective. When the definite article der appears, in contrast, it shows masculine gender, which means the weak ending -e appears on the adjective. Thus since the choice of determiner impacts the morphological form of the adjective, there is a morphological dependency pointing from the determiner to the adjective, whereby this morphological dependency is entirely independent of the syntactic dependencies. Consider further the following French sentences: 14.5. TYPES OF DEPENDENCIES 57

The masculine subject le chien in (a) demands the masculine form of the predicative adjective blanc, whereas the feminine subject la maison demands the feminine form of this adjective. A morphological dependency that is entirely independent of the syntactic dependencies therefore points again across the syntactic hierarchy. Morphological dependencies play an important role in typological studies. Languages are classified as mostly head- marking (Sam work-s) or mostly dependent-marking (these houses), whereby most if not all languages contain at least some minor measure of both head and dependent marking.[16]

14.5.3 Prosodic dependencies

Prosodic dependencies are acknowledged in order to accommodate the behavior of clitics.[17] A clitic is a syntacti- cally autonomous element that is prosodically dependent on a host. A clitic is therefore integrated into the of its host, meaning that it forms a single word with its host. Prosodic dependencies exist entirely in the linear dimen- sion (horizontal dimension), whereas standard syntactic dependencies exist in the hierarchical dimension (vertical dimension). Classic examples of clitics in English are reduced auxiliaries (e.g. -ll, -s, -ve) and the possessive marker -s. The prosodic dependencies in the following examples are indicated with the hyphen and the lack of a vertical projection line:

The and lack of projection lines indicate prosodic dependencies. A hyphen that appears on the left of the clitic indicates that the clitic is prosodically dependent on the word immediately to its left (He'll, There’s), whereas a hyphen that appears on the right side of the clitic (not shown here) indicates that the clitic is prosodically dependent on the word that appears immediately to its right. A given clitic is often prosodically dependent on its syntactic dependent (He'll, There’s) or on its head (would've). At other times, it can depend prosodically on a word that is neither its head nor its immediate dependent (Florida’s).

14.5.4 Syntactic dependencies

Syntactic dependencies are the focus of most work in dependency grammar, as stated above. How the presence and the direction of syntactic dependencies are determined is of course often open to debate. In this regard, it must be acknowledged that the validity of syntactic dependencies in the trees throughout this article is being taken for granted. However, these hierarchies are such that many dependency grammars can largely support them, although there will certainly be points of disagreement. The basic question about how syntactic dependencies are discerned has proven 58 CHAPTER 14. DEPENDENCY GRAMMAR difficult to answer definitively. One should acknowledge in this area, however, that the basic task of identifying and discerning the presence and direction of the syntactic dependencies of dependency grammars is no easier or harder than determining the constituent groupings of constituency grammars. A variety of heuristics are employed to this end, basic constituency tests being useful tools; the syntactic dependencies assumed in the trees in this article are grouping words together in a manner that most closely matches the results of standard permutation, substitution, and ellipsis constituency tests. Etymological considerations also provide helpful clues about the direction of dependencies. A promising principle upon which to base the existence of syntactic dependencies is distribution.[18] When one is striving to identify the root of a given phrase, the word that is most responsible for determining the distribution of that phrase as a whole is its root.

14.6 Linear order and discontinuities

Traditionally, DGs have had a different approach to linear order (word order) than constituency grammars. Dependency- based structures are minimal compared to their constituency-based counterparts, and these minimal structures allow one to focus intently on the two ordering dimensions.[19] Separating the vertical dimension (hierarchical order) from the horizontal dimension (linear order) is easily accomplished. This aspect of dependency-based structures has al- lowed DGs, starting with Tesnière (1959), to focus on hierarchical order in a manner that is hardly possible for constituency grammars. For Tesnière, linear order was secondary to hierarchical order insofar as hierarchical order preceded linear order in the mind of a speaker. The stemmas (trees) that Tesnière produced reflected this view; they abstracted away from linear order to focus almost entirely on hierarchical order. Many DGs that followed Tesnière adopted this practice, that is, they produced tree structures that reflect hierarchical order alone, e.g.

The traditional focus on hierarchical order generated the impression that DGs have little to say about linear order, and it has contributed to the view that DGs are particularly well-suited to examine languages with free word or- der. A negative result of this focus on hierarchical order, however, is that there is a dearth of dependency-based explorations of particular word order phenomena, such as of standard discontinuities. Comprehensive dependency grammar accounts of topicalization, wh-fronting, scrambling, and are mostly absent from many es- tablished dependency-based frameworks. This situation can be contrasted with constituency grammars, which have devoted tremendous effort to exploring these phenomena. The nature of the dependency relation does not, however, prevent one from focusing on linear order. Dependency- based structures are as capable of exploring word order phenomena as constituency-based structures. The following trees illustrate this point; they represent one way of exploring discontinuities using dependency-based structures. The 14.6. LINEAR ORDER AND DISCONTINUITIES 59 trees suggest the manner in which common discontinuities can be addressed. An example from German is used to illustrate a scrambling discontinuity: 60 CHAPTER 14. DEPENDENCY GRAMMAR

The a-trees on the left show projectivity violations (= crossing lines), and the b-trees on the right demonstrate one means of addressing these violations. The displaced constituent takes on a word as its head that is not its governor. The words in red mark the catena (=chain) of words that extends from the root of the displaced constituent to the governor of that constituent.[20] Discontinuities are then explored in terms of these catenae. The limitations on topicalization, wh-fronting, scrambling, and extraposition can be explored and identified by examining the nature of the catenae involved.

14.7 Syntactic functions

Traditionally, DGs have treated the syntactic functions (= grammatical functions, grammatical relations) as primitive. They posit an inventory of functions (e.g. subject, object, oblique, determiner, attribute, predicative, etc.). These functions can appear as labels on the dependencies in the tree structures, e.g.[21]

The syntactic functions in this tree are shown in green: ATTR (attribute), COMP-P (complement of preposition), COMP-TO (complement of to), DET (determiner), P-ATTR (prepositional attribute), PRED (predicative), SUBJ (subject), TO-COMP (to complement). The functions chosen and abbreviations used in the tree here are merely representative of the general stance of DGs toward the syntactic functions. The actual inventory of functions and designations employed vary from DG to DG. As a primitive of the theory, the status of these functions is much different than in some constituency grammars. Traditionally, constituency grammars derive the syntactic functions from the constellation. For instance, the object is identified as the NP appearing inside finite VP, and the subject as the NP appearing outside of finite VP. Since DGs reject the existence of a finite VP constituent, they were never presented with the option to view the syntactic functions in this manner. The issue is a question of what comes first: traditionally, DGs take the syntactic functions to be primitive and they then derive the constellation from these functions, whereas constituency grammars traditionally take the constellation to be primitive and they then derive the syntactic functions from the constellation. This question about what comes first (the functions or the constellation) is not an inflexible matter. The stances of both grammar types (dependency and constituency grammars) is not narrowly limited to the traditional views. Dependency and constituency are both fully compatible with both approaches to the syntactic functions. Indeed, monostratal systems, be they dependency- or constituency-based, will likely reject the notion that the functions are derived from the constellation or that the constellation is derived from the functions. They will take both to be primitive, which means neither can be derived from the other.

14.8 See also

• Catena

• Constituent 14.9. NOTES 61

• Dependency relation (in mathematics) • Discontinuity • Finite verb • Lucien Tesnière • Phrase structure grammar • Predicate • Verb phrase •

14.9 Notes

[1] Concerning the history of the dependency concept, see Percival (1990).

[2] Concerning the influence of term logic on the theory of grammar, see Percival (1976).

[3] Concerning dependency in the works of Brassai, see Imrényi (2013), and concerning dependency in the works of Tiktin, see Coseriu (1980).

[4] Concerning early dependency grammars that may have developed independently of Tesnière’s work, see for instance Hays (1960), Gaifman (1965), and Robinson (1970).

[5] Some prominent dependency grammars that were well established by the 1980s are from Hudson (1984), Sgall, Hajičová et Panevova (1986), Mel’čuk (1988), and Starosta (1988).

[6] Some prominent dependency grammars from the German schools are from Heringer (1996), Engel (1994), Eroms (2000), and Ágel et al. (2003/6) is a massive two volume collection of essays on dependency and valence grammars from more than 100 authors.

[7] The minimality of dependency structures is emphasized, for instance, by Ninio (2006) and by Osborne et al. (2011).

[8] Concerning Tesnière’s rejection of the subject-predicate division of the clause, see Tesnière (1959:103–105), and for discussion of empirical considerations that support Tesnière’s point, see Matthews (2007:17ff.), Miller (2011:54ff.), and Osborne et al. (2011:323f.).

[9] The conventions illustrated with trees (a) and (b) are preferred by Osborne et al. (2011, 2013).

[10] Unordered trees like (d) are associated above all with Tesnière’s stemmas and with the syntactic strata of Mel’čuk’s Meaning- Text Theory.

[11] Three major works on Word Grammar are Hudson (1984, 1990, 2007).

[12] Lobin (2003) makes heavy use of these indentations.

[13] For a discussion of semantic, morphological, and syntactic dependencies in Meaning-Text Theory, see Melʹc̆uk (2003:191ff.).

[14] Concerning semantic dependencies, see Melʹc̆uk (2003:192f.).

[15] Concerning morphological dependencies, see Melʹc̆uk (2003:193ff.).

[16] The distinction between head- and dependent-marking was established by Nichols (1986). Nichols was using a dependency- based understanding of these distinctions.

[17] Concerning prosodic dependencies and the analysis of clitics, see Groß (2011).

[18] Distribution is primary principle used by Owens (1984:36), Schubert (1988:40), and Melʹc̆uk (2003:200) for discerning syntactic dependencies.

[19] Concerning the importance of the two ordering dimensions, see Tesnière (1959:16ff).

[20] See Osborne et al. (2012) concerning catenae.

[21] For discussion and examples of the labels for syntactic functions that are attached to dependency edges and arcs, see for instance Mel'cuk (1988:22, 69) and van Valin (2001:102ff.). 62 CHAPTER 14. DEPENDENCY GRAMMAR

14.10 References

• Ágel, Vilmos; Eichinger, Ludwig M.; Eroms, Hans Werner; Hellwig, Peter; Heringer, Hans Jürgen; Lobin, Henning, eds. (2003). Dependenz und Valenz:Ein internationales Handbuch der zeitgenössischen Forschung [Dependency and Valency:An International Handbook of Contemporary Research] (in German). Berlin: de Gruyter. ISBN 978-3110141900. Retrieved 24 August 2012.

• Coseriu, E. 1980. Un précurseur méconnu de la syntaxe structurale: H. Tiktin. In Recherches de Linguistique : Hommage à Maurice Leroy. Éditions de l’Université de Bruxelles, 48–62.

• Engel, U. 1994. Syntax der deutschen Sprache, 3rd edition. Berlin: Erich Schmidt Verlag.

• Eroms, Hans-Werner (2000). Syntax der deutschen Sprache. Berlin [u.a.]: de Gruyter. ISBN 978-3110156669. Retrieved 24 August 2012.

• Groß, T. 2011. Clitics in dependency morphology. Depling 2011 Proceedings, 58–68.

• Helbig, Gerhard; Buscha, Joachim (2007). Deutsche Grammatik: ein Handbuch für den Ausländerunterricht [German Grammar: A Guide for Foreigners Teaching] (6. [Dr.]. ed.). Berlin: Langenscheidt. ISBN 978-3- 468-49493-2. Retrieved 24 August 2012.

• Heringer, H. 1996. Deutsche Syntax dependentiell. Tübingen: Stauffenburg.

• Hays, D. 1960. Grouping and dependency theories. P-1910, RAND Corporation.

• Hays, D. 1964. Dependency theory: A formalism and some observations. Language, 40: 511-525. Reprinted in Syntactic Theory 1, Structuralist, edited by Fred W. Householder. Penguin, 1972.

• Hudson, Richard (1984). Word grammar (1. publ. ed.). Oxford, OX, England: B. Blackwell. ISBN 978- 0631131861.

• Hudson, R. 1990. An English Word Grammar. Oxford: Basil Blackwell.

• Hudson, R. 2007. Language Networks: The New Word Grammar. Oxford University Press.

• Imrényi, A. 2013. Constituency or dependency? Notes on Sámuel Brassai’s syntactic model of Hungarian. In Szigetvári, Péter (ed.), VLlxx. Papers Presented to László Varga on his 70th Birthday. Budapest: Tinta. 167–182.

• Liu, H. 2009. Dependency Grammar: from Theory to Practice. Beijing: Science Press.

• Lobin, H. 2003. Koordinationssyntax als prozedurales Phänomen. Tübingen: Gunter Narr-Verlag.

• Matthews, P. H. (2007). Syntactic Relations: a critical survey (1. publ. ed.). Cambridge: Cambridge University Press. ISBN 9780521608299. Retrieved 24 August 2012.

• Melʹc̆uk, Igor A. (1987). Dependency syntax : theory and practice. Albany: State University Press of New York. ISBN 978-0-88706-450-0. Retrieved 24 August 2012.

• Melʹc̆uk, I. 2003. Levels of dependency in linguistic description: Concepts and problems. In Ágel et al., 170–187.

• Miller, J. 2011. A critical introduction to syntax. London: continuum.

• Nichols, J. 1986. Head-marking and dependent-marking languages. Language 62, 56–119.

• Ninio, A. 2006. Language and the learning curve: A new theory of syntactic development. Oxford: Oxford University Press.

• Osborne, T., M. Putnam, and T. Groß 2011. Bare phrase structure, label-less trees, and specifier-less syntax: Is Minimalism becoming a dependency grammar? The Linguistic Review 28, 315–364.

• Osborne, T., M. Putnam, and T. Groß 2012. Catenae: Introducing a novel unit of syntactic analysis. Syntax 15, 4, 354–396.

• Owens, J. 1984. On getting a head: A problem in dependency grammar. Lingua 66, 25–42. 14.11. EXTERNAL LINKS 63

• Percival, K. 1976. On the historical source of immediate-constituent analysis. In: Notes from the linguistic underground, James McCawley (ed.), Syntax and Semantics 7, 229–242. New York: Academic Press. • Percival, K. 1990. Reflections on the history of dependency notions in linguistics. Historiographia Linguistica, 17, 29–47. • Robinson, J. 1970. Dependency structures and transformational rules. Language 46, 259–285.

• Schubert, K. 1988. Metataxis: Contrastive dependency syntax for machine translation. Dordrecht: Foris. • Sgall, P., E. Hajičová, and J. Panevová 1986. The meaning of the sentence in its semantic and pragmatic aspects. Dordrecht: D. Reidel Publishing Company. • Starosta, S. 1988. The case for lexicase. London: Pinter Publishers.

• Tesnière, L. 1959. Éléments de syntaxe structurale. Paris: Klincksieck.

• Tesnière, L. 1966. Éléments de syntaxe structurale, 2nd edition. Paris: Klincksieck. • Tesnière, L. 2015. Elements of structural syntax [English translation of Tesnière 1966]. John Benjamins, Amsterdam. • van Valin, R. 2001. An introduction to syntax. Cambridge, UK: Cambridge University Press.

14.11 External links

• Universal Dependencies – a set of in a harmonized dependency grammar representation Chapter 15

Phrase structure grammar

The term phrase structure grammar was originally introduced by Noam Chomsky as the term for grammars as defined by phrase structure rules,[1] i.e. rewrite rules of the type studied previously by Emil Post and Axel Thue (Post canonical systems). Some authors, however, reserve the term for more restricted grammars in the Chomsky hierarchy: context-sensitive grammars, or context-free grammars. In a broader sense, phrase structure grammars are also known as constituency grammars. The defining trait of phrase structure grammars is thus their adherence to the constituency relation, as opposed to the dependency relation of dependency grammars.

15.1 Constituency relation

In linguistics, phrase structure grammars are all those grammars that are based on the constituency relation, as opposed to the dependency relation associated with dependency grammars; hence phrase structure grammars are also known as constituency grammars.[2] Any of several related theories for the parsing of natural language qualify as constituency grammars, and most of them have been developed from Chomsky’s work, including

• Government and Binding Theory, • Generalized Phrase Structure Grammar, • Head-Driven Phrase Structure Grammar, • Lexical Functional Grammar, • The Minimalist Program, and • Nanosyntax.

Further grammar frameworks and formalisms also qualify as constituency-based, although they may not think of themselves as having spawned from Chomsky’s work, e.g.

• Arc Pair Grammar and • .

The fundamental trait that these frameworks all share is that they view sentence structure in terms of the constituency relation. The constituency relation derives from the subject-predicate division of Latin and Greek grammars that is based on term logic and reaches back to Aristotle in antiquity. Basic clause structure is understood in terms of a binary division of the clause into subject (noun phrase NP) and predicate (verb phrase VP). The binary division of the clause results in a one-to-one-or-more correspondence. For each element in a sentence, there are one or more nodes in the tree structure that one assumes for that sentence. A two word sentence such as Luke laughed necessarily implies three (or more) nodes in the syntactic structure: one for the noun Luke (subject NP), one for the verb laughed (predicate VP), and one for the entirety Luke laughed (sentence S). The constituency grammars listed above all view sentence structure in terms of this one-to-one-or-more correspondence.

64 15.2. DEPENDENCY RELATION 65

15.2 Dependency relation

By the time of Gottlob Frege, a competing understanding of the logic of sentences had arisen. Frege rejected the binary division of the sentence and replaced it with an understanding of sentence logic in terms of predicates and their arguments. On this alternative conception of sentence logic, the binary division of the clause into subject and predicate was not possible. It therefore opened the door to the dependency relation (although the dependency relation had also existed in a less obvious form in traditional grammars long before Frege). The dependency relation was first acknowledged concretely and developed as the basis for a comprehensive theory of syntax and grammar by Lucien Tesnière in his posthumously published work Éléments de syntaxe structurale (Elements of Structural Syntax).[3] The dependency relation is a one-to-one correspondence: for every element (word or morph) in a sentence, there is just one node in the syntactic structure. The distinction is thus a graph-theoretical distinction. The dependency relation restricts the number of nodes in the syntactic structure of a sentence to the exact number of syntactic units (usually words) that that sentence contains. Thus the two word sentence Luke laughed implies just two syntactic nodes, one for Luke and one for laughed. Some prominent dependency grammars are listed here:

• Algebraic Syntax • Functional Generative Description • Lexicase • Meaning-Text Theory • Operator Grammar • Word Grammar

Since these grammars are all based on the dependency relation, they are by definition NOT phrase structure grammars.

15.3 Non-descript grammars

Other grammars generally avoid attempts to group syntactic units into clusters in a manner that would allow classi- fication in terms of the constituency vs. dependency distinction. In this respect, the following grammar frameworks do not come down solidly on either side of the dividing line:

• Construction grammar • Cognitive grammar

15.4 See also

• Dependency grammar 66 CHAPTER 15. PHRASE STRUCTURE GRAMMAR

• Gottlob Frege

• Lucien Tesnière • Predicate

• Subject • Verb phrase

15.5 Notes

[1] See Chomsky (1957).

[2] Matthews (1981:71ff.) provides an insightful discussion of the distinction between constituency- and dependency-based grammars. See also Allerton (1979:238f.), McCawley (1988:13), Mel'cuk (1988:12-14), Borsley (1991:30f.), Sag and Wasow (1999:421f.), van Valin (2001:86ff.).

[3] See Tesnière (1959).

15.6 References

• Allerton, D. 1979. Essentials of grammatical theory. London: Routledge & Kegan Paul. • Borsley, R. 1991. Syntactic theory: A unified approach. London: Edward Arnold.

• Chomsky, Noam 1957. Syntactic structures. The Hague/Paris: Mouton. • Matthews, P. Syntax. 1981. Cambridge, UK: Cambridge University Press, ISBN 978-0521297097.

• McCawley, T. 1988. The syntactic phenomena of English, Vol. 1. Chicago: The University of Chicago Press. • Mel'cuk, I. 1988. Dependency syntax: Theory and practice. Albany: SUNY Press.

• Sag, I. and T. Wasow. 1999. Syntactic theory: A formal introduction. Stanford, CA: CSLI Publications. • Tesnière, Lucien 1959. Éleménts de syntaxe structurale. Paris: Klincksieck.

• van Valin, R. 2001. An introduction to syntax. Cambridge, UK: Cambridge University Press. Chapter 16

Verb phrase

In linguistics, a verb phrase (VP) is a syntactic unit composed of at least one verb and its dependents—objects, complements and other modifiers—but not always including the subject. Thus in the sentence A fat man put the money quickly in the box, the words put the money quickly in the box are a verb phrase; it consists of the verb put and its dependents, but not the subject a fat man. A verb phrase is similar to what is considered a predicate in more traditional grammars. Verb phrases generally are divided among two types: finite, of which the head of the phrase is a finite verb; and nonfinite, where the head is a nonfinite verb, such as an infinitive, participle or gerund. Phrase structure grammars acknowledge both types, but dependency grammars treat the subject as just another verbal dependent, and they do not recognize the finite verbal phrase constituent. Understanding verb phrase analysis depends upon knowing which theory obtains in context.

16.1 Verb phrases in phrase structure grammars

In phrase structure grammars such as generative grammar, the verb phrase is one headed by a verb. It may be composed of only a single verb, but typically it consists of combinations of main and auxiliary verbs, plus optional specifiers, complements (not including subject complements), and adjuncts. For example:

Yankee batters hit the ball well enough to win their first World Series since 2000.

Mary saw the man through the window.

David gave Mary a book.

The first example contains the long verb phrase hit the ball well enough to win their first World Series since 2000; the second is a verb phrase composed of the main verb saw, the complement phrase the man (a noun phrase), and the adjunct phrase through the window (a prepositional phrase). The third example presents three elements, the main verb gave, the noun Mary, and the noun phrase a book, all which comprise the verb phrase. Note, the verb phrase described here corresponds to the predicate of traditional grammar. Current views vary on whether all languages have a verb phrase; some schools of generative grammar (such as Principles and Parameters) hold that all languages have a verb phrase, while others (such as Lexical Functional Gram- mar) take the view that at least some languages lack a verb phrase constituent, including those languages with a very free word order (the so-called non-configurational languages, such as Japanese, Hungarian, or Australian aboriginal languages), and some languages with a default VSO order (several Celtic and Oceanic languages). Phrase structure grammars view both finite and nonfinite verb phrases as constituent phrases and, consequently, do not draw any key distinction between them. Dependency grammars (described below) are much different in this regard.

67 68 CHAPTER 16. VERB PHRASE

16.2 Verb phrases in dependency grammars

While phrase structure grammars (constituency grammars) acknowledge both finite and non-finite VPs as constituents (complete subtrees), dependency grammars reject the former. That is, dependency grammars acknowledge only non- finite VPs as constituents; finite VPs do not qualify as constituents in dependency grammars. For example:

John has finished the work. – Finite VP in bold John has finished the work. – Non-finite VP in bold

Since has finished the work contains the finite verb has, it is a finite VP, and since finished the work contains the non-finite verb finished but lacks a finite verb, it is a non-finite VP. Similar examples:

They do not want to try that. – Finite VP in bold They do not want to try that. – One non-finite VP in bold They do not want to try that. – Another non-finite VP in bold

These examples illustrate well that many clauses can contain more than one non-finite VP, but they generally contain only one finite VP. Starting with Lucien Tesnière 1959,[1] dependency grammars challenge the validity of the initial binary division of the clause into subject (NP) and predicate (VP), which means they reject the notion that the second half of this binary division, i.e. the finite VP, is a constituent. They do, however, readily acknowledge the existence of non-finite VPs as constituents. The two competing views of verb phrases are visible in the following trees:

The constituency tree on the left shows the finite VP has finished the work as a constituent, since it corresponds to a complete subtree. The dependency tree on the right, in contrast, does not acknowledge a finite VP constituent, since there is no complete subtree there that corresponds to has finished the work. Note that the analyses agree concerning the non-finite VP finished the work; both see it as a constituent (complete subtree). Dependency grammars point to the results of many standard constituency tests to back up their stance.[2] For instance, topicalization, pseudoclefting, and answer ellipsis suggest that non-finite VP does, but finite VP does not, exist as a constituent:

*...and has finished the work, John. – Topicalization

*What John has done is has finished the work. – Pseudoclefting

What has John done? – *Has finished the work. – Answer ellipsis

The * indicates that the sentence is bad. These data must be compared to the results for non-finite VP: 16.3. VERB PHRASES NARROWLY DEFINED 69

...and finished the work, John (certainly) has. – Topicalization

What John has done is finished the work. – Pseudoclefting

What has John done? – Finished the work. – Answer ellipsis

The strings in bold are the ones in focus. Attempts to in some sense isolate the finite VP fail, but the same attempts with the non-finite VP succeed.[3]

16.3 Verb phrases narrowly defined

Verb phrases are sometimes defined more narrowly in , in effect admitting only those elements considered as strictly verbal to compose verb phrases, which, accordingly, would consist only of main and auxiliary verbs, plus infinitive or participle constructions.[4] For example, in the following sentences only the words in bold would be used in forming the verb phrase:

John has given Mary a book. The picnickers were being eaten alive by mosquitos. She kept screaming like a football maniac. Thou shalt not kill.

This more narrow definition is often applied in functionalist frameworks and traditional European reference grammars. It is incompatible with the phrase structure model, because the strings in bold are not constituents under that analysis. It is, however, compatible with dependency grammars and other grammars that view the verb catena (verb chain) as the fundamental unit of syntactic structure, as opposed to the constituent. Furthermore, the verbal elements in bold are syntactic units consistent with the understanding of predicates in the tradition of predicate calculus.

16.4 See also

• Auxiliary verb

• Constituent

• Dependency grammar

• Finite verb

• Non-configurational language

• Non-finite verb

• Phrase

• Phrase structure grammar

• Predicate (grammar)

16.5 Notes

[1] Concerning Tesnière’s rejection of a finite VP constituent, see Tesnière (1959:103–105).

[2] For a discussion of the evidence for and against a finite VP constituent, see Matthews (2007:17ff.), Miller (2011:54ff.), and Osborne et al. (2011:323f.). 70 CHAPTER 16. VERB PHRASE

[3] Attempts to motivate the existence of a finite VP constituent tend to confuse the distinction between finite and non-finite VPs. They mistakenly take evidence for a non-finite VP constituent as support for the existence a finite VP constituent. See for instance Akmajian and Heny (1980:29f., 257ff.), Finch (2000:112), van Valin (2001:111ff.), Kroeger (2004:32ff.), Sobin (2011:30ff.).

[4] Klammer and Schulz (1996:157ff.), for instance, pursue this narrow understanding of verb phrases.

16.6 References

• Akmajian, A. and F. Heny. 1980. An introduction to the principle of transformational syntax. Cambridge, MA: The MIT Press.

• Finch, G. 2000. Linguistic terms and concepts. New York: St. Martin’s Press. • Klammer, T. and M. Schulz. 1996. Analyzing English grammar. Boston: Allyn and Bacon.

• Kroeger, P. 2004. Analyzing syntax: A lexical-functional approach. Cambridge, UK: Cambridge University Press.

• Matthews, P. 2007. Syntactic relations: A critical survey. Cambridge, UK: Cambridge University Press. • Miller, J. 2011. A critical introduction to syntax. London: continuum.

• Osborne, T., M. Putnam, and T. Groß 2011. Bare phrase structure, label-less structures, and specifier-less syntax: Is Minimalism becoming a dependency grammar? The Linguistic Review 28: 315–364.

• Sobin, N. 2011. Syntactic analysis: The basics. Malden, MA: Wiley–Blackwell. • Tesnière, Lucien 1959. Éleménts de syntaxe structurale. Paris: Klincksieck.

• van Valin, R. 2001. An introduction to syntax. Cambridge, UK: Cambridge University Press. Chapter 17

Information retrieval

Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on full-text or other content-based indexing. Automated information retrieval systems are used to reduce what has been called "information overload". Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines are the most visible IR applications.

17.1 Overview

An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy. An object is an entity that is represented by information in a content collection or database. User queries are matched against the database information. However, as opposed to classical SQL queries of a database, in information retrieval the results returned may or may not match the query, so results are typically ranked. This ranking of results is a key difference of information retrieval searching compared to database searching.[1] Depending on the application the data objects may be, for example, text documents, images,[2] audio,[3] mind maps[4] or videos. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates or metadata. Most IR systems compute a numeric score on how well each object in the database matches the query, and rank the objects according to this value. The top ranking objects are then shown to the user. The process may then be iterated if the user wishes to refine the query.[5]

17.2 History

The idea of using computers to search for relevant pieces of information was popularized in the article As We May Think by Vannevar Bush in 1945.[6] It would appear that Bush was inspired by patents for a 'statistical machine' - filed by Emanuel Goldberg in the 1920s and '30s - that searched for documents stored on film.[7] The first description of a computer searching for information was described by Holmstrom in 1948,[8] detailing an early mention of the Univac computer. Automated information retrieval systems were introduced in the 1950s: one even featured in the 1957 romantic comedy, Desk Set. In the 1960s, the first large information retrieval research group was formed by Gerard Salton at Cornell. By the 1970s several different retrieval techniques had been shown to perform well on small text corpora such as the Cranfield collection (several thousand documents).[6] Large-scale retrieval systems, such as the Lockheed Dialog system, came into use early in the 1970s. In 1992, the US Department of Defense along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text

71 72 CHAPTER 17. INFORMATION RETRIEVAL retrieval methodologies on a very large text collection. This catalyzed research on methods that scale to huge corpora. The introduction of web search engines has boosted the need for very large scale retrieval systems even further.

17.3 Model types

Categorization of IR-models (translated from German entry, original source Dominik Kuropka).

For effectively retrieving relevant documents by IR strategies, the documents are typically transformed into a suitable representation. Each retrieval strategy incorporates a specific model for its document representation purposes. The picture on the right illustrates the relationship of some common models. In the picture, the models are categorized according to two dimensions: the mathematical basis and the properties of the model.

17.3.1 First dimension: mathematical basis

• Set-theoretic models represent documents as sets of words or phrases. Similarities are usually derived from set-theoretic operations on those sets. Common models are: • Standard Boolean model • Extended Boolean model • Fuzzy retrieval • Algebraic models represent documents and queries usually as vectors, matrices, or tuples. The similarity of the query vector and document vector is represented as a scalar value. • Vector space model • Generalized vector space model • (Enhanced) Topic-based Vector Space Model • Extended Boolean model • Latent semantic indexing a.k.a. • Probabilistic models treat the process of document retrieval as a probabilistic inference. Similarities are com- puted as probabilities that a document is relevant for a given query. Probabilistic theorems like the Bayes’ theorem are often used in these models. • Binary Independence Model • Probabilistic relevance model on which is based the okapi (BM25) relevance function 17.4. PERFORMANCE AND CORRECTNESS MEASURES 73

• Uncertain inference • Language models • Divergence-from-randomness model • Latent Dirichlet allocation

• Feature-based retrieval models view documents as vectors of values of feature functions (or just features) and seek the best way to combine these features into a single relevance score, typically by learning to rank methods. Feature functions are arbitrary functions of document and query, and as such can easily incorporate almost any other retrieval model as just another feature.

17.3.2 Second dimension: properties of the model

• Models without term-interdependencies treat different terms/words as independent. This fact is usually repre- sented in vector space models by the orthogonality assumption of term vectors or in probabilistic models by an independency assumption for term variables.

• Models with immanent term interdependencies allow a representation of interdependencies between terms. However the degree of the interdependency between two terms is defined by the model itself. It is usually directly or indirectly derived (e.g. by dimensional reduction) from the co-occurrence of those terms in the whole set of documents.

• Models with transcendent term interdependencies allow a representation of interdependencies between terms, but they do not allege how the interdependency between two terms is defined. They rely an external source for the degree of interdependency between two terms. (For example, a human or sophisticated algorithms.)

17.4 Performance and correctness measures

Further information: Evaluation measures (information retrieval)

The evaluation of an information retrieval system is the process of assessing how well a system meets the infor- mation needs of its users. Traditional evaluation metrics, designed for Boolean retrieval or top-k retrieval, include precision and recall. Many more measures for evaluating the performance of information retrieval systems have also been proposed. In general, measurement considers a collection of documents to be searched and a search query. All common measures described here assume a ground truth notion of relevancy: every document is known to be either relevant or non-relevant to a particular query. In practice, queries may be ill-posed and there may be different shades of relevancy. Virtually all modern evaluation metrics (e.g., mean average precision, discounted cumulative gain) are designed for ranked retrieval without any explicit rank cutoff, taking into account the relative order of the documents retrieved by the search engines and giving more weight to documents returned at higher ranks. The mathematical symbols used in the formulas below mean:

• X ∩ Y - Intersection - in this case, specifying the documents in both sets X and Y

• |X| - Cardinality - in this case, the number of documents in set X ∫ • - Integral ∑ • - Summation

• ∆ - Symmetric difference

17.4.1 Precision

Main article: Precision and recall 74 CHAPTER 17. INFORMATION RETRIEVAL

Precision is the fraction of the documents retrieved that are relevant to the user’s information need.

|{relevant documents} ∩ {retrieved documents}| precision = |{retrieved documents}| In binary classification, precision is analogous to positive predictive value. Precision takes all retrieved documents into account. It can also be evaluated at a given cut-off rank, considering only the topmost results returned by the system. This measure is called precision at n or P@n. Note that the meaning and usage of “precision” in the field of information retrieval differs from the definition of accuracy and precision within other branches of science and statistics.

17.4.2 Recall

Main article: Precision and recall

Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.

|{relevant documents} ∩ {retrieved documents}| recall = |{relevant documents}| In binary classification, recall is often called sensitivity. So it can be looked at as the probability that a relevant document is retrieved by the query. It is trivial to achieve recall of 100% by returning all documents in response to any query. Therefore, recall alone is not enough but one needs to measure the number of non-relevant documents also, for example by computing the precision.

17.4.3 Fall-out

The proportion of non-relevant documents that are retrieved, out of all non-relevant documents available:

|{non-relevant documents} ∩ {retrieved documents}| fall-out = |{non-relevant documents}| In binary classification, fall-out is closely related to specificity and is equal to (1 − specificity) . It can be looked at as the probability that a non-relevant document is retrieved by the query. It is trivial to achieve fall-out of 0% by returning zero documents in response to any query.

17.4.4 F-score / F-measure

Main article: F-score

The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-score is:

2 · precision · recall F = (precision + recall)

This is also known as the F1 measure, because recall and precision are evenly weighted. The general formula for non-negative real β is:

(1 + β2) · (precision · recall) F = β (β2 · precision + recall) 17.4. PERFORMANCE AND CORRECTNESS MEASURES 75

Two other commonly used F measures are the F2 measure, which weights recall twice as much as precision, and the F0.5 measure, which weights precision twice as much as recall.

The F-measure was derived by van Rijsbergen (1979) so that Fβ “measures the effectiveness of retrieval with respect to a user who attaches β times as much importance to recall as precision”. It is based on van Rijsbergen’s effectiveness − 1 measure E = 1 α 1−α . Their relationship is: P + R

− 1 Fβ = 1 E where α = 1+β2

F-measure can be a better single metric when compared to precision and recall; both precision and recall give different information that can complement each other when combined. If one of them excels more than the other, F-measure will reflect it.

17.4.5 Average precision

Precision and recall are single-value metrics based on the whole list of documents returned by the system. For systems that return a ranked sequence of documents, it is desirable to also consider the order in which the returned documents are presented. By computing a precision and recall at every position in the ranked sequence of documents, one can plot a precision-recall curve, plotting precision p(r) as a function of recall r . Average precision computes the average value of p(r) over the interval from r = 0 to r = 1 :[9]

∫ 1 AveP = p(r)dr 0 That is the area under the precision-recall curve. This integral is in practice replaced with a finite sum over every position in the ranked sequence of documents:

∑n AveP = P (k)∆r(k) k=1 where k is the rank in the sequence of retrieved documents, n is the number of retrieved documents, P (k) is the precision at cut-off k in the list, and ∆r(k) is the change in recall from items k − 1 to k .[9] This finite sum is equivalent to:

∑ n (P (k) × rel(k)) AveP = k=1 number of relevant documents where rel(k) is an indicator function equaling 1 if the item at rank k is a relevant document, zero otherwise.[10] Note that the average is over all relevant documents and the relevant documents not retrieved get a precision score of zero. Some authors choose to interpolate the p(r) function to reduce the impact of “wiggles” in the curve.[11][12] For example, the PASCAL Visual Object Classes challenge (a benchmark for computer vision object detection) computes average precision by averaging the precision over a set of evenly spaced recall levels {0, 0.1, 0.2, ... 1.0}:[11][12]

1 ∑ AveP = p (r) 11 interp r∈{0,0.1,...,1.0} where pinterp(r) is an interpolated precision that takes the maximum precision over all recalls greater than r :

pinterp(r) = maxr˜:˜r≥r p(˜r) An alternative is to derive an analytical p(r) function by assuming a particular parametric distribution for the under- lying decision values. For example, a binormal precision-recall curve can be obtained by assuming decision values in both classes to follow a Gaussian distribution.[13] 76 CHAPTER 17. INFORMATION RETRIEVAL

17.4.6 Precision at K

For modern (Web-scale) information retrieval, recall is no longer a meaningful metric, as many queries have thousands of relevant documents, and few users will be interested in reading all of them. Precision at k documents (P@k) is still a useful metric (e.g., P@10 or “Precision at 10” corresponds to the number of relevant results on the first search results page), but fails to take into account the positions of the relevant documents among the top k. Another shortcoming is that on a query with fewer relevant results than k, even a perfect system will have a score less than 1.[14] It is easier to score manually since only the top k results need to be examined to determine if they are relevant or not.

17.4.7 R-Precision

R-precision requires knowing all documents that are relevant to a query. The number of relevant documents, R , is used as the cutoff for calculation, and this varies from query to query. For example, if there are 15 documents relevant to “red” in a corpus (R=15), R-precision for “red” looks at the top 15 documents returned, counts the number that are relevant r turns that into a relevancy fraction: r/R = r/15 .[15] Precision is equal to recall at the R-th position.[14] Empirically, this measure is often highly correlated to mean average precision.[14]

17.4.8 Mean average precision

Mean average precision for a set of queries is the mean of the average precision scores for each query.

∑ Q AveP(q) MAP = q=1 Q where Q is the number of queries.

17.4.9 Discounted cumulative gain

Main article: Discounted cumulative gain

DCG uses a graded relevance scale of documents from the result set to evaluate the usefulness, or gain, of a document based on its position in the result list. The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result. The DCG accumulated at a particular rank position p is defined as:

p ∑ rel DCG = rel + i . p 1 log i i=2 2

Since result set may vary in size among different queries or systems, to compare performances the normalised version of DCG uses an ideal DCG. To this end, it sorts documents of a result list by relevance, producing an ideal DCG at position p ( IDCGp ), which normalizes the score:

DCG nDCG = p . p IDCGp

The nDCG values for all queries can be averaged to obtain a measure of the average performance of a ranking algorithm. Note that in a perfect ranking algorithm, the DCGp will be the same as the IDCGp producing an nDCG of 1.0. All nDCG calculations are then relative values on the interval 0.0 to 1.0 and so are cross-query comparable. 17.5. TIMELINE 77

17.4.10 Other measures

• Mean reciprocal rank • Spearman’s rank correlation coefficient • bpref - a summation-based measure of how many relevant documents are ranked before irrelevant documents[15] • GMAP - geometric mean of (per-topic) average precision[15] • Measures based on marginal relevance and document diversity - see Relevance (information retrieval) § Prob- lems and alternatives

17.4.11 Visualization

Visualizations of information retrieval performance include:

• Graphs which chart precision on one axis and recall on the other[15] • Histograms of average precision over various topics[15] • Receiver operating characteristic (ROC curve) • Confusion matrix

17.5 Timeline

• Before the 1900s 1801: Joseph Marie Jacquard invents the Jacquard loom, the first machine to use punched cards to control a sequence of operations. 1880s: Herman Hollerith invents an electro-mechanical data tabulator using punch cards as a ma- chine readable medium. 1890 Hollerith cards, keypunches and tabulators used to process the 1890 US Census data. • 1920s-1930s Emanuel Goldberg submits patents for his “Statistical Machine” a document search engine that used photoelectric cells and pattern recognition to search the metadata on rolls of microfilmed documents. • 1940s–1950s late 1940s: The US military confronted problems of indexing and retrieval of wartime scientific research documents captured from Germans. 1945: Vannevar Bush's As We May Think appeared in Atlantic Monthly. 1947: Hans Peter Luhn (research engineer at IBM since 1941) began work on a mecha- nized punch card-based system for searching chemical compounds. 1950s: Growing concern in the US for a “science gap” with the USSR motivated, encouraged funding and provided a backdrop for mechanized literature searching systems (Allen Kent et al.) and the invention of citation indexing (Eugene Garfield). 1950: The term “information retrieval” was coined by Calvin Mooers.[19] 1951: Philip Bagley conducted the earliest experiment in computerized document retrieval in a master thesis at MIT.[20] 1955: Allen Kent joined Case Western Reserve University, and eventually became associate di- rector of the Center for Documentation and Communications Research. That same year, Kent and colleagues published a paper in American Documentation describing the precision and recall measures as well as detailing a proposed “framework” for evaluating an IR system which included statistical sampling methods for determining the number of relevant documents not retrieved.[21] 78 CHAPTER 17. INFORMATION RETRIEVAL

1958: International Conference on Scientific Information Washington DC included consideration of IR systems as a solution to problems identified. See: Proceedings of the International Conference on Scientific Information, 1958 (National Academy of Sciences, Washington, DC, 1959) 1959: Hans Peter Luhn published “Auto-encoding of documents for information retrieval.”

• 1960s:

early 1960s: Gerard Salton began work on IR at Harvard, later moved to Cornell. 1960: Melvin Earl Maron and John Lary Kuhns[22] published “On relevance, probabilistic indexing, and information retrieval” in the Journal of the ACM 7(3):216–244, July 1960. 1962: • Cyril W. Cleverdon published early findings of the Cranfield studies, developing a model for IR system evaluation. See: Cyril W. Cleverdon, “Report on the Testing and Analysis of an Investigation into the Comparative Efficiency of Indexing Systems”. Cranfield Collection of Aeronautics, Cranfield, England, 1962. • Kent published Information Analysis and Retrieval. 1963: • Weinberg report “Science, Government and Information” gave a full articulation of the idea of a “crisis of scientific information.” The report was named after Dr. Alvin Weinberg. • Joseph Becker and Robert M. Hayes published text on information retrieval. Becker, Joseph; Hayes, Robert Mayo. Information storage and retrieval: tools, elements, theories. New York, Wiley (1963). 1964: • Karen Spärck Jones finished her thesis at Cambridge, Synonymy and Semantic Classification, and continued work on computational linguistics as it applies to IR. • The National Bureau of Standards sponsored a symposium titled “Statistical Association Meth- ods for Mechanized Documentation.” Several highly significant papers, including G. Salton’s first published reference (we believe) to the SMART system. mid-1960s: • National Library of Medicine developed MEDLARS Medical Literature Analysis and Retrieval System, the first major machine-readable database and batch-retrieval system. • Project Intrex at MIT. 1965: J. C. R. Licklider published Libraries of the Future. 1966: Don Swanson was involved in studies at University of Chicago on Requirements for Future Catalogs. late 1960s: F. Wilfrid Lancaster completed evaluation studies of the MEDLARS system and pub- lished the first edition of his text on information retrieval. 1968: • Gerard Salton published Automatic Information Organization and Retrieval. • John W. Sammon, Jr.'s RADC Tech report “Some Mathematics of Information Storage and Retrieval...” outlined the vector model. 1969: Sammon’s “A nonlinear mapping for data structure analysis” (IEEE Transactions on Computers) was the first proposal for visualization interface to an IR system.

• 1970s

early 1970s: • First online systems—NLM’s AIM-TWX, MEDLINE; Lockheed’s Dialog; SDC’s ORBIT. • Theodor Nelson promoting concept of , published Computer Lib/Dream Ma- chines. 17.6. AWARDS IN THE FIELD 79

1971: Nicholas Jardine and Cornelis J. van Rijsbergen published “The use of hierarchic clustering in information retrieval”, which articulated the “cluster hypothesis.”[23] 1975: Three highly influential publications by Salton fully articulated his vector processing frame- work and term discrimination model: • A Theory of Indexing (Society for Industrial and Applied Mathematics) • A Theory of Term Importance in Automatic Text Analysis (JASIS v. 26) • A Vector Space Model for Automatic Indexing (CACM 18:11) 1978: The First ACM SIGIR conference. 1979: C. J. van Rijsbergen published Information Retrieval (Butterworths). Heavy emphasis on probabilistic models. 1979: Tamas Doszkocs implemented the CITE natural language user interface for MEDLINE at the National Library of Medicine. The CITE system supported free form query input, ranked output and relevance feedback.[24] • 1980s 1980: First international ACM SIGIR conference, joint with British Computer Society IR group in Cambridge. 1982: Nicholas J. Belkin, Robert N. Oddy, and Helen M. Brooks proposed the ASK (Anomalous State of Knowledge) viewpoint for information retrieval. This was an important concept, though their automated analysis tool proved ultimately disappointing. 1983: Salton (and Michael J. McGill) published Introduction to Modern Information Retrieval (McGraw-Hill), with heavy emphasis on vector models. 1985: David Blair and Bill Maron publish: An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System mid-1980s: Efforts to develop end-user versions of commercial IR systems. 1985–1993: Key papers on and experimental systems for visualization interfaces. Work by Donald B. Crouch, Robert R. Korfhage, Matthew Chalmers, Anselm Spoerri and others. 1989: First World Wide Web proposals by Tim Berners-Lee at CERN. • 1990s 1992: First TREC conference. 1997: Publication of Korfhage's Information Storage and Retrieval[25] with emphasis on visualiza- tion and multi-reference point systems. late 1990s: Web search engines implementation of many features formerly found only in experi- mental IR systems. Search engines become the most common and maybe best instantiation of IR models.

17.6 Awards in the field

• Tony Kent Strix award • Gerard Salton Award

17.7 Leading IR Research Groups

• Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts Amherst [26] • Information Retrieval Group at the University of Glasgow [27] • Information and Language Processing Systems (ILPS) at the University of Amsterdam [28] • Language Technologies Institutes (LTI) at the Carnegie Mellon University • Text Information Management and Analysis Group (TIMAN) at the University of Illinois at Urbana-Champaign 80 CHAPTER 17. INFORMATION RETRIEVAL

17.8 See also

• Adversarial information retrieval

• Collaborative information seeking

• Controlled vocabulary

• Cross-language information retrieval

• Data mining

• European Summer School in Information Retrieval

• Human–computer information retrieval (HCIR)

• Information extraction

• Information Retrieval Facility

• Knowledge visualization

• Multimedia information retrieval

• Personal information management

• Relevance (Information Retrieval)

• Relevance feedback

• Rocchio Classification

• Search index

• Social information seeking

• Special Interest Group on Information Retrieval

• Subject indexing

• Temporal information retrieval

• tf-idf

• XML-Retrieval

17.9 References

[1] Jansen, B. J. and Rieh, S. (2010) The Seventeen Theoretical Constructs of Information Searching and Information Retrieval. Journal of the American Society for Information Sciences and Technology. 61(8), 1517-1534.

[2] Goodrum, Abby A. (2000). “Image Information Retrieval: An Overview of Current Research”. Informing Science. 3 (2).

[3] Foote, (1999). “An overview of audio information retrieval”. Multimedia Systems. Springer.

[4] Beel, Jöran; Gipp, Bela; Stiller, Jan-Olaf (2009). Information Retrieval On Mind Maps - What Could It Be Good For?. Proceedings of the 5th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom'09). Washington, DC: IEEE.

[5] Frakes, William B. (1992). Information Retrieval Data Structures & Algorithms. Prentice-Hall, Inc. ISBN 0-13-463837-9.

[6] Singhal, Amit (2001). “Modern Information Retrieval: A Brief Overview” (PDF). Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 24 (4): 35–43.

[7] Mark Sanderson & W. Bruce Croft (2012). “The History of Information Retrieval Research”. Proceedings of the IEEE. 100: 1444–1451. doi:10.1109/jproc.2012.2189916. 17.10. FURTHER READING 81

[8] JE Holmstrom (1948). "'Section III. Opening Plenary Session”. The Royal Society Scientific Information Conference, 21 June-2 July 1948: report and papers submitted: 85.

[9] Zhu, Mu (2004). “Recall, Precision and Average Precision” (PDF).

[10] Turpin, Andrew; Scholer, Falk (2006). “User performance versus precision measures for simple search tasks”. Proceedings of the 29th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Seattle, WA, August 06–11, 2006). New York, NY: ACM: 11–18. doi:10.1145/1148170.1148176. ISBN 1-59593-369-7.

[11] Everingham, Mark; Van Gool, Luc; Williams, Christopher K. I.; Winn, John; Zisserman, Andrew (June 2010). “The PASCAL Visual Object Classes (VOC) Challenge” (PDF). International Journal of Computer Vision. Springer. 88 (2): 303–338. doi:10.1007/s11263-009-0275-4. Retrieved 2011-08-29.

[12] Manning, Christopher D.; Raghavan, Prabhakar; Schütze, Hinrich (2008). Introduction to Information Retrieval. Cam- bridge University Press.

[13] K.H. Brodersen, C.S. Ong, K.E. Stephan, J.M. Buhmann (2010). The binormal assumption on precision-recall curves. Proceedings of the 20th International Conference on Pattern Recognition, 4263-4266.

[14] Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze (2009). “Chapter 8: Evaluation in information re- trieval” (PDF). Retrieved 2015-06-14. Part of Introduction to Information Retrieval

[15] http://trec.nist.gov/pubs/trec15/appendices/CE.MEASURES06.pdf

[16] Fawcett, Tom (2006). “An Introduction to ROC Analysis” (PDF). Pattern Recognition Letters. 27 (8): 861–874. doi:10.1016/j.patrec.2005.10.010.

[17] Powers, David M W (2011). “Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation” (PDF). Journal of Machine Learning Technologies. 2 (1): 37–63.

[18] Ting, Kai Ming (2011). Encyclopedia of machine learning. Springer. ISBN 978-0-387-30164-8.

[19] Mooers, Calvin N.; The Theory of Digital Handling of Non-numerical Information and its Implications to Machine Eco- nomics (Zator Technical Bulletin No. 48), cited in Fairthorne, R. A. (1958). “Automatic Retrieval of Recorded Informa- tion”. The Computer Journal. 1 (1): 37. doi:10.1093/comjnl/1.1.36.

[20] Doyle, Lauren; Becker, Joseph (1975). Information Retrieval and Processing. Melville. pp. 410 pp. ISBN 0-471-22151-1.

[21] “Machine literature searching X. Machine language; factors underlying its design and development”. doi:10.1002/asi.5090060411.

[22] Maron, Melvin E. (2008). “An Historical Note on the Origins of Probabilistic Indexing” (PDF). Information Processing and Management. 44 (2): 971–972. doi:10.1016/j.ipm.2007.02.012.

[23] N. Jardine, C.J. van Rijsbergen (December 1971). “The use of hierarchic clustering in information retrieval”. Information Storage and Retrieval. 7 (5): 217–240. doi:10.1016/0020-0271(71)90051-9.

[24] Doszkocs, T.E. & Rapp, B.A. (1979). “Searching MEDLINE in English: a Prototype User Inter-face with Natural Lan- guage Query, Ranked Output, and relevance feedback,” In: Proceedings of the ASIS Annual Meeting, 16: 131-139.

[25] Korfhage, Robert R. (1997). Information Storage and Retrieval. Wiley. pp. 368 pp. ISBN 978-0-471-14338-3.

[26] “Center for Intelligent Information Retrieval | UMass Amherst”. ciir.cs.umass.edu. Retrieved 2016-07-29.

[27] “University of Glasgow - Schools - School of Computing Science - Research - Research overview - Information Retrieval”. www.gla.ac.uk. Retrieved 2016-07-29.

[28] “ILPS - information and language processing systems”. ILPS. Retrieved 2016-07-29.

17.10 Further reading

• Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.

• Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, Cambridge, Mass., 2010. 82 CHAPTER 17. INFORMATION RETRIEVAL

17.11 External links

• ACM SIGIR: Information Retrieval Special Interest Group

• BCS IRSG: British Computer Society - Information Retrieval Specialist Group • Text Retrieval Conference (TREC)

• Forum for Information Retrieval Evaluation (FIRE)

• Information Retrieval (online book) by C. J. van Rijsbergen • Information Retrieval Wiki

• Information Retrieval Facility • Information Retrieval @ DUTH

• TREC report on information retrieval evaluation techniques • How eBay measures search relevance

• Information retrieval performance evaluation tool @ Athena Research Centre Chapter 18

Vector space model

Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

18.1 Definitions

Documents and queries are represented as vectors.

dj = (w1,j , w2,j, . . . , wt,j) q = (w1,q, w2,q, . . . , wn,q) Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is tf-idf weighting (see the example below). The definition of term depends on the application. Typically terms are single words, keywords, or longer phrases. If words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of distinct words occurring in the corpus). Vector operations can be used to compare documents with queries.

18.2 Applications

Relevance rankings of documents in a keyword search can be calculated, using the assumptions of document sim- ilarities theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as the same kind of vector as the documents. In practice, it is easier to calculate the cosine of the angle between the vectors, instead of the angle itself:

d · q cos θ = 2 ∥d2∥ ∥q∥

Where d2 · q is the intersection (i.e. the dot product) of the document (d2 in the figure to the right) and the query (q in the figure) vectors, ∥d2∥ is the norm of vector d2, and ∥q∥ is the norm of vector q. The norm of a vector is calculated as such:

v u u∑n ∥ ∥ t 2 q = qi i=1

83 84 CHAPTER 18. VECTOR SPACE MODEL

As all vectors under consideration by this model are elementwise nonnegative, a cosine value of zero means that the query and document vector are orthogonal and have no match (i.e. the query term does not exist in the document being considered). See cosine similarity for further information.

18.3 Example: tf-idf weights

In the classic vector space model proposed by Salton, Wong and Yang [1] the term-specific weights in the docu- ment vectors are products of local and global parameters. The model is known as term frequency-inverse document T frequency model. The weight vector for document d is vd = [w1,d, w2,d, . . . , wN,d] , where

|D| w = tf · log t,d t,d |{d′ ∈ D | t ∈ d′}| and

• tft,d is term frequency of term t in document d (a local parameter) • |D| | | log |{d′∈D | t∈d′}| is inverse document frequency (a global parameter). D is the total number of documents in the document set; |{d′ ∈ D | t ∈ d′}| is the number of documents containing the term t. 18.4. ADVANTAGES 85

Using the cosine the similarity between document dj and query q can be calculated as:

∑ · N w w dj q √ i=1 √i,j i,q sim(dj, q) = = ∑ ∑ ∥dj∥ ∥q∥ N 2 N 2 i=1 wi,j i=1 wi,q

18.4 Advantages

The vector space model has the following advantages over the Standard Boolean model:

1. Simple model based on linear algebra 2. Term weights not binary 3. Allows computing a continuous degree of similarity between queries and documents 4. Allows ranking documents according to their possible relevance 5. Allows partial matching

Most of these advantages are a consequence of the difference in the density of the document collection represen- tation between Boolean and tf-idf approaches. When using Boolean weights, any document lies in a vertex in a n n-dimensional hypercube.√ Therefore, the possible document representations are 2 and the maximum Euclidean distance between pairs is n . As documents are added to the document collection, the region defined by the hyper- cube’s vertexes become more populated and hence denser. Unlike Boolean, when a document is added using tf-idf weights, the idfs of the terms in the new document decrease while that of the remaining terms increase. In average, as documents are added, the region where documents lie expands regulating the density of the entire collection rep- resentation. This behavior models the original motivation of Salton and his colleagues that a document collection represented in a low density region could yield better retrieval results.

18.5 Limitations

The vector space model has the following limitations:

1. Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality) 2. Search keywords must precisely match document terms; word substrings might result in a "false positive match” 3. Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, result- ing in a "false negative match”. 4. The order in which the terms appear in the document is lost in the vector space representation. 5. Theoretically assumed terms are statistically independent. 6. Weighting is intuitive but not very formal.

Many of these difficulties can, however, be overcome by the integration of various tools, including mathematical techniques such as singular value decomposition and lexical such as WordNet.

18.6 Models based on and extending the vector space model

Models based on and extending the vector space model include:

• Generalized vector space model 86 CHAPTER 18. VECTOR SPACE MODEL

• Latent semantic analysis • Term Discrimination • Rocchio Classification • Random Indexing

18.7 Software that implements the vector space model

The following software packages may be of interest to those wishing to experiment with vector models and implement search services based upon them.

18.7.1 Free open source software

. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. • Gensim is a Python+NumPy framework for Vector Space modelling. It contains incremental (memory-efficient) algorithms for Tf–idf, Latent Semantic Indexing, Random Projections and Latent Dirichlet Allocation. • Weka. Weka is a popular data mining package for Java including WordVectors and Bag Of Words models.

18.8 Further reading

• G. Salton, A. Wong, and C. S. Yang (1975), "A Vector Space Model for Automatic Indexing,” Communications of the ACM, vol. 18, nr. 11, pages 613–620. (Article in which a vector space model was presented) • David Dubin (2004), The Most Influential Paper Gerard Salton Never Wrote (Explains the history of the Vector Space Model and the non-existence of a frequently cited publication) • Description of the vector space model • Description of the classic vector space model by Dr E. Garcia • Relationship of vector space search to the “k-Nearest Neighbor” search

18.9 See also

• Bag-of-words model • Compound term processing • Conceptual space • Eigenvalues and eigenvectors • Inverted index • Nearest neighbor search • Sparse distributed memory • w-shingling

18.10 References

[1] G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, v.18 n.11, p.613-620, Nov. 1975 Chapter 19

tf–idf

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1] It is often used as a weighting factor in information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Nowadays, tf-idf is one of the most popular term-weighting schemes. For instance, 83% of text-based recommender systems in the domain of digital libraries use tf-idf[2]. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification. One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisti- cated ranking functions are variants of this simple model.

19.1 Motivation

19.1.1 Term frequency

Suppose we have a set of English text documents and wish to determine which document is most relevant to the query “the brown cow”. A simple way to start out is by eliminating documents that do not contain all three words “the”, “brown”, and “cow”, but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document and sum them all together; the number of times a term occurs in a document is called its term frequency. The first form of term weighting is due to Hans Peter Luhn (1957) and is based on the Luhn Assumption:

• The weight of a term that occurs in a document is simply proportional to the term frequency. [3]

19.1.2 Inverse document frequency

Because the term “the” is so common, term frequency will tend to incorrectly emphasize documents which happen to use the word “the” more frequently, without giving enough weight to the more meaningful terms “brown” and “cow”. The term “the” is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less common words “brown” and “cow”. Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely. Karen Spärck Jones (1972) conceived a statistical interpretation of term specificity called Inverse Document Fre- quency (IDF), which became a cornerstone of term weighting:

• The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs. [4]

87 88 CHAPTER 19. TF–IDF

19.2 Definition

tf–idf is the product of two statistics, term frequency and inverse document frequency. Various ways for determining the exact values of both statistics exist.

19.2.1 Term frequency

In the case of the term frequency tf(t,d), the simplest choice is to use the raw frequency of a term in a document, i.e. the number of times that term t occurs in document d. If we denote the raw frequency of t by ft,d, then the simple tf scheme is tf(t,d) = ft,d. Other possibilities include[5]:128

• Boolean “frequencies": tf(t,d) = 1 if t occurs in d and 0 otherwise;

• logarithmically scaled frequency: tf(t,d) = 1 + log ft,d, or zero if ft,d is zero;

• augmented frequency, to prevent a bias towards longer documents, e.g. raw frequency divided by the maximum raw frequency of any term in the document:

· ft,d tf(t, d) = 0.5 + 0.5 ′ max{ft′,d : t ∈ d}

19.2.2 Inverse document frequency

The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

N idf(t, D) = log |{d ∈ D : t ∈ d}| with

• N : total number of documents in the corpus N = |D|

• |{d ∈ D : t ∈ d}| : number of documents where the term t appears (i.e., tf(t, d) ≠ 0 ). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to 1 + |{d ∈ D : t ∈ d}| .

19.2.3 Term frequency–Inverse document frequency

Then tf–idf is calculated as

tfidf(t, d, D) = tf(t, d) · idf(t, D)

A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf’s log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0. 19.3. JUSTIFICATION OF IDF 89

19.3 Justification of idf

Idf was introduced, as “term specificity”, by Karen Spärck Jones in a 1972 paper. Although it has worked well as a heuristic, its theoretical foundations have been troublesome for at least three decades afterward, with many researchers trying to find information theoretic justifications for it.[6] Spärck Jones’s own explanation did not propose much theory, aside from a connection to Zipf’s law.[6] Attempts have been made to put idf on a probabilistic footing,[7] by estimating the probability that a given document d contains a term t as the relative document frequency,

|{d ∈ D : t ∈ d}| P (t|d) = , N so that we can define idf as

idf = − log P (t|d) 1 = log P (t|d) N = log |{d ∈ D : t ∈ d}| Namely, the inverse document frequency is the logarithm of “inverse” relative document frequency. This probabilistic interpretation in turn takes the same form as that of self-information. However, applying such information-theoretic notions to problems in information retrieval leads to problems when trying to define the appro- priate event spaces for the required probability distributions: not only documents need to be taken into account, but also queries and terms.[6]

19.4 Example of tf–idf

Suppose that we have term count tables of a corpus consisting of only two documents, as listed on the right. The calculation of tf–idf for the term “this” is performed as follows: In its raw frequency form, tf is just the frequency of the “this” for each document. In each document, the word “this” appears once; but as the document 2 has more words, its relative frequency is smaller.

1 tf(′′this′′, d ) = = 0.2 1 5 1 tf(′′this′′, d ) = ≈ 0.14 2 7 An idf is constant per corpus, and accounts for the ratio of documents that include the word “this”. In this case, we have a corpus of two documents and all them include the word “this”.

( ) 2 idf(′′this′′,D) = log = 0 2 So tf–idf is zero for the word “this”, which implies that the word is not very informative as it appears in all documents.

′′ ′′ tfidf( this , d1) = 0.2 × 0 = 0

′′ ′′ tfidf( this , d2) = 0.14 × 0 = 0 A slightly more interesting example arises from the word “example”, which occurs three times but only in the second document: 90 CHAPTER 19. TF–IDF

0 tf(′′example′′, d ) = = 0 1 5

′′ ′′ 3 tf( example , d2) = ≈ 0.429 7 ( ) 2 idf(′′example′′,D) = log = 0.301 1 Finally,

′′ ′′ ′′ ′′ ′′ ′′ tfidf( example , d1) = tf( example , d1) × idf( example ,D) = 0 × 0.301 = 0 ′′ ′′ ′′ ′′ ′′ ′′ tfidf( example , d2) = tf( example , d2) × idf( example ,D) = 0.429 × 0.301 ≈ 0.13 (using the base 10 logarithm).

19.5 tf-idf Beyond Terms

The idea behind TF–IDF has also been applied to entities other than terms. In 1998, the concept of IDF was applied to citations[8]. The authors argued that “if a very uncommon citation is shared by two documents, this should be weighted more highly than a citation made by a large number of documents”. In addition, tf-idf was applied to “visual words” with the purpose of conducting object matching in videos[9], and entire sentences[10]. However, not in all cases the concept of TF–IDF proved to be more effective than a plain TF scheme (without IDF). When TF–IDF was applied to citations, researchers could find no improvement over a simple citation–count weight that had no IDF component [11].

19.6 tf-idf Derivates

There is a number of term-weighting schemes that derived from TF–IDF. One of them is TF–PDF (Term Frequency * Proportional Document Frequency)[12]. TF-PDF was introduced in 2001 in the context of identifying emerging topics in the media. The PDF component measures the difference of how often a term occurs in different domains. Another derivate is TF-IDuF. In TF-IDuF[13], IDF is not calculated based on the document corpus that is to be searched or recommended. Instead, IDF is calculated based on users’ personal document collections. The authors report that TF-IDuF was equally effective as tf-idf but could also be applied in situations when e.g. a user modeling system has no access to a global document corpus.

19.7 See also

• Okapi BM25 • Noun phrase • Word count • Vector space model • PageRank • Kullback–Leibler divergence • Mutual information • Latent semantic analysis • Latent semantic indexing • Latent Dirichlet allocation 19.8. REFERENCES 91

19.8 References

[1] Rajaraman, A.; Ullman, J. D. (2011). “Data Mining”. Mining of Massive Datasets (PDF). pp. 1–17. doi:10.1017/CBO9781139058452.002. ISBN 9781139058452.

[2] Breitinger, Corinna; Gipp, Bela; Langer, Stefan (2015-07-26). “Research-paper recommender systems: a literature sur- vey”. International Journal on Digital Libraries. 17 (4): 305–338. doi:10.1007/s00799-015-0156-0. ISSN 1432-5012.

[3] Luhn, Hans Peter (1957). “A Statistical Approach to Mechanized Encoding and Searching of Literary Information” (PDF). IBM Journal of research and development. IBM. 1 (4): 315. doi:10.1147/rd.14.0309. Retrieved 2 March 2015. There is also the probability that the more frequently a notion and combination of notions occur, the more importance the author attaches to them as reflecting the essence of his overall idea.

[4] Spärck Jones, K. (1972). “A Statistical Interpretation of Term Specificity and Its Application in Retrieval”. Journal of Documentation. 28: 11–21. doi:10.1108/eb026526.

[5] Manning, C. D.; Raghavan, P.; Schutze, H. (2008). “Scoring, term weighting, and the vector space model”. Introduction to Information Retrieval (PDF). p. 100. doi:10.1017/CBO9780511809071.007. ISBN 9780511809071.

[6] Robertson, S. (2004). “Understanding inverse document frequency: On theoretical arguments for IDF”. Journal of Docu- mentation. 60 (5): 503–520. doi:10.1108/00220410410560582.

[7] See also Probability estimates in practice in Introduction to Information Retrieval.

[8] Bollacker, Kurt D.; Lawrence, Steve; Giles, C. Lee (1998-01-01). “CiteSeer: An Autonomous Web Agent for Automatic Retrieval and Identification of Interesting Publications”. Proceedings of the Second International Conference on Autonomous Agents. AGENTS '98. New York, NY, USA: ACM: 116–123. doi:10.1145/280765.280786. ISBN 0897919831.

[9] Sivic, Josef; Zisserman, Andrew (2003-01-01). “Video Google: A Text Retrieval Approach to Object Matching in Videos”. Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2. ICCV '03. Washington, DC, USA: IEEE Computer Society: 1470–. ISBN 0769519504.

[10] Seki, Yohei. “Sentence Extraction by tf/idf and Position Weighting from Newspaper Articles” (PDF). http://research. nii.ac.jp/ntcir/workshop/OnlineProceedings3/NTCIR3-TSC-SekiY.pdf. National Institute of Informatics. External link in |website= (help)

[11] Beel, Joeran; Breitinger, Corinna (2017). “Evaluating the CC-IDF citation-weighting scheme - How effectively can 'Inverse Document Frequency' (IDF) be applied to references?" (PDF). Proceedings of the 12th iConference.

[12] Khoo Khyou Bun; Bun, Khoo Khyou; Ishizuka, M. “Emerging Topic Tracking System”. Proceedings Third International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems. WECWIS 2001. doi:10.1109/wecwis.2001.933900.

[13] Langer, Stefan; Gipp, Bela (2017). “TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections” (PDF). iConference.

• Salton, G; McGill, M. J. (1986). Introduction to modern information retrieval. McGraw-Hill. ISBN 978- 0070544840.

• Salton, G.; Fox, E. A.; Wu, H. (1983). “Extended Boolean information retrieval”. Communications of the ACM. 26 (11): 1022–1036. doi:10.1145/182.358466.

• Salton, G.; Buckley, C. (1988). “Term-weighting approaches in automatic text retrieval”. Information Pro- cessing & Management. 24 (5): 513–523. doi:10.1016/0306-4573(88)90021-0.

• Wu, H. C.; Luk, R. W. P.; Wong, K. F.; Kwok, K. L. (2008). “Interpreting TF-IDF term weights as making relevance decisions”. ACM Transactions on Information Systems. 26 (3): 1. doi:10.1145/1361684.1361686.

19.9 External links and suggested reading

• TFxIDF Repository: A definitive guide to the variants and their evolution.

• Gensim is a Python library for vector space modeling and includes tf–idf weighting.

• Robust Hyperlinking: An application of tf–idf for stable document addressability. 92 CHAPTER 19. TF–IDF

• A demo of using tf–idf with PHP and Euclidean distance for Classification

• Anatomy of a search engine • tf–idf and related definitions as used in Lucene

• TfidfTransformer in scikit-learn • Text to Matrix Generator (TMG) MATLAB toolbox that can be used for various tasks in text mining (TM) specifically i) indexing, ii) retrieval, iii) dimensionality reduction, iv) clustering, v) classification. The indexing step offers the user the ability to apply local and global weighting methods, including tf–idf.

• Pyevolve: A tutorial series explaining the tf-idf calculation. • TF/IDF with Google n-Grams and POS Tags Chapter 20

Synonym

This article is about the general meaning of “synonym”. For its use in biology, see Synonym (taxonomy). A synonym is a word or phrase that means exactly or nearly the same as another word or phrase in the same language. Words that are synonyms are said to be synonymous, and the state of being a synonym is called synonymy. The word comes from Ancient Greek sýn (σύν; “with”) and ónoma (ὄνομα; “name”). An example of synonyms are the words begin, start, commence, and initiate. Words can be synonymous when meant in certain senses, even if they are not synonymous in all of their senses. For example, if one talks about a long time or an extended time, long and extended are synonymous within that context. Synonyms with exact meaning share a seme or denotational sememe, whereas those with inexactly similar meanings share a broader denotational or connotational sememe and thus overlap within a semantic field. Some academics call the former type cognitive synonyms to distinguish them from the latter type, which they call near-synonyms.[2] Some lexicographers claim that no synonyms have exactly the same meaning (in all contexts or social levels of lan- guage) because , orthography, phonic qualities, ambiguous meanings, usage, etc. make them unique. Dif- ferent words that are similar in meaning usually differ for a reason: feline is more formal than cat; long and extended are only synonyms in one usage and not in others (for example, a long arm is not the same as an extended arm). Synonyms are also a source of euphemisms. In the figurative sense, two words are sometimes said to be synonymous if they have the same connotation:

...a widespread impression that ... Hollywood was synonymous with immorality...[3] — Doris Kearns Goodwin

Metonymy can sometimes be a form of synonymy, as when, for example, the White House is used as a synonym of the administration in referring to the U.S. executive branch under a specific president. Thus a metonym is a type of synonym, and the word metonym is a hyponym of the word synonym. The analysis of synonymy, polysemy, hyponymy, and hypernymy is inherent to taxonomy and ontology in the information- science senses of those terms. It has applications in pedagogy and machine learning, because they rely on word-sense disambiguation and schema.

20.1 Examples

Synonyms can be any part of speech (such as nouns, verbs, adjectives, adverbs or prepositions), as long as both words belong to the same part of speech. Examples:

• verb

• buy and purchase

• adjective

• big and large

93 94 CHAPTER 20. SYNONYM

Synonym list in cuneiform on a clay tablet, Neo-Assyrian period[1]

• adverb 20.2. SEE ALSO 95

• quickly and speedily

• preposition

• on and upon

Note that synonyms are defined with respect to certain senses of words; for instance, pupil as the aperture in the iris of the eye is not synonymous with student. Such like, he expired means the same as he died, yet my passport has expired cannot be replaced by my passport has died. In English, many synonyms emerged in the Middle Ages, after the Norman conquest of England. While England's new ruling class spoke Norman French, the lower classes continued to speak Old English (Anglo-Saxon). Thus, today we have synonyms like the Norman-derived people, liberty and archer, and the Saxon-derived folk, freedom and bowman. For more examples, see the list of Germanic and Latinate equivalents in English. The purpose of a is to offer the user a listing of similar or related words; these are often, but not always, synonyms.

• The word poecilonym is a rare synonym of the word synonym. It is not entered in most major dictionaries and is a curiosity or piece of trivia for being an autological word because of its meta quality as a synonym of synonym.

• Antonyms are words with opposite or nearly opposite meanings. For example: hot ↔ cold, large ↔ small, thick ↔ thin, synonym ↔ antonym

• Hypernyms and hyponyms are words that refer to, respectively, a general category and a specific instance of that category. For example, vehicle is a hypernym of car, and car is a hyponym of vehicle.

• Homophones are words that have the same pronunciation, but different meanings. For example, witch and which are homophones in most accents (because they are pronounced the same).

• Homographs are words that have the same spelling, but have different pronunciations. For example, one can record a song or keep a record of documents.

• Homonyms are words that have the same pronunciation and spelling, but have different meanings. For exam- ple, rose (a type of flower) and rose (past tense of rise) are homonyms.

20.2 See also

• -onym

• Synonym ring

• Cognitive synonymy

• Elegant variation, the gratuitous use of a synonym in prose

20.3 References

[1] K.4375

[2] Stanojević, Maja (2009), “Cognitive synonymy: a general overview” (PDF), Facta Universitatis, Linguistics and Literature series, 7 (2): 193–200.

[3] The Fitzgeralds and the Kennedys. Macmillan. 1991. p. 370. ISBN 9780312063542. Retrieved 27 May 2014. 96 CHAPTER 20. SYNONYM

20.4 External links

Tools which graph words relations :

• Graph Words - Online tool for visualization word relations • Synonyms.net - Online reference resource that provides instant synonyms and antonyms definitions including visualizations, voice pronunciations and

• English/French Semantic Atlas - Graph words relations in English, French and gives cross representations for translations - offers 500 searches per user per day.

Plain words synonyms finder :

• Synonym Finder - Synonym finder including hypernyms in search result • Thesaurus - Online synonyms in English, Italian, French and German

• Woxikon Synonyms - Over 1 million synonyms - English, German, Spanish, French, Italian, Portuguese, Swedish and Dutch

• Power Thesaurus - Thesaurus with synonyms ordered by rating

• FindMeWords Synonyms - Online Synonym Dictionary with definitions Chapter 21

Relevance

Relevance is the concept of one topic being connected to another topic in a way that makes it useful to consider the first topic when considering the second. The concept of relevance is studied in many different fields, including cognitive sciences, logic, and library and information science. Most fundamentally, however, it is studied in epistemology (the theory of knowledge). Different theories of knowledge have different implications for what is considered relevant and these fundamental views have implications for all other fields as well.

21.1 Definition

“Something (A) is relevant to a task (T) if it increases the likelihood of accomplishing the goal (G), which is implied by T.” (Hjørland & Sejer Christensen,2002).[1] A thing might be relevant, a document or a piece of information may be relevant. The basic understanding of relevance does not depend on whether we speak of “things” or “information”. For example, the Gandhian principles are of great relevance in today’s world.

21.2 Epistemology

If you believe that schizophrenia is caused by bad communication between mother and child, then family interaction studies become relevant. If, on the other hand, you subscribe to a genetic theory of relevance then the study of genes becomes relevant. If you subscribe to the epistemology of empiricism, then only intersubjectively controlled observations are relevant. If, on the other hand, you subscribe to feminist epistemology, then the sex of the observer becomes relevant. Epistemology is not just one domain among others. Epistemological views are always at play in any domain. Those views determine or influence what is regarded relevant.

21.3 Relevance logic

In formal reasoning, relevance has proved an important but elusive concept. It is important because the solution of any problem requires the prior identification of the relevant elements from which a solution can be constructed. It is elusive, because the meaning of relevance appears to be difficult or impossible to capture within conventional logical systems. The obvious suggestion that q is relevant to p if q is implied by p breaks down because under standard definitions of material implication, a false proposition implies all other propositions. However though 'iron is a metal' may be implied by 'cats lay eggs’ it doesn't seem to be relevant to it the way in which 'cats are mammals’ and 'mammals give birth to living young' are relevant to each other. If one states “I love ice cream,” and another person responds “I have a friend named Brad Cook,” then these statements are not relevant. However, if one states “I love ice cream,” and another person responds “I have a friend named Brad Cook who also likes ice cream,” this statement now becomes relevant because it relates to the first person’s idea.

97 98 CHAPTER 21. RELEVANCE

Graphic of relevance in digital ecosystems

More recently a number of theorists have sought to account for relevance in terms of "possible world logics” in intensional logic. Roughly, the idea is that necessary truths are true in all possible worlds, contradictions (logical falsehoods) are true in no possible worlds, and contingent propositions can be ordered in terms of the number of possible worlds in which they are true. Relevance is argued to depend upon the “remoteness relationship” between an actual world in which relevance is being evaluated and the set of possible worlds within which it is true.

21.4 Application

21.4.1 Politics

During the 1960s, relevance became a fashionable buzzword, meaning roughly 'relevance to social concerns’, such as racial equality, poverty, social justice, world hunger, world economic development, and so on. The implication was that some subjects, e.g., the study of medieval poetry and the practice of corporate law, were not worthwhile because they did not address pressing social issues.

21.4.2 Economics

The economist John Maynard Keynes saw the importance of defining relevance to the problem of calculating risk in economic decision-making. He suggested that the relevance of a piece of evidence, such as a true proposition, should be defined in terms of the changes it produces of estimations of the probability of future events. Specifically, Keynes proposed that new evidence e is irrelevant to a proposition, given old evidence q, if and only if p/q & e = p/q and relevant otherwise. 21.4. APPLICATION 99

There are technical problems with this definition, for example, the relevance of a piece of evidence can be sensitive to the order in which other pieces of evidence are received.

21.4.3 Cognitive science and pragmatics

Further information: Relevance theory

In 1986, Dan Sperber and Deirdre Wilson drew attention to the central importance of relevance decisions in reasoning and communication. They proposed an account of the process of inferring relevant information from any given utterance. To do this work, they used what they called the “Principle of Relevance": namely, the position that any utterance addressed to someone automatically conveys the presumption of its own optimal relevance. The central idea of Sperber and Wilson’s theory is that all utterances are encountered in some context, and the correct interpretation of a particular utterance is the one that allows most new implications to be made in that context on the basis of the least amount of information necessary to convey it. For Sperber and Wilson, relevance is conceived as relative or subjective, as it depends upon the state of knowledge of a hearer when they encounter an utterance. Sperber and Wilson that this theory is not intended to account for every intuitive application of the English word “relevance”. Relevance, as a technical term, is restricted to relationships between utterances and interpretations, and so the theory cannot account for intuitions such as the one that relevance relationships obtain in problems involving physical objects. If a plumber needs to fix a leaky faucet, for example, some objects and tools are relevant (i.e. a wrench) and others are not (i.e. a waffle iron). And, moreover, the latter seems to be irrelevant in a manner which does not depend upon the plumber’s knowledge, or the utterances used to describe the problem. A theory of relevance that seems to be more readily applicable to such instances of physical problem solving has been suggested by Gorayska and Lindsay in a series of articles published during the 1990s. The key feature of their theory is the idea that relevance is goal-dependent. An item (e.g., an utterance or object) is relevant to a goal if and only if it can be an essential element of some plan capable of achieving the desired goal. This theory embraces both propositional reasoning and the problem-solving activities of people such as plumbers, and defines relevance in such a way that what is relevant is determined by the real world (because what plans will work is a matter of empirical fact) rather than the state of knowledge or belief of a particular problem solver.

21.4.4 Law

Main article: Relevance (law)

The meaning of “relevance” in U.S. law is reflected in Rule 401 of the Federal Rules of Evidence. That rule defines relevance as “having any tendency to make the existence of any fact that is of consequence to the determinations of the action more probable or less probable than it would be without the evidence.” In other words, if a fact were to have no bearing on the truth or falsity of a conclusion, it would be legally irrelevant.

21.4.5 Library and information science

Main article: Relevance (information retrieval)

This field has considered when documents (or document representations) retrieved from databases are relevant or non-relevant. Given a conception of relevance, two measures have been applied: Precision and recall: Recall = a : (a + c) X 100%, where a = number of retrieved, relevant documents, c = number of non-retrieved, relevant documents (sometimes termed “silence”). Recall is thus an expression of how exhaustive a search for documents is. Precision = a : (a + b) X 100%, where a = number of retrieved, relevant documents, b = number of retrieved, non-relevant documents (often termed “noise”). Precision is thus a measure of the amount of noise in document-retrieval. Relevance itself has in the literature often been based on what is termed “the system’s view” and “the user’s view”. Hjørland (2010) criticize these two views and defends a “subject knowledge view of relevance”. 100 CHAPTER 21. RELEVANCE

21.5 See also

• Source criticism

• Description • Distraction

• Information-action ratio

• Information overload • Intention

• Relevance theory

21.6 References

[1] Hjørland, B. & Sejer Christensen, F. (2002). Work tasks and socio-cognitive relevance: a specific example. Journal of the American Society for Information Science and Technology, 53(11), 960-965.

• Gorayska B. & R. O. Lindsay (1993). The Roots of Relevance. Journal of Pragmatics 19, 301–323. Los Alamitos: IEEE Computer Society Press. • Hjørland, Birger (2010). The foundation of the concept of relevance. Journal of the American Society for Information Science and Technology, 61(2), 217-237.

• Keynes, J. M. (1921). Treatise on Probability. London: MacMillan • Lindsay, R. & Gorayska, B. (2002) Relevance, Goals and Cognitive Technology. International Journal of Cognitive Technology, 1, (2), 187–232 • Sperber, D. & D. Wilson (1986/1995) Relevance: Communication and Cognition. 2nd edition. Oxford: Black- well. • Sperber, D. & D. Wilson (1987). Précis of Relevance: Communication and Cognition. Behavioral and Brain Science, 10, 697–754. • Sperber, D. & D. Wilson (2004). Relevance Theory. In Horn, L.R. & Ward, G. (eds.) 2004 The Handbook of Pragmatics. Oxford: Blackwell, 607-632. http://www.dan.sperber.fr/?p=93 • Zhang, X, H. (1993). A Goal-Based Relevance Model and its Application to Intelligent Systems. Ph.D. Thesis, Oxford Brookes University, Department of Mathematics and Computer Science, October, 1993.

21.7 External links

• Malcolm Gladwell - Blink - full show: TVOntario interview regarding “snap judgements” and Blink Chapter 22

Library and information science

Library and information science (LIS) (sometimes given as the plural library and information sciences)[1][2] or as "library and information studies"[3] is a merging of library science and information science. The joint term is associated with schools of library and information science (abbreviated to “SLIS”). In the last part of 1960s, schools of librarianship, which generally developed from professional training programs (not academic disciplines) to university institutions during the second half of the 20th century, began to add the term “information science” to their names. The first school to do this was at the University of Pittsburgh in 1964.[4] More schools followed during the 1970s and 1980s, and by the 1990s almost all library schools in the USA had added information science to their names. Weaver Press: Although there are exceptions, similar developments have taken place in other parts of the world. In Denmark, for example, the 'Royal School of Librarianship' changed its English name to The Royal School of Library and Information Science in 1997. Exceptions include Tromsø, Norway, where the term documentation science is the preferred name of the field, France, where information science and communication studies form one interdiscipline,[5] and Sweden, where the fields of Archival science, Library science and Museology have been integrated as Archival, Library and Museum studies. In spite of various trends to merge the two fields, some consider the two original disciplines, library science and information science, to be separate.[6][7] However, the tendency today is to use the terms as synonyms or to drop the term “library” and to speak about information departments or I-schools. There have also been attempts to revive the concept of documentation and to speak of Library, information and documentation studies (or science).[8]

22.1 Relations between library science, information science and LIS

Tefko Saracevic (1992, p. 13)[6] argued that library science and information science are separate fields:

“The common ground between library science and information science, which is a strong one, is in the sharing of their social role and in their general concern with the problems of effective utilization of graphic records. But there are also very significant differences in several critical respects, among them in: (1) selection of problems addressed and in the way they were defined; (2) theoretical questions asked and frameworks established;(3) the nature and degree of experimentation and empirical development and the resulting practical knowledge/competencies derived; (4) tools and approaches used; and (5) the nature and strength of interdisciplinary relations established and the dependence of the progress and evolution of interdisciplinary approaches. All of these differences warrant the conclusion that librarianship and information science are two different fields in a strong interdisciplinary relation, rather than one and the same field, or one being a special case of the other.”

Another indication of the different uses of the two terms are the indexing in UMI’s Dissertations Abstracts. In Dissertations Abstracts Online on November 2011 were 4888 dissertations indexed with the descriptor LIBRARY SCIENCE and 9053 with the descriptor INFORMATION SCIENCE. For the year 2009 the numbers were 104 LIBRARY SCIENCE and 514 INFORMATION SCIENCE. 891 dissertations were indexed with both terms (36 in 2009). It should be considered that information science grew out of documentation science and therefore has a tradition for considering scientific and scholarly communication, bibliographic databases, subject knowledge and terminology

101 102 CHAPTER 22. LIBRARY AND INFORMATION SCIENCE

etc. Library science, on the other hand has mostly concentrated on libraries and their internal processes and best practices. It is also relevant to consider that information science used to be done by scientists, while librarianship has been split between public libraries and scholarly research libraries. Library schools have mainly educated librarians for public libraries and not shown much interest in scientific communication and documentation. When information scientists from 1964 entered library schools, they brought with them competencies in relation to information retrieval in subject databases, including concepts such as recall and precision, boolean search techniques, query formulation and related issues. Subject bibliographic databases and citation indexes provided a major step forward in information dissemination - and also in the curriculum at library schools. Julian Warner (2010)[9] suggests that the information and computer science tradition in information retrieval may broadly be characterized as query transformation, with the query articulated verbally by the user in advance of searching and then transformed by a system into a set of records. From librarianship and indexing, on the other hand, has been an implicit stress on selection power enabling the user to make relevant selections.

22.2 Difficulties defining LIS

“The question, 'What is library and information science?' does not elicit responses of the same internal conceptual coherence as similar inquiries as to the nature of other fields, e.g., 'What is chemistry?', 'What is economics?', 'What is medicine?' Each of those fields, though broad in scope, has clear ties to basic concerns of their field. [...] Neither LIS theory nor practice is perceived to be monolithic nor unified by a common literature or set of professional skills. Occasionally, LIS scholars (many of whom do not self-identify as members of an interreading LIS community, or prefer names other than LIS), attempt, but are unable, to find core concepts in common. Some believe that computing and internetworking concepts and skills underlie virtually every important aspect of LIS, indeed see LIS as a sub-field of computer science! [Footnote III.1] Others claim that LIS is principally a social science accompanied by practical skills such as ethnography and interviewing. Historically, traditions of public service, bibliography, documentalism, and information science have viewed their mission, their philosophical toolsets, and their domain of research differently. Still others deny the existence of a greater metropolitan LIS, viewing LIS instead as a loosely organized collection of specialized interests often unified by nothing more than their shared (and fought-over) use of the descriptor information. Indeed, claims occasionally arise to the effect that the field even has no theory of its own.” (Konrad, 2007, p. 652-653).

22.2.1 A multidisciplinary, interdisciplinary or monodisciplinary field?

The Swedish researcher Emin Tengström (1993).[10] described cross-disciplinary research as a process, not a state or structure. He differentiates three levels of ambition regarding cross-disciplinary research:

• The "Pluridisciplinary" or "multidisciplinarity" level

• The genuine cross-disciplinary level: "interdisciplinarity"

• The discipline-forming level "transdisciplinarity"

What is described here is a view of social fields as dynamic and changing. Library and information science is viewed as a field that started as a multidisciplinary field based on literature, psychology, sociology, management, computer science etc., which is developing towards an academic discipline in its own right. However, the following quote seems to indicate that LIS is actually developing in the opposite direction: Chua & Yang (2008)[11] studied papers published in Journal of the American Society for Information Science and Technology in the period 1988-1997 and found, among other things: “Top authors have grown in diversity from those being affiliated predominantly with library/information-related departments to include those from informa- tion systems management, information technology, business, and the humanities. Amid heterogeneous clusters of collaboration among top authors, strongly connected crossdisciplinary coauthor pairs have become more prevalent. Correspondingly, the distribution of top keywords’ occurrences that leans heavily on core information science has shifted towards other subdisciplines such as information technology and sociobehavioral science.” As a field with its own body of interrelated concepts, techniques, journals, and professional associations, LIS is clearly a discipline. But by the nature of its subject matter and methods LIS is just as clearly an interdiscipline, drawing on many adjacent fields (see below). 22.3. THE UNIQUE CONCERN OF LIBRARY AND INFORMATION SCIENCE 103

22.2.2 A fragmented adhocracy

Richard Whitley (1984,[12] 2000)[13] classified scientific fields according to their intellectual and social organization and described management studies as a ‘fragmented adhocracy’, a field with a low level of coordination around a diffuse set of goals and a non-specialized terminology; but with strong connections to the practice in the business sector. Åström (2006)[14] applied this conception to the description of LIS.

22.2.3 Scattering of the literature

Meho & Spurgin (2005)[15] found that in a list of 2,625 items published between 1982 and 2002 by 68 faculty members of 18 schools of library and information science, only 10 databases provided significant coverage of the LIS literature. Results also show that restricting the data sources to one, two, or even three databases leads to inaccurate rankings and erroneous conclusions. Because no database provides comprehensive coverage of the LIS literature, researchers must rely on a wide range of disciplinary and multidisciplinary databases for ranking and other research purposes. Even when the nine most comprehensive databases in LIS was searched and combined, 27.0% (or 710 of 2,635) of the publications remain not found.

“The study confirms earlier research that LIS literature is highly scattered and is not limited to standard LIS databases. What was not known or verified before, however, is that a significant amount of this literature is indexed in the interdisciplinary or multidisciplinary databases of Inside Conferences and INSPEC. Other interdisciplinary databases, such as America: History and Life, were also found to be very useful and complementary to traditional LIS databases, particularly in the areas of archives and library history."(Meho & Spurgin, 2005, p.1329).

22.3 The unique concern of library and information science

“Concern for people becoming informed is not unique to LIS, and thus is insufficient to differentiate LIS from other fields. LIS are a part of a larger enterprise.” (Konrad, 2007, p. 655).[16] “The unique concern of LIS is recognized as: Statement of the core concern of LIS: Humans becoming informed (constructing meaning) via intermediation between inquirers and instrumented records. No other field has this as its concern. " (Konrad, 2007, p. 660) “Note that the promiscuous term information does not appear in the above statement circumscribing the field’s central concerns: The detrimental effects of the ambiguity this term provokes are discussed above (Part III). Furner [Furner 2004, 427] has shown that discourse in the field is improved where specific terms are utilized in place of the i-word for specific senses of that term.” (Konrad, 2007, p. 661). Michael Buckland wrote: “Educational programs in library, information and documentation are concerned with what people know, are not limited to technology, and require wide-ranging expertise. They differ fundamentally and im- portantly from computer science programs and from the information systems programs found in business schools.”.[17]

22.4 LIS theories

Julian Warner (2010, p. 4-5)[9] suggests that

"Two paradigms, the cognitive and the physical, have been distinguished in information retrieval research, but they share the assumption of the value of delivering relevant records (Ellis 1984, 19;[18] Belkin and Vickery 1985, 114[19]). For the purpose of discussion here, they can be considered a single heterogeneous paradigm, linked but not united by this common assumption. The value placed on query transformation is dissonant with common practice, where users may prefer to explore an area and may value fully informed exploration. Some dissenting research discussions have been more congruent with practice, advocating explorative capability - the ability to explore and make discriminations between representations of objects - as the fundamental design principle for information retrieval systems”.

The domain analytic approach (e.g., Hjørland 2010[20]) suggests that the relevant criteria for making discriminations in information retrieval are scientific and scholarly criteria. In some fields (e.g. evidence-based medicine)[21] the 104 CHAPTER 22. LIBRARY AND INFORMATION SCIENCE relevant distinctions are very explicit. In other cases they are implicit or unclear. At the basic level, the relevance of bibliographical records are determined by epistemological criteria of what constitutes knowledge. Among other approaches, Evidence Based Library and Information Practice should also be mentioned.

22.5 Journals

(see also List of LIS Journals in India page, Category:Library science journals and Journal Citation Reports for listing according to Impact factor) Some core journals in LIS are:

• Annual Review of Information Science and Technology (ARIST) (1966–2011)

• El Profesional de la Información (es) (EPI) (1992-) (Formerly Information World en Español)

• Information Processing and Management

• Information Research: An international electronic journal (IR) (1995-)

• Italian Journal of Library and Information Studies (JLIS.it)

• Journal of Documentation (JDoc) (1945-)

• Journal of Information Science (JIS) (1979-)

• Journal of the Association for Information Science and Technology (Formerly Journal of the American Society for Information Science and Technology) (JASIST) (1950-)

• Knowledge Organization (journal)

• The Library Quarterly (LQ) (1931-)

• Library Trends (1952-)

• Scientometrics (journal) (1978-)

• Library Literature and Information Science Retrospective (1901-1983)

Important bibliographical databases in LIS are, among others, Social Sciences Citation Index and Library and Infor- mation Science Abstracts

22.6 Conferences

This is a list of some of the major conferences in the field.

• Annual meeting of the American Society for Information Science and Technology

• Conceptions of Library and Information Science

• i-Schools' “iConferences

• ISIC - the Information Behaviour Conference http://informationr.net/isic/index.html

• The International Federation of Library Associations and Institutions (IFLA): World Library and Information Congress, http://web.archive.org/web/20150706164140/http://conference.ifla.org/

• The international conferences of the International Society for Knowledge Organization (ISKO), http://www. isko.org/events.html 22.7. COMMON SUBFIELDS 105

22.7 Common subfields

An advertisement for a full Professor in information science at the Royal School of Library and Information Sci- ence, spring 2011, provides one view of which subdisciplines are well-established:[22] “The research and teach- ing/supervision must be within some (and at least one) of these well-established information science areas

• a. Knowledge organization

• b. Library studies

• c.

• d. Information behavior

• e. Interactive information retrieval

• f. Information systems

• g. Scholarly communication

• h. Digital literacy (cf information literacy)

• i. Bibliometrics or scientometrics

• j. Interaction design and user experience"

• k.

There are other ways to identify subfields within LIS, for example bibliometric mapping and comparative stud- ies of curricula. Bibliometric maps of LIS have been produced by, among others, Vickery & Vickery (1987, frontispiece),[23] White & McCain (1998),[24] Åström (2002,[25] 2006) and Hassan-Montero & Herrero-Solana (2007).[26] An example of a curriculum study is Kajberg & Lørring, 2005.[27] In this publication are the following data reported (p 234): “Degree of overlap of the ten curricular themes with subject areas in the current curricula of responding LIS schools

• Information seeking and Information retrieval 100%

• Library management and promotion 96%

86%

• Knowledge organization 82%

• Information literacy and learning 76%

• Library and society in a historical perspective (Library history) 66%

• The Information society: Barriers to the free access to information 64%

• Cultural heritage and digitisation of the cultural heritage (Digital preservation) 62%

• The library in the multi-cultural information society: International and intercultural communication 42%

• Mediation of culture in a special European context 26% "

There is often an overlap between these subfields of LIS and other fields of study. Most information retrieval re- search, for example, belongs to computer science. Knowledge management is considered a subfield of management or organizational studies.[28] 106 CHAPTER 22. LIBRARY AND INFORMATION SCIENCE

22.8 See also

• Archival science • Authority control • Bibliography • Digital Asset Management (DAM) • Documentation science • Education for librarianship • Glossary of library and information science • I-school • Information history • Information systems • Knowledge management • Library and information scientist • Museology • Museum informatics • Records Management

22.9 References

[1] Bates, M.J. and Maack, M.N. (eds.). (2010). Encyclopedia of Library and Information Sciences. Vol. 1-7. CRC Press, Boca Raton, USA. Also available as an electronic source.

[2] Library and Information Sciences is the name used in the Dewey Decimal Classification for class 20 from the 18th edition (1971) to the 22nd edition (2003)

[3] “Canada Library School University Programs”. www.canadian-universities.net. Retrieved 23 November 2014.

[4] Galvin, T. J. (1977). Pittsburgh. University of Pittsburgh Graduate School of Library and Information Sciences. IN: Encyclopedia of Library and Information Science (Vol. 22). Ed. by A. Kent, H. Lancour & J.E.Daily. New York: Marcel Dekker, Inc. (pp. 280–291)

[5] Mucchielli, A., (2000), La nouvelle communication : épistémologie des sciences de l’informationcommunication. Paris, Armand Colin, 2000. Collection U. Sciences de la communication

[6] Saracevic, Tefko (1992). Information science: origin, evolution and relations. In: Conceptions of library and information science. Historical, empirical and theoretical perspectives. Edited by Pertti Vakkari & Blaise Cronin. London: Taylor Graham (pp. 5-27).

[7] Miksa, Francis L. (1992). Library and information science: two paradigms. In: In: Conceptions of library and information science. Historical, empirical and theoretical perspectives. Edited by Pertti Vakkari & Blaise Cronin. London: Taylor Graham (pp. 229-252).

[8] Rayward, W. B. (Ed.) (2004). Aware and responsible. Papers of the Nordic- International Colloquium on Social and Cultural Awareness and responsibility in Library, Information, and Documentation Studies (SCARLID). Lanham, MD:

[9] Warner, Julian (2010). Human information retrieval.Cambridge, MA: The MIT Press

[10] Tengström, E. (1993). Biblioteks- och informationsvetenskapen - ett fler- eller tvär-vetenskapligt område? Svensk Bib- lioteksforskning,(1), 9-20.

[11] Chua, A. & Yang, C.C. (2008). The shift towards multi-disciplinarity in information science, Journal of the American Society for Information Science and Technology, 59(13), 2156–2170. 22.10. FURTHER READING 107

[12] Whitley, R. (1984). The fragmented state of management studies: Reasons and consequences. Journal of management studies, 21(3), 331-348.

[13] Whitley, R. (2000). The intellectual and social organization of the sciences. Oxford University Press, Oxford.

[14] Åström, F. (2006). The social and intellectual development of library and information science. Doctoral theses at the Department of Sociology, Umeå University,. No 48 2006. http://www.diva-portal.org/smash/get/diva2:145144/ FULLTEXT01

[15] Meho, Lokman I. & Spurgin, Kristina M. (2005). Ranking the Research Productivity of Library and Information Sci- ence Faculty and Schools: An Evaluation of Data Sources and Research Methods. Journal of the American Society for Information Science and Technology, 56(12), 1314–1331.

[16] Konrad, A. (2007). On inquiry: Human concept formation and construction of meaning through library and information science intermediation (Unpublished doctoral dissertation). University of California, Berkeley. Retrieved from http:// escholarship.org/uc/item/1s76b6hp

[17] Buckland, Michael K. (2004). Reflections on social and cultural awareness and responsibility in library, information and documentation - Commentary on the SCARLID colloquium. In: Rayward, W. B. (Ed.). Aware and responsible. Papers of the Nordic- International Colloquium on Social and Cultural Awareness and responsibility in Library, Information, and Documentation Studies (SCARLID). Lanham, MD: Scarecrow Press. (pp. 169-175).

[18] Ellis, David (1984). Theory and explanation in information retrieval research. Journal of Information Science, 8, 25-38

[19] Belkin, N. J. & Vickery, A. (1985)- Interaction in information systems: A review of research from document retrieval to knowledge-based systems. London: British Library (Library and Information Research Report 35).

[20] Hjørland, Birger (2010). The foundation of the concept of relevance. Journal of the American Society for Information Science and Technology. 61(2), 217-237.

[21] Hjørland, Birger (2011). Evidence based practice: An analysis based on the philosophy of science. Journal of the American Society for Information Science and Technology, 62(7), 1301-1310.

[22] Advertisement for a full Professor in information science at the Royal School of Library and Information Science, spring 2011: http://www.job-i-staten.dk/SearchResults/position-as-full-professor-in-information-science-lja-3723916.aspx?jobId= LJA-3723916&list=SearchResultsJobsIds&index=6&querydesc=SearchJobQueryDescription&viewedfrom=1

[23] Vickery, Brian & Vickery, Alina (1987). Information science in theory and practice. London: Bowker-Saur.

[24] White, H. D., & McCain, K. W. (1998). Visualizing a discipline: An author co-citation analysis of information science, 1972-1995. Journal of the American Society for Information Science, 49(4), 327-355.

[25] Åström, Fredrik (2002) Visualizing Library and Information Science concept spaces through keyword and citation based maps and clusters. In: Bruce, Fidel, Ingwersen & Vakkari (Eds.). Emerging frameworks and methods: Proceedings of the fourth international conference on conceptions of Library and Information Science (CoLIS4), pp 185-197. Greenwood Village: Libraries unlimited.

[26] Hassan-Montero, Y., Herrero-Soalana, V. (2007). Visualizing Library and Information Science from the practitioner’s perspective. 11th International Conference of the International Society for Scientometrics and Informetrics June 25–27, 2007, Madrid (Spain). http://yusef.es/Visualizing_LIS.pdf

[27] Kajberg, Leif & Lørring, Leif (eds.). (2005). European Curriculum Reflections on Library and Information Science Edu- cation. Copenhagen: The Royal School of Library and Information Science. http://library.upt.ro/LIS_Bologna.pdf

[28] Clegg, Stewart; Bailey, James R., eds. (2008). International Encyclopedia of Organizational Studies. Los Angeles, Calif.: Sage Publications Inc. pp. 758–762. ISBN 978-1-4129-5390-0.

22.10 Further reading

• Hjørland, B. (2000). Library and Information Science: Practice, theory, and philosophical basis. Information Processing and Management, 36(3), 501-531.

• Hjørland, B. (2013). Information science and its core concepts: Levels of disagreement. In lbekwe-SanJuan, F., & Dousa, T.(ed.), Fundamental notions of information communication and knowledge (pp. 205–235). Dordrecht: Springer Science+Business Media B.V. 108 CHAPTER 22. LIBRARY AND INFORMATION SCIENCE

• Järvelin, K. & Vakkari, P. (1993). The Evolution of Library and Information Science 1965-1985: A Content Analysis of Journal Articles. Information Processing & Management, 29(1), 129-144. • Kajberg, L. (1992). Library and Information Science Research in Denmark 1965-1989: A Content Analysis of R&D Publications. IN: Teknologi och kompetens. Proceedings. 8:de Nordiska konferencen för Information och Dokumentation 19-21/5 1992 i Helsingborg. Stockholm: Tekniska Litteratursällskapet, 233-237.

• McNicol, S. (2003). LIS: The Interdisciplinary Research Landscape. Journal of Librarianship and Information Science, 35(1), 23-30.

• McClure, C. R. & Hernon, P. (eds.). (1991). Library and Information Science Research: Perspectives and Strategies for Improvement. Norwood, N.J.: Ablex.

• Åström, Fredrik (2008). Formalizing a discipline: The institutionalization of library and information science research in the Nordic countries”, Journal of Documentation, Vol. 64 Iss: 5, 721-737. Chapter 23

Relevance (information retrieval)

For other uses, see Relevance (disambiguation).

In information science and information retrieval, relevance denote how well a retrieved document or set of documents meets the information need of the user. Relevance may include concerns such as timeliness, authority or novelty of the result.

23.1 History

The concern with the problem of finding relevant information dates back at least to the first publication of scientific journals in the 17th century. The formal study of relevance began in the 20th Century with the study of what would later be called bibliometrics. In the 1930s and 1940s, S. C. Bradford used the term “relevant” to characterize articles relevant to a subject (cf., Bradford’s law). In the 1950s, the first information retrieval systems emerged, and researchers noted the retrieval of irrelevant articles as a significant concern. In 1958, B. C. Vickery made the concept of relevance explicit in an address at the International Conference on Scientific Information.[1] Since 1958, information scientists have explored and debated definitions of relevance. A particular focus of the debate was the distinction between “relevance to a subject” or “topical relevance” and “user relevance”.

23.2 Evaluation

Main article: Information retrieval § Performance and correctness measures

The information retrieval community has emphasized the use of test collections and benchmark tasks to measure topical relevance, starting with the Cranfield Experiments of the early 1960s and culminating in the TREC evaluations that continue to this day as the main evaluation framework for information retrieval research. In order to evaluate how well an information retrieval system retrieved topically relevant results, the relevance of retrieved results must be quantified. In Cranfield-style evaluations, this typically involves assigning a relevance level to each retrieved result, a process known as relevance assessment. Relevance levels can be binary (indicating a result is relevant or that it is not relevant), or graded (indicating results have a varying degree of match between the topic of the result and the information need). Once relevance levels have been assigned to the retrieved results, information retrieval performance measures can be used to assess the quality of a retrieval system’s output. In contrast to this focus solely on topical relevance, the information science community has emphasized user studies that consider user relevance. These studies often focus on aspects of human-computer interaction (see also human- computer information retrieval).

109 110 CHAPTER 23. RELEVANCE (INFORMATION RETRIEVAL)

23.3 Clustering and relevance

The cluster hypothesis, proposed by C. J. van Rijsbergen in 1979, asserts that two documents that are similar to each other have a high likelihood of being relevant to the same information need. With respect to the embedding similarity space, the cluster hypothesis can be interpreted globally or locally.[2] The global interpretation assumes that there exist some fixed set of underlying topics derived from inter-document similarity. These global clusters or their representatives can then be used to relate relevance of two documents (e.g. two documents in the same cluster should both be relevant to the same request). Methods in this spirit include:

• cluster-based information retrieval[3][4] • cluster-based document expansion such as latent semantic analysis or its language modeling equivalents.[5] It is important to ensure that clusters – either in isolation or combination – successfully model the set of possible relevant documents.

A second interpretation, most notably advanced by Ellen Voorhees,[6] focuses on the local relationships between documents. The local interpretation avoids having to model the number or size of clusters in the collection and allow relevance at multiple scales. Methods in this spirit include,

• multiple cluster retrieval[4][6] • spreading activation[7] and relevance propagation[8] methods • local document expansion[9] • score regularization[10]

Local methods require an accurate and appropriate document similarity measure.

23.4 Problems and alternatives

The documents which are most relevant are not necessarily those which are most useful to display in the first page of search results. For example, two duplicate documents might be individually considered quite relevant, but it is only useful to display one of them. A measure called “maximal marginal relevance” (MMR) has been proposed to overcome this shortcoming. It considers the relevance of each document only in terms of how much new information it brings given the previous results.[11] In some cases, a query may have an ambiguous interpretation, or a variety of potential responses. Providing a diversity of results can be a consideration when evaluating the utility of a result set.[12]

23.5 References

[1] Mizzaro, S. (1997). Relevance: The Whole History. Journal of the American Society for Information Science. 48, 810‐832.

[2] F. Diaz, Autocorrelation and Regularization of Query-Based Retrieval Scores. PhD thesis, University of Massachusetts Amherst, Amherst, MA, February 2008, Chapter 3.

[3] W. B. Croft, “A model of cluster searching based on classification,” Information Systems, vol. 5, pp. 189–195, 1980.

[4] A. Griffiths, H. C. Luckhurst, and P. Willett, “Using interdocument similarity information in document retrieval systems,” Journal of the American Society for Information Science, vol. 37, no. 1, pp. 3–11, 1986.

[5] X. Liu and W. B. Croft, “Cluster-based retrieval using language models,” in SIGIR ’04: Proceedings of the 27th annual international conference on Research and development in information retrieval, (New York, NY, USA), pp. 186–193, ACM Press, 2004.

[6] E. M. Voorhees, “The cluster hypothesis revisited,” in SIGIR ’85: Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval, (New York, NY, USA), pp. 188–196, ACM Press, 1985. 23.6. ADDITIONAL READING 111

[7] S. Preece, A spreading activation network model for information retrieval. PhD thesis, University of Illinois, Urbana- Champaign, 1981.

[8] T. Qin, T.-Y. Liu, X.-D. Zhang, Z. Chen, and W.-Y. Ma, “A study of relevance propagation for web search,” in SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, (New York, NY, USA), pp. 408–415, ACM Press, 2005.

[9] A. Singhal and F. Pereira, “Document expansion for speech retrieval,” in SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, (New York, NY, USA), pp. 34–41, ACM Press, 1999.

[10] F. Diaz, “Regularizing query-based retrieval scores,” Information Retrieval, vol. 10, pp. 531–562, December 2007.

[11] Carbonell, Jaime; Goldstein, Jade (1998). “The use of MMR, diversity-based reranking for reordering documents and producing summaries”. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. doi:10.1145/290941.291025.

[12] http://www.dcs.gla.ac.uk/workshops/ddr2012/

23.6 Additional reading

• Hjørland, B. (2010). The foundation of the concept of relevance. Journal of the American Society for Infor- mation Science and Technology, 61(2), 217-237.

• Relevance : communication and cognition. by Dan Sperber; Deirdre Wilson. 2nd ed. Oxford; Cambridge, MA: Blackwell Publishers, 2001. ISBN 978-0-631-19878-9

• Saracevic, T. (2007). Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. Journal of the American Society for Information Science and Technology, 58(3), 1915-1933. (pdf) • Saracevic, T. (2007). Relevance: A review of the literature and a framework for thinking on the notion in in- formation science. Part III: Behavior and effects of relevance. Journal of the American Society for Information Science and Technology, 58(13), 2126-2144. (pdf)

• Saracevic, T. (2007). Relevance in information science. Invited Annual Thomson Scientific Lazerow Memorial Lecture at School of Information Sciences, University of Tennessee. September 19, 2007. (video)

• Introduction to Information Retrieval: Evaluation. Stanford. (presentation in PDF) Chapter 24

Web search engine

“Search engine” redirects here. For other uses, see Search engine (disambiguation). A web search engine is a software system that is designed to search for information on the World Wide Web. The

The results of a search for the term “lunar eclipse” in a web-based image search engine search results are generally presented in a line of results often referred to as search engine results pages (SERPs). The information may be a mix of web pages, images, and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler.

24.1 History

Further information: Timeline of web search engines

Internet search engines themselves predate the debut of the Web in December 1990. The Who is user search dates back to 1982 [1] and the Knowbot Information Service multi-network user search was first implemented in 1989.[2]

112 24.1. HISTORY 113

The first well documented search engine that searched content files, namely FTP files was Archie, which debuted on 10 September 1990. Prior to September 1993 the World Wide Web was entirely indexed by hand. There was a list of webservers edited by Tim Berners-Lee and hosted on the CERN webserver. One historical snapshot of the list in 1992 remains,[3] but as more and more web servers went online the central list could no longer keep up. On the NCSA site, new servers were announced under the title “What’s New!"[4] The first tool used for searching content (as opposed to users) on the Internet was Archie.[5] The name stands for “archive” without the “v”. It was created by Alan Emtage, Bill Heelan and J. Peter Deutsch, computer science students at McGill University in Montreal. The program downloaded the directory listings of all the files located on public anonymous FTP (File Transfer Protocol) sites, creating a searchable database of file names; however, Archie Search Engine did not index the contents of these sites since the amount of data was so limited it could be readily searched manually. The rise of Gopher (created in 1991 by Mark McCahill at the University of Minnesota) led to two new search programs, Veronica and Jughead. Like Archie, they searched the file names and titles stored in Gopher index systems. Veronica (Very Easy Rodent-Oriented Net-wide Index to Computerized Archives) provided a keyword search of most Gopher menu titles in the entire Gopher listings. Jughead (Jonzy’s Universal Gopher Hierarchy Excavation And Display) was a tool for obtaining menu information from specific Gopher servers. While the name of the search engine "Archie Search Engine" was not a reference to the Archie comic book series, "Veronica" and "Jughead" are characters in the series, thus referencing their predecessor. In the summer of 1993, no search engine existed for the web, though numerous specialized catalogues were maintained by hand. Oscar Nierstrasz at the University of Geneva wrote a series of Perl scripts that periodically mirrored these pages and rewrote them into a standard format. This formed the basis for W3Catalog, the web’s first primitive search engine, released on September 2, 1993.[6] In June 1993, Matthew Gray, then at MIT, produced what was probably the first web robot, the Perl-based World Wide Web Wanderer, and used it to generate an index called 'Wandex'. The purpose of the Wanderer was to measure the size of the World Wide Web, which it did until late 1995. The web’s second search engine appeared in November 1993. Aliweb did not use a web robot, but instead depended on being notified by website administrators of the existence at each site of an index file in a particular format. NCSA’s Mosaic™ - Mosaic (web browser) wasn't the first Web browser. But it was the first to make a major splash. In November 1993, Mosaic v 1.0 broke away from the small pack of existing browsers by including features—like icons, bookmarks, a more attractive interface, and pictures—that made the software easy to use and appealing to “non-geeks.” JumpStation (created in December 1993[7] by Jonathon Fletcher) used a web robot to find web pages and to build its index, and used a web form as the interface to its query program. It was thus the first WWW resource-discovery tool to combine the three essential features of a web search engine (crawling, indexing, and searching) as described below. Because of the limited resources available on the platform it ran on, its indexing and hence searching were limited to the titles and headings found in the web pages the crawler encountered. One of the first “all text” crawler-based search engines was WebCrawler, which came out in 1994. Unlike its prede- cessors, it allowed users to search for any word in any webpage, which has become the standard for all major search engines since. It was also the first one widely known by the public. Also in 1994, (which started at Carnegie Mellon University) was launched and became a major commercial endeavor. Soon after, many search engines appeared and vied for popularity. These included Magellan, , , , Northern Light, and AltaVista. Yahoo! was among the most popular ways for people to find web pages of interest, but its search function operated on its web directory, rather than its full-text copies of web pages. Information seekers could also browse the directory instead of doing a keyword-based search. In 1996, Netscape was looking to give a single search engine an exclusive deal as the featured search engine on Netscape’s web browser. There was so much interest that instead Netscape struck deals with five of the major search engines: for $5 million a year, each search engine would be in rotation on the Netscape search engine page. The five engines were Yahoo!, Magellan, Lycos, Infoseek, and Excite.[8][9] Google adopted the idea of selling search terms in 1998, from a small search engine company named goto.com. This move had a significant effect on the SE business, which went from struggling to one of the most profitable businesses in the internet. Search engines were also known as some of the brightest stars in the Internet investing frenzy that occurred in the 114 CHAPTER 24. WEB SEARCH ENGINE

late 1990s.[10] Several companies entered the market spectacularly, receiving record gains during their initial public offerings. Some have taken down their public search engine, and are marketing enterprise-only editions, such as Northern Light. Many search engine companies were caught up in the dot-com bubble, a speculation-driven market boom that peaked in 1999 and ended in 2001. Around 2000, Google’s search engine rose to prominence.[11] The company achieved better results for many searches with an innovation called PageRank, as was explained in the paper Anatomy of a Search Engine written by Sergey Brin and Larry Page, the later founders of Google.[12] This iterative algorithm ranks web pages based on the number and PageRank of other web sites and pages that link there, on the premise that good or desirable pages are linked to more than others. Google also maintained a minimalist interface to its search engine. In contrast, many of its competitors embedded a search engine in a web portal. In fact, Google search engine became so popular that spoof engines emerged such as Mystery Seeker. By 2000, Yahoo! was providing search services based on Inktomi’s search engine. Yahoo! acquired Inktomi in 2002, and Overture (which owned AlltheWeb and AltaVista) in 2003. Yahoo! switched to Google’s search engine until 2004, when it launched its own search engine based on the combined technologies of its acquisitions. Microsoft first launched MSN Search in the fall of 1998 using search results from Inktomi. In early 1999 the site began to display listings from Looksmart, blended with results from Inktomi. For a short time in 1999, MSN Search used results from AltaVista instead. In 2004, Microsoft began a transition to its own search technology, powered by its own web crawler (called msnbot). Microsoft’s rebranded search engine, Bing, was launched on June 1, 2009. On July 29, 2009, Yahoo! and Microsoft finalized a deal in which Yahoo! Search would be powered by technology.

24.2 How web search engines work

A search engine maintains the following processes in near real time:

1. Web crawling

2. Indexing

3. Searching[13]

Web search engines get their information by web crawling from site to site. The “spider” checks for the standard filename robots.txt, addressed to it, before sending certain information back to be indexed depending on many factors, such as the titles, page content, JavaScript, Cascading Style Sheets (CSS), headings, as evidenced by the standard HTML markup of the informational content, or its metadata in HTML meta tags. Indexing means associating words and other definable tokens found on web pages to their domain names and HTML- based fields. The associations are made in a public database, made available for web search queries. A query from a user can be a single word. The index helps find information relating to the query as quickly as possible.[13] Some of the techniques for indexing, and caching are trade secrets, whereas web crawling is a straightforward process of visiting all sites on a systematic basis. Between visits by the spider, the cached version of page (some or all the content needed to render it) stored in the search engine working memory is quickly sent to an inquirer. If a visit is overdue, the search engine can just act as a web proxy instead. In this case the page may differ from the search terms indexed.[13] The cached page holds the appearance of the version whose words were indexed, so a cached version of a page can be useful to the web site when the actual page has been lost, but this problem is also considered a mild form of linkrot. Typically when a user enters a query into a search engine it is a few keywords.[14] The index already has the names of the sites containing the keywords, and these are instantly obtained from the index. The real processing load is in generating the web pages that are the search results list: Every page in the entire list must be weighted according to information in the indexes.[13] Then the top search result item requires the lookup, reconstruction, and markup of the snippets showing the context of the keywords matched. These are only part of the processing each search results web page requires, and further pages (next to the top) require more of this post processing. Beyond simple keyword lookups, search engines offer their own GUI- or command-driven operators and search parameters to refine the search results. These provide the necessary controls for the user engaged in the feedback 24.2. HOW WEB SEARCH ENGINES WORK 115

World Wide Web

Web pages

URLs Multi-threaded Scheduler downloader Text and metadata

Queue URLs Storage

High-level architecture of a standard Web crawler

loop users create by filtering and weighting while refining the search results, given the initial pages of the first search results. For example, from 2007 the Google.com search engine has allowed one to filter by date by clicking “Show search tools” in the leftmost column of the initial search results page, and then selecting the desired date range.[15] It’s also possible to weight by date because each page has a modification time. Most search engines support the use of the boolean operators AND, OR and NOT to help end users refine the search query. Boolean operators are for literal searches that allow the user to refine and extend the terms of the search. The engine looks for the words or phrases exactly as entered. Some search engines provide an advanced feature called proximity search, which allows users to define the distance between keywords.[13] There is also concept-based searching where the research involves using statistical analysis on pages containing the words or phrases you search for. As well, natural language queries allow the user to type a question in the same form one would ask it to a human.[16] A site like this would be ask.com.[17] The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of web pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the “best” results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another.[13] The methods also change over time as Internet usage changes and new techniques evolve. There are two main types of search engine that have evolved: one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively. The other is a system that generates an "inverted index" by analyzing texts it locates. This first form relies much more heavily on the computer itself to do the bulk of the work. Most Web search engines are commercial ventures supported by advertising revenue and thus some of them allow advertisers to have their listings ranked higher in search results for a fee. Search engines that do not accept money for their search results make money by running search related ads alongside the regular search engine results. The search engines make money every time someone clicks on one of these ads.[18] 116 CHAPTER 24. WEB SEARCH ENGINE

24.3 Market share

Google is the world’s most popular search engine, with a marketshare of 75.97 percent as of December, 2016.[19] The world’s most popular search engines (with >1% market share) are:

24.3.1 East Asia and Russia

In some East Asian countries and Russia, Google is not the most popular search engine since its algorithm searching has regional filtering, and hides most results. Yandex commands a marketshare of 61.9 percent in Russia, compared to Google’s 28.3 percent.[20] In China, is the most popular search engine.[21] South Korea’s homegrown search portal, , is used for 70 percent of online searches in the country.[22] Yahoo! Japan and Yahoo! Taiwan are the most popular avenues for internet search in Japan and Taiwan, respectively.[23]

24.4 Search engine bias

Although search engines are programmed to rank websites based on some combination of their popularity and rele- vancy, empirical studies indicate various political, economic, and social biases in the information they provide[24][25] and the underlying assumptions about the technology.[26] These biases can be a direct result of economic and com- mercial processes (e.g., companies that advertise with a search engine can become also more popular in its organic search results), and political processes (e.g., the removal of search results to comply with local laws).[27] For example, Google will not surface certain Neo-Nazi websites in France and Germany, where Holocaust denial is illegal. Biases can also be a result of social processes, as search engine algorithms are frequently designed to exclude non- normative viewpoints in favor of more “popular” results.[28] Indexing algorithms of major search engines skew towards coverage of U.S.-based sites, rather than websites from non-U.S. countries.[25] Google Bombing is one example of an attempt to manipulate search results for political, social or commercial reasons. Several scholars have studied the cultural changes triggered by search engines,[29] and the representation of certain controversial topics in their results, such as terrorism in Ireland[30] and conspiracy theories.[31]

24.5 Customized results and filter bubbles

Many search engines such as Google and Bing provide customized results based on the user’s activity history. This leads to an effect that has been called a filter bubble. The term describes a phenomenon in which websites use algorithms to selectively guess what information a user would like to see, based on information about the user (such as location, past click behaviour and search history). As a result, websites tend to show only information that agrees with the user’s past viewpoint, effectively isolating the user in a bubble that tends to exclude contrary information. Prime examples are Google’s personalized search results and Facebook's personalized news stream. According to Eli Pariser, who coined the term, users get less exposure to conflicting viewpoints and are isolated intellectually in their own informational bubble. Pariser related an example in which one user searched Google for “BP” and got investment news about British Petroleum while another searcher got information about the Deepwater Horizon oil spill and that the two search results pages were “strikingly different”.[32][33][34] The bubble effect may have negative implications for civic discourse, according to Pariser.[35] Since this problem has been identified, competing search engines have emerged that seek to avoid this problem by not tracking or “bubbling” users, such as DuckDuckGo. Other scholars do not share Pariser’s view, finding the evidence in support of his thesis unconvincing.[36]

24.6 Christian, Islamic and Jewish search engines

The global growth of the Internet and electronic media in the Arab and Muslim World during the last decade has encouraged Islamic adherents in the Middle East and Asian sub-continent, to attempt their own search engines, their own filtered search portals that would enable users to perform safe searches. 24.7. SEARCH ENGINE SUBMISSION 117

More than usual safe search filters, these Islamic web portals categorizing websites into being either "halal" or "haram", based on modern, expert, interpretation of the “Law of Islam”. I’mHalal came online in September 2011. Halalgoogling came online in July 2013. These use haram filters on the collections from Google and Bing (and other).[37] While lack of investment and slow pace in technologies in the Muslim World has hindered progress and thwarted success of an Islamic search engine, targeting as the main consumers Islamic adherents, projects like Muxlim, a Muslim lifestyle site, did receive millions of dollars from investors like Rite Internet Ventures, and it also faltered. Other religion-oriented search engines are Jewgle, the Jewish version of Google, and SeekFind.org, which is Christian. SeekFind filters sites that attack or degrade their faith.[38]

24.7 Search engine submission

Search engine submission is a process in which a webmaster submits a website directly to a search engine. While search engine submission is sometimes presented as a way to promote a website, it generally is not necessary because the major search engines use web crawlers, that will eventually find most web sites on the Internet without assistance. They can either submit one web page at a time, or they can submit the entire site using a sitemap, but it is normally only necessary to submit the home page of a web site as search engines are able to crawl a well designed website. There are two remaining reasons to submit a web site or web page to a search engine: to add an entirely new web site without waiting for a search engine to discover it, and to have a web site’s record updated after a substantial redesign. Some search engine submission software not only submits websites to multiple search engines, but also add links to websites from their own pages. This could appear helpful in increasing a website’s ranking, because external links are one of the most important factors determining a website’s ranking. However John Mueller of Google has stated that this “can lead to a tremendous number of unnatural links for your site” with a negative impact on site ranking.[39]

24.8 See also

• Comparison of web search engines • Information retrieval • • Question answering • Google effect • Use of web search engines in libraries • Semantic Web • • Web development tools • Search engine manipulation effect

24.9 References

[1] “RFC 812 - NICNAME/WHOIS”. ietf.org.

[2] http://ftp.sunet.se/pub/Internet-documents/matrix/services/KIS-id.txt

[3] “World-Wide Web Servers”. W3.org. Retrieved 2012-05-14.

[4] “What’s New! February 1994”. Home.mcom.com. Retrieved 2012-05-14.

[5] “Internet History - Search Engines” (from Search Engine Watch), Universiteit Leiden, Netherlands, September 2001, web: LeidenU-Archie. 118 CHAPTER 24. WEB SEARCH ENGINE

[6] Oscar Nierstrasz (2 September 1993). “Searchable Catalog of WWW Resources (experimental)".

[7] “Archive of NCSA what’s new in December 1993 page”. Web.archive.org. 2001-06-20. Archived from the original on 2001-06-20. Retrieved 2012-05-14.

[8] “Yahoo! And Netscape Ink International Distribution Deal” (PDF)

[9] “Browser Deals Push Netscape Stock Up 7.8%". Los Angeles Times. 1 April 1996

[10] Gandal, Neil (2001). “The dynamics of competition in the internet search engine market”. International Journal of Indus- trial Organization. 19 (7): 1103–1117. doi:10.1016/S0167-7187(01)00065-0.

[11] “Our History in depth”. W3.org. Retrieved 2012-10-31.

[12] Brin, Sergey; Page, Larry. “The Anatomy of a Large-Scale Hypertextual Web Search Engine” (PDF).

[13] Jawadekar, Waman S (2011), “8. Knowledge Management: Tools and Technology”, Knowledge Management: Text & Cases, New Delhi: Tata McGraw-Hill Education Private Ltd, p. 278, ISBN 978-0-07-07-0086-4, retrieved November 23, 2012

[14] Jansen, B. J., Spink, A., and Saracevic, T. 2000. Real life, real users, and real needs: A study and analysis of user queries on the web. Information Processing & Management. 36(2), 207-227.

[15] Chitu, Alex (August 30, 2007). “Easy Way to Find Recent Web Pages”. Google . Retrieved 22 February 2015.

[16] "Versatile question answering systems: seeing in synthesis", Mittal et al., IJIIDS, 5(2), 119-142, 2011.

[17] http://www.ask.com. Retrieved 10 September 2015.

[18] “FAQ”. RankStar. Retrieved 19 June 2013.

[19] “Desktop Search Engine Market Share”. NetMarketShare. Retrieved 30 December 2016.

[20] “Live Internet - Site Statistics”. Live Internet. Retrieved 2014-06-04.

[21] Arthur, Charles (2014-06-03). “The Chinese technology companies poised to dominate the world”. The Guardian. Re- trieved 2014-06-04.

[22] “How Naver Hurts Companies’ Productivity”. The Wall Street Journal. 2014-05-21. Retrieved 2014-06-04.

[23] “Age of Internet Empires”. Oxford Internet Institute. Retrieved 2014-06-04.

[24] Segev, El (2010). Google and the Digital Divide: The Biases of Online Knowledge, Oxford: Chandos Publishing.

[25] Vaughan, Liwen; Mike Thelwall (2004). “Search engine coverage bias: evidence and possible causes”. Information Pro- cessing & Management. 40 (4): 693–707. doi:10.1016/S0306-4573(03)00063-3.

[26] Jansen, B. J. and Rieh, S. (2010) The Seventeen Theoretical Constructs of Information Searching and Information Retrieval. Journal of the American Society for Information Sciences and Technology. 61(8), 1517-1534.

[27] Berkman Center for Internet & Society (2002), “Replacement of Google with Alternative Search Systems in China: Doc- umentation and Screen Shots”, Harvard Law School.

[28] Introna, Lucas; Helen Nissenbaum (2000). “Shaping the Web: Why the Politics of Search Engines Matters”. The Infor- mation Society: An International Journal. 16 (3). doi:10.1080/01972240050133634.

[29] Hillis, Ken; Petit, Michael; Jarrett, Kylie (2012-10-12). Google and the Culture of Search. Routledge. ISBN 9781136933066.

[30] Reilly, P. (2008-01-01). Spink, Prof Dr Amanda; Zimmer, Michael, eds. ‘Googling’ Terrorists: Are Northern Irish Terrorists Visible on Internet Search Engines?. Information Science and Knowledge Management. Springer Berlin Heidelberg. pp. 151–175. doi:10.1007/978-3-540-75829-7_10. ISBN 978-3-540-75828-0.

[31] Ballatore, A. “Google chemtrails: A methodology to analyze topic representation in search engines”. First Monday.

[32] Parramore, Lynn (10 October 2010). “The Filter Bubble”. The Atlantic. Retrieved 2011-04-20. Since Dec. 4, 2009, Google has been personalized for everyone. So when I had two friends this spring Google “BP,” one of them got a set of links that was about investment opportunities in BP. The other one got information about the oil spill....

[33] Weisberg, Jacob (10 June 2011). “Bubble Trouble: Is Web personalization turning us into solipsistic twits?". Slate. Re- trieved 2011-08-15. 24.10. FURTHER READING 119

[34] Gross, Doug (May 19, 2011). “What the Internet is hiding from you”. CNN. Retrieved 2011-08-15. I had friends Google BP when the oil spill was happening. These are two women who were quite similar in a lot of ways. One got a lot of results about the environmental consequences of what was happening and the spill. The other one just got investment information and nothing about the spill at all.

[35] Zhang, Yuan Cao; Séaghdha, Diarmuid Ó; Quercia, Daniele; Jambor, Tamas (February 2012). “Auralist: Introducing Serendipity into Music Recommendation” (PDF). ACM WSDM.

[36] O'Hara, K. (2014-07-01). “In Worship of an Echo”. IEEE Internet Computing. 18 (4): 79–83. doi:10.1109/MIC.2014.71. ISSN 1089-7801.

[37] “New Islam-approved search engine for Muslims”. News.msn.com. Retrieved 2013-07-11.

[38] “Halalgoogling: Muslims Get Their Own “sin free” Google; Should Christians Have Christian Google? - Christian Blog”. Christian Blog.

[39] Schwartz, Barry (2012-10-29). “Google: Search Engine Submission Services Can Be Harmful”. Search Engine Roundtable. Retrieved 2016-04-04.

24.10 Further reading

• Steve Lawrence; C. Lee Giles (1999). “Accessibility of information on the web”. Nature. 400 (6740): 107–9. doi:10.1038/21987. PMID 10428673.

• Bing Liu (2007), Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Springer,ISBN 3-540- 37881-2

• Bar-Ilan, J. (2004). The use of Web search engines in information science research. ARIST, 38, 231-288. • Levene, Mark (2005). An Introduction to Search Engines and Web Navigation. Pearson.

• Hock, Randolph (2007). The Extreme Searcher’s Handbook.ISBN 978-0-910965-76-7 • Javed Mostafa (February 2005). “Seeking Better Web Searches”. Scientific American.

• Ross, Nancy; Wolfram, Dietmar (2000). “End user searching on the Internet: An analysis of term pair topics submitted to the Excite search engine”. Journal of the American Society for Information Science. 51 (10): 949–958. doi:10.1002/1097-4571(2000)51:10<949::AID-ASI70>3.0.CO;2-5. • Xie, M.; et al. (1998). “Quality dimensions of Internet search engines”. Journal of Information Science. 24 (5): 365–372. doi:10.1177/016555159802400509. • Information Retrieval: Implementing and Evaluating Search Engines. MIT Press. 2010.

24.11 External links

• Search Engines at DMOZ 120 CHAPTER 24. WEB SEARCH ENGINE

24.12 Text and image sources, contributors, and licenses

24.12.1 Text • Information extraction Source: https://en.wikipedia.org/wiki/Information_extraction?oldid=761844477 Contributors: The Anome, Ed- ward, Michael Hardy, Kku, MichaelJanich, Ronz, Geraki, Dbabbitt, Owen, Phil Boswell, Dmolla, Solipsist, Beland, Robertbowerman, Mike Schwartz, Leondz, ScottDavis, Bkkbrad, Gmelli, Intgr, Spencerk, Sfrancoeur, Cedar101, Dongxun~enwiki, SmackBot, Mladi- filozof, JonHarder, Natecull, Will Beback, Kuru, Dreftymac, Searchtools, MaxEnt, Alexander Wilks, JAnDbot, The Transhumanist, David Eppstein, Ronbarak, DomBot, Falazar, Francesco sclano, Nevalicori, Jamelan, HamishCunningham, Sebastjanmm, Gdupont, Alle- borgoBot, Omerod, Icognitiva, Dasterner, Jojalozzo, CharlesGillingham, Disooqi, Bidoll, Plastikspork, Niceguyedc, Dtunkelang, Pixel- Bot, Pablomendes, Ost316, Duffbeerforme, Texterp, Addbot, Belmond, OlEnglish, Incola, Yobot, Fraggle81, Tiffany9027, George1975, Fran.sansalone, SebastianHellmann, Al Maghi, FrescoBot, Mark Renier, Hosszuka, Jandalhandler, Supersun511, Khazakistyle, Bangla11, John of Reading, Animorphus, Lawrykid, DaTribe, Shaalank, ClueBot NG, Lawrence87, Pushpinder12, Astronautguo, Rubengra, DBigXray, BG19bot, Yiyeguhu, Lucyinthesky45, Khazar2, Dexbot, Pintoch, Brandon Bertelsen, Me, Myself, and I are Here, Robyvd, Lisa Beck, Aasasd, Hajasu, Preetansh9, H.dryad, Daniel kenneth, Blane from Cinbar and Anonymous: 88 • Named-entity recognition Source: https://en.wikipedia.org/wiki/Named-entity_recognition?oldid=761065837 Contributors: Paul A, Ronz, Jogloran, Dmolla, T0m, Macrakis, Beland, Powdahound, Echuck215, Leondz, Sandius, Bkkbrad, Apokrif, Simsong, Qwertyus, Rjwilmsi, Feydey, Dmccreary, Spencerk, Msbmsb, RussBot, Rjlabs, Tony1, Cedar101, Gabr~enwiki, Moquist, Ckatz, Chwalker, Cm- drObot, Megannnn, Chrisahn, Cs california, Kevin.cohen, MER-C, Activelink, Cander0000, Jfroelich, Francis Tyers, Ttague, Erikt~enwiki, Pythonner, Davidmakovoz, Rhhender, Synthebot, Legoktm, Chaotix63, Icognitiva, Jojalozzo, Strife911, JBrookeAker, UKoch, Schreiber- Bike, Carriearchdale, Mpawel, Texterp, Addbot, Favonian, Luckas-bot, Yobot, Themfromspace, Ptbotgourou, Sumail, AnomieBOT, Sdmonroe, Brightgalrs, Vuongvina, FrescoBot, Nainawalli, Kwiki, DrilBot, Danyaljj, Mean as custard, EmausBot, Goldwas1, ZéroBot, Iropark, Shaalank, ClueBot NG, Karkand23, Ngocminh.oss, Ae David, BG19bot, Compfreak7, Ratinov, ChrisGualtieri, TextMech, Pe- tecog, Melonkelon, Ocky7, Monkbot, Fake ones, Brokkolie, Iwasaki hirofumi, Hughesonline, Lemborio, ChrisManning and Anonymous: 101 • Part-of-speech tagging Source: https://en.wikipedia.org/wiki/Part-of-speech_tagging?oldid=756514674 Contributors: Stevertigo, Michael Hardy, Kku, Dino, Furrykef, Arkuat, Babbage, BenFrantzDale, Ds13, Khalid hassani, Dfrankow, Beland, Cagri, Venu62, Vacindak, 4pq1injbok, Rama, Cmdrjameson, Grutness, Facopad, Woohookitty, Mumpitz~enwiki, FBarber, Marudubshinki, Qwertyus, Koavf, Dmccreary, Hermione1980, Brendan642, Sderose, Spencerk, Msbmsb, Wavelength, Taejo, Philopedia, Ritchy, Bkil, Thnidu, Closed- mouth, Kostmo, JorisvS, IvanLanin, Cydebot, Skittleys, Cs california, Qwyrxian, Handicapper, PhilKnight, Magioladitis, Brett, Mar- tinBot, R'n'B, Francis Tyers, Katalaveno, AntiSpamBot, Serge925, Bonadea, Soshial, Sandman2007, TXiKiBoT, Paladin Artix, Tlieu, Enviroboy, RaseaC, Kehrbykid, Legoktm, Matt Gerber, SieBot, Poi dog pondering, AlanUS, CharlesGillingham, NastalgicCam, ClueBot, Mild Bill Hiccup, UKoch, Goodvac, DumZiBoT, Addbot, Fluffernutter, Yobot, TaBOT-zerem, Legobot II, THEN WHO WAS PHONE?, AnomieBOT, DemocraticLuntz, Sdmonroe, Jim1138, Maxis ftw, Xqbot, Farazv, 10metreh, EmausBot, Yatsko, LWG, Mozzy66, Elaz85, ClueBot NG, Kasirbot, Granadajose, Karkand23, Verhoevenben, BG19bot, Soheila 3155, Chmarkine, Davebs, Murhaff, ChrisGualtieri, YFdyh-bot, Me, Myself, and I are Here, Terrance26, Loraof, LingLass and Anonymous: 87 • Phrase chunking Source: https://en.wikipedia.org/wiki/Phrase_chunking?oldid=574661004 Contributors: Thorwald, Spencerk, Wave- length, SmackBot, Francesco sclano, Legoktm, CharlesGillingham, Dana boomer, Lam Kin Keung and Anonymous: 2 • Relationship extraction Source: https://en.wikipedia.org/wiki/Relationship_extraction?oldid=723833388 Contributors: Kku, Scott, Sandius, Gmelli, RussBot, Breno, Legoktm, SimonTrew, Rcartic, DumZiBoT, DOI bot, Yobot, Fortdj33, Rlistou, Citation bot 1, RjwilmsiBot, Dcirovic and Anonymous: 4 • Sentence boundary disambiguation Source: https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation?oldid=751822675 Con- tributors: Benwing, DragonflySixtyseven, TreveX, BD2412, Sderose, Spencerk, Cedar101, MER-C, Dlwh, Musically ut, Truthanado, Legoktm, Ve4ernik, Niceguyedc, AnomieBOT, Unready, Dcirovic, Shaddim, Karkand23, BonzaiThePenguin, Flinter and Anonymous: 10 • Shallow parsing Source: https://en.wikipedia.org/wiki/Shallow_parsing?oldid=721226487 Contributors: Michael Hardy, Kku, Ronz, Charles Matthews, Grm wnr, Rama, AdamAtlas, Jonsafari, Ish ishwar, Banazir, Spencerk, SmackBot, Uthbrian, Aranduil, Alaibot, Head- bomb, Francis Tyers, Jamelan, Legoktm, Niceguyedc, Addbot, Yobot, GrouchoBot, Piotrks, AJCham, Erik9bot, Jonesey95, RjwilmsiBot, Goldwas1, Karkand23, Ngocminh.oss, Terrance26, Robo-Kyon, Raidex.sym, Rilinger and Anonymous: 11 • Stemming Source: https://en.wikipedia.org/wiki/Stemming?oldid=763558125 Contributors: Maury Markowitz, Mrwojo, Edward, Michael Hardy, Zeno Gantner, Nohat, Altenmann, Babbage, Stewartadcock, KellyCoinGuy, Diberri, Macrakis, Gdm, Beland, Urhixidur, Kurisu, Rich Farmbrough, ESkog, Kwamikagami, Aaronbrick, Spalding, Jonsafari, Shabble, Gothick, Ruud Koot, Byronknoll, Gwil, Qwer- tyus, Rjwilmsi, Salix alba, Mazzmn, Nihiltres, Mendicott, Fmccown, 2over0, SmackBot, Moralis, Sundaryourfriend, Ewok Slayer, Mirokado, Acmeacme, Salamurai, Nemonemo~enwiki, CapitalR, ILikeThings, CRGreathouse, Searchtools, Cs california, Malleus Fatuo- rum, Thijs!bot, Plausible deniability, Vtcondo, KP Botany, Alphachimpbot, JAnDbot, David Eppstein, Jim Carnicelli, R'n'B, Jfroelich, TottyBot, ChesMartin, Maghnus, Legoktm, AlanUS, Disooqi, Vacio, Mild Bill Hiccup, Ray3055, Xodarap00, Addbot, Lon of Oakdale, Halloleo, Lightbot, Teles, Moderngirllive, Luckas-bot, Yobot, AnomieBOT, Xqbot, Kracekumar, FrescoBot, Felix.middendorf, Dian- naa, Luismsgomes, Uanfala, EmausBot, John of Reading, Yatsko, Donner60, Sotnyk, ClueBot NG, Stevenxlead, Frietjes, DBigXray, Doszkocs, BG19bot, Tensorylabs, BattyBot, Khazar2, Xmu YHLiu, Lade271, Faizan, SpaceScape, Cato The Censor, Nahden, Fj2c, JonaathanKatz and Anonymous: 81 • Text segmentation Source: https://en.wikipedia.org/wiki/Text_segmentation?oldid=744684004 Contributors: Fnielsen, BenKovitz, Bab- bage, Jorge Stolfi, Scode, Serapio, Querent, Statusquo, Leondz, Woohookitty, Ruud Koot, BD2412, Rjwilmsi, Thangalin, Spencerk, Trondtr, Daniel Mietchen, Tony1, Sandwich, SmackBot, Took, Nbarth, Whomp, IvanLanin, Jausel, Alexamies, Alaibot, Cs cali- fornia, David Eppstein, Soshial, Rei-bot, Jamelan, Cnilep, Legoktm, Mark l watson, Niceguyedc, DragonBot, Leontios, PixelBot, Addbot, Quercus solaris, Jarble, Legobot II, , VernoWhitney, Born2bgratis, Helpful Pixie Bot, Hieukieng~enwiki, Impsswoon, Grantmjenks, Metasyn and Anonymous: 18 • Tokenization (lexical analysis) Source: https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)?oldid=759379009 Contributors: Beorhast, Alvestrand, Rich Farmbrough, Foolip, Sleske, Leondz, Sderose, Mahahahaneapneap, Malcolma, Laser2k, King Mir, Ulfmatts- son, Maghnus, DonBarredora, Legoktm, BearMachine, Chininazu12, Addbot, Luckas-bot, Yobot, AnomieBOT, Artnowo, Suasysar, Ben- zolBot, LittleWink, Yacht Travler, Wei2912, BG19bot, Fadirra, ChrisGualtieri, StevenRRusso, Yanis ahmed, Impsswoon, Hughesonline, Yijisoo and Anonymous: 22 24.12. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 121

• Parsing Source: https://en.wikipedia.org/wiki/Parsing?oldid=759112319 Contributors: Damian Yerrick, Vicki Rosenzweig, The Anome, -Vt-aoe, Jaredwf, Shoesfullofdust, Mu ,דוד ,K.lee, Michael Hardy, TakuyaMurata, CesarB, Ahoerstemeier, Mac, Glenn, Ralesk, Furrykef lukhiyya, Martin Hampl~enwiki, Pps, Tea2min, Giftlite, Tom harrison, Dmb000006, Jason Quinn, Macrakis, Neilc, Gadfium, Beland, MarkSweep, Billposer, Irrelevant, Paulbmann, Zeman, Sylvain Schmitz~enwiki, Spayrard, Liberatus, Richard W.M. Jones, Bobo192, Slomo~enwiki, Larryv, Zetawoof, Jonsafari, Obradovic Goran, Alansohn, Liao, Seans Potato Business, Chrisjohnson, Pekinensis, Ruud Koot, Slike2, Knuckles, Wikiklrsc, BD2412, Knudvaneeden, Nekobasu, MarSch, Jimworm, Salix alba, RobertG, Gurch, Quuxplusone, GreyCat, BradBeattie, Chobot, Hairy Dude, Arado, Epolk, Van der Hoorn, Yrithinnd, Mccready, Jstrater, Mikeblas, Maunus, Cadillac, Pietdesomere, Modify, TuukkaH, SmackBot, NickyMcLean, Sebesta, Gilliam, Chris the speller, Nbarth, Sephiroth BCR, Fchaumartin, Nixeagle, Sommers, SundarBot, Flyguy649, Downwards, Andrei Stroe, Derek farn, John, HVS, Ourai, SilkTork, Catapult, IronGargoyle, Pelzi~enwiki, 16@r, Rkmlai, Alatius, MrDolomite, Clarityfiend, Paul Foxworthy, RekishiEJ, Vanisaac, FatalError, Ahy1, E-boy, Devin- Cook, MarsRover, HenkeB, Myasuda, Cydebot, Valodzka, Agentilini, Farzaneh, Msnicki, QTJ, 218, Zalgo, Thijs!bot, JEBrown87544, AntiVandalBot, Seaphoto, FedericoMP, Natelewis, Hermel, JAnDbot, MER-C, Gavia immer, AlmostReadytoFly, Tedickey, Usien6, Aakin, Cic, Bubba hotep, CapnPrep, Nicsterr, Scruffy323, Nathanfunk, Mmh, Dantek, W3stfa11, Magicjazz, Cometstyles, Kjmjds, DorganBot, Brvman, CardinalDan, Vipinhari, Pulsar.co.nr, Technopat, Sylvia.wei, Lbmarshall, DonBarredora, SieBot, YoCmos~enwiki, Timhowardriley, Til Eulenspiegel, Flyer22 Reborn, Diego Grez-Cañete, AlanUS, FghIJklm, Gabilaw, Denisarona, Stokito, Benoît Sagot, Thisisnotapipe, DragonBot, Estirabot, Nikolasmorton, Roxy the dog, AngelHaf, Libcub, Addbot, Download, Dougbast, OlEnglish, Jar- ble, TaBOT-zerem, Wonderfl, LucidFool, AnomieBOT, Jim1138, Пика Пика, Materialscientist, Citation bot, Xtremejames183, Xqbot, St.nerol, J04n, WordsAndNumbers, Thehelpfulbot, FrescoBot, Borsotti, OgreBot, Robinlrandall, I dream of horses, HRoestBot, Romek1, Lotje, Callanecc, Ruebencampbell, Spakin, RjwilmsiBot, GDBarry, Garfieldnate, J36miles, EmausBot, Oliverlyc, WikitanvirBot, Eek- erz, ZxxZxxZ, Hpvpp, Sirthias, Ashowen1701, Peterh5322, Jeroendv, Donner60, MainFrame, Sebbes333, ChuispastonBot, ClueBot NG, Satellizer, Jeana7, Frietjes, DBSand, MerlIwBot, Leonxlin, Architectual, Claytoncarney, ChrisGualtieri, Steamerandy, Pintoch, Samlan- ning, JohnZofSydney, Watchforever, YarLucebith, Romavikt, AddWittyNameHere, Akshaynexus, Rooke, Aasasd, Equinox, ProprioMe OW, Wrath abyss, Imadeluz, Bender the Bot and Anonymous: 244 • Parse tree Source: https://en.wikipedia.org/wiki/Parse_tree?oldid=759757439 Contributors: Bryan Derksen, The Anome, Smelialichu, Cadr, Emperorbma, Dysprosia, Fredrik, Pps, Ruakh, Beorhast, Tamur, Giftlite, Falstaft, BenFrantzDale, Beland, Ganymead, Mathi- asl26, Spayrard, Zscout370, Jonsafari, Alansohn, RJFJR, Angr, LOL, Qwertyus, Chobot, YurikBot, Wavelength, RussBot, David Pierce, Modify, Frigoris, Donhalcon, TuukkaH, Frap, Ioscius, Dono, BrainMagMo, Jafet, FatalError, Cydebot, Egriffin, Stannered, Wootery, Ryan Postlethwaite, Alan U. Kennington, VanishedUserABC, AlleborgoBot, EmxBot, AHMartin, BotMultichill, Flyer22 Reborn, Bgal- itsky, OKBot, Denisarona, RMFan1, Addbot, Yakiv Gluck, JakobVoss, Luckas-bot, AnomieBOT, Arjun G. Menon, The High Fin Sperm Whale, SassoBot, Fetchmaster, LucienBOT, Tjo3ya, Extra999, Arabismo, EmausBot, Bollyjeff, Chris857, ClueBot NG, Joefromrandb, ChrisGualtieri, Steamerandy, Jochen Burghardt, Theo’s Little Bot, François Robere, Rtran, Buffbills7701, W. P. Uzer, Kimberly.Ling300, Jenny129, Cherisec, Wafa Al-Ali, Johnathan jones, Some Gadget Geek, Bender the Bot and Anonymous: 58 • Constituent (linguistics) Source: https://en.wikipedia.org/wiki/Constituent_(linguistics)?oldid=713560228 Contributors: Fransvannes, Ruakh, GreatWhiteNortherner, Andycjp, Discospinster, Woohookitty, TaivoLinguist, Polina Khabina, Dorothea~enwiki, YurikBot, Russ- Bot, DanMS, Donald Albury, Antonielly, Doug Weller, Garik, Jsteph, Thijs!bot, JustAGal, Comhreir, Wizymon, Anaxial, Dbraasch, VolkovBot, Toddy1, Ddxc, Alexbot, Darkicebot, Addbot, LaaknorBot, Zorrobot, Zien3, Luckas-bot, Yobot, Fraggle81, 4th-otaku, AnomieBOT, Rjanag, 90 Auto, MauritsBot, HRoestBot, Tjo3ya, Skomakar'n, EmausBot, Socialservice, ClueBot NG, Pacerier, Stjep, Dnmaxwell, Sykling, YiFeiBot, W. P. Uzer, Chelseanne and Anonymous: 30 • Dependency grammar Source: https://en.wikipedia.org/wiki/Dependency_grammar?oldid=743621571 Contributors: Michael Hardy, Peak, Waltpohl, Jason Quinn, Jonsafari, Linas, Uncle G, Pitan, Chobot, YurikBot, Tony1, Bbenzon, Trickstar, Gilliam, RichardHudson, Byelf2007, U-571, Chenli, Alaibot, JamesAM, Informatician, MezzoMezzo, Linguistlist, Ddxc, UKoch, MystBot, Addbot, Metavivo, Yobot, Pcap, KamikazeBot, AnomieBOT, JackieBot, The Wiki ghost, Attardi, FrescoBot, Tjo3ya, Kielbasa1, Arabismo, Ripchip Bot, EfGee, Cmfraser, John of Reading, Qsdfqsdf, Zuky79, ClueBot NG, Fortelle65, Frietjes, BG19bot, Pacerier, Whym, Dnmaxwell, Erans- gran, GabeIglesia, MaiyaH78, Jamesmcmahon0, Bohnetbd, Dough34, Christian Nassif-Haynes, Odysseus71, Malves98, Sunmist, JMP EAX, Engulfing and Anonymous: 33 • Phrase structure grammar Source: https://en.wikipedia.org/wiki/Phrase_structure_grammar?oldid=733503443 Contributors: Domi- nus, Pnm, Rp, Kku, Cadr, Burschik, Saccade, Dr Zen, Linas, SmackBot, Eskimbot, Gregbard, Alaibot, Vantelimus, Nxavar, Ganna24, MenoBot, Libcub, Addbot, AlexandrDmitri, Luckas-bot, Yobot, Rubinbot, Oliverbeatson, Erik9bot, Tjo3ya, Kyoakoa, ChrisGualtieri, Mjshusain, Some Gadget Geek and Anonymous: 7 • Verb phrase Source: https://en.wikipedia.org/wiki/Verb_phrase?oldid=763003916 Contributors: Waveguy, AdamRaizen, Angela, Cadr, Dduck, Tallus, Ruakh, Beland, OverlordQ, Burschik, Szyslak, Bobo192, TACD, Alansohn, Thryduulf, Nivix, Malhonen, DTOx, Cac- tus.man, KEJ, Ste1n, PaulGarner, Jcvamp, Wotnarg, Nikkimaria, Pb30, SmackBot, Mm100100, Gilliam, Mazeface, Haplology, Sjock, Ergative rlt, A. Parrot, Courcelles, FilipeS, Alaibot, JamesAM, Thijs!bot, Epbr123, Jobber, Bobblehead, Widefox, PhilKnight, Kir- rages, Yehuda Falk, Vokaler, Learningnerd, CapnPrep, R'n'B, Toon05, Juliancolton, RJASE1, VolkovBot, Fences and windows, Ma- linaccier, Hqb, Anna Lincoln, Dendodge, Pishogue, BigDunc, Koldito, Logan, Jauerback, Keilana, BenoniBot~enwiki, Mygerardro- mance, Xiaq, Atif.t2, ClueBot, SuperHamster, GoRight, Estirabot, Thingg, Aitias, SoxBot III, Vinceducut, Jbeans, Addbot, Cuaxdon, Glane23, West.andrew.g, Tide rolls, Luckas-bot, Rjanag, Materialscientist, Cureden, RibotBOT, The Wiki ghost, Ebalder, Griffinofwales, MarkkuP, HRoestBot, Tjo3ya, Richsiffer, Reach Out to the Truth, Shabidoo, Tommy2010, K6ka, Jordantrew, Basketball4998, Osh- iokhaienega, Sonicyouth86, ClueBot NG, Sitka1000, MusikAnimal, Clavaine, Victor Yus, Katealli and Anonymous: 165 • Information retrieval Source: https://en.wikipedia.org/wiki/Information_retrieval?oldid=763392430 Contributors: The Anome, LA2, Marian, Michael Hardy, Kku, Ronz, Notheruser, Nichtich~enwiki, Hike395, Charles Matthews, Nickg, Greenrd, Silvonen, DJ Clay- worth, Espertus, Ggrefenstette, AaronSw, Robbot, Altenmann, Psychonaut, Dmolla, Masao, Smb1001, Enochlau, Giftlite, Christopher Parham, Sepreece, Andris, AlistairMcMillan, Macrakis, SWAdair, Worldguy~enwiki, Decoy, Utcursch, Pgan002, Beland, MarkSweep, ChaTo, Urhixidur, Rich Farmbrough, Rama, Kaisershatner, Flyskippy1, Serapio, Wikinaut, Themindset, Jonsafari, Mdd, Kessler, Stephen Turner, Nkour, MIT Trekkie, Dominik Kuropka~enwiki, Ceyockey, Oleg Alexandrov, Linas, Apokrif, Burkhard~enwiki, Male1979, KKramer~enwiki, Stoni, Graham87, BD2412, Qwertyus, LanguageMan, Rjwilmsi, Gmelli, KYPark, Runarb, Intgr, Sderose, Bmicomp, Planetneutral, Chobot, Vmenkov, Msbmsb, YurikBot, Wavelength, Borgx, Laurentius, Waitak, Appler~enwiki, Wimt, Fmccown, Modify, GraemeL, Allens, Tobi Kellner, NeilN, Marregui, DSiv, That Guy, From That Show!, Lwives~enwiki, Chrissi~enwiki, Unyoyega, Eskim- bot, Pfaff9, London25, Bluebot, JackyR, EncMstr, Nbarth, Srchvrs, AntiVan, Gabr~enwiki, JonHarder, Cache22, Vina-iwbot~enwiki, Spiritia, NewTestLeper79, ThomasHofmann, Accurizer, Ckatz, Clark Mobarry, Packerliu, RichardF, SimonD, B7T, GerryWolff, Cm- drObot, Tamarkot, Indigo1300, Myasuda, Krauss, Evenmadderjon, Thijs!bot, Andyjsmith, CharlotteWebb, Niduzzi, AnAj, LazyEditor, 122 CHAPTER 24. WEB SEARCH ENGINE

Clamster5, Barek, The Transhumanist, Sanchom, Ph.eyes, Herr blaschke, Magioladitis, Anþony, Buettcher, U608854, Jodi.a.schneider, Gwern, MartinBot, R'n'B, Jfroelich, Thirdright, Rbrewer42, Mderijke, Theo Mark, Yannick56, Textminer, Falazar, AKA MBG, Neil Dodgson, Dominich01, Funandtrvl, ShahChirag, VolkovBot, Rodrigoluk, Mjbinfo, Bobareann, Drrprasath, Sebastjanmm, Gdupont, PhysPhD, Wavehunter, AlleborgoBot, SieBot, Gorpik, Sonermanc, Tiptoety, Artod, Maynelaw, Disooqi, Pinkadelica, Vanished user qkqknjitkcse45u3, WakingLili, Shodanium, UKoch, Dtunkelang, Erahana, Ray3055, Glendac, Gavinsam1994, Hiemstra, Rainmannn, Drmeier8, Armando49, Johnuniq, Puvar, Boleyn, XLinkBot, Fastily, Chickensquare, SilvonenBot, PL290, DOI bot, Fgnievinski, St73ir, Hdez.maria~enwiki, Erhard002, Tanhabot, OZJ, Josevellezcaldas, MrOllie, Favonian, Torla42, Prashantmore 1, Zorrobot, Johnchal- lis, Yobot, WikiDan61, Ptbotgourou, Anypodetos, AnomieBOT, Rodrigobartels, Ciphers, Citation bot, Devantheryv, Awesomeness, Xqbot, StuffyProf, Ameliablue, Vuongvina, Aragor~enwiki, Rami ghorab, LatentDrK, FrescoBot, Spadarabdon, Trimaine, Mark Renier, Hosszuka, X7q, Moiencore, Citation bot 1, PrincessofLlyr, ErinM, C messier, Eupraxis, Aoidh, Schubi87, Gregman2, Mean as cus- tard, RjwilmsiBot, TigerHokieFan, Helwr, EmausBot, Baseball1015, John of Reading, Zollerriia, Primefac, Riclas, Custard Pie Tarlet, Slightsmile, Pacung, Summertime30, Sue Myburgh, Erianna, Pintaio, ClueBot NG, Marek.rei, Helpful Pixie Bot, Arraycom, Doszkocs, Nigel V Thomas, Eidenberger, Nprieve, ChrisGualtieri, Helensol, TwoTwoHello, Abtin.zo, Hawlkeye1997, Me, Myself, and I are Here, Phamnhatkhanh, Michipedian, Benjamin Großmann, Param Mudgal, FooCow, Somipam r shimray, Kandreyev, MRD2014, Cynulliad, Deepakagrawal075, KasparBot, CAPTAIN RAJU, Akoutsou77, Polm23, Researcher9999, Theodorelaporie, Udomxsor and Anonymous: 279 • Vector space model Source: https://en.wikipedia.org/wiki/Vector_space_model?oldid=744792705 Contributors: Michael Hardy, Kku, Dcljr, Stan Shebs, Jitse Niesen, Gdm, Beland, Thorwald, Rama, Mykhal, ESkog, Aaronbrick, .:Ajvol:., Jonsafari, Gary, Dominik Kuropka~enwiki, Bjh~enwiki, Ruud Koot, GregorB, Qwertyus, LanguageMan, Rjwilmsi, Gmelli, YurikBot, Conscious, Fmccown, Mike Dillon, SmackBot, MalafayaBot, Morecore~enwiki, JohnWhitlock, Stiang, Hankat, Padvi~enwiki, Thijs!bot, Oliver202, Remaire, AnAj, SamatJain, Jone- merson, Destynova, Ezani, Unkx80, Cometstyles, Dominich01, VolkovBot, Amroamroamro, Philip Trueman, InformationSpace, Synthe- bot, Luc.denys, Disooqi, Maxalbanese, Dspattison, UKoch, PixelBot, Sir Tobek, Dwiddows, XLinkBot, Addbot, Favonian, Vuongvina, LatentDrK, Hyju, Suffusion of Yellow, TigerHokieFan, Riclas, Boraas, ZéroBot, Donner60, Tbear1234, PenelopeKit, Justincheng12345- bot, Lxcythian, Biogeographist, SergioJimenez, Alenrooni, Vítor and Anonymous: 54 • Tf–idf Source: https://en.wikipedia.org/wiki/Tf%E2%80%93idf?oldid=762551426 Contributors: Damian Yerrick, AxelBoldt, Fnielsen, Kku, Dcoetzee, Greenrd, Topbanana, Metasquares, Psychonaut, Beorhast, Schmmd, Sepreece, Beland, Sam Hocevar, Urhixidur, Thor- wald, Rich Farmbrough, Rama, Syp, Mkosmul, Araste, Jasonzhuocn, .:Ajvol:., Physicistjedi, Jonsafari, Pearle, CyberSkull, Rickyp, Ruud Koot, Triddle, GregorB, Qwertyus, Nat5an, Rjwilmsi, Jehochman, Winterstein, Sderose, RussBot, Gareth Jones, Mugunth Ku- mar, Fmccown, Thnidu, Jingjun, Cedar101, Mcld, Bluebot, Alexdow, Colonies Chris, Talia ali, P199, Woodshed, Only2sea, Farzaneh, Rkrish67, Yellowdesk, Leedude, Pax:Vobiscum, Absurdburger, Unkx80, Ranboii, Jsundram, RichardSocher~enwiki, Amroamroamro, VVVBot, K-nakayama, Disooqi, Melcombe, DonAByrd, Pkalmar, Dsimic, Addbot, DOI bot, Josevellezcaldas, MrOllie, Halloleo, Ebban- dari, O76923, Yobot, SwisterTwister, Eric-Wester, AnomieBOT, Xqbot, GrouchoBot, Tarantulae, Kyng, LatentDrK, Scott A Herbert, Ndudeja, Citation bot 1, Rickyphyllis, Kmels, Thái Nhi, Ursula Huggenbichler, Dinamik-bot, ThaddeusB-public, Ripchip Bot, Kien- jakenobi, Dixtosa, Cskudzu, Tashuhka, Yatsko, Julienhamonic, EdwardLas, Mjbmrbot, Integr8e, ClueBot NG, Mataglap, DrDooBig, Pankajb64, Rezabot, Helpful Pixie Bot, Chafe66, Intervallic, Ibid17, Saturdayswiki, Dexbot, Yissel Espinosa, Yinlongzhao, Monkbot, Kcgoo, CosineP, Velvel2, Xiaoming online, Leopeng1995, Kaleida, Svgspnr, Fmadd, Kakkeshyor and Anonymous: 102 • Synonym Source: https://en.wikipedia.org/wiki/Synonym?oldid=764129490 Contributors: XJaM, Ortolan88, Ben-Zin~enwiki, Dieter Simon, Jaknouse, Stevertigo, DennisDaniels, Patrick, RTC, Michael Hardy, GTBacchus, Cyp, Mac, TUF-KAT, Jebba, Александър, Glenn, Nikai, Raven in Orbit, Hashar, Nohat, Hydnjo, Haukurth, Paul-L~enwiki, Shizhao, Robbot, Psmith, Halthecomputer, Academic Challenger, Borislav, Adam78, Marc Venot, Sethoeph, Aphaia, NeoJustin, Mboverload, Khalid hassani, Jackol, Alexf, Wleman, Gdr, Noe, JoJan, Icairns, Tail, Burschik, Wyllium, Trevor MacInnis, Chepry, Discospinster, Rich Farmbrough, KillerChihuahua, Bender235, Bobo192, Circeus, Richi, Greenleaf~enwiki, Numerousfalx, Nsaa, Jumbuck, Alansohn, Blahma, Duffman~enwiki, AzaToth, Lightdark- ness, Bart133, Velella, Ringbang, Toby D, Abanima, Camw, Tbc2, Macaddct1984, Gimboid13, Stefanomione, HappyCamper, Dou- bleBlue, FlaBot, Ian Pitchford, Alphachimp, Chobot, Deyyaz, Bgwhite, Roboto de Ajvol, YurikBot, RobotE, Lissoy, Stephenb, Mike Young, Dysmorodrepanis~enwiki, Wiki alf, Haoie, Moe Epsilon, Nescio, Siyavash, ArielGold, GrinBot~enwiki, DVD R W, Sintonak.X, SmackBot, Brya, Prodego, Hydrogen Iodide, Bomac, Jacek Kendysz, EncycloPetey, Ricadus, Xaosflux, Gilliam, Ohnoitsjamie, Keegan, LinguistAtLarge, MalafayaBot, Gracenotes, Jahiegel, Crboyer, Evlekis, SofieElisBexter, SashatoBot, Valfontis, Kuru, Mr.K., DIEGO RICARDO PEREIRA, Ckatz, 16@r, InedibleHulk, Nehrams2020, Tawkerbot2, Jh12, Rouseaubade, Cydebot, Eu.stefan, Fifo, Naudefj, Chrislk02, Jalen~enwiki, Zalgo, Epbr123, Olahus, HappyInGeneral, James086, TXiKi, Whoda, AntiVandalBot, Hjherbert~enwiki, Luna Santin, Cchhrriiss, Nancy Vandal, Dreaded Walrus, JAnDbot, Plantsurfer, Andonic, Hut 8.5, Mladen.adamovic, Yahel Guhan, James- BWatson, Singularity, Studios, Lošmi, Bugtrio, Vssun, DerHexer, MartinBot, Arjun01, Anaxial, Hasanisawi, J.delanoy, Pharaoh of the Wizards, Kimse, Trusilver, Belovedfreak, Johnmccrae, NewEnglandYankee, Shoessss, Bonadea, SoCalSuperEagle, 28bytes, Tolone, ABF, Satani, Locamomof5, Philip Trueman, TXiKiBoT, Zidonuke, Asarlaí, Drake Redcrest, Saber girl08, Qxz, Anna Lincoln, Atelaes, RandomXYZb, Rjgodoy, Wolfrock, Carinemily, Synthebot, RaseaC, WatermelonPotion, Weirdalfan1, Newbyguesses, Regregex, Dan Polansky, SieBot, BotMultichill, Gerakibot, Yintan, Jessdingding, Georgi87, Chridd, Rhanyeia, Allmightyduck, Oxymoron83, Antonio Lopez, Nuttycoconut, Crisis, Techman224, Seaniedan, ClueBot, The Thing That Should Not Be, Mild Bill Hiccup, Alexbot, Razor- flame, INTERSTREAMER, Otr500, XLinkBot, SilvonenBot, MystBot, Addbot, Proofreader77, Vakeger~enwiki, Basilicofresco, Willk- ing1979, Lofty2, Betterusername, Download, LaaknorBot, Favonian, Quercus solaris, Tide rolls, BrianKnez, Nguyễn Thanh Quang, JackieMoon2, Luckas-bot, KamikazeBot, Las vegas12, Synchronism, AnomieBOT, Sonia, Jim1138, Neptune5000, Glenfarclas, Materi- alscientist, Alexsheksna, Neurolysis, Xqbot, Sionus, Addihockey10, Tomdo08, Omnipaedista, Backpackadam, Mark Schierbecker, Ribot- BOT, Wikieditor1988, Shadowjams, Sesu Prime, 13alexander, LucienBOT, Paine Ellsworth, Wikieditor754, Jamesooders, Pinethicket, DARTH SIDIOUS 2, EmausBot, RA0808, K6ka, Fæ, EdEColbert, Kilopi, Sahim, Donner60, Chuck3r, ClueBot NG, Gareth Griffith- Jones, O.Koslowski, Alexhangartner, Bear030702, Widr, Rkrgwergto, Trans2011, BG19bot, Murphyc65, Hashem sfarim, Bolatbek, Sylvain.maurin, MusikAnimal, Davidiad, Altaïr, Snow Blizzard, YVSREDDY, Verbcatcher, Amitswarup, David.moreno72, The Illu- sive Man, EuroCarGT, JYBot, Cwobeel, Lugia2453, Frosty, Fox2k11, Trollerboi203, Cadillac000, Faizan, Epicgenius, Caveman12, F12X21, Talkjohn, BDawgonnit, Supriya Desai, Everymorning, DavidLeighEllis, Ugog Nizdast, Quenhitran, Jianhui67, AddWittyName- Here, Kwicbaez, Ilopez0000, JaconaFrere, Sherrond28, Scarbom2014, Thewickedkid, KH-1, Silentkhajiit, Hhhhhhhhjjjjjjjj, Supdiop, KasparBot, CLCStudent, Fuortu, Shaneicemaldonado, Es1326, Dusade, Whynot99, WU TAN CLAN FAN and Anonymous: 438 • Relevance Source: https://en.wikipedia.org/wiki/Relevance?oldid=755224399 Contributors: Edward, Ihcoyc, Ahoerstemeier, Scott, Charles Matthews, Hyacinth, Metasquares, Pingveno, Micru, Macrakis, Lucidish, Rich Farmbrough, Pmsyyz, Aecis, EmilJ, Stesmo, Smalljim, Foobaz, Adrian~enwiki, SpeedyGonsales, PWilkinson, Runner1928, John Quiggin, RainbowOfLight, Brookie, Tabletop, Magister Math- ematicae, BD2412, Tommy Kronkvist, FlaBot, Nihiltres, YurikBot, Hairy Dude, RL0919, Roger Lindsay, Paul Erik, GrinBot~enwiki, 24.12. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 123

SmackBot, McGeddon, WillAndrews, Gilliam, Silly rabbit, Rklawton, JorisvS, Physis, Dreftymac, Mpoulshock, Megatronium, Gregbard, Themightyquill, Thijs!bot, AntiVandalBot, JAnDbot, Ecurrey, Arno Matthias, Father Goose, Cpl Syx, Mcfar54, Dan Pelleg, MartinBot, Arjun01, AstroHurricane001, Yonidebot, SimDarthMaul, Vranak, Zmnsr1, Fences and windows, Nedelisky, Michaeldsuarez, Maran- lar, Neparis, Flyer22 Reborn, JSpung, Mr. Stradivarius, ClueBot, Mike Klaassen, Blanchardb, RenamedUser jaskldjslak903, Awick- ert, Excirial, Jusdafax, PixelBot, BirgerH, Rebele, BarretB, Noctibus, Gunnex, Addbot, Ezekiel 7:19, Ccacsmss, West.andrew.g, Tide rolls, Lightbot, OlEnglish, Zorrobot, Jarble, Luckas-bot, Yobot, THEN WHO WAS PHONE?, AnomieBOT, Jim1138, Materialscien- tist, ArthurBot, Shadowjams, Wissling, Pinethicket, Lotje, Reach Out to the Truth, John of Reading, Tommy2010, Stefania75~enwiki, Bertman3, L Kensington, ClueBot NG, Anmccaff, Fauzan, MerlIwBot, Helpful Pixie Bot, HMSSolent, Leonxlin, Marcocapelle, Brian- condron, Sriharsh1234, New worl, WikiEnthusiastNumberTwenty-Two, Grey.dreyk, Qwertyxp2000, QKsu, XLSXANDER24, Layla, the remover, Lilybaizer, John “Hannibal” Smith, Bender the Bot, -- and Anonymous: 102 • Library and information science Source: https://en.wikipedia.org/wiki/Library_and_information_science?oldid=760445740 Contrib- utors: Ijon, Scott, Discospinster, BDD, Ruud Koot, Quiddity, Kmccook, Bgwhite, Wavelength, Themightyquill, TonyBrooke, R'n'B, Funandtrvl, Niceguyedc, Research84, BirgerH, Royksprekk, SchreiberBike, WikHead, Addbot, Yobot, Amirobot, DisillusionedBit- terAndKnackered, AnomieBOT, Gutam2000, Tomwsulcer, Omnipaedista, ChanakaW, FrescoBot, , Marchitelli, RA0808, Clue- Bot NG, ClaretAsh, Baiget, Widr, Lawsonstu, Strike Eagle, Cseanburns, BG19bot, Mark Arsten, Auteny, Tabrezalamalig, Azad li- brary, Rainanaina, Ghazala yasmeen, Consider42, Wildtoast, Hassan.zamir, Khanparveen, Tabiveed5, Cyberathenaeum, Achalamunigal, Prakashjyotibharti, KasparBot, Sweepy, Mungopark, Rasomu, InternetArchiveBot, Mmaximov1986nnov, Angelamcreynolds, 123456789toooooo, Wikishovel and Anonymous: 21 • Relevance (information retrieval) Source: https://en.wikipedia.org/wiki/Relevance_(information_retrieval)?oldid=758504944 Con- tributors: Kku, Nickg, Greenrd, Metasquares, Twang, Filip nohe, Pgan002, Beland, Karol Langner, E090, Jehochman, Nihiltres, RexNL, YurikBot, Gaius Cornelius, Bbbozzz, SmackBot, Floridi~enwiki, John254, Wbuchan, VictorAnyakin, Hut 8.5, Dan Pelleg, Jodi.a.schneider, DGG, Mycroft7, ShlomoS, Jamelan, Jludwig, Dtunkelang, DragonBot, Igorberger, BirgerH, Gjnaasaa, Addbot, Nobunobu, Josevellezcal- das, Jelsas, Bddavison, Yobot, AnomieBOT, Lezhao, John of Reading, ClueBot NG, CaroleHenson, Helpful Pixie Bot, BG19bot, New worl, Bejvisek and Anonymous: 21 • Web search engine Source: https://en.wikipedia.org/wiki/Web_search_engine?oldid=764666751 Contributors: Lquilter, Haakon, Mac, Ronz, Xcohen, Tpbradbury, Chuunen Baka, ZimZalaBim, Nurg, Smb1001, Plandu, Alan Liefting, Giftlite, Chris Wood, Macrakis, Beland, James A. Donald, Bumm13, Oknazevad, Andreas Kaufmann, Mvuijlst, Discospinster, Bender235, ESkog, MBisanz, Bjelli, EurekaLott, Vipul, Bobo192, Smalljim, John Vandenberg, Blakkandekka, NeonLego, Elipongo, Espoo, Alansohn, Gary, Smarteralec, SnowFire, Arthena, Steele~enwiki, Wtmitchell, Velella, Geraldshields11, Tomlzz1, Bsadowski1, Versageek, Brookie, BryanStrome, Woohookitty, Waldir, Toussaint, Mandarax, Rjwilmsi, Koavf, Strait, Bruce1ee, Mitul0520, Vegaswikian, Bhadani, Yoursvivek, Gurch, Chobot, Benlisquare, DVdm, Bgwhite, Banaticus, Wavelength, Sceptre, StuffOfInterest, Phantomsteve, Stephenb, CambridgeBayWeather, Rsrikanth05, NawlinWiki, Arichnad, Porthugh, LodeRunner, Klutzy, Elkman, Fmccown, Zzuuzz, Ketsuekigata, Carlosguitar, Cmglee, SmackBot, Samdutton, Ma8thew, Hydrogen Iodide, McGeddon, Edgar181, Gilliam, Jdfoote, Ohnoitsjamie, MalafayaBot, Deli nk, Jf- samper, MercZ, A. B., Милан Јелисавчић, Frap, Kazastankas, SundarBot, Popsup, Runefurb, Makemi, Jiddisch~enwiki, Legalea- gle86, Spotworks, CristianoMacaluso, Mwtoews, DMacks, Janhoy, Wikiolap, Valfontis, Kuru, General Ization, Francis Irving, Silk- Tork, Accurizer, Bjankuloski06en~enwiki, Ckatz, 16@r, Hvn0413, Optakeover, TastyPoutine, Caiaffa, Hu12, Levineps, BranStark, Iri- descent, Plenderj, Blehfu, INkubusse, ^, JForget, Jonathan A Jones, Leevanjackson, Dgw, Alandavidson, WeggeBot, Gogo Dodo, Nick2253, DumbBOT, Headbomb, EdJohnston, Nick Number, Seaphoto, Aliweb, Lfstevens, ClassicSC, JAnDbot, Leuko, Barek, MER- C, Rothorpe, Kerotan, Freshacconci, Magioladitis, Xangis, Andropod, VoABot II, Carlwev, JNW, JamesBWatson, Think outside the box, Buettcher, Tedickey, Jatkins, Midgrid, Theroadislong, Elinruby, Hoverfish, Kgfleischmann, Thompson.matthew, DGG, DRogers, Cotton2, Poeloq, Yegg13, Ggrefen, Jfroelich, J.delanoy, ChrisfromHouston, Terrek, Athaenara, Tdadamemd, Scurless, Cpiral, Ajmint, McSly, Gurchzilla, Kmmhasan, KylieTastic, WJBscribe, Jamesontai, Janderie, Tagus, Bonadea, JavierMC, Halmstad, Inas, Idioma-bot, Ruukasu2005, VolkovBot, DSRH, Jeff G., Maghnus, Mathiaslylo, Fences and windows, Philip Trueman, TXiKiBoT, Oshwah, Zidonuke, Newtown11, Dbenford, Gihangamos, CoJaBo, Nexus501, Martin451, Jackfork, Wiae, Larklight, Enigmaman, Synthebot, CoolKid1993, Coldmachine, Vchimpanzee, Gepcsirke, Cnilep, Insanity Incarnate, LittleBenW, Thunderbird2, Logan, Biscuittin, SieBot, Account- ing4Taste, ATS, Rlendog, Aep itah, Josconklin, Dawn Bard, GoHuskies990411, Srushe, SiegeLord, Simulacrum01, Bentogoa, Happy- sailor, Flyer22 Reborn, Radon210, Oscar.nierstrasz, Edward Elric 1308, Yerpo, Steven Crossin, RW Marloe, Chansonh, PhoenixLight- Inc, UncleMartin, IdreamofJeanie, Benaya, DancingPhilosopher, Rathee, Mattmnelson, Searchmaven, HPJoker, Ggallucci, Francvs, De- marie, Doxin45, Alfons Åberg, Afnecors, ClueBot, Caffeinejolt, Professorbond, Schwarzenneger, The Thing That Should Not Be, PLA y Grande Covián, MIDI, Ndenison, Unbuttered Parsnip, Saddhiyama, Drmies, VQuakr, SuperHamster, TarzanASG, Trivialist, Shantu123, Puchiko, Accl.news, Ray3055, K4m1y4, Excirial, Anvilmedia, Resoru, Rhododendrites, Sonicdrewdriver, NuclearWarfare, ClashThe- Bunny, Aseld, Titustimuli, Foogus, Rui Gabriel Correia, 7, Qwfp, Johnuniq, Apparition11, SF007, Classicrockfan42, DumZiBoT, Tem- plarion, Crazy Boris with a red beard, Brethvoice, XLinkBot, Boyd Reimer, PseudoOne, Pnm123, Pgallert, Avoided, Jingle bigballs, Drmadskills, Badgernet, Another-sailor, SDSandecki, Wmartin08, Rajesh.patchala, DOI bot, Tcncv, Fyrael, Captain-tucker, 123b, 123c, 123f, Kiranoush, Skapoor007, Ronhjones, Cut Bravo, Cst17, Wikipedian314, MrOllie, Chamal N, Jreconomy, Foreigner82, Favonian, West.andrew.g, 84user, Ehrenkater, Apteva, Nurasko, Teles, Gail, Capone7722, Ben Ben, Gemirates, Yobot, WikiDan61, Fraggle81, ZeeknayTzfat, DisillusionedBitterAndKnackered, Steve.bassey, Mbelaunde, Jose Gervasio, JeanCaffou, Bugnot, Amrikbhat, Anspar, Manwichosu, AnomieBOT, Mhha, Jim1138, Dwayne, Piano non troppo, BIGGOOGIES, ChristopheS, Ddemetrios5, Materialscientist, Kc03, Loderuner, Citation bot, Srinivas, Arctic Fox, ArthurBot, Ambassador29, MauritsBot, Xqbot, StuffyProf, Vuongvina, Sman24, Jozef.kutej, Capricorn42, Nasnema, Regisbates, George.boeck, Connorthecat, Maximus2000, Hi878, Ceramic catfish, JanDeFietser, Wizardist, Mark Schierbecker, Seeleschneider, Aragor~enwiki, Dan6hell66, Nenya17, PakRise, Prari, FrescoBot, Kiransarv99, Kjpocon- nor, Credibly Witless, X7q, Rosariomorgan, Mangaman27, Searchman2, Nainawalli, Bebo77, Haeinous, Tegel, Llamafirst, Semio7, Car- tel7, Hillwilliam6, DivineAlpha, Jakesyl, Citation bot 1, Biker Biker, Pinethicket, I dream of horses, Epipkin, Vicenarian, HRoestBot, 10metreh, MJ94, Skapoor 92, Nadeem12345, Xiaoshuang, RedBot, Blogger11, XDaniX, Serols, Fixer88, Ltkmerlini, Beao, Crows1985, Bloxxy, Brat22, Cnwilliams, Mlo0352, Nubicsearch, FoxBot, Mjs1991, ConcernedVancouverite, HFadeel, TobeBot, Wotnow, Dgiul, GrantGD, Xlxfjh, Heavyweight Gamer, Lotje, Vancouver Outlaw, Ginadavis, Aoidh, Fzamith, Richard31415, K-ray913, David Hedlund, Reaper Eternal, Kendalfong, Ddloe, Gregman2, Luv len, Xin0427, Suzukiboy04, Yamaha07, Dillonpg1, G-Yenn123, MoeenKhurshid, Moeenkhurshids, Inetmonster, Codename.venice, Likmo123, Thinktdub, RazorXX8, Rz1115, Sharon08tam, Hkreiger, Lsolan, Xmark- manx, Nono-1966, Mooglesearch, Mstrehlke, Ooyyo, Onel5969, Mean as custard, Cac united, Qnxkuba, Searchprochina, Kvasilev, Ajkovacs, Indian2493, चंद्रकांत धुतडमल, Rollins83, DASHBot, EmausBot, Meemore, Thomas humphrey12, Akjar13, Dewritech, Racerx11, GoingBatty, RA0808, Pincerr, Dem1995, AlanS1951, Moswento, Tommy2010, Dcirovic, Entalpia2, Thecheesykid, InfoS- ources, Shuipzv3, Elandy2009, Mnhweb, Pickuptha'Musket, Uniltìranyu, HarleyULTRArider, Friendocity, Jimbo16454, Appledandy, 124 CHAPTER 24. WEB SEARCH ENGINE

Erianna, Lokpest, W163, Sfoske70, Mulva111, Schnoatbrax, Champion, Kapil.xerox, Gsarwa, Ajit garga, Nlyte.Software, Nimoegra, Orange Suede Sofa, Gaganmasoun, Fronier, LScriv, Mapelpark, Cmcardle720, Llightex, DASHBotAV, Rakeitin, Danieltabak, Sllim jon, Calvinklein911, ClueBot NG, Angeld89, Hashim2010, ES IRM, Patience2, Backtous2012, Hagreyman, MelbourneStar, Achugg, Satellizer, Cirsam, Griffbo, Willonthemove, Loginnigol, Grablev~enwiki, Tch5416339, Lostzenfound, Dhua315, Lesley.latham, Frietjes, Ty27rv, Miladz7560, S2009qw, Riveravaldez, Widr, Antiqueight, Jim the Techie, Coolaij, Nowo11, Helpful Pixie Bot, Alemafut, Rozbif, Saha zapaat, Thistrackted, BG19bot, 321ylzzirg, Ltcoconut88, Jackr1909, Ccpedia, Northamerica1000, Luriflax, Outlinekiller, Red- dogsix, MusikAnimal, AvocatoBot, RikkiAaron, Compfreak7, Dephnit45, Cncmaster, CitationCleanerBot, A0tv23, Crh23, Harizotoh9, Rachell36, United States Man, FrankyFrank101, Robertnettleton, Sandbergja, Bigbluebeaver, Klilidiplomus, Kanggotan, Gustavo Destro, Dubleeble, Williamjhonson45, Esv123, Aneesprince, Benhall2121, ChrisGualtieri, Roggerladislau, Cklein1209, Barnabas321, Pepdeal, Quant18, Coolblue75759, Scofield190, Misty fungus, L;kasd;fweiotr4, Mogism, 331dot, Thokara, NOnash61, MiguelAraujoS, Number- maniac, Lugia2453, SFK2, Jc86035, Chanuka25, Awp9633, VanishedUser 2313214sad1, Dave Braunschweig, Purbitaditecha, Krushialk, Zalunardo8, Tcarnes2, Taniki122, SEVAGIRI REAL ESTATE CONSULTANTS, Ajit.tripathy14, Ggwine, Antar Fathy Antar Amer, Asrosen, Kolophon, JohnJohnson5, Cphisher, E1510sf, Thevideodrome, Roma shah, Darkesthoursoflife, MDavid.me, Agasthya12345, Lornefade, Rekowo, Menelaosc, Crispit, Sharma Hrishi, 9k7kq3, J3ts9ij, Majid661, Pilgrimnet, JaconaFrere, Skr15081997, John D’silva, Dbsseven, Concord hioz, Monkbot, Likhary, Thecorbaman, Klaaskizito, Perpetualuche, Kaytav, Alvandria, Sights40, Vineydhiman, Lucky1620, Kinetic37, Kentthegreats, Lilred234, Employerspain15, Pradeeprv123, Wikiinfosub, Azirann, Evolutionvisions, Elloge6, P2g4k, Jakebohall, Renjithrajeevvk, Asdklf;, Buzzdarkmatter, Elimash, KH-1, Broweuli, Ryanopoku123, Royt75, Krishnachaitan, Jack- pison, Jozefsanders, Bullwinkle2003, StudiousStudent, Areeshanoor2020, Eml.web.search, Pixillated, BlueFire25, Sdxu, Chrislolololol- lolol, Tony y stark, Anwenparrott, Megalegit, Mp3wallet, Broido, Maryfrench, Stillalivelong, 2911ashish, Sves lab, Newwikieditor678, Elenctic, Eric0928, Anil bhatiwal, Bookaccount, Nksp20z, CAPTAIN RAJU, Feminist, Kts sports cars, Alarbash, Samira zeynali, Be- lajarhebat, Thareshkum, Pinny house, Nvmemory, Micmactictac, StraboVarenius, Hipkik, Researcher9999, RSR CABS, Serversulti- .log, Seo Doktorum, Expert Computers, Kavitha reddy b, Pizzarollsxx, Vivekavardhanou, Ibrahim loknathpur, Cr7 wikitech, Omni Flames, Shashibharanger, Gopykamrai, Sinamalleki, Damar Ramadhan, Thomas , John “Hannibal” Smith, Doctormukisa, Seosefi, Sagarahmed0172, StrayKitty29, Hedayatyazdani, Buyshop corp, Bender the Bot, Vlady000, MoshiKun, Richard614, Imminent77, Sel- vankalai, Jrjohn2012, John smith web, Vinay7737, Abhijeet saawant, Stikkyy, Florarlk, Glixx express, Zingyi512, Raybrighton2016, Arhajati, Joe Wiz, Jeos149, Ravigupta.winworld and Anonymous: 835

24.12.2 Images

• File:Ambox_important.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/b4/Ambox_important.svg License: Public do- main Contributors: Own work, based off of Image:Ambox scales.svg Original artist: Dsmurat (talk · contribs) • File:Commons-logo.svg Source: https://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg License: PD Contributors: ? Orig- inal artist: ? • File:Conventions.jpg Source: https://upload.wikimedia.org/wikipedia/commons/9/9e/Conventions.jpg License: CC BY-SA 3.0 Con- tributors: Own work Original artist: Tjo3ya • File:Dg-new-1.jpg Source: https://upload.wikimedia.org/wikipedia/commons/3/39/Dg-new-1.jpg License: CC BY-SA 3.0 Contribu- tors: Own work Original artist: Tjo3ya • File:Dg-new-2.jpg Source: https://upload.wikimedia.org/wikipedia/commons/6/6d/Dg-new-2.jpg License: CC BY-SA 3.0 Contribu- tors: Own work Original artist: Tjo3ya • File:Edit-clear.svg Source: https://upload.wikimedia.org/wikipedia/en/f/f2/Edit-clear.svg License: Public domain Contributors: The Tango! Desktop Project. Original artist: The people from the Tango! project. And according to the meta-data in the file, specifically: “Andreas Nilsson, and Jakub Steiner (although minimally).” • File:Emoji_u1f4bb.svg Source: https://upload.wikimedia.org/wikipedia/commons/d/d7/Emoji_u1f4bb.svg License: Apache License 2.0 Contributors: https://code.google.com/p/noto/ Original artist: Google • File:Information-Retrieval-Models.png Source: https://upload.wikimedia.org/wikipedia/commons/c/c3/Information-Retrieval-Models. png License: CC-BY-SA-3.0 Contributors: ? Original artist: ? • File:Johnhasfinishedthework-1.jpg Source: https://upload.wikimedia.org/wikipedia/commons/e/e7/Johnhasfinishedthework-1.jpg Li- cense: CC BY-SA 3.0 Contributors: Own work Original artist: Tjo3ya • File:LampFlowchart.svg Source: https://upload.wikimedia.org/wikipedia/commons/9/91/LampFlowchart.svg License: CC-BY-SA- 3.0 Contributors: vector version of Image:LampFlowchart.png Original artist: svg by Booyabazooka

• File:Library-logo.svg Source: https://upload.wikimedia.org/wikipedia/commons/5/53/Library-logo.svg License: CC0 Contributors: Own work Original artist: Mononomic • File:Library_of_Ashurbanipal_synonym_list_tablet.jpg Source: https://upload.wikimedia.org/wikipedia/commons/6/64/Library_of_ Ashurbanipal_synonym_list_tablet.jpg License: CC BY-SA 3.0 Contributors: Fæ (Own work) Original artist: ? • File:Linguistics_stub.svg Source: https://upload.wikimedia.org/wikipedia/commons/d/dc/Linguistics_stub.svg License: Public domain Contributors: ? Original artist: ? • File:Lock-green.svg Source: https://upload.wikimedia.org/wikipedia/commons/6/65/Lock-green.svg License: CC0 Contributors: en: File:Free-to-read_lock_75.svg Original artist: User:Trappist the monk • File:Mayflower_Wikimedia_Commons_image_search_engine_screenshot.png Source: https://upload.wikimedia.org/wikipedia/commons/ b/ba/Mayflower_Wikimedia_Commons_image_search_engine_screenshot.png License: GPL Contributors: Screenshot of a search for lunar eclipse. Original artist: Mayflower was written by User:Tangotango. • File:Merge-arrow.svg Source: https://upload.wikimedia.org/wikipedia/commons/a/aa/Merge-arrow.svg License: Public domain Con- tributors: ? Original artist: ? 24.12. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES 125

• File:Mophological_dependencies_1.png Source: https://upload.wikimedia.org/wikipedia/commons/6/6b/Mophological_dependencies_ 1.png License: CC BY-SA 3.0 Contributors: Own work Original artist: Tjo3ya • File:Morphological_dependencies_2'.png Source: https://upload.wikimedia.org/wikipedia/commons/e/e5/Morphological_dependencies_ 2%27.png License: CC BY-SA 3.0 Contributors: Own work Original artist: Tjo3ya • File:Parse2.jpg Source: https://upload.wikimedia.org/wikipedia/commons/8/8c/Parse2.jpg License: CC BY-SA 3.0 Contributors: Own work Original artist: Tjo3ya • File:Parse_tree_1.jpg Source: https://upload.wikimedia.org/wikipedia/commons/5/54/Parse_tree_1.jpg License: CC BY-SA 3.0 Con- tributors: Own work Original artist: Tjo3ya • File:Parser_Flowո.gif Source: https://upload.wikimedia.org/wikipedia/commons/d/d6/Parser_Flow%D5%B8.gif License: Public do- main Contributors: • Aho, Sethi, Ullman, Compilers: Principles, Techniques, and Tools, Addison-Wesley, 1986. ISBN 0-201-10088-6 Original artist: DevinCook at English Wikipedia • File:Prosodic_dependencies’{}.png Source: https://upload.wikimedia.org/wikipedia/commons/5/55/Prosodic_dependencies%27.png License: CC BY-SA 3.0 Contributors: Own work Original artist: Tjo3ya • File:Question_book-new.svg Source: https://upload.wikimedia.org/wikipedia/en/9/99/Question_book-new.svg License: Cc-by-sa-3.0 Contributors: Created from scratch in Adobe Illustrator. Based on Image:Question book.png created by User:Equazcion Original artist: Tkgd2007 • File:Quranic-arabic-corpus.png Source: https://upload.wikimedia.org/wikipedia/commons/5/5c/Quranic-arabic-corpus.png License: CC BY 3.0 Contributors: Own work Original artist: Arabismo • File:Relevance.jpg Source: https://upload.wikimedia.org/wikipedia/commons/7/77/Relevance.jpg License: CC BY-SA 3.0 Contribu- tors: Own work Original artist: GinsuText • File:Semantic_dependencies.png Source: https://upload.wikimedia.org/wikipedia/commons/a/ac/Semantic_dependencies.png License: CC BY-SA 3.0 Contributors: Own work Original artist: Tjo3ya • File:Split-arrows.svg Source: https://upload.wikimedia.org/wikipedia/commons/a/a7/Split-arrows.svg License: Public domain Contrib- utors: ? Original artist: ? • File:Syntactic_functions_1.png Source: https://upload.wikimedia.org/wikipedia/commons/c/c3/Syntactic_functions_1.png License: CC BY-SA 3.0 Contributors: Own work Original artist: Tjo3ya • File:Text_document_with_red_question_mark.svg Source: https://upload.wikimedia.org/wikipedia/commons/a/a4/Text_document_ with_red_question_mark.svg License: Public domain Contributors: Created by bdesham with ; based upon Text-x-generic.svg from the Tango project. Original artist: Benjamin D. Esham (bdesham) • File:Theykilledthemanwithagun-1b.jpg Source: https://upload.wikimedia.org/wikipedia/commons/7/74/Theykilledthemanwithagun-1b. jpg License: CC BY-SA 3.0 Contributors: Own work Original artist: Tjo3ya • File:Thistreeisillustratingtherelation(PSG).png Source: https://upload.wikimedia.org/wikipedia/commons/8/8e/Thistreeisillustratingtherelation% 28PSG%29.png License: CC BY-SA 3.0 Contributors: Own work Original artist: Tjo3ya • File:Vector_space_model.jpg Source: https://upload.wikimedia.org/wikipedia/commons/f/ff/Vector_space_model.jpg License: CC BY 3.0 Contributors: Own work Original artist: Riclas • File:Wearetryingtounderstandthedifference_(2).jpg Source: https://upload.wikimedia.org/wikipedia/commons/0/0d/Wearetryingtounderstandthedifference_ %282%29.jpg License: CC BY-SA 3.0 Contributors: Own work Original artist: Tjo3ya • File:WebCrawlerArchitecture.svg Source: https://upload.wikimedia.org/wikipedia/commons/d/df/WebCrawlerArchitecture.svg Li- cense: CC-BY-SA-3.0 Contributors: self-made, based on image from PhD. Thesis of Carlos Castillo, image released to public domain by the original author. Original artist: Vector version by dnet based on image by User:ChaTo • File:Wikiquote-logo.svg Source: https://upload.wikimedia.org/wikipedia/commons/f/fa/Wikiquote-logo.svg License: Public domain Contributors: Own work Original artist: Rei-artur • File:Wikiversity-logo.svg Source: https://upload.wikimedia.org/wikipedia/commons/9/91/Wikiversity-logo.svg License: CC BY-SA 3.0 Contributors: Snorky (optimized and cleaned up by verdy_p) Original artist: Snorky (optimized and cleaned up by verdy_p) • File:-logo-v2.svg Source: https://upload.wikimedia.org/wikipedia/commons/0/06/Wiktionary-logo-v2.svg License: CC BY- SA 4.0 Contributors: Own work Original artist: Dan Polansky based on work currently attributed to Wikimedia Foundation but originally created by Smurrayinchester

24.12.3 Content license

• Creative Commons Attribution-Share Alike 3.0