NaturalLanguageProcessingWeek01 Chapter 1

Week01

1.1 Natural language processing

This article is about language processing by computers. For the processing of language by the human brain, see Language processing in the brain.

Natural language processing is a field of computer science, artificial intelligence, and computational concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation.

1.1.1 History

Main article: History of natural language processing

The history of NLP generally starts in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, would be a solved problem.[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed. Some notably successful NLP systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 and 1966. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the “patient” exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to “My head hurts” with “Why do you say your head hurts?". During the 1970s many programmers began to write “conceptual ontologies”, which structured real-world informa- tion into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky. Up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in NLP with the introduction of for language processing. This was due to both the steady increase in computational power (see Moore’s Law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of that underlies the machine-learning approach to language processing.[3]

2 1.1. NATURAL LANGUAGE PROCESSING 3

An automated online assistant providing customer service on a web page, an example of an application where natural language processing is a major component.[1]

Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. However, part-of-speech tagging introduced the use of hidden Markov models to NLP, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cache language models upon which many systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks. Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Re- 4 CHAPTER 1. WEEK01 search, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official lan- guages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data. Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms are able to learn from data that has not been hand-annotated with the desired answers, or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non- annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results.

1.1.2 Using machine learning

‹See Tfd› Modern NLP algorithms are based on machine learning, especially statistical machine learning. The paradigm of machine learning is different from that of most prior attempts at language processing. Prior implementations of language-processing tasks typically involved the direct hand coding of large sets of rules. The machine-learning paradigm calls instead for using general learning algorithms — often, although not always, grounded in statistical inference — to automatically learn such rules through the analysis of large corpora of typical real-world examples. A corpus (plural, “corpora”) is a set of documents (or sometimes, individual sentences) that have been hand-annotated with the correct values to be learned. Many different classes of machine learning algorithms have been applied to NLP tasks. These algorithms take as input a large set of “features” that are generated from the input data. Some of the earliest-used algorithms, such as decision trees, produced systems of hard if-then rules similar to the systems of hand-written rules that were then common. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to each input feature. Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one, producing more reliable results when such a model is included as a component of a larger system. Systems based on machine-learning algorithms have many advantages over hand-produced rules:

• The learning procedures used during machine learning automatically focus on the most common cases, whereas when writing rules by hand it is often not at all obvious where the effort should be directed.

• Automatic learning procedures can make use of statistical inference algorithms to produce models that are robust to unfamiliar input (e.g. containing or structures that have not been seen before) and to erroneous input (e.g. with misspelled words or words accidentally omitted). Generally, handling such input gracefully with hand-written rules — or more generally, creating systems of hand-written rules that make soft decisions — is extremely difficult, error-prone and time-consuming.

• Systems based on automatically learning the rules can be made more accurate simply by supplying more in- put data. However, systems based on hand-written rules can only be made more accurate by increasing the complexity of the rules, which is a much more difficult task. In particular, there is a limit to the complexity of systems based on hand-crafted rules, beyond which the systems become more and more unmanageable. However, creating more data to input to machine-learning systems simply requires a corresponding increase in the number of man-hours worked, generally without significant increases in the complexity of the annotation process.

The subfield of NLP devoted to learning approaches is known as natural language learning (NLL) and its confer- ence CoNLL[4] and peak body SIGNLL[5] are sponsored by ACL, recognizing also their links with computational linguistics and Language Acquisition. When the aim of computational language learning research is to understand more about human language acquisition, or psycholinguistics, NLL overlaps into the related field of computational psycholinguistics. 1.1. NATURAL LANGUAGE PROCESSING 5

1.1.3 Major tasks

The following is a list of some of the most commonly researched tasks in NLP. Note that some of these tasks have direct real-world applications, while others more commonly serve as subtasks that are used to aid in solving larger tasks. What distinguishes these tasks from other potential and actual NLP tasks is not only the volume of research devoted to them but the fact that for each one there is typically a well-defined problem setting, a standard metric for evaluating the task, standard corpora on which the task can be evaluated, and competitions devoted to the specific task.

Automatic summarization Produce a readable summary of a chunk of text. Often used to provide summaries of text of a known type, such as articles in the financial section of a newspaper. Coreference resolution Given a sentence or larger chunk of text, determine which words (“mentions”) refer to the same objects (“entities”). Anaphora resolution is a specific example of this task, and is specifically concerned with matching up pronouns with the nouns or names to which they refer. The more general task of coreference resolution also includes identifying so-called “bridging relationships” involving referring expressions. For ex- ample, in a sentence such as “He entered John’s house through the front door”, “the front door” is a referring expression and the bridging relationship to be identified is the fact that the door being referred to is the front door of John’s house (rather than of some other structure that might also be referred to). Discourse analysis This rubric includes a number of related tasks. One task is identifying the discourse structure of connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast). Another possible task is recognizing and classifying the speech acts in a chunk of text (e.g. yes-no question, content question, statement, assertion, etc.). Machine translation Automatically translate text from one human language to another. This is one of the most difficult problems, and is a member of a class of problems colloquially termed "AI-complete", i.e. requiring all of the different types of knowledge that humans possess (grammar, , facts about the real world, etc.) in order to solve properly. Morphological segmentation Separate words into individual morphemes and identify the class of the morphemes. The difficulty of this task depends greatly on the complexity of the morphology (i.e. the structure of words) of the language being considered. English has fairly simple morphology, especially inflectional morphology, and thus it is often possible to ignore this task entirely and simply model all possible forms of a (e.g. “open, opens, opened, opening”) as separate words. In languages such as Turkish or Meitei,[6] a highly agglutinated Indian language, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms. Named entity recognition (NER) Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Note that, although capitalization can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case is often inaccurate or insufficient. For example, the first word of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g. Chinese or ) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, German capitalizes all nouns, regardless of whether they refer to names, and French and Spanish do not capitalize names that serve as adjectives. Natural language generation Convert information from computer databases or semantic intents into readable hu- man language. Natural language understanding Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate. Natural language understanding involves the identification of the intended semantic from the multiple possible semantics which can be derived from a natural language expression which usually takes the form of organized notations of natural languages concepts. Introduction and creation of language metamodel and ontology are efficient however empirical solutions. An explicit formalization of natural languages semantics without confusions with implicit assumptions such as closed-world assumption (CWA) vs. open-world assumption, or subjective Yes/No vs. objective True/False is expected for the construction of a basis of semantics formalization.[7]

Optical character recognition (OCR) Given an image representing printed text, determine the corresponding text. 6 CHAPTER 1. WEEK01

Part-of-speech tagging Given a sentence, determine the part of speech for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, “book” can be a noun (“the book on the table”) or verb (“to book a flight”); “set” can be a noun, verb or adjective; and “out” can be any of at least five different parts of speech. Some languages have more such than others. Languages with little inflectional morphology, such as English are particularly prone to such ambiguity. Chinese is prone to such ambiguity because it is a tonal language during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey intended meaning.

Parsing Determine the parse tree (grammatical analysis) of a given sentence. The grammar for natural languages is ambiguous and typical sentences have multiple possible analyses. In fact, perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human).

Question answering Given a human-language question, determine its answer. Typical questions have a specific right answer (such as “What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as “What is the meaning of life?"). Recent works have looked at even more complex questions.[8]

Relationship extraction Given a chunk of text, identify the relationships among named entities (e.g. who is married to whom).

Sentence breaking (also known as sentence boundary disambiguation) Given a chunk of text, find the sentence boundaries. Sentence boundaries are often marked by periods or other punctuation marks, but these same characters can serve other purposes (e.g. marking abbreviations).

Sentiment analysis Extract subjective information usually from a set of documents, often using online reviews to determine “polarity” about specific objects. It is especially useful for identifying trends of public opinion in the social media, for the purpose of marketing.

Speech recognition Given a sound clip of a person or people speaking, determine the textual representation of the speech. This is the opposite of text to speech and is one of the extremely difficult problems colloquially termed "AI-complete" (see above). In natural speech there are hardly any pauses between successive words, and thus is a necessary subtask of speech recognition (see below). Note also that in most spoken languages, the sounds representing successive letters blend into each other in a process termed coarticulation, so the conversion of the analog signal to discrete characters can be a very difficult process.

Speech segmentation Given a sound clip of a person or people speaking, separate it into words. A subtask of speech recognition and typically grouped with it.

Topic segmentation and recognition Given a chunk of text, separate it into segments each of which is devoted to a topic, and identify the topic of the segment.

Word segmentation Separate a chunk of continuous text into separate words. For a language like English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese, Japanese and Thai do not mark word boundaries in such a fashion, and in those languages is a significant task requiring knowledge of the vocabulary and morphology of words in the language.

Word sense disambiguation Many words have more than one meaning; we have to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or from an online resource such as WordNet.

In some cases, sets of related tasks are grouped into subfields of NLP that are often considered separately from NLP as a whole. Examples include:

Information retrieval (IR) This is concerned with storing, searching and retrieving information. It is a separate field within computer science (closer to databases), but IR relies on some NLP methods (for example, ). Some current research and applications seek to bridge the gap between IR and NLP.

Information extraction (IE) This is concerned in general with the extraction of semantic information from text. This covers tasks such as named entity recognition, Coreference resolution, relationship extraction, etc.

Speech processing This covers speech recognition, text-to-speech and related tasks. 1.1. NATURAL LANGUAGE PROCESSING 7

Other tasks include:

• Native-language identification

• Stemming

• Text simplification

• Text-to-speech

• Text-proofing

• Natural language search

• Query expansion

1.1.4 Statistical

Main article: Stochastic grammar

Statistical natural-language processing uses stochastic, probabilistic, and statistical methods to resolve some of the difficulties discussed above, especially those which arise because longer sentences are highly ambiguous when pro- cessed with realistic grammars, yielding thousands or millions of possible analyses. Methods for disambiguation often involve the use of corpora and Markov models. The ESPRIT Project P26 (1984 - 1988), led by CSELT, explored the problem of speech recognition comparing knowledge-based approach and statistical ones: the chosen result was a completely statistical model.[9] One among the first models of statistical natural language understanding was intro- duced in 1991 by Roberto Pieraccini, Esther Levin, and Chin-Hui Lee from Bell Laboratories.[10] NLP comprises all quantitative approaches to automated language processing, including probabilistic modeling, information theory, and linear algebra.[11] The technology for statistical NLP comes mainly from machine learning and , both of which are fields of artificial intelligence that involve learning from data.

1.1.5 Evaluation

Objectives

The goal of NLP evaluation is to measure one or more qualities of an or a system, in order to determine: whether the algorithm answers the goals of its designers, or if the system meets the needs of its users. Research in NLP evaluation has received considerable attention, because the definition of proper evaluation criteria is one way to specify precisely an NLP problem. The metric of NLP evaluation on an algorithmic system allows for the integration of language understanding and language generation. A precise set of evaluation criteria, which include mainly evaluation data and evaluation metrics can enable several teams to compare their solutions for a given NLP problem.

Timeline of evaluation

• In 1983, start of Esprit P26 Project, that evaluated Speech Technologies (including general topic such as Syn- tactic & Semantic , etc.) comparing rule-based towards statistical approaches.[12]

• In 1987, the first evaluation campaign on written texts seems to be a campaign dedicated to message under- standing (Pallet 1998).

• The Parseval/GEIG project compared phrase-structure grammars (Black 1991).

• There were series of campaigns within Tipster project on tasks like summarization, translation, and searching (Hirschman 1998). 8 CHAPTER 1. WEEK01

• In 1994, in Germany, the Morpholympics compared German morphological taggers. • The Senseval & Romanseval campaigns were conducted with the objectives of semantic disambiguation. • In 1996, the Sparkle campaign compared syntactic parsers in four different languages (English, French, German and Italian). • In France, the Grace project compared a set of 21 taggers for French in 1997 (Adda 1999). • In 2004, during the Technolangue/Easy project, 13 parsers for French were compared. • Large-scale evaluation of dependency parsers were performed in the context of the CoNLL shared tasks in 2006 and 2007. • In France, within the ANR-Passage project (end of 2007), 10 parsers for French were compared – passage web site. • In Italy, the EVALITA campaign was conducted in 2007,[13] 2009, 2011, and 2014[14] to compare various NLP and speech tools for Italian – EVALITA web site.

Different types of evaluation

Depending on the evaluation procedures, a number of distinctions are traditionally made in NLP evaluation.

Intrinsic v. extrinsic evaluation Intrinsic evaluation considers an isolated NLP system and characterizes its per- formance with respect to a gold standard result as defined by the evaluators. Extrinsic evaluation, also called evaluation in use, considers the NLP system in a more complex setting as either an embedded system or a precise function for a human user. The extrinsic performance of the system is then characterized in terms of utility with respect to the overall task of the extraneous system or the human user. For example, consider a syntactic parser which is based on the output of some part of speech (POS) tagger. An intrinsic evaluation would run the POS tagger on structured data, and compare the system output of the POS tagger to the gold standard output. An extrinsic evaluation would run the parser with some other POS tagger, and then with the novel POS tagger, and compare the parsing accuracy.

Black-box v. glass-box evaluation Black-box evaluation requires someone to run an NLP system on a sample data set and to measure a number of parameters related to: the quality of the process, such as speed, reliability, resource consumption; and most importantly, the quality of the result, such as the accuracy of data annotation or the fidelity of a translation. Glass-box evaluation looks at the: design of the system; the algorithms that are implemented; the linguistic resources it uses, like vocabulary size or expression set cardinality. Given the complexity of NLP problems, it is often difficult to predict performance only on the basis of glass-box evaluation; but this type of evaluation is more informative with respect to error analysis or future developments of a system.

Automatic v. manual evaluation In many cases, automatic procedures can be defined to evaluate an NLP system by comparing its output with the gold standard one. Although the cost of reproducing the gold standard can be quite high, bootstrapping automatic evaluation on the same input data can be repeated as often as needed without inordinate additional costs. However, for many NLP problems the precise definition of a gold standard is a complex task and it can prove impossible when inter-annotator agreement is insufficient. Manual evaluation is best performed by human judges instructed to estimate the quality of a system, or most often of a sample of its output, based on a number of criteria. Although, thanks to their linguistic competence, human judges can be considered as the reference for a number of language processing tasks, there is also considerable variation across their ratings. That is why automatic evaluation is sometimes referred to as objective evaluation while the human evaluation is perspective.

1.1.6 Standardization

An ISO subcommittee is working in order to ease interoperability between lexical resources and NLP programs. The subcommittee is part of ISO/TC37 and is called ISO/TC37/SC4. Some ISO standards are already published but most of them are under construction, mainly on lexicon representation (see LMF), annotation, and data category registry. 1.1. NATURAL LANGUAGE PROCESSING 9

1.1.7 See also

• Biomedical

term processing

• Computer-assisted reviewing

• Controlled natural language

• Deep linguistic processing

• Foreign language reading aid

• Foreign language writing aid

• Language technology

• Latent Dirichlet allocation (LDA)

• Latent semantic indexing

• List of natural language processing toolkits

• LRE Map

• Natural language programming

• Reification (linguistics)

• Semantic folding

• Spoken dialogue system

• Thought vector

• Transderivational search

1.1.8 References

[1] Implementing an online help desk system based on conversational agent Authors: Alisa Kongthon, Chatchawal Sang- keettrakarn, Sarawoot Kongyoung and Choochart Haruechaiyasak. Published by ACM 2009 Article, Bibliometrics Data Bibliometrics. Published in: Proceeding, MEDES '09 Proceedings of the International Conference on Management of Emergent Digital EcoSystems, ACM New York, NY, USA. ISBN 978-1-60558-829-2, doi:10.1145/1643823.1643908

[2] Hutchins, J. (2005). “The history of machine translation in a nutshell”.

[3] Chomskyan linguistics encourages the investigation of "corner cases" that stress the limits of its theoretical models (compa- rable to pathological phenomena in mathematics), typically created using thought experiments, rather than the systematic investigation of typical phenomena that occur in real-world data, as is the case in corpus linguistics. The creation and use of such corpora of real-world data is a fundamental part of machine-learning algorithms for NLP. In addition, theoreti- cal underpinnings of Chomskyan linguistics such as the so-called "poverty of the stimulus" argument entail that general learning algorithms, as are typically used in machine learning, cannot be successful in language processing. As a result, the Chomskyan paradigm discouraged the application of such models to language processing.

[4] CoNLL

[5] SIGNLL

[6] Kishorjit, N., Vidya Raj RK., Nirmal Y., and Sivaji B. (2012) "Manipuri Morpheme Identification", Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP),pages 95–108, COLING 2012, Mumbai, December 2012

[7] Yucong Duan, Christophe Cruz (2011), Formalizing Semantic of Natural Language through Conceptualization from Exis- tence. International Journal of Innovation, Management and Technology(2011) 2 (1), pp. 37-42. 10 CHAPTER 1. WEEK01

[8] "Versatile systems: seeing in synthesis", Mittal et al., IJIIDS, 5(2), 119-142, 2011. [9] Ciaramella, Alberto, Giancarlo Pirani, and Claudio Rullent. “Conclusions and Future Developments.” Advanced Algo- rithms and Architectures for Speech Understanding. Springer Berlin Heidelberg, 1990. 265-274. [10] Roberto Pieraccini, Esther Levin, Chin-Hui Lee: Stochastic representation of conceptual structure in the ATIS task,, Proc. Fourth Joint DARPA Speech and Natural Lang. Workshop, Pacific Grove, CA, Feb. 1991. [11] Christopher D. Manning, Hinrich Schütze: Foundations of Statistical Natural Language Processing, MIT Press (1999), ISBN 978-0-262-13360-9, p. xxxi [12] Pirani, Giancarlo, ed. Advanced algorithms and architectures for speech understanding. Vol. 1. Springer Science & Business Media, 2013. [13] Magnini, B., Cappelli, A., Tamburini, F., Bosco, C., Mazzei, A., Lombardo, V., Bertagna, F., Toral, A., Bartalesi Lenzi, V., Sprugnoli, R. & Speranza, M. (2008, May). Evaluation of Natural Language Tools for Italian: EVALITA 2007. In Proceedings of LREC 2008. [14] Attardi, G., Basile, V., Bosco, C., Caselli, T., Dell’Orletta, F., Montemagni, S., Patti, V., Simi, M. & Sprugnoli, R. (2015). State of the Art Language Technologies for Italian: The EVALITA 2014 Perspective. Intelligenza Artificiale, 9(1), 43-61.

1.1.9 Further reading

• Bates, M (1995). “Models of natural language understanding”. Proceedings of the National Academy of Sci- ences of the United States of America. 92 (22): 9977–9982. doi:10.1073/pnas.92.22.9977. • Steven Bird, Ewan Klein, and Edward Loper (2009). Natural Language Processing with Python. O'Reilly Media. ISBN 978-0-596-51649-9. • Daniel Jurafsky and James H. Martin (2008). Speech and Language Processing, 2nd edition. Pearson Prentice Hall. ISBN 978-0-13-187321-6. • Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze (2008). Introduction to Information Re- trieval. Cambridge University Press. ISBN 978-0-521-86571-5. Official html and pdf versions available without charge. • Christopher D. Manning and Hinrich Schütze (1999). Foundations of Statistical Natural Language Processing. The MIT Press. ISBN 978-0-262-13360-9. • David M. W. Powers and Christopher C. R. Turk (1989). Machine Learning of Natural Language. Springer- Verlag. ISBN 978-0-387-19557-5.

1.2 Computational linguistics

This article is about the scientific field. For the journal, see Computational Linguistics (journal).

Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective. Traditionally, computational linguistics was performed by computer scientists who had specialized in the application of computers to the processing of a natural language. But little if any success was made. Computational linguists often work as members of interdisciplinary teams, which can include regular linguists, experts in the target language, and computer scientists. In general, computational linguistics draws upon the involvement of linguists, computer scientists, experts in artificial intelligence, mathematicians, logicians, philosophers, cognitive scientists, cognitive psychologists, psycholinguists, anthropologists and neuroscientists, among others. Computational linguistics has theoretical and applied components. Theoretical computational linguistics focuses on issues in theoretical linguistics and cognitive science, and applied computational linguistics focuses on the practical outcome of modeling human language use.[1] The Association for Computational Linguistics defines computational linguistics as:

...the scientific study of language from a computational perspective. Computational linguists are inter- ested in providing computational models of various kinds of linguistic phenomena.[2] 1.2. COMPUTATIONAL LINGUISTICS 11

1.2.1 Origins

Computational linguistics is often grouped within the field of artificial intelligence, but actually was present before the development of artificial intelligence. Computational linguistics originated with efforts in the United States in the 1950s to use computers to automatically translate texts from foreign languages, particularly Russian scientific journals, into English.[3] Since computers can make arithmetic calculations much faster and more accurately than humans, it was thought to be only a short matter of time before they could also begin to process language.[4] Computational and quantitative methods are also used historically in attempted reconstruction of earlier forms of modern languages and subgrouping modern languages into language families. Earlier methods such as lexicostatistics and glottochronology have been proven to be premature and inaccurate. However, recent interdisciplinary studies which borrow concepts from biological studies, especially gene mapping, have proved to produce more sophisticated analytical tools and more trustful results.[5] When machine translation (also known as mechanical translation) failed to yield accurate translations right away, automated processing of human languages was recognized as far more complex than had originally been assumed. Computational linguistics was born as the name of the new field of study devoted to developing algorithms and soft- ware for intelligently processing language data. When artificial intelligence came into existence in the 1960s, the field of computational linguistics became that sub-division of artificial intelligence dealing with human-level comprehen- sion and production of natural languages. In order to translate one language into another, it was observed that one had to understand the grammar of both languages, including both morphology (the grammar of word forms) and syntax (the grammar of sentence structure). In order to understand syntax, one had to also understand the semantics and the lexicon (or 'vocabulary'), and even something of the pragmatics of language use. Thus, what started as an effort to translate between languages evolved into an entire discipline devoted to understanding how to represent and process natural languages using computers.[6] Nowadays research within the scope of computational linguistics is done at computational linguistics departments,[7] computational linguistics laboratories,[8] computer science departments,[9] and linguistics departments.[10][11] Some research in the field of computational linguistics aims to create working speech or systems while others aim to create a system allowing human-machine interaction. Programs meant for human-machine communication are called conversational agents.[12]

1.2.2 Approaches

Just as computational linguistics can be performed by experts in a variety of fields and through a wide assortment of departments, so too can the research fields broach a diverse range of topics. The following sections discuss some of the literature available across the entire field broken into four main area of discourse: developmental linguistics, structural linguistics, linguistic production, and linguistic comprehension.

Developmental approaches

Language is a cognitive skill which develops throughout the life of an individual. This developmental process has been examined using a number of techniques, and a computational approach is one of them. Human language development does provide some constraints which make it harder to apply a computational method to understanding it. For instance, during language acquisition, human children are largely only exposed to positive evidence.[13] This means that during the linguistic development of an individual, only evidence for what is a correct form is provided, and not evidence for what is not correct. This is insufficient information for a simple hypothesis testing procedure for information as complex as language,[14] and so provides certain boundaries for a computational approach to modeling language development and acquisition in an individual. Attempts have been made to model the developmental process of language acquisition in children from a compu- tational angle, leading to both statistical grammars and connectionist models.[15] Work in this realm has also been proposed as a method to explain the evolution of language through history. Using models, it has been shown that languages can be learned with a combination of simple input presented incrementally as the child develops better memory and longer attention span.[16] This was simultaneously posed as a reason for the long developmental period of human children.[16] Both conclusions were drawn because of the strength of the neural network which the project created. The ability of infants to develop language has also been modeled using robots[17] in order to test linguistic theories. Enabled to learn as children might, a model was created based on an affordance model in which mappings between 12 CHAPTER 1. WEEK01

actions, perceptions, and effects were created and linked to spoken words. Crucially, these robots were able to ac- quire functioning word-to-meaning mappings without needing grammatical structure, vastly simplifying the learning process and shedding light on information which furthers the current understanding of linguistic development. It is important to note that this information could only have been empirically tested using a computational approach. As our understanding of the linguistic development of an individual within a lifetime is continually improved us- ing neural networks and learning robotic systems, it is also important to keep in mind that languages themselves change and develop through time. Computational approaches to understanding this phenomenon have unearthed very interesting information. Using the Price Equation and Pólya urn dynamics, researchers have created a system which not only predicts future linguistic evolution, but also gives insight into the evolutionary history of modern-day languages.[18] This modeling effort achieved, through computational linguistics, what would otherwise have been impossible. It is clear that the understanding of linguistic development in humans as well as throughout evolutionary time has been fantastically improved because of advances in computational linguistics. The ability to model and modify systems at will affords science an ethical method of testing hypotheses that would otherwise be intractable.

Structural approaches

In order to create better computational models of language, an understanding of language’s structure is crucial. To this end, the English language has been meticulously studied using computational approaches to better understand how the language works on a structural level. One of the most important pieces of being able to study linguistic structure is the availability of large linguistic corpora. This grants computational linguists the raw data necessary to run their models and gain a better understanding of the underlying structures present in the vast amount of data which is contained in any single language. One of the most cited English linguistic corpora is the Penn .[19] Containing over 4.5 million words of American English, this corpus has been annotated for part-of-speech information. This type of annotated corpus allows other researchers to apply hypotheses and measures that would otherwise be impossible to perform. Theoretical approaches to the structure of languages have also been developed. These works allow computational linguistics to have a framework within which to work out hypotheses that will further the understanding of the language in a myriad of ways. One of the original theoretical theses on internalization of grammar and structure of language proposed two types of models.[14] In these models, rules or patterns learned increase in strength with the frequency of their encounter.[14] The work also created a question for computational linguists to answer: how does an infant learn a specific and non-normal grammar (Chomsky Normal Form) without learning an overgeneralized version and getting stuck?[14] Theoretical efforts like these set the direction for research to go early in the lifetime of a field of study, and are crucial to the growth of the field. Structural information about languages allows for the discovery and implementation of similarity recognition between pairs of text utterances.[20] For instance, it has recently been proven that based on the structural information present in patterns of human discourse, conceptual recurrence plots can be used to model and visualize trends in data and create reliable measures of similarity between natural textual utterances.[20] This technique is a strong tool for further probing the structure of human discourse. Without the computational approach to this question, the vastly complex information present in discourse data would have remained inaccessible to scientists. Information regarding the structural data of a language is available for English as well as other languages, such as Japanese.[21] Using computational methods, Japanese sentence corpora were analyzed and a pattern of log-normality was found in relation to sentence length.[21] Though the exact cause of this lognormality remains unknown, it is precisely this sort of intriguing information which computational linguistics is designed to uncover. This information could lead to further important discoveries regarding the underlying structure of Japanese, and could have any number of effects on the understanding of Japanese as a language. Computational linguistics allows for very exciting additions to the scientific knowledge base to happen quickly and with very little room for doubt. Without a computational approach to the structure of linguistic data, much of the information that is available now would still be hidden under the vastness of data within any single language. Computational linguistics allows scientists to parse huge amounts of data reliably and efficiently, creating the possibility for discoveries unlike any seen in most other approaches. 1.2. COMPUTATIONAL LINGUISTICS 13

Production approaches

The production of language is equally as complex in the information it provides and the necessary skills which a fluent producer must have. That is to say, comprehension is only half the problem of communication. The other half is how a system produces language, and computational linguistics has made some very interesting discoveries in this area.

Alan Turing: computer scientist and namesake developer of the Turing Test as a method of measuring the intelligence of a machine.

In a now famous paper published in 1950 Alan Turing proposed the possibility that machines might one day have the ability to “think”. As a thought experiment for what might define the concept of thought in machines, he proposed an “imitation test” in which a human subject has two text-only conversations, one with a fellow human and another with 14 CHAPTER 1. WEEK01 a machine attempting to respond like a human. Turing proposes that if the subject cannot tell the difference between the human and the machine, it may be concluded that the machine is capable of thought.[22] Today this test is known as the Turing test and it remains an influential idea in the area of artificial intelligence.

Joseph Weizenbaum: former MIT professor and computer scientist who developed ELIZA, a primitive computer program utilizing natural language processing.

One of the earliest and best known examples of a computer program designed to converse naturally with humans is the ELIZA program developed by Joseph Weizenbaum at MIT in 1966. The program emulated a Rogerian psychotherapist when responding to written statements and questions posed by a user. It appeared capable of under- standing what was said to it and responding intelligently, but in truth it simply followed a pattern matching routine that relied on only understanding a few keywords in each sentence. Its responses were generated by recombining the unknown parts of the sentence around properly translated versions of the known words. For example, in the phrase “It seems that you hate me” ELIZA understands “you” and “me” which matches the general pattern “you [some words] me”, allowing ELIZA to update the words “you” and “me” to “I” and “you” and replying “What makes you think I hate you?". In this example ELIZA has no understanding of the word “hate”, but it is not required for a logical response in the context of this type of psychotherapy.[23] Some projects are still trying to solve the problem which first started computational linguistics off as its own field in the first place. However, the methods have become more refined and clever, and consequently the results generated by 1.2. COMPUTATIONAL LINGUISTICS 15

computational linguists have become more enlightening. In an effort to improve computer translation, several models have been compared, including hidden Markov models, smoothing techniques, and the specific refinements of those to apply them to verb translation.[24] The model which was found to produce the most natural translations of German and French words was a refined alignment model with a first-order dependence and a fertility model[16]. They also provide efficient training algorithms for the models presented, which can give other scientists the ability to improve further on their results. This type of work is specific to computational linguistics, and has applications which could vastly improve understanding of how language is produced and comprehended by computers. Work has also been done in making computers produce language in a more naturalistic manner. Using linguistic input from humans, algorithms have been constructed which are able to modify a system’s style of production based on a factor such as linguistic input from a human, or more abstract factors like politeness or any of the five main dimensions of personality.[25] This work takes a computational approach via parameter estimation models to categorize the vast array of linguistic styles we see across individuals and simplify it for a computer to work in the same way, making human-computer interaction much more natural.

Text-based interactive approach Many of the earliest and simplest models of human-computer interaction, such as ELIZA for example, involve a text-based input from the user to generate a response from the computer. By this method, words typed by a user trigger the computer to recognize specific patterns and reply accordingly, through a process known as keyword spotting.

Speech-based interactive approach Recent technologies have placed more of an emphasis on speech-based in- teractive systems. These systems, such as Siri of the iOS operating system, operate on a similar pattern-recognizing technique as that of text-based systems, but with the former, the user input is conducted through speech recognition. This branch of linguistics involves the processing of the user’s speech as sound waves and the interpreting of the acoustics and language patterns in order for the computer to recognize the input.[26]

Comprehension approaches

Much of the focus of modern computational linguistics is on comprehension. With the proliferation of the internet and the abundance of easily accessible written human language, the ability to create a program capable of understanding human language would have many broad and exciting possibilities, including improved search engines, automated customer service, and online education. Early work in comprehension included applying Bayesian statistics to the task of optical character recognition, as illustrated by Bledsoe and Browing in 1959 in which a large dictionary of possible letters were generated by “learning” from example letters and then the probability that any one of those learned examples matched the new input was combined to make a final decision.[27] Other attempts at applying Bayesian statistics to language analysis included the work of Mosteller and Wallace (1963) in which an analysis of the words used in The Federalist Papers was used to attempt to determine their authorship (concluding that Madison most likely authored the majority of the papers).[28] In 1971 Terry Winograd developed an early natural language processing engine capable of interpreting naturally written commands within a simple rule governed environment. The primary language parsing program in this project was called SHRDLU, which was capable of carrying out a somewhat natural conversation with the user giving it commands, but only within the scope of the toy environment designed for the task. This environment consisted of different shaped and colored blocks, and SHRDLU was capable of interpreting commands such as “Find a block which is taller than the one you are holding and put it into the box.” and asking questions such as “I don't understand which pyramid you mean.” in response to the user’s input.[29] While impressive, this kind of natural language processing has proven much more difficult outside the limited scope of the toy environment. Similarly a project developed by NASA called LUNAR was designed to provide answers to naturally written questions about the geological analysis of lunar rocks returned by the Apollo missions.[30] These kinds of problems are referred to as question answering. Initial attempts at understanding spoken language were based on work done in the 1960s and 70s in signal modeling where an unknown signal is analyzed to look for patterns and to make predictions based on its history. An initial and somewhat successful approach to applying this kind of signal modeling to language was achieved with the use of hidden Markov models as detailed by Rabiner in 1989.[31] This approach attempts to determine probabilities for the arbitrary number of models that could be being used in generating speech as well as modeling the probabilities for various words generated from each of these possible models. Similar approaches were employed in early speech recognition attempts starting in the late 70s at IBM using word/part-of-speech pair probabilities.[32] 16 CHAPTER 1. WEEK01

More recently these kinds of statistical approaches have been applied to more difficult tasks such as topic identification using Bayesian parameter estimation to infer topic probabilities in text documents.[33]

1.2.3 Applications

Modern computational linguistics is often a combination of studies in computer science and programming, math, particularly statistics, language structures, and natural language processing. Combined, these fields most often lead to the development of systems that can recognize speech and perform some task based on that speech. Examples include speech recognition software, such as Apple’s Siri feature, spellcheck tools, programs, which are often used to demonstrate pronunciation or help the disabled, and machine translation programs and websites, such as Translate and Word Reference.[34] Computational linguistics can be especially helpful in situations involving social media and the Internet. For example, filters in chatrooms or on website searches require computational linguistics. Chat operators often use filters to identify certain words or phrases and deem them inappropriate so that users cannot submit them.[34] Another example of using filters is on websites. Schools use filters so that websites with certain keywords are blocked from children to view. There are also many programs in which parents use Parental controls to put content filters in place. Computational linguists can also develop programs that group and organize content through Social media mining. An example of this is Twitter, in which programs can group tweets by subject or keywords.[35]

1.2.4 Subfields

Computational linguistics can be divided into major areas depending upon the medium of the language being pro- cessed, whether spoken or textual; and upon the task being performed, whether analyzing language (recognition) or synthesizing language (generation). Speech recognition and speech synthesis deal with how spoken language can be understood or created using comput- ers. Parsing and generation are sub-divisions of computational linguistics dealing respectively with taking language apart and putting it together. Machine translation remains the sub-division of computational linguistics dealing with having computers translate between languages. The possibility of automatic language translation, however, has yet to be realized and remains a notorious branch of computational linguistics.[36] Some of the areas of research that are studied by computational linguistics include:

• Computational complexity of natural language, largely modeled on automata theory, with the application of context-sensitive grammar and linearly bounded Turing machines. • Computational semantics comprises defining suitable logics for linguistic meaning representation, automatically constructing them and reasoning with them • Computer-aided corpus linguistics, which has been used since the 1970s as a way to make detailed advances in the field of discourse analysis [37] • Design of parsers or chunkers for natural languages • Design of taggers like POS-taggers (part-of-speech taggers) • Machine translation as one of the earliest and most difficult applications of computational linguistics draws on many subfields. • Simulation and study of language evolution in historical linguistics/glottochronology.

1.2.5 Legacy

The subject of computational linguistics has had a recurring impact on popular culture:

• The 1983 film WarGames features a young computer hacker who interacts with an artificially intelligent supercomputer.[38] 1.2. COMPUTATIONAL LINGUISTICS 17

• A 1997 film, Conceiving Ada, focuses on Ada Lovelace, considered one of the first computer scientists, as well as themes of computational linguistics.[39]

• Her, a 2013 film, depicts a man’s interactions with the “world’s first artificially intelligent operating system.”[40]

• The 2014 film The Imitation Game follows the life of computer scientist Alan Turing, developer of the Turing Test.[41]

• The 2015 film Ex Machina centers around human interaction with artificial intelligence.[42]

1.2.6 See also

|style="float: left;"|

1.2.7 References

[1] Hans Uszkoreit. What Is Computational Linguistics? Department of Computational Linguistics and Phonetics of Saarland University

[2] The Association for Computational Linguistics What is Computational Linguistics? Published online, Feb, 2005.

[3] John Hutchins: Retrospect and prospect in computer-based translation. Proceedings of MT Summit VII, 1999, pp. 30–44.

[4] Arnold B. Barach: Translating Machine 1975: And the Changes To Come.

[5] T. Crowley., C. Bowern. An Introduction to Historical Linguistics. Auckland, N.Z.: Oxford UP, 1992. Print.

[6] Natural Language Processing by Liz Liddy, Eduard Hovy, Jimmy Lin, John Prager, Dragomir Radev, Lucy Vanderwende, Ralph Weischedel

[7] “Computational Linguistics and Phonetics”.

[8] “Yatsko’s Computational Linguistics Laboratory”.

[9] “CLIP”.

[10] Computational Linguistics – Department of Linguistics – Georgetown College

[11] “UPenn Linguistics: Computational Linguistics”.

[12] Jurafsky, D., & Martin, J. H. (2009). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, N.J: Pearson Prentice Hall.

[13] Bowerman, M. (1988). The “no negative evidence” problem: How do children avoid constructing an overly general gram- mar. Explaining language universals.

[14] Braine, M.D.S. (1971). On two types of models of the internalization of grammars. In D.I. Slobin (Ed.), The ontogenesis of grammar: A theoretical perspective. New York: Academic Press.

[15] Powers, D.M.W. & Turk, C.C.R. (1989). Machine Learning of Natural Language. Springer-Verlag. ISBN 978-0-387- 19557-5.

[16] Elman, J. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 71-99.

[17] Salvi, G., Montesano, L., Bernardino, A., & Santos-Victor, J. (2012). Language bootstrapping: learning word meanings from perception-action association. IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics : a publication of the IEEE Systems, Man, and Cybernetics Society, 42(3), 660-71. doi:10.1109/TSMCB.2011.2172420

[18] Gong, T.; Shuai, L.; Tamariz, M. & Jäger, G. (2012). E. Scalas, ed. “Studying Language Change Using Price Equation and Pólya-urn Dynamics”. PLoS ONE. 7 (3): e33171. doi:10.1371/journal.pone.0033171.

[19] Marcus, M. & Marcinkiewicz, M. (1993). “Building a large annotated corpus of English: The Penn Treebank” (PDF). Computational linguistics. 19 (2): 313–330.

[20] Angus, D.; Smith, A. & Wiles, J. (2012). “Conceptual recurrence plots: revealing patterns in human discourse”. IEEE transactions on visualization and computer graphics. 18 (6): 988–97. doi:10.1109/TVCG.2011.100. 18 CHAPTER 1. WEEK01

[21] Furuhashi, S. & Hayakawa, Y. (2012). “Lognormality of the Distribution of Japanese Sentence Lengths”. Journal of the Physical Society of Japan. 81 (3): 034004. doi:10.1143/JPSJ.81.034004.

[22] Turing, A. M. (1950). “Computing machinery and intelligence”. Mind. 59 (236): 433–460. doi:10.1093/mind/lix.236.433. JSTOR 10.2307/2251299.

[23] Weizenbaum, J. (1966). “ELIZA—a computer program for the study of natural language communication between man and machine”. Communications of the ACM. 9 (1): 36–45. doi:10.1145/365153.365168.

[24] Och, F. J.; Ney, H. (2003). “A Systematic Comparison of Various Statistical Alignment Models”. Computational Linguis- tics. 29 (1): 19–51. doi:10.1162/089120103321337421.

[25] Mairesse, F. (2011). “Controlling user perceptions of linguistic style: Trainable generation of personality traits”. Compu- tational Linguistics. 37 (3): 455–488. doi:10.1162/COLI_a_00063.

[26] Language Files. The Ohio State University Department of Linguistics. 2011. pp. 624–634. ISBN 9780814251799.

[27] Bledsoe, W. W. & Browning, I. (1959). Pattern recognition and reading by machine. Papers presented at the December 1–3, 1959, eastern joint IRE-AIEE-ACM computer conference on - IRE-AIEE-ACM ’59 (Eastern). New York, New York, USA: ACM Press. pp. 225–232. doi:10.1145/1460299.1460326.

[28] Mosteller, F. (1963). “Inference in an authorship problem”. Journal of the American Statistical Association. 58 (302): 275–309. doi:10.2307/2283270. JSTOR 10.2307/2283270.

[29] Winograd, T. (1971). “Procedures as a Representation for Data in a Computer Program for Understanding Natural Lan- guage” (Report).

[30] Woods, W.; Kaplan, R. & Nash-Webber, B. (1972). “The lunar sciences natural language information system” (Report).

[31] Rabiner, L. (1989). “A tutorial on hidden Markov models and selected applications in speech recognition”. Proceedings of the IEEE.

[32] Bahl, L.; Baker, J.; Cohen, P.; Jelinek, F. (1978). “Recognition of continuously read natural corpus”. Acoustics, Speech, and Signal. 3: 422–424. doi:10.1109/ICASSP.1978.1170402.

[33] Blei, D. & Ng, A. (2003). “Latent dirichlet allocation”. The Journal of Machine Learning. 3: 993–1022.

[34] “Careers in Computational Linguistics”. California State University. Retrieved 19 September 2016.

[35] Marujo, Lu ´i s et al. “Automatic Keyword Extraction on Twitter.” Language Technologies Institute, Carnegie Melon University, n.d. Web. 19 Sept. 2016.

[36] Oettinger, A. G. (1965). Computational Linguistics. The American Mathematical Monthly, Vol. 72, No. 2, Part 2: Computers and Computing, pp. 147-150.

[37] McEnery, Thomas (1996). Corpus Linguistics: An Introduction. Edinburgh: Edinburgh University Press. p. 114. ISBN 0748611657.

[38] Badham, John (1983-06-03), WarGames, retrieved 2016-02-22

[39] Hershman-Leeson, Lynn (1999-02-19), Conceiving Ada, retrieved 2016-02-22

[40] Jonze, Spike (2014-01-10), Her, retrieved 2016-02-18

[41] Tyldum, Morten (2014-12-25), The Imitation Game, retrieved 2016-02-18

[42] Garland, Alex (2015-04-24), Ex Machina, retrieved 2016-02-18

1.2.8 External links

• Association for Computational Linguistics (ACL)

• ACL Anthology of research papers • ACL Wiki for Computational Linguistics

• CICLing annual conferences on Computational Linguistics

• Computational Linguistics – Applications workshop 1.3. TEXT MINING 19

• Free online introductory book on Computational Linguistics at the Wayback Machine (archived January 25, 2008) • Language Technology World • Resources for Text, Speech and Language Processing • The Research Group in Computational Linguistics

1.3 Text mining

Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, , document summarization, and entity relation modeling (i.e., learning relations between named entities). Text analysis involves information retrieval, to study word frequency distributions, pattern recog- nition, tagging/annotation, , data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods. A typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted.

1.3.1 Text analytics

The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation.[1] The term is roughly synonymous with text mining; indeed, Ronen Feldman modified a 2000 de- scription of “text mining”[2] in 2004 to describe “text analytics.”[3] The latter term is now used more frequently in business settings while “text mining” is used in some of the earliest application areas, dating to the 1980s,[4] notably life-sciences research and government intelligence. The term text analytics also describes that application of text analytics to respond to business problems, whether independently or in conjunction with query and analysis of fielded, numerical data. It is a truism that 80 percent of business-relevant information originates in unstructured form, primarily text.[5] These techniques and processes discover and present knowledge – facts, business rules, and relationships – that is otherwise locked in textual form, impenetrable to automated processing.

1.3.2 History

Labor-intensive manual text mining approaches first surfaced in the mid-1980s,[6] but technological advances have enabled the field to advance during the past decade. Text mining is an interdisciplinary field that draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information (common estimates say over 80%)[5] is currently stored as text, text mining is believed to have a high commercial potential value. Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning. The challenge of exploiting the large proportion of enterprise information that originates in “unstructured” form has been recognized for decades.[7] It is recognized in the earliest definition of business intelligence (BI), in an October 1958 IBM Journal article by H.P. Luhn, A Business Intelligence System, which describes a system that will:

"...utilize data-processing machines for auto-abstracting and auto-encoding of documents and for creating interest profiles for each of the 'action points’ in an organization. Both incoming and internally 20 CHAPTER 1. WEEK01

generated documents are automatically abstracted, characterized by a word pattern, and sent automati- cally to appropriate action points.”

Yet as management information systems developed starting in the 1960s, and as BI emerged in the '80s and '90s as a software category and field of practice, the emphasis was on numerical data stored in relational databases. This is not surprising: text in “unstructured” documents is hard to process. The emergence of text analytics in its current form stems from a refocusing of research in the late 1990s from algorithm development to application, as described by Prof. Marti A. Hearst in the paper Untangling Text Data Mining:[8]

For almost a decade the computational linguistics community has viewed large text collections as a resource to be tapped in order to produce better text analysis algorithms. In this paper, I have attempted to suggest a new emphasis: the use of large online text collections to discover new facts and trends about the world itself. I suggest that to make progress we do not need fully artificial intelligent text analysis; rather, a mixture of computationally-driven and user-guided analysis may open the door to exciting new results.

Hearst’s 1999 statement of need fairly well describes the state of text analytics technology and practice a decade later.

1.3.3 Text analysis processes

Subtasks—components of a larger text-analytics effort—typically include:

• Information retrieval or identification of a corpus is a preparatory step: collecting or identifying a set of textual materials, on the Web or held in a file system, database, or content corpus manager, for analysis.

• Although some text analytics systems apply exclusively advanced statistical methods, many others apply more extensive natural language processing, such as part of speech tagging, syntactic parsing, and other types of linguistic analysis.

• Named entity recognition is the use of gazetteers or statistical techniques to identify named text features: people, organizations, place names, stock ticker symbols, certain abbreviations, and so on. Disambiguation— the use of contextual clues—may be required to decide where, for instance, “Ford” can refer to a former U.S. president, a vehicle manufacturer, a movie star, a river crossing, or some other entity.

• Recognition of Pattern Identified Entities: Features such as telephone numbers, e-mail addresses, quantities (with units) can be discerned via regular expression or other pattern matches.

• Coreference: identification of noun phrases and other terms that refer to the same object.

• Relationship, fact, and event Extraction: identification of associations among entities and other information in text

• Sentiment analysis involves discerning subjective (as opposed to factual) material and extracting various forms of attitudinal information: sentiment, opinion, mood, and emotion. Text analytics techniques are helpful in analyzing, sentiment at the entity, concept, or topic level and in distinguishing opinion holder and opinion object.[9]

• Quantitative text analysis is a set of techniques stemming from the social sciences where either a human judge or a computer extracts semantic or grammatical relationships between words in order to find out the meaning or stylistic patterns of, usually, a casual personal text for the purpose of psychological profiling etc.[10]

1.3.4 Applications

The technology is now broadly applied for a wide variety of government, research, and business needs. Applications can be sorted into a number of categories by analysis type or by business function. Using this approach to classifying solutions, application categories include:

• Enterprise Business Intelligence/Data Mining, Competitive Intelligence 1.3. TEXT MINING 21

• E-Discovery, Records Management

• National Security/Intelligence

• Scientific discovery, especially Life Sciences

• Sentiment Analysis Tools, Listening Platforms

• Natural Language/Semantic Toolkit or Service

• Publishing

• Automated ad placement

• Search/Information Access

• Social media monitoring

Security applications

Many text mining software packages are marketed for security applications, especially monitoring and analysis of online plain text sources such as Internet news, blogs, etc. for national security purposes.[11] It is also involved in the study of text encryption/decryption.

Biomedical applications

Main article: Biomedical text mining

A range of text mining applications in the biomedical literature has been described.[12] One online text mining application in the biomedical literature is PubGene that combines biomedical text mining with network visualization as an Internet service.[13][14] GoPubMed is a knowledge-based search engine for biomedical texts.

Software applications

Text mining methods and software is also being researched and developed by major firms, including IBM and Microsoft, to further automate the mining and analysis processes, and by different firms working in the area of search and indexing in general as a way to improve their results. Within public sector much effort has been concentrated on creating software for tracking and monitoring terrorist activities.[15]

Online media applications

Text mining is being used by large media companies, such as the Tribune Company, to clarify information and to provide readers with greater search experiences, which in turn increases site “stickiness” and revenue. Additionally, on the back end, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content.

Marketing applications

Text mining is starting to be used in marketing as well, more specifically in analytical customer relationship manage- ment.[16] Coussement and Van den Poel (2008)[17][18] apply it to improve predictive analytics models for customer churn (customer attrition).[17] 22 CHAPTER 1. WEEK01

Sentiment analysis

Sentiment analysis may involve analysis of movie reviews for estimating how favorable a review is for a movie.[19] Such an analysis may need a labeled data set or labeling of the affectivity of words. Resources for affectivity of words and concepts have been made for WordNet[20] and ConceptNet,[21] respectively. Text has been used to detect emotions in the related area of affective computing.[22] Text based approaches to affective computing have been used on multiple corpora such as students evaluations, children stories and news stories.

Academic applications

The issue of text mining is of importance to publishers who hold large databases of information needing indexing for retrieval. This is especially true in scientific disciplines, in which highly specific information is often contained within written text. Therefore, initiatives have been taken such as Nature’s proposal for an Open Text Mining Interface (OTMI) and the National Institutes of Health's common Journal Publishing Document Type Definition (DTD) that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access. Academic institutions have also become involved in the text mining initiative:

• The National Centre for Text Mining (NaCTeM), is the first publicly funded text mining centre in the world. NaCTeM is operated by the University of Manchester[23] in close collaboration with the Tsujii Lab,[24] University of Tokyo.[25] NaCTeM provides customised tools, research facilities and offers advice to the academic com- munity. They are funded by the Joint Information Systems Committee (JISC) and two of the UK Research Councils (EPSRC & BBSRC). With an initial focus on text mining in the biological and biomedical sciences, research has since expanded into the areas of social sciences.

• In the United States, the School of Information at University of California, Berkeley is developing a program called BioText to assist biology researchers in text mining and analysis.

• The Text Analysis Portal for Research (TAPoR), currently housed at the University of Alberta, is a scholarly project to catalogue text analysis applications and create a gateway for researchers new to the practice.

Digital humanities and computational sociology

The automatic analysis of vast textual corpora has created the possibility for scholars to analyse millions of documents in multiple languages with very limited manual intervention. Key enabling technologies have been parsing, machine translation, topic categorization, and machine learning. The automatic parsing of textual corpora has enabled the extraction of actors and their relational networks on a vast scale, turning textual data into network data. The resulting networks, which can contain thousands of nodes, are then analysed by using tools from network theory to identify the key actors, the key communities or parties, and general properties such as robustness or structural stability of the overall network, or centrality of certain nodes.[27] This automates the approach introduced by quantitative narrative analysis,[28] whereby subject-verb-object triplets are identified with pairs of actors linked by an action, or pairs formed by actor-object.[26] Content analysis has been a traditional part of social sciences and media studies for a long time. The automation of content analysis has allowed a "big data" revolution to take place in that field, with studies in social media and newspaper content that include millions of news items. Gender bias, readability, content similarity, reader prefer- ences, and even mood have been analyzed based on text mining methods over millions of documents.[29][30][31][32][33] The analysis of readability, gender bias and topic bias was demonstrated in Flaounas et al.[34] showing how different topics have different gender biases and levels of readability; the possibility to detect mood shifts in a vast population by analysing Twitter content was demonstrated as well.[35]

1.3.5 Software

Text mining computer programs are available from many commercial and open source companies and sources. See List of text mining software. 1.3. TEXT MINING 23

Narrative network of US Elections 2012[26]

1.3.6 Intellectual property law

Situation in Europe

Because of a lack of flexibilities in European copyright and database law, the mining of in-copyright works (such as web mining) without the permission of the copyright owner is illegal. In the UK in 2014, on the recommendation of the Hargreaves review the government amended copyright law[36] to allow text mining as a limitation and exception. It was only the second country in the world to do so, following Japan, which introduced a mining-specific exception in 2009. However, owing to the restriction of the Copyright Directive, the UK exception only allows content mining for non-commercial purposes. UK copyright law does not allow this provision to be overridden by contractual terms and conditions. The European Commission facilitated stakeholder discussion on text and data mining in 2013, under the title of Licences for Europe.[37] The fact that the focus on the solution to this legal issue was licences, and not limitations and exceptions to copyright law, led representatives of universities, researchers, libraries, civil society groups and open access publishers to leave the stakeholder dialogue in May 2013.[38] 24 CHAPTER 1. WEEK01

Video by Fix Copyright campaign explaining TDM and its copyright issues in the EU, 2016 [3:52

Situation in the United States

By contrast to Europe, the flexible nature of US copyright law, and in particular fair use, means that text mining in America, as well as other fair use countries such as Israel, Taiwan and South Korea, is viewed as being legal. As text mining is transformative, meaning that it does not supplant the original work, it is viewed as being lawful under fair use. For example, as part of the Google Book settlement the presiding judge on the case ruled that Google’s digitisation project of in-copyright books was lawful, in part because of the transformative uses that the digitisation project displayed—one such use being text and data mining.[39]

1.3.7 Implications

Until recently, websites most often used text-based searches, which only found documents containing specific user- defined words or phrases. Now, through use of a semantic web, text mining can find content based on meaning and context (rather than just by a specific word). Additionally, text mining software can be used to build large dossiers of information about specific people and events. For example, large datasets based on data extracted from news reports can be built to facilitate social networks analysis or counter-intelligence. In effect, the text mining software may act in a capacity similar to an intelligence analyst or research librarian, albeit with a more limited scope of analysis. Text mining is also used in some email spam filters as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material. Text mining plays an important role in determining financial market sentiment.

1.3.8 See also

• Document processing

• Full text search

• List of text mining software

• Market sentiment

• Name resolution (semantics and text extraction) 1.3. TEXT MINING 25

• Named entity recognition

• News analytics

• Record linkage

• Sequential pattern mining (string and sequence mining)

• w-shingling

• Web mining, a task that may involve text mining (e.g. first find appropriate web pages by classifying crawled web pages, then extract the desired information from the text content of these pages considered relevant)

1.3.9 References

Citations

[1] Archived November 29, 2009, at the Wayback Machine.

[2] “KDD-2000 Workshop on Text Mining - Call for Papers”. Cs.cmu.edu. Retrieved 2015-02-23.

[3] Archived March 3, 2012, at the Wayback Machine.

[4] Hobbs, Jerry R.; Walker, Donald E.; Amsler, Robert A. (1982). “Proceedings of the 9th conference on Computational linguistics”. 1: 127–32. doi:10.3115/991813.991833.

[5] “Unstructured Data and the 80 Percent Rule”. Breakthrough Analysis. Retrieved 2015-02-23.

[6] “Content Analysis of Verbatim Explanations”. Ppc.sas.upenn.edu. Retrieved 2015-02-23.

[7] “A Brief History of Text Analytics by Seth Grimes”. Beyenetwork. 2007-10-30. Retrieved 2015-02-23.

[8] Hearst, Marti A. (1999). “Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics”: 3–10. doi:10.3115/1034678.1034679. ISBN 1-55860-609-2.

[9] “Full Circle Sentiment Analysis”. Breakthrough Analysis. Retrieved 2015-02-23.

[10] Mehl, Matthias R. (2006). “Handbook of multimethod measurement in psychology": 141. doi:10.1037/11383-011. ISBN 1-59147-318-7.

[11] Zanasi, Alessandro (2009). “Proceedings of the International Workshop on Computational Intelligence in Security for Information Systems CISIS'08”. Advances in Soft Computing. 53: 53. doi:10.1007/978-3-540-88181-0_7. ISBN 978-3- 540-88180-3.

[12] Cohen, K. Bretonnel; Hunter, Lawrence (2008). “Getting Started in Text Mining”. PLoS Computational Biology. 4 (1): e20. doi:10.1371/journal.pcbi.0040020. PMC 2217579 . PMID 18225946.

[13] Jenssen, Tor-Kristian; Lægreid, Astrid; Komorowski, Jan; Hovig, Eivind (2001). “A literature network of human genes for high-throughput analysis of gene expression”. Nature Genetics. 28 (1): 21–8. doi:10.1038/ng0501-21. PMID 11326270.

[14] Masys, Daniel R. (2001). “Linking microarray data to the literature”. Nature Genetics. 28 (1): 9–10. doi:10.1038/ng0501- 9. PMID 11326264.

[15] Archived October 4, 2013, at the Wayback Machine.

[16] “Text Analytics”. Medallia. Retrieved 2015-02-23.

[17] Coussement, Kristof; Van Den Poel, Dirk (2008). “Integrating the voice of customers through call center emails into a de- cision support system for churn prediction”. Information & Management. 45 (3): 164–74. doi:10.1016/j.im.2008.01.005.

[18] Coussement, Kristof; Van Den Poel, Dirk (2008). “Improving customer complaint management by automatic email classifi- cation using linguistic style features as predictors”. Decision Support Systems. 44 (4): 870–82. doi:10.1016/j.dss.2007.10.010.

[19] Pang, Bo; Lee, Lillian; Vaithyanathan, Shivakumar (2002). “Proceedings of the ACL-02 conference on Empirical methods in natural language processing”. 10: 79–86. doi:10.3115/1118693.1118704.

[20] Alessandro Valitutti; Carlo Strapparava; Oliviero Stock (2005). “Developing Affective Lexical Resources” (PDF). Psy- chology Journal. 2 (1): 61–83. 26 CHAPTER 1. WEEK01

[21] Erik Cambria; Robert Speer; Catherine Havasi; Amir Hussain (2010). “SenticNet: a Publicly Available Semantic Resource for Opinion Mining” (PDF). Proceedings of AAAI CSK. pp. 14–18.

[22] Calvo, Rafael A; d'Mello, Sidney (2010). “Affect Detection: An Interdisciplinary Review of Models, Methods, and Their Applications”. IEEE Transactions on Affective Computing. 1 (1): 18–37. doi:10.1109/T-AFFC.2010.1.

[23] “The University of Manchester”. Manchester.ac.uk. Retrieved 2015-02-23.

[24] “Tsujii Laboratory”. Tsujii.is.s.u-tokyo.ac.jp. Retrieved 2015-02-23.

[25] “The University of Tokyo”. UTokyo. Retrieved 2015-02-23.

[26] Automated analysis of the US presidential elections using Big Data and network analysis; S Sudhahar, GA Veltri, N Cris- tianini; Big Data & Society 2 (1), 1-28, 2015

[27] Network analysis of narrative content in large corpora; S Sudhahar, G De Fazio, R Franzosi, N Cristianini; Natural Lan- guage Engineering, 1-32, 2013

[28] Quantitative Narrative Analysis; Roberto Franzosi; Emory University © 2010

[29] Lansdall-Welfare, Thomas; Sudhahar, Saatviga; Thompson, James; Lewis, Justin; Team, FindMyPast Newspaper; Cris- tianini, Nello (2017-01-09). “Content analysis of 150 years of British periodicals”. Proceedings of the National Academy of Sciences: 201606380. doi:10.1073/pnas.1606380114. ISSN 0027-8424. PMID 28069962.

[30] I. Flaounas, M. Turchi, O. Ali, N. Fyson, T. De Bie, N. Mosdell, J. Lewis, N. Cristianini, The Structure of EU Mediasphere, PLoS ONE, Vol. 5(12), pp. e14243, 2010.

[31] Nowcasting Events from the Social Web with Statistical Learning V Lampos, N Cristianini; ACM Transactions on Intelli- gent Systems and Technology (TIST) 3 (4), 72

[32] NOAM: news outlets analysis and monitoring system; I Flaounas, O Ali, M Turchi, T Snowsill, F Nicart, T De Bie, N Cristianini Proc. of the 2011 ACM SIGMOD international conference on Management of data

[33] Automatic discovery of patterns in media content, N Cristianini, Combinatorial Pattern Matching, 2-13, 2011

[34] I. Flaounas, O. Ali, T. Lansdall-Welfare, T. De Bie, N. Mosdell, J. Lewis, N. Cristianini, RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM, Digital Journalism, Routledge, 2012

[35] Effects of the Recession on Public Mood in the UK; T Lansdall-Welfare, V Lampos, N Cristianini; Mining Social Network Dynamics (MSND) session on Social Media Applications

[36] Archived June 9, 2014, at the Wayback Machine.

[37] “Licences for Europe - Structured Stakeholder Dialogue 2013”. European Commission. Retrieved 14 November 2014.

[38] “Text and Data Mining:Its importance and the need for change in Europe”. Association of European Research Libraries. Retrieved 14 November 2014.

[39] “Judge grants summary judgment in favor of Google Books — a fair use victory”. Lexology.com. Antonelli Law Ltd. Retrieved 14 November 2014.

Sources

• Ananiadou, S. and McNaught, J. (Editors) (2006). Text Mining for Biology and Biomedicine. Artech House Books. ISBN 978-1-58053-984-5 • Bilisoly, R. (2008). Practical Text Mining with Perl. New York: John Wiley & Sons. ISBN 978-0-470-17643-6 • Feldman, R., and Sanger, J. (2006). The Text Mining Handbook. New York: Cambridge University Press. ISBN 978-0-521-83657-9 • Indurkhya, N., and Damerau, F. (2010). Handbook Of Natural Language Processing, 2nd Edition. Boca Raton, FL: CRC Press. ISBN 978-1-4200-8592-1 • Kao, A., and Poteet, S. (Editors). Natural Language Processing and Text Mining. Springer. ISBN 1-84628- 175-X • Konchady, M. Text Mining Application Programming (Programming Series). Charles River Media. ISBN 1- 58450-460-9 1.4. PRECISION AND RECALL 27

• Manning, C., and Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press. ISBN 978-0-262-13360-9

• Miner, G., Elder, J., Hill. T, Nisbet, R., Delen, D. and Fast, A. (2012). Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Elsevier Academic Press. ISBN 978-0-12-386979-1

• McKnight, W. (2005). “Building business intelligence: Text data mining in business intelligence”. DM Review, 21-22.

• Srivastava, A., and Sahami. M. (2009). Text Mining: Classification, Clustering, and Applications. Boca Raton, FL: CRC Press. ISBN 978-1-4200-5940-3

• Zanasi, A. (Editor) (2007). Text Mining and its Applications to Intelligence, CRM and Knowledge Management. WIT Press. ISBN 978-1-84564-131-3

1.3.10 External links

• Marti Hearst: What Is Text Mining? (October, 2003)

• Automatic Content Extraction, Linguistic Data Consortium

• Automatic Content Extraction, NIST

• Research work and applications of Text Mining (for instance AgroNLP)

• Text Mining Free Tools

1.4 Precision and recall

In pattern recognition and information retrieval with binary classification, precision (also called positive predictive value) is the fraction of retrieved instances that are relevant, while recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. Both precision and recall are therefore based on an understanding and measure of relevance. Suppose a computer program for recognizing dogs in scenes from a video identifies 7 dogs in a scene containing 9 dogs and some cats. If 4 of the identifications are correct, but 3 are actually cats, the program’s precision is 4/7 while its recall is 4/9. When a search engine returns 30 pages only 20 of which were relevant while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3 while its recall is 20/60 = 1/3. So, in this case, precision is “how useful the search results are”, and recall is “how complete the results are”. In statistics, if the null hypothesis is that all and only the relevant items are retrieved, absence of type I and type II errors corresponds respectively to maximum precision (no false positive) and maximum recall (no false negative). The above pattern recognition example contained 7 − 4 = 3 type I errors and 9 − 4 = 5 type II errors. Precision can be seen as a measure of exactness or quality, whereas recall is a measure of completeness or quantity. In simple terms, high precision means that an algorithm returned substantially more relevant results than irrelevant ones, while high recall means that an algorithm returned most of the relevant results.

1.4.1 Introduction

In an information retrieval scenario, the instances are documents and the task is to return a set of relevant documents given a search term; or equivalently, to assign each document to one of two categories, “relevant” and “not relevant”. In this case, the “relevant” documents are simply those that belong to the “relevant” category. Recall is defined as the number of relevant documents retrieved by a search divided by the total number of existing relevant documents, while precision is defined as the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search. In a classification task, the precision for a class is the number of true positives (i.e. the number of items correctly labeled as belonging to the positive class) divided by the total number of elements labeled as belonging to the positive class (i.e. the sum of true positives and false positives, which are items incorrectly labeled as belonging to the class). 28 CHAPTER 1. WEEK01

Recall in this context is defined as the number of true positives divided by the total number of elements that actually belong to the positive class (i.e. the sum of true positives and false negatives, which are items which were not labeled as belonging to the positive class but should have been). In information retrieval, a perfect precision score of 1.0 means that every result retrieved by a search was relevant (but says nothing about whether all relevant documents were retrieved) whereas a perfect recall score of 1.0 means that all relevant documents were retrieved by the search (but says nothing about how many irrelevant documents were also retrieved). In a classification task, a precision score of 1.0 for a class C means that every item labeled as belonging to class C does indeed belong to class C (but says nothing about the number of items from class C that were not labeled correctly) whereas a recall of 1.0 means that every item from class C was labeled as belonging to class C (but says nothing about how many other items were incorrectly also labeled as belonging to class C). Often, there is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other. Brain surgery provides an obvious example of the tradeoff. Consider a brain surgeon tasked with removing a cancerous tumor from a patient’s brain. The surgeon needs to remove all of the tumor cells since any remaining cancer cells will regenerate the tumor. Conversely, the surgeon must not remove healthy brain cells since that would leave the patient with impaired brain function. The surgeon may be more liberal in the area of the brain she removes to ensure she has extracted all the cancer cells. This decision increases recall but reduces precision. On the other hand, the surgeon may be more conservative in the brain she removes to ensure she extracts only cancer cells. This decision increases precision but reduces recall. That is to say, greater recall increases the chances of removing healthy cells (negative outcome) and increases the chances of removing all cancer cells (positive outcome). Greater precision decreases the chances of removing healthy cells (positive outcome) but also decreases the chances of removing all cancer cells (negative outcome). Usually, precision and recall scores are not discussed in isolation. Instead, either values for one measure are compared for a fixed level at the other measure (e.g. precision at a recall level of 0.75) or both are combined into a single measure. Examples for measures that are a combination of precision and recall are the F-measure (the weighted harmonic mean of precision and recall), or the Matthews correlation coefficient, which is a geometric mean of the chance-corrected variants: the regression coefficients Informedness (DeltaP') and Markedness (DeltaP).[1][2] Accuracy is a weighted arithmetic mean of Precision and Inverse Precision (weighted by Bias) as well as a weighted arithmetic mean of Recall and Inverse Recall (weighted by Prevalence).[1] Inverse Precision and Recall are simply the Precision and Recall of the inverse problem where positive and negative labels are exchanged (for both real classes and prediction labels). Recall and Inverse Recall, or equivalently true positive rate and false positive rate, are frequently plotted against each other as ROC curves and provide a principled mechanism to explore operating point tradeoffs. Outside of Information Retrieval, the application of Recall, Precision and F-measure are argued to be flawed as they ignore the true negative cell of the contingency table, and they are easily manipulated by biasing the predictions.[1] The first problem is 'solved' by using Accuracy and the second problem is 'solved' by discounting the chance component and renormalizing to Cohen’s kappa, but this no longer affords the opportunity to explore tradeoffs graphically. However, Informedness and Markedness are Kappa-like renormalizations of Recall and Precision,[3] and their geometric mean Matthews correlation coefficient thus acts like a debiased F-measure.

1.4.2 Definition (information retrieval context)

In information retrieval contexts, precision and recall are defined in terms of a set of retrieved documents (e.g. the list of documents produced by a web search engine for a query) and a set of relevant documents (e.g. the list of all documents on the internet that are relevant for a certain topic), cf. relevance. The measures were defined in Perry, Kent & Berry (1955).

Precision

In the field of information retrieval, precision is the fraction of retrieved documents that are relevant to the query: |{documents relevant}∩{documents retrieved}| precision = |{documents retrieved}| Precision takes all retrieved documents into account, but it can also be evaluated at a given cut-off rank, considering only the topmost results returned by the system. This measure is called precision at n or P@n. For example for a text search on a set of documents precision is the number of correct results divided by the number of all returned results. 1.4. PRECISION AND RECALL 29

Precision is also used with recall, the percent of all relevant documents that is returned by the search. The two measures are sometimes used together in the F1 Score (or f-measure) to provide a single measurement for a system. Note that the meaning and usage of “precision” in the field of information retrieval differs from the definition of accuracy and precision within other branches of science and technology.

Recall

Recall in information retrieval is the fraction of the documents that are relevant to the query that are successfully retrieved. |{documents relevant}∩{documents retrieved}| recall = |{documents relevant}| For example for text search on a set of documents recall is the number of correct results divided by the number of results that should have been returned. In binary classification, recall is called sensitivity. So it can be looked at as the probability that a relevant document is retrieved by the query. It is trivial to achieve recall of 100% by returning all documents in response to any query. Therefore, recall alone is not enough but one needs to measure the number of non-relevant documents also, for example by computing the precision.

1.4.3 Definition (classification context)

For classification tasks, the terms true positives, true negatives, false positives, and false negatives (see Type I and type II errors for definitions) compare the results of the classifier under test with trusted external judgments. The terms positive and negative refer to the classifier’s prediction (sometimes known as the expectation), and the terms true and false refer to whether that prediction corresponds to the external judgment (sometimes known as the observation). Let us define an experiment from P positive instances and N negative instances for some condition. The four outcomes can be formulated in a 2×2 contingency table or confusion matrix, as follows: Precision and recall are then defined as:[6] tp tp Precision = tp+fp Recall = tp+fn Recall in this context is also referred to as the true positive rate or sensitivity, and precision is also referred to as positive predictive value (PPV); other related measures used in classification include true negative rate and accuracy.[6] True negative rate is also called specificity. tn tp+tn rate negative True = tn+fp Accuracy = tp+tn+fp+fn

1.4.4 Probabilistic interpretation

It is possible to interpret precision and recall not as ratios but as probabilities:

• Precision is the probability that a (randomly selected) retrieved document is relevant.

• Recall is the probability that a (randomly selected) relevant document is retrieved in a search.

Note that the random selection refers to a uniform distribution over the appropriate pool of documents; i.e. by randomly selected retrieved document, we mean selecting a document from the set of retrieved documents in a random fashion. The random selection should be such that all documents in the set are equally likely to be selected. Note that, in a typical classification system, the probability that a retrieved document is relevant depends on the document. The above interpretation extends to that scenario also (needs explanation). Another interpretation for precision and recall is as follows. Precision is the average probability of relevant retrieval. Recall is the average probability of complete retrieval. Here we average over multiple retrieval queries. 30 CHAPTER 1. WEEK01

1.4.5 F-measure

Main article: F1 score

A measure that combines precision and recall is the harmonic mean of precision and recall, the traditional F-measure or balanced F-score: · precision·recall F = 2 precision+recall This measure is approximately the average of the two when they are close, and is more generally the harmonic mean, which, for the case of two numbers, coincides with the square of the geometric mean divided by the arithmetic mean. There are several reasons that the F-score can be criticized in particular circumstances due to its bias as an evaluation [1] metric. This is also known as the F1 measure, because recall and precision are evenly weighted.

It is a special case of the general Fβ measure (for non-negative real values of β ): 2 · precision·recall Fβ = (1 + β ) β2·precision+recall

Two other commonly used F measures are the F2 measure, which weights recall higher than precision, and the F0.5 measure, which puts more emphasis on precision than recall.

The F-measure was derived by van Rijsbergen (1979) so that Fβ “measures the effectiveness of retrieval with respect to a user who attaches β times as much importance to recall as precision”. It is based on van Rijsbergen’s effectiveness − 1 measure Eα = 1 α 1−α , the second term being the weighted harmonic mean of precision and recall with weights P + R − − 1 (α, 1 α) . Their relationship is Fβ = 1 Eα where α = 1+β2 .

1.4.6 Limitations as goals

There are other parameters and strategies for performance metric of information retrieval system, such as the area under the precision-recall curve (AUC).[7] For web document retrieval, if the user’s objectives are not clear, the precision and recall can't be optimized. As summarized by Lopresti,[8]

Browsing is a comfortable and powerful paradigm (the serendipity effect). • Search results don't have to be very good. • Recall? Not important (as long as you get at least some good hits). • Precision? Not important (as long as at least some of the hits on the first page you return are good).

1.4.7 See also

• Uncertainty coefficient, also called proficiency • Sensitivity and specificity

1.4.8 References

[1] Powers, David M W (2011). “Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation” (PDF). Journal of Machine Learning Technologies. 2 (1): 37–63.

[2] Perruchet, P.; Peereman, R. (2004). “The exploitation of distributional information in syllable processing”. J. Neurolin- guistics. 17: 97–119. doi:10.1016/s0911-6044(03)00059-9.

[3] Powers, David M. W. (2012). “The Problem with Kappa”. Conference of the European Chapter of the Association for Computational Linguistics (EACL2012) Joint ROBUS-UNSUP Workshop.

[4] Fawcett, Tom (2006). “An Introduction to ROC Analysis” (PDF). Pattern Recognition Letters. 27 (8): 861–874. doi:10.1016/j.patrec.2005.10.010.

[5] Ting, Kai Ming (2011). Encyclopedia of machine learning. Springer. ISBN 978-0-387-30164-8.

[6] Olson, David L.; and Delen, Dursun (2008); Advanced Data Mining Techniques, Springer, 1st edition (February 1, 2008), page 138, ISBN 3-540-76916-1 1.4. PRECISION AND RECALL 31

[7] Zygmunt Zając. What you wanted to know about AUC. http://fastml.com/what-you-wanted-to-know-about-auc/

[8] Lopresti, Daniel (2001); WDA 2001 panel

• Baeza-Yates, Ricardo; Ribeiro-Neto, Berthier (1999). Modern Information Retrieval. New York, NY: ACM Press, Addison-Wesley, Seiten 75 ff. ISBN 0-201-39829-X • Hjørland, Birger (2010); The foundation of the concept of relevance, Journal of the American Society for Information Science and Technology, 61(2), 217-237 • Makhoul, John; Kubala, Francis; Schwartz, Richard; and Weischedel, Ralph (1999); Performance measures for information extraction, in Proceedings of DARPA Broadcast News Workshop, Herndon, VA, February 1999 • “Machine literature searching X. Machine language; factors underlying its design and development”. 1955. doi:10.1002/asi.5090060411. • van Rijsbergen, Cornelis Joost “Keith” (1979); Information Retrieval, London, GB; Boston, MA: Butterworth, 2nd Edition, ISBN 0-408-70929-4

1.4.9 External links

• Information Retrieval – C. J. van Rijsbergen 1979

• Computing Precision and Recall for a Multi-class Classification Problem 32 CHAPTER 1. WEEK01

relevant elements

false negatives true negatives

true positives false positives

selected elements

How many selected How many relevant items are relevant? items are selected?

Precision = Recall =

Precision and recall 1.5. F1 SCORE 33

1.5 F1 score

“F score” redirects here. For the significance test, see F-test.

In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test’s accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results, and r is the number of correct positive results divided by the number of positive results that should have been returned. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0.

The traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and recall — multiplying the constant of 2 scales the score to 1 when both recall and precision are 1:

· · 1 · precision recall F1 = 2 1 1 = 2 recall + precision precision + recall The general formula for positive real β is:

precision · recall F = (1 + β2) · β (β2 · precision) + recall The formula in terms of Type I and type II errors:

(1 + β2) · true positive F = β (1 + β2) · true positive + β2 · false negative + false positive

Two other commonly used F measures are the F2 measure, which weighs recall higher than precision (by placing more emphasis on false negatives), and the F0.5 measure, which weighs recall lower than precision (by attenuating the influence of false negatives).

The F-measure was derived so that Fβ “measures the effectiveness of retrieval with respect to a user who attaches β times as much importance to recall as precision”.[1] It is based on Van Rijsbergen's effectiveness measure

( )− α 1 − α 1 E = 1 − + P R − 1 Their relationship is Fβ = 1 E where α = 1+β2 .

1.5.1 Diagnostic Testing

This is related to the field of binary classification where recall is often termed as Sensitivity. There are several reasons [2] that the F1 score can be criticized in particular circumstances.

1.5.2 Applications

The F-score is often used in the field of information retrieval for measuring search, document classification, and query [3] classification performance. Earlier works focused primarily on the F1 score, but with the proliferation of large scale [4] search engines, performance goals changed to place more emphasis on either precision or recall and so Fβ is seen in wide application. The F-score is also used in machine learning.[5] Note, however, that the F-measures do not take the true negatives into account, and that measures such as the Phi coefficient, Matthews correlation coefficient, Informedness or Cohen’s kappa may be preferable to assess the performance of a binary classifier.[2] The F-score has been widely used in the natural language processing literature, such as the evaluation of named entity recognition and word segmentation. 34 CHAPTER 1. WEEK01

1.5.3 G-measure

While the F-measure is the harmonic mean of Recall and Precision, the G-measure is the geometric mean.[2]

√ G = precision · recall √ G = PPV · TPR . [6]

This is also known as the Fowlkes–Mallows index.

1.5.4 See also

• Precision and recall • BLEU

• NIST (metric) • METEOR

• ROUGE (metric) • Word Error Rate (WER)

• Receiver operating characteristic • Matthews correlation coefficient

• Uncertainty coefficient, aka Proficiency

1.5.5 References

[1] Van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). Butterworth.

[2] Powers, David M W (2011). “Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation” (PDF). Journal of Machine Learning Technologies. 2 (1): 37–63.

[3] Beitzel., Steven M. (2006). On Understanding and Classifying Web Queries (Ph.D. thesis). IIT. CiteSeerX 10.1.1.127.634 .

[4] X. Li; Y.-Y. Wang; A. Acero (July 2008). Learning query intent from regularized click graphs. Proceedings of the 31st SIGIR Conference.

[5] See, e.g., the evaluation of the CoNLL 2002 shared task.

[6] Li, Guo-Zheng, et al. “Inquiry diagnosis of coronary heart disease in Chinese medicine based on symptom-syndrome interactions.” Chinese medicine 7.1 (2012): 1. Chapter 2

Text and image sources, contributors, and licenses

2.1 Text

• Natural language processing Source: https://en.wikipedia.org/wiki/Natural_language_processing?oldid=758389294 Contributors: Derek Ross, Lee Daniel Crocker, The Anome, PierreAbbat, Michael Hardy, Kku, Egil, Ronz, Nanshu, Glenn, Cadr, Jiang, Universimmedia, Charles Matthews, Guaka, Viajero, Greenrd, Cbraun, Selket, Carol Fenijn, Topbanana, Khym Chanur, Robbot, Chris 73, Benwing, Kowey, Dmolla, Modulatum, Babbage, Rfc1394, Hippietrail, TittoAssini, Hadal, Wikibot, Xanzzibar, Tea2min, Giftlite, 0x0077BE, Zigger, Jorge Stolfi, Pne, Edcolins, Worldguy~enwiki, Bact, Geni, Beland, MarkSweep, Bumm13, Irune Berdún, Karl-Henner, IRATX- ESANTISTEBAN, Noelia~enwiki, Robin klein, D6, Mrevan, FT2, Michal Jurosz, Rama, Dbachmann, Kwamikagami, Serapio, Reiny- day, LaGirasol, Slambo, Jonsafari, Jarich, Thüringer, Diego Moya, Ricky81682, Ish ishwar, Satanjeev, Leondz, Arran4, Simetrical, Mindmatrix, AnmaFinotera, Mayz, Gwil, Stoni, Graham87, Qwertyus, Rjwilmsi, Quiddity, Cmouse, Pfctdayelise, Margosbot~enwiki, Sderose, Bmicomp, Tedder, Spencerk, DVdm, Msbmsb, YurikBot, Wavelength, Ste1n, Mokhov, Nucleusboy, Action potential, Crisco 1492, Square87~enwiki, JLaTondre, Ulf Hermjakob, MrGBug, Daya.sharma, SmackBot, Gnuwho, Fikus, Gelingvistoj, Ohnoitsjamie, Durova, Chris the speller, J. Spencer, DoctorW, The Moose, A. B., Gabr~enwiki, Fchaumartin, Nixeagle, Gokmop, Natecull, Andrei Stroe, SashatoBot, Domokato, Doc Daneeka, Dreftymac, Sjb72, IvanLanin, JDubman, Adrian.walker, FatalError, KyraVixen, Baiji, Dgw, Masdmh, Arnavion, Clappingsimon, Kallerdis, Dbp653, Alexnye, Pakada, Cs california, Atlamp, Alexander Wilks, Odoncaoa, Dawnseeker2000, Zang0, Escarbot, Luna Santin, Prolog, Kevin.cohen, JAnDbot, Karlpearson, Stevenbird, Gcm, The Transhumanist, Shermanmonroe, PhilKnight, .anacondabot, Agent1, Transcendence, Pdturney, EMLM, Mcguire, Dorte Nielsen, David Eppstein, Tommy Herbert, Calltech, Bissinger, Vanwhistler, Numbo3, Sunfishy, Senu, Kudpung, Skrab~enwiki, Falazar, AKA MBG, Mikael Häggström, Fishwristwatch, Bonadea, Pleasantville, Flyingidiot, Hookoo, Partha lal, Maghnus, Sandman2007, Lifeisfun, BlackVegetable, Bugone, Nave.notnilc, VanishedUserABC, Radagast3, Pdfpdf, SieBot, Nihil novi, Prakash Nadkarni, Icognitiva, Dasterner, Oxymoron83, Jun- ling, Emesee, CharlesGillingham, Searchmaven, BettinaJ, JBrookeAker, Mx. Granger, WurmWoode, Verbascum~enwiki, Unbuttered Parsnip, Arakunem, Computerboy0, Daniel Hershcovich, Benoît Sagot, Niceguyedc, UKoch, Dtunkelang, Timhoooey, Sawwie, Tylerd- mace, Zephyr1981, BearMachine, Diaa abdelmoneim, 7, Francopoulo, Jytdog, WikHead, G7valera, YrPolishUncle, Addbot, Mortense, Cantaloupe2, Betterusername, MrOllie, Ayoopdog, AsphyxiateDrake, Wheat1965, Jarble, Luckas-bot, Yobot, OrgasGirl, Pcap, Tcwicks, Tiffany9027, Ykerzner, AnomieBOT, Sdmonroe, Ai24081983, Ambertk, Royote, Piano non troppo, Materialscientist, LilHelpa, , Frosted14, Robykiwi~enwiki, FrescoBot, Mmchen, DivineAlpha, I dream of horses, Skyerise, Sami r i le, Yves.lussier, Apesteilen, Haael, Renepick, Karapattes~enwiki, Mesfar, Lesbar2, Geek4gurl, KenAdamsNSA, Dmwpowers, चंद्रकांत धुतडमल, Zparcell, Yeah yeah yeah nah, Helwr, Primefac, RenamedUser01302013, Goldwas1, Shinchan-Guillou, ZéroBot, Jahub, Jonpatterns, Csmaltes, TehMorp, Pintaio, Alvations, Will Beback Auto, ClueBot NG, Tillander, Andrewrosenberg, Aniketdalal, Chester Markel, Karkand23, Helpful Pixie Bot, Celct, Titodutta, BG19bot, Yucong Duan, PhnomPencil, Sawalha.majdi, 098760sam, ChrisGualtieri, Pairadonis, Me, Myself, and I are Here, Reatlas, Phamnhatkhanh, BeachComber1972, 7804j, Sangeenjan, Fellytone84, Marco Ciaramella, Coenni, Fixuture, Luraatcelct, Waggie, Nkishorjitsingh, Dakaplo, Ocelot8, HelpUsStopSpam, Newwikieditor678, Jdcomix, KasparBot, ProprioMe OW, Ncasale, Lin- guist111, WilliamJennings1989, PullJosh, EurekaDS, Mattiasostmar, Fmadd, Hyksandra, AllenAkhaumere, Aios3837 and Anonymous: 303 • Computational linguistics Source: https://en.wikipedia.org/wiki/Computational_linguistics?oldid=760148102 Contributors: Jkominek, Hannes Hirzel, Edward, Kwertii, Cprompt, Zeno Gantner, Karada, Looxix~enwiki, Александър, Poor Yorick, Junesun, Nikola Smolen- ski, Nohat, Ed Cormany, Furrykef, Itai, Carol Fenijn, Robbot, Lambda, Psychonaut, Dmolla, Babbage, Ruakh, Dmitri83, Vdannyv, Sun- dar, MarkSweep, LukeD, Demeter~enwiki, Burschik, Ornil, Rich Farmbrough, Michal Jurosz, Bender235, Dnaeil, Lycurgus, Skeppy, Shanes, SpeedyGonsales, Jonsafari, Haham hanuka, Xinelo, Thüringer, Pion, Ffbond, Dominic, Leondz, Angr, Kelly Martin, The JPS, Jon Harald Søby, Marudubshinki, Paxsimius, Graham87, BD2412, Jclemens, Rjwilmsi, Margosbot~enwiki, RexNL, Chobot, Roboto de Ajvol, Wavelength, Hede2000, Yahya Abdal-Aziz, Eighty~enwiki, Larry laptop, Tony1, Gzabers, Maunus, Pawyilee, Arthur Rubin, Allens, Shepard, JJL, SmackBot, McGeddon, BiT, Kurykh, MK8, The Moose, Gabr~enwiki, Alphathon, Jennica, Letowskie, David ek- ,ConceptExp, Peachris ,הסרפד ,strand, Lambiam, Pbice, 16@r, Dicklyon, Ace Frahm, IvanLanin, Scarlet Lioness, Cydebot, Kallerdis Ideogram, Ardief~enwiki, Hopiakuta, Myanw, Stevenbird, The Transhumanist, Eurobas, Magioladitis, Avjoska, EMLM, Master2841, CapnPrep, Afil, Francis Tyers, Inimino, Maurice Carbonaro, Mike.lifeguard, Dannywalk, Remi0o, Pleasantville, Hookoo, TXiKiBoT, Don4of4, General Reader, Logan, 3point14159, SieBot, Sonermanc, Yongjik, Yerpo, Jreans, Svick, Friederb, Randy Kryn, Sfan00 IMG, ClueBot, Erudecorp, Rhododendrites, Sun Creator, DeltaQuad, SchreiberBike, BOTarate, Dthomsen8, Ian a wright, Addbot, MrOl- lie, Simon J Kissane, Leovizza, , Legobot, Yobot, AnomieBOT, Mfelkess, Rjanag, Galoubet, NeaNita, Materialscientist, Compu- tational linguist professional, Louelle, Mononomic, Cyphoidbomb, Crzer07, RibotBOT, Turaids, Shadowjams, Inksha, Reinhard Hart-

35 36 CHAPTER 2. TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES

mann, FrescoBot, Canhaspancake, Stormmaster8923, Eanc, Mistakefinder, LittleWink, Skyerise, Domadams, Jauhienij, FoxBot, Lam Kin Keung, Pluemaster, Hyarmendacil, Uanfala, Boteditor, Primefac, Ebe123, Josve05a, Yatsko, Pintaio, Mokkers, ClueBot NG, Lin- guaua, Strike Eagle, BG19bot, PhnomPencil, Snow Rise, CitationCleanerBot, Rcamilled, Semantia, Conor.frye, Dfrysinger, Omid.espero, Suraduttashandilya, , Me, Myself, and I are Here, Phamnhatkhanh, SomeFreakOnTheInternet, Alexwho314, Jamesmcmahon0, Michael Ten, Kyrabw, EchoAC, Checkerblocks, Fixuture, ACanadianToker, Debaprasad01, Newwikieditor678, KasparBot, Skyroad, Sibyllwang, Elizaclaire, GreenC bot, BukWeb U, Hward4627 and Anonymous: 118 • Text mining Source: https://en.wikipedia.org/wiki/Text_mining?oldid=759468728 Contributors: Fnielsen, Lexor, Nixdorf, Kku, Ronz, Angela, Hike395, Furrykef, Khym Chanur, Pigsonthewing, ZimZalaBim, TimothyPilgrim, Timrollpickering, Adam78, Fennec, Ds13, Alison, Dmb000006, Nwynder, Utcursch, Alexf, Geni, Beland, Mike Rosoft, D6, Bender235, MBisanz, Shanes, Serapio, Comtebenoit, John Vandenberg, Alansohn, Thüringer, Stephen Turner, Woohookitty, Dallan, Macaddct1984, Joerg Kurt Wegner, GrundyCamellia, Rjwilmsi, FayssalF, Intgr, Eric.dane~enwiki, Adoniscik, Alexmorgan, Chris Capoccia, Dialectric, Ksteen~enwiki, Kawika, Paul Mag- nussen, GraemeL, GrinBot~enwiki, Veinor, SmackBot, Object01, Nixeagle, JonHarder, Matthew, Mfalhon, Gokmop, Derek R Bullam- ore, Dr.faustus, Musidora, JzG, Fernando S. Aldado~enwiki, Agathoclea, Slakr, Beetstra, Tmcw, Hobbularmodule, Hu12, Iridescent, Dreftymac, Ralf Klinkenberg, IvanLanin, Tawkerbot2, Dlohcierekim, PhillyPartTwo, CmdrObot, Van helsing, Gogo Dodo, Firstauthor, Scientio, A3RO, Nick Number, Zang0, AntiVandalBot, JAnDbot, Olaf, East718, Whayes, Magioladitis, Bubba hotep, Jodi.a.schneider, Shadesofgrey, Saganaki-, Infovarius, MartinBot, Anaxial, CalendarWatcher, Jfroelich, Textminer, Vision3001, Hdurina, AKA MBG, Jkwaran, KylieTastic, DarkSaber2k, Lalvers, VolkovBot, ABF, Maghnus, Fences and windows, Rogerfgay, Valerie928, Slabbe, Sebast- janmm, EverGreg, Guyjones, Znmeb, K. Takeda~enwiki, Sonermanc, Louiseh, Periergeia, Jerryobject, Eikoku, Mkonchady, Alessan- dro.zanasi, CharlesGillingham, Disooqi, Shah1979, JBrookeAker, Drgarden, Dokkai, Kotsiantis, Dtunkelang, Wikimeyou, Mdehaaff, Ray3055, Jplehmann, Pixelpty, Ferzsamara, Atolf19, Johnuniq, SoxBot III, Justin Mauger, XLinkBot, Ubahat, Chickensquare, Damson- Dragon, Dianeburley, Maximgr, Texterp, Addbot, DOI bot, Fgnievinski, MrOllie, Kq-hit, Yami0226, Lightbot, Sredmore, Luckas-bot, Yobot, Worldbruce, Themfromspace, Xicouto, Paulbalas, Glentipton, Tiffany9027, IslandData, Lexisnexisamsterdam, AnomieBOT, 1exec1, JoshKirschner, Saustinwiki, Citation bot, Wileri, Drecute, FrescoBot, Citation bot 1, Bina2, Boxplot, MastiBot, Trappist the monk, Lam Kin Keung, 564dude, Wikri63, Mean as custard, RjwilmsiBot, Pangeaworld, NotAnonymous0, Waidanian, Dcirovic, K6ka, Jahub, AvicAWB, Chire, 1Veertje, AManWithNoPlan, Yatsko, TyA, Mayko333, Stanislaw.osinski, Amzimti, Epheson, ClueBot NG, Researchadvocate, Babaifalola, Babakdarabi, Luke145, Lawsonstu, Debuntu, Helpful Pixie Bot, Andrewjsledge, Rgranich, BG19bot, Marcocapelle, Meshlabs, Martinef, Desildorf, TextAnalysisProf, Cyberbot II, Astaravista, Anonymous but Registered, Marie22446688, Stabylorus, Cerabot~enwiki, Davednh, Marketpsy, Bradhill14, Tomtcs, Me, Myself, and I are Here, SCBerry, Biogeographist, Gnust, Koichi home add, Semantria, Astigitana, RTNTD, Fabio.ruini, LegalResearch345 2, Cognie, Cbuyle, Monkbot, Patient Zero, Java12389, Aasasd, HelpUsStopSpam, KasparBot, Howardbm, Mroche.perso, GreenC bot, Vladiatorr, GregWS and Anonymous: 269 • Precision and recall Source: https://en.wikipedia.org/wiki/Precision_and_recall?oldid=755633573 Contributors: Michael Hardy, Nichtich~enwiki, Nikai, Willem, Benwing, Sepreece, Keltus, Bovlb, Chriki, Dfrankow, Burkenyo, Beland, Huwr, Urhixidur, Diomidis Spinellis, ACW, Linas, RHaworth, Rjwilmsi, Bgwhite, Gustavb, Dirk Riehle, Cedar101, Tobi Kellner, SmackBot, Gutworth, Nbarth, Barabum, Iridescent, Vaughan Pratt, Ibadibam, Krauss, Jbom1, Nyq, Maurice Carbonaro, Mathglot, Emma li mk, Melcombe, WDavis1911, JP.Martin-Flatin, Mild Bill Hiccup, UKoch, Sachinagarwal25, BirgerH, Brianbjparker, Bigoperm, Fastily, Addbot, St73ir, OZJ, Smokybreak, Ox thedark- ness, Yobot, Anypodetos, KamikazeBot, AnomieBOT, Qorilla, Zacacox, Louperibot, Trappist the monk, Duoduoduo, Dmwpowers, EmausBot, JustinTime55, Yago.salamanca, Wra2, Mikipedian, Helpful Pixie Bot, Marcocapelle, Ninney, Mimrie, Sds57, Ajraymond, Unstructural, Xserrat, Richard Kohar, Jedshady, Srinivasan latha, Dosyunques and Anonymous: 73 • F1 score Source: https://en.wikipedia.org/wiki/F1_score?oldid=747315814 Contributors: Kku, Zeno Gantner, Phil Boswell, Mfc, Smyth, Leondz, Btyner, Qwertyus, MarSch, Elder~enwiki, Johndburger, Ikkyu2, Cmglee, SmackBot, Mcld, Miguel Andrade, Ecoder, Kena- hoo, Nyq, J admo, Tdiethe, Agricola44, Jasontable, Fabi42~enwiki, Melcombe, Drmadskills, Addbot, Yobot, Legobot II, Anypodetos, AnomieBOT, Ton4eg, Omnipaedista, Joxemai, FrescoBot, Trappist the monk, Sjlver, Mavam, Dmwpowers, Dcirovic, Mentibot, Pablo Picossa, BG19bot, ChrisGualtieri, Sds57, SoledadKabocha, Pintoch, Matt Faus, Monkbot, Victsm, Dosyunques and Anonymous: 36

2.2 Images

• File:Alan_Turing_Aged_16.jpg Source: https://upload.wikimedia.org/wikipedia/commons/a/a1/Alan_Turing_Aged_16.jpg License: Public domain Contributors: http://www.turingarchive.org/viewer/?id=521&title=4 Original artist: Unknown • File:Ambox_important.svg Source: https://upload.wikimedia.org/wikipedia/commons/b/b4/Ambox_important.svg License: Public do- main Contributors: Own work, based off of Image:Ambox scales.svg Original artist: Dsmurat (talk · contribs) • File:ArtificialFictionBrain.png Source: https://upload.wikimedia.org/wikipedia/commons/1/17/ArtificialFictionBrain.png License: CC- BY-SA-3.0 Contributors: No machine-readable source provided. Own work assumed (based on copyright claims). Original artist: No machine-readable author provided. Gengiskanhg assumed (based on copyright claims). • File:Automated_online_assistant.png Source: https://upload.wikimedia.org/wikipedia/commons/8/8b/Automated_online_assistant.png License: Attribution Contributors: The text is adapted from the Wikipedia merchandise page (this automated customer service itself, however, is fictional), and pasted into a page in Wikipedia: Original artist: Mikael Häggström • File:FixCopyright-_Copyright_&_Research_-_Text_&_Data_Mining_(TDM)_Explained.webm Source: https://upload.wikimedia. org/wikipedia/commons/1/15/FixCopyright-_Copyright_%26_Research_-_Text_%26_Data_Mining_%28TDM%29_Explained.webm License: CC BY 3.0 Contributors: YouTube: #FixCopyright: Copyright & Research - Text & Data Mining (TDM) Explained Original artist: FixCopyright • File:Internet_map_1024.jpg Source: https://upload.wikimedia.org/wikipedia/commons/d/d2/Internet_map_1024.jpg License: CC BY 2.5 Contributors: Originally from the English Wikipedia; description page is/was here. Original artist: The Opte Project 2.3. CONTENT LICENSE 37

• File:Joseph_Weizenbaum.jpg Source: https://upload.wikimedia.org/wikipedia/commons/1/17/Joseph_Weizenbaum.jpg License: CC- BY-SA-3.0 Contributors: Own work Original artist: Ulrich Hansen, Germany (Journalist). • File:Lock-green.svg Source: https://upload.wikimedia.org/wikipedia/commons/6/65/Lock-green.svg License: CC0 Contributors: en: File:Free-to-read_lock_75.svg Original artist: User:Trappist the monk • File:Open_Access_logo_PLoS_transparent.svg Source: https://upload.wikimedia.org/wikipedia/commons/7/77/Open_Access_logo_ PLoS_transparent.svg License: CC0 Contributors: http://www.plos.org/ Original artist: art designer at PLoS, modified by Wikipedia users Nina, Beao, and JakobVoss • File:Phrenology1.jpg Source: https://upload.wikimedia.org/wikipedia/commons/f/fa/Phrenology1.jpg License: Public domain Contrib- utors: Friedrich Eduard Bilz (1842–1922): Das neue Naturheilverfahren (75. Jubiläumsausgabe) Original artist: scanned by de:Benutzer: Summi • File:Precisionrecall.svg Source: https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg License: CC BY-SA 4.0 Contributors: Own work Original artist: Walber • File:Question_book-new.svg Source: https://upload.wikimedia.org/wikipedia/en/9/99/Question_book-new.svg License: Cc-by-sa-3.0 Contributors: Created from scratch in Adobe Illustrator. Based on Image:Question book.png created by User:Equazcion Original artist: Tkgd2007 • File:Tripletsnew2012.png Source: https://upload.wikimedia.org/wikipedia/commons/4/43/Tripletsnew2012.png License: CC BY 4.0 Contributors: Own work Original artist: Thinkbig-project • File:Wikiversity-logo.svg Source: https://upload.wikimedia.org/wikipedia/commons/9/91/Wikiversity-logo.svg License: CC BY-SA 3.0 Contributors: Snorky (optimized and cleaned up by verdy_p) Original artist: Snorky (optimized and cleaned up by verdy_p)

2.3 Content license

• Creative Commons Attribution-Share Alike 3.0