Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2020. All rights reserved. Draft of December 30, 2020.

CHAPTER Sequence Labeling for Parts of 8 Speech and Named Entities

To each word a warbling note A Midsummer Night’s Dream, V.I

Dionysius Thrax of Alexandria (c. 100 B.C.), or perhaps someone else (it was a long time ago), wrote a grammatical sketch of Greek (a “techne¯”) that summarized the linguistic knowledge of his day. This work is the source of an astonishing proportion of modern linguistic vocabulary, including the words syntax, diphthong, clitic, and parts of speech analogy. Also included are a description of eight parts of speech: , , , preposition, , , , and . Although earlier scholars (including Aristotle as well as the Stoics) had their own lists of parts of speech, it was Thrax’s set of eight that became the basis for descriptions of European languages for the next 2000 years. (All the way to the Schoolhouse Rock educational television shows of our childhood, which had songs about 8 parts of speech, like the late great Bob Dorough’s Conjunction Junction.) The durability of parts of speech through two millennia speaks to their centrality in models of human language. Proper names are another important and anciently studied linguistic category. While parts of speech are generally assigned to individual words or morphemes, a proper name is often an entire multiword phrase, like the name “Marie Curie”, the location “New York City”, or the organization “Stanford University”. We’ll use the named entity term named entity for, roughly speaking, anything that can be referred to with a proper name: a person, a location, an organization, although as we’ll see the term is commonly extended to include things that aren’t entities per se. POS Parts of speech (also known as POS) and named entities are useful clues to sen- tence structure and meaning. Knowing whether a word is a noun or a verb tells us about likely neighboring words ( in English are preceded by and , by nouns) and syntactic structure (verbs have dependency links to nouns), making part-of-speech tagging a key aspect of parsing. Knowing if a named entity like Washington is a name of a person, a place, or a university is important to many natural language understanding tasks like question answering, stance detec- tion, or information extraction. In this chapter we’ll introduce the task of part-of-speech tagging, taking a se- quence of words and assigning each word a like NOUN or VERB, and the task of named entity recognition (NER), assigning words or phrases tags like PERSON, LOCATION, or ORGANIZATION. Such tasks in which we assign, to each word xi in an input word sequence, a label yi, so that the output sequence Y has the same length as the input sequence X sequence labeling are called sequence labeling tasks. We’ll introduce classic sequence labeling algo- rithms, one generative— the Hidden Markov Model (HMM)—and one discriminative— the Conditional Random Field (CRF). In following chapters we’ll introduce modern sequence labelers based on RNNs and Transformers. 2 CHAPTER 8•S EQUENCE LABELINGFOR PARTS OF SPEECHAND NAMED ENTITIES 8.1 (Mostly) English Word Classes

Until now we have been using part-of-speech terms like noun and verb rather freely. In this section we give more complete definitions. While word classes do have semantic tendencies—adjectives, for example, often describe properties and nouns people— parts of speech are defined instead based on their grammatical relationship with neighboring words or the morphological properties about their affixes.

Tag Description Example ADJ : noun modifiers describing properties red, young, awesome ADV Adverb: verb modifiers of time, place, manner very, slowly, home, yesterday NOUN words for persons, places, things, etc. algorithm, cat, mango, beauty VERB words for actions and processes draw, provide, go

Open Class PROPN Proper noun: name of a person, organization, place, etc.. Regina, IBM, Colorado INTJ : exclamation, greeting, yes/no response, etc. oh, um, yes, hello ADP Adposition (Preposition/Postposition): marks a noun’s in, on, by under spacial, temporal, or other relation AUX Auxiliary: helping verb marking tense, aspect, mood, etc., can, may, should, are CCONJ Coordinating Conjunction: joins two phrases/clauses and, or, but DET : marks noun phrase properties a, an, the, this NUM one, two, first, second PART Particle: a preposition-like form used together with a verb up, down, on, off, in, out, at, by PRON Pronoun: a shorthand for referring to an entity or event she, who, I, others Closed Class Words SCONJ Subordinating Conjunction: joins a main clause with a that, which subordinate clause such as a sentential PUNCT Punctuation ˙, , () SYM Symbols like $ or emoji $, % Other X Other asdf, qwfg Figure 8.1 The 17 parts of speech in the Universal Dependencies tagset (Nivre et al., 2016a). Features can be added to make finer-grained distinctions (with properties like number, case, definiteness, and so on).

closed class Parts of speech fall into two broad categories: closed class and open class. open class Closed classes are those with relatively fixed membership, such as prepositions— new prepositions are rarely coined. By , nouns and verbs are open classes— new nouns and verbs like iPhone or to fax are continually being created or borrowed. Closed class words are generally function words like of, it, and, or you, which tend to be very short, occur frequently, and often have structuring uses in grammar. Four major open classes occur in the languages of the world: nouns (including proper nouns), verbs, adjectives, and , as well as the smaller open class of . English has all five, although not every language does. noun Nouns are words for people, places, or things, but include others as well. Com- common noun mon nouns include concrete terms like cat and mango, abstractions like algorithm and beauty, and verb-like terms like pacing as in His pacing to and fro became quite annoying. Nouns in English can occur with determiners (a goat, its bandwidth) take (IBM’s annual revenue), and may occur in the (goats, abaci). count noun Many languages, including English, divide common nouns into count nouns and mass nouns. Count nouns can occur in the singular and plural (goat/goats, rela- tionship/relationships) and can be counted (one goat, two goats). Mass nouns are used when something is conceptualized as a homogeneous group. So snow, salt, and proper noun communism are not counted (i.e., *two snows or *two communisms). Proper nouns, like Regina, Colorado, and IBM, are names of specific persons or entities. 8.1•(M OSTLY)ENGLISH WORD CLASSES 3

verb Verbs refer to actions and processes, including main verbs like draw, provide, and go. English verbs have inflections (non-third-person-singular (eat), third-person- singular (eats), progressive (eating), past participle (eaten)). While many scholars believe that all human languages have the categories of noun and verb, others have argued that some languages, such as Riau Indonesian and Tongan, don’t even make this distinction (Broschart 1997; Evans 2000; Gil 2000). adjective Adjectives often describe properties or qualities of nouns, like color (white, black), age (old, young), and value (good, bad), but there are languages without adjectives. In Korean, for example, the words corresponding to English adjectives act as a subclass of verbs, so what is in English an adjective “beautiful” acts in Korean like a verb meaning “to be beautiful”. adverb Adverbs are a hodge-podge. All the italicized words in this example are adverbs: Actually, I ran home extremely quickly yesterday Adverbs generally modify something (often verbs, hence the name “adverb”, but locative also other adverbs and entire verb phrases). Directional adverbs or locative ad- degree verbs (home, here, downhill) specify the direction or location of some action; degree adverbs (extremely, very, somewhat) specify the extent of some action, process, or manner property; manner adverbs (slowly, slinkily, delicately) describe the manner of some temporal action or process; and temporal adverbs describe the time that some action or event took place (yesterday, Monday). interjection Interjections (oh, hey, alas, uh, um), are a smaller open class, that also includes greetings (hello, goodbye), and question responses (yes, no, uh-huh). preposition English adpositions occur before nouns, hence are called prepositions. They can indicate spatial or temporal relations, whether literal (on it, before then, by the house) or metaphorical (on time, with gusto, beside herself), and relations like marking the in Hamlet was written by Shakespeare. particle A particle resembles a preposition or an adverb and is used in combination with a verb. Particles often have extended meanings that aren’t quite the same as the prepositions they resemble, as in the particle over in she turned the paper over.A phrasal verb verb and a particle acting as a single unit is called a phrasal verb. The meaning of phrasal verbs is often non-compositional—not predictable from the individual meanings of the verb and the particle. Thus, turn down means ‘reject’, rule out ‘eliminate’, and go on ‘continue’. determiner Determiners like this and that (this chapter, that page) can mark the start of an article English noun phrase. Articles like a, an, and the, are a type of determiner that mark discourse properties of the noun and are quite frequent; the is the most common word in written English, with a and an right behind. conjunction Conjunctions join two phrases, clauses, or sentences. Coordinating conjunc- tions like and, or, and but join two elements of equal status. Subordinating conjunc- tions are used when one of the elements has some embedded status. For example, the subordinating conjunction that in “I thought that you might like some milk” links the main clause I thought with the subordinate clause you might like some milk. This clause is called subordinate because this entire clause is the “content” of the main verb thought. Subordinating conjunctions like that which link a verb to its in this way are also called . pronoun act as a shorthand for referring to an entity or event. Personal pro- nouns refer to persons or entities (you, she, I, it, me, etc.). pronouns are forms of personal pronouns that indicate either actual or more often just an abstract relation between the person and some (my, your, his, her, its, wh one’s, our, their). Wh-pronouns (what, who, whom, whoever) are used in certain 4 CHAPTER 8•S EQUENCE LABELINGFOR PARTS OF SPEECHAND NAMED ENTITIES

question forms, or act as complementizers (Frida, who married Diego. . . ). auxiliary Auxiliary verbs mark semantic features of a main verb such as its tense, whether it is completed (aspect), whether it is negated (polarity), and whether an action is necessary, possible, suggested, or desired (mood). English auxiliaries include the copula verb be, the two verbs do and have, forms, as well as modal verbs used to modal mark the mood associated with the event depicted by the main verb: can indicates ability or possibility, may permission or possibility, must necessity. An English-specific tagset, the 45-tag Penn Treebank tagset (Marcus et al., 1993), shown in Fig. 8.2, has been used to label many syntactically annotated corpora like the Penn Treebank corpora, so is worth knowing about.

Tag Description Example Tag Description Example Tag Description Example CC coord. conj. and, but, or NNP proper noun, sing. IBM TO “to” to CD cardinal number one, two NNPS proper noun, plu. Carolinas UH interjection ah, oops DT determiner a, the NNS noun, plural llamas VB verb base eat EX existential ‘there’ there PDT predeterminer all, both VBD verb past tense ate FW foreign word mea culpa POS possessive ending ’s VBG verb eating IN preposition/ of, in, by PRP I, you, he VBN verb past partici- eaten subordin-conj ple JJ adjective yellow PRP$ possess. pronoun your, one’s VBP verb non-3sg-pr eat JJR comparative adj bigger RB adverb quickly VBZ verb 3sg pres eats JJS superlative adj wildest RBR comparative adv faster WDT wh-determ. which, that LS list item marker 1, 2, One RBS superlatv. adv fastest WP wh-pronoun what, who MD modal can, should RP particle up, off WP$ wh-possess. whose NN sing or mass noun llama SYM symbol +,%, & WRB wh-adverb how, where Figure 8.2 Penn Treebank part-of-speech tags.

Below we show some examples with each word tagged according to both the UD and Penn tagsets. Notice that the Penn tagset distinguishes tense and on verbs, and has a special tag for the existential there construction in English. Note that since New England Journal of Medicine is a proper noun, both tagsets mark its component nouns as NNP, including journal and medicine, which might otherwise be labeled as common nouns (NOUN/NN). (8.1) There/PRO/EX are/VERB/VBP 70/NUM/CD children/NOUN/NNS there/ADV/RB./PUNC/. (8.2) Preliminary/ADJ/JJ findings/NOUN/NNS were/AUX/VBD reported/VERB/VBN in/ADP/IN today/NOUN/NN ’s/PART/POS New/PROPN/NNP England/PROPN/NNP Journal/PROPN/NNP of/ADP/IN Medicine/PROPN/NNP

8.2 Part-of-Speech Tagging

part-of-speech tagging Part-of-speech tagging is the process of assigning a part-of-speech to each word in a text. The input is a sequence x1,x2,...,xn of (tokenized) words and a tagset, and the output is a sequence y1,y2,...,yn of tags, each output yi corresponding exactly to one input xi, as shown in the intuition in Fig. 8.3. ambiguous Tagging is a disambiguation task; words are ambiguous —have more than one possible part-of-speech—and the goal is to find the correct tag for the situation. For example, book can be a verb (book that flight) or a noun (hand me that book). That can be a determiner (Does that flight serve dinner) or a complementizer (I 8.2•P ART-OF-SPEECH TAGGING 5

y1 y2 y3 y4 y5

NOUN AUX VERB DET NOUN

Part of Speech Tagger

Janet will back the bill

x1 x2 x3 x4 x5

Figure 8.3 The task of part-of-speech tagging: mapping from input words x1,x2,...,xn to output POS tags y1,y2,...,yn . ambiguity resolution thought that your flight was earlier). The goal of POS-tagging is to resolve these ambiguities, choosing the proper tag for the context. accuracy The accuracy of part-of-speech tagging algorithms (the percentage of test set tags that match human gold labels) is extremely high. One study found accuracies over 97% across 15 languages from the Universal Dependency (UD) treebank (Wu and Dredze, 2019). Accuracies on various English treebanks are also 97% (no matter the algorithm; HMMs, CRFs, BERT perform similarly). This 97% number is also about the human performance on this task, at least for English (Manning, 2011).

Types: WSJ Brown Unambiguous (1 tag) 44,432 (86%) 45,799 (85%) Ambiguous (2+ tags) 7,025 (14%) 8,050 (15%) Tokens: Unambiguous (1 tag) 577,421 (45%) 384,349 (33%) Ambiguous (2+ tags) 711,780 (55%) 786,646 (67%) Figure 8.4 Tag ambiguity in the Brown and WSJ corpora (Treebank-3 45-tag tagset).

We’ll introduce algorithms for the task in the next few sections, but first let’s explore the task. Exactly how hard is it? Fig. 8.4 shows that most word types (85-86%) are unambiguous (Janet is always NNP, hesitantly is always RB). But the ambiguous words, though accounting for only 14-15% of the vocabulary, are very common, and 55-67% of word tokens in running text are ambiguous. Particularly ambiguous common words include that, back, down, put and set; here are some examples of the 6 different parts of speech for the word back: earnings growth took a back/JJ seat a small building in the back/NN a clear majority of senators back/VBP the bill Dave began to back/VB toward the door enable the country to buy back/RP debt I was twenty-one back/RB then Nonetheless, many words are easy to disambiguate, because their different tags aren’t equally likely. For example, a can be a determiner or the letter a, but the determiner sense is much more likely. This idea suggests a useful baseline: given an ambiguous word, choose the tag which is most frequent in the training corpus. This is a key concept: Most Frequent Class Baseline: Always compare a classifier against a baseline at least as good as the most frequent class baseline (assigning each token to the class it occurred in most often in the training set). 6 CHAPTER 8•S EQUENCE LABELINGFOR PARTS OF SPEECHAND NAMED ENTITIES

The most-frequent-tag baseline has an accuracy of about 92%1. The baseline thus differs from the state-of-the-art and human ceiling (97%) by only 5%.

8.3 Named Entities and Named Entity Tagging

Part of speech tagging can tell us that words like Janet, Stanford University, and Colorado are all proper nouns; being a proper noun is a grammatical property of these words. But viewed from a semantic perspective, these proper nouns refer to different kinds of entities: Janet is a person, Stanford University is an organization,.. and Colorado is a location. named entity A named entity is, roughly speaking, anything that can be referred to with a proper name: a person, a location, an organization. The task of named entity recog- named entity recognition nition (NER) is to find spans of text that constitute proper names and tag the type of NER the entity. Four entity tags are most common: PER (person), LOC (location), ORG (organization), or GPE (geo-political entity). However, the term named entity is commonly extended to include things that aren’t entities per se, including dates, times, and other kinds of temporal expressions, and even numerical expressions like prices. Here’s an example of the output of an NER tagger:

Citing high fuel prices,[ ORG United Airlines] said[ TIME Friday] it has increased fares by[ MONEY $6] per round trip on flights to some cities also served by lower-cost carriers.[ ORG American Airlines], a unit of[ ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.[ ORG United], a unit of[ ORG UAL Corp.], said the increase took effect[ TIME Thursday] and applies to most routes where it competes against discount carriers, such as[ LOC Chicago] to[ LOC Dallas] and[ LOC Denver] to[ LOC San Francisco]. The text contains 13 mentions of named entities including 5 organizations, 4 loca- tions, 2 times, 1 person, and 1 mention of money. Figure 8.5 shows typical generic named entity types. Many applications will also need to use specific entity types like proteins, genes, commercial products, or works of art.

Type Tag Sample Categories Example sentences People PER people, characters Turing is a giant of computer science. Organization ORG companies, sports teams The IPCC warned about the cyclone. Location LOC regions, mountains, seas Mt. Sanitas is in Sunshine Canyon. Geo-Political Entity GPE countries, states Palo Alto is raising the fees for parking. Figure 8.5 A list of generic named entity types with the kinds of entities they refer to.

Named entity tagging is a useful first step in lots of natural language understand- ing tasks. In sentiment analysis we might want to know a consumer’s sentiment toward a particular entity. Entities are a useful first stage in question answering, or for linking text to information in structured knowledge sources like Wikipedia. And named entity tagging is also central to natural language understanding tasks of building semantic representations, like extracting events and the relationship be- tween participants. Unlike part-of-speech tagging, where there is no segmentation problem since each word gets one tag, the task of named entity recognition is to find and label

1 In English, on the WSJ corpus, tested on sections 22-24. 8.3•N AMED ENTITIESAND NAMED ENTITY TAGGING 7

spans of text, and is difficult partly because of the ambiguity of segmentation; we need to decide what’s an entity and what isn’t, and where the boundaries are. Indeed, most words in a text will not be named entities. Another difficulty is caused by type ambiguity. The mention JFK can refer to a person, the airport in New York, or any number of schools, bridges, and streets around the United States. Some examples of this kind of cross-type confusion are given in Figure 8.6.

[PER Washington] was born into slavery on the farm of James Burroughs. [ORG Washington] went up 2 games to 1 in the four-game series. Blair arrived in [LOC Washington] for what may well be his last state visit. In June, [GPE Washington] passed a primary seatbelt law. Figure 8.6 Examples of type ambiguities in the use of the name Washington.

The standard approach to sequence labeling for a span-recognition problem like NER is BIO tagging (Ramshaw and Marcus, 1995). This is a method that allows us to treat NER like a word-by-word sequence labeling task, via tags that capture both the boundary and the named entity type. Consider the following sentence:

[PER Jane Villanueva ] of[ ORG United] , a unit of[ ORG United Airlines Holding] , said the fare applies to the[ LOC Chicago ] route. BIO Figure 8.7 shows the same excerpt represented with BIO tagging, as well as variants called IO tagging and BIOES tagging. In BIO tagging we label any token that begins a span of interest with the label B, tokens that occur inside a span are tagged with an I, and any tokens outside of any span of interest are labeled O. While there is only one O tag, we’ll have distinct B and I tags for each named entity class. The number of tags is thus 2n + 1 tags, where n is the number of entity types. BIO tagging can represent exactly the same information as the bracketed notation, but has the advantage that we can represent the task in the same simple sequence modeling way as part-of-speech tagging: assigning a single label yi to each input word xi:

Words IO Label BIO Label BIOES Label Jane I-PER B-PER B-PER Villanueva I-PER I-PER E-PER of O O O United I-ORG B-ORG B-ORG Airlines I-ORG I-ORG I-ORG Holding I-ORG I-ORG E-ORG discussed O O O the O O O Chicago I-LOC B-LOC S-LOC route O O O . O O O Figure 8.7 NER as a sequence model, showing IO, BIO, and BIOES taggings.

We’ve also shown two variant tagging schemes: IO tagging, which loses some information by eliminating the B tag, and BIOES tagging, which adds an end tag E for the end of a span, and a span tag S for a span consisting of only one word. A sequence labeler (HMM, CRF, RNN, Transformer, etc.) is trained to label each token in a text with tags that indicate the presence (or absence) of particular kinds of named entities. 20 CHAPTER 8•S EQUENCE LABELINGFOR PARTS OF SPEECHAND NAMED ENTITIES

The requisite changes from HMM Viterbi have to do only with how we fill each cell. Recall from Eq. 8.19 that the recursive step of the Viterbi equation computes the Viterbi value of time t for state j as

N vt ( j) = max vt−1(i)ai j b j(ot ); 1 ≤ j ≤ N,1 < t ≤ T (8.31) i=1 which is the HMM implementation of

N vt ( j) = max vt−1(i) P(s j|si) P(ot |s j) 1 ≤ j ≤ N,1 < t ≤ T (8.32) i=1 The CRF requires only a slight change to this latter formula, replacing the a and b prior and likelihood probabilities with the CRF features:

K N X vt ( j) = max vt−1(i) wk fk(yt−1,yt ,X,t) 1 ≤ j ≤ N,1 < t ≤ T (8.33) i=1 k=1 Learning in CRFs relies on the same supervised learning algorithms we presented for logistic regression. Given a sequence of observations, functions, and cor- responding outputs, we use stochastic gradient descent to train the weights to maxi- mize the log-likelihood of the training corpus. The local nature of linear-chain CRFs means that a CRF version of the forward-backward algorithm (see Appendix A) can be used to efficiently compute the necessary derivatives. As with logistic regression, L1 or L2 regularization is important,

8.6 Evaluation of Named Entity Recognition

Part-of-speech taggers are evaluated by the standard metric of accuracy. Named entity recognizers are evaluated by recall, precision, and F1 measure. Recall that recall is the ratio of the number of correctly labeled responses to the total that should have been labeled; precision is the ratio of the number of correctly labeled responses to the total labeled; and F-measure is the harmonic mean of the two. To know if the difference between the F1 scores of two MT systems is a signif- icant difference, we use the paired bootstrap test, or the similar randomization test (Section ??). For named entities, the entity rather than the word is the unit of response. Thus in the example in Fig. 8.16, the two entities Jane Villanueva and United Airlines Holding and the non-entity discussed would each count as a single response. The fact that named entity tagging has a segmentation component which is not present in tasks like text categorization or part-of-speech tagging causes some prob- lems with evaluation. For example, a system that labeled Jane but not Jane Vil- lanueva as a person would cause two errors, a false positive for O and a false nega- tive for I-PER. In addition, using entities as the unit of response but words as the unit of training means that there is a mismatch between the training and test conditions.

8.7 Further Details

In this section we summarize a few remaining details of the data and models, be- ginning with data. Since the algorithms we have presented are supervised, hav- 22 CHAPTER 8•S EQUENCE LABELINGFOR PARTS OF SPEECHAND NAMED ENTITIES

8.7.3 POS Tagging for Morphologically Rich Languages Augmentations to tagging algorithms become necessary when dealing with lan- guages with rich morphology like Czech, Hungarian and Turkish. These productive word-formation processes result in a large vocabulary for these languages: a 250,000 word token corpus of Hungarian has more than twice as many word types as a similarly sized corpus of English (Oravecz and Dienes, 2002), while a 10 million word token corpus of Turkish contains four times as many word types as a similarly sized English corpus (Hakkani-Tur¨ et al., 2002). Large vocabular- ies mean many unknown words, and these unknown words cause significant per- formance degradations in a wide variety of languages (including Czech, Slovene, Estonian, and Romanian) (Hajic,ˇ 2000). Highly inflectional languages also have much more information than English coded in word morphology, like case (nominative, accusative, genitive) or gender (masculine, feminine). Because this information is important for tasks like pars- ing and coreference resolution, part-of-speech taggers for morphologically rich lan- guages need to label words with case and gender information. Tagsets for morpho- logically rich languages are therefore sequences of morphological tags rather than a single primitive tag. Here’s a Turkish example, in which the word izin has three pos- sible morphological/part-of-speech tags and meanings (Hakkani-Tur¨ et al., 2002): 1. Yerdeki izin temizlenmesi gerek. iz + Noun+A3sg+Pnon+Gen The trace on the floor should be cleaned. 2. Uzerinde¨ parmak izin kalmis¸ iz + Noun+A3sg+P2sg+Nom Your finger print is left on (it). 3. Ic¸eri girmek ic¸in izin alman gerekiyor. izin + Noun+A3sg+Pnon+Nom You need permission to enter. Using a morphological parse sequence like Noun+A3sg+Pnon+Gen as the part- of-speech tag greatly increases the number of parts of speech, and so tagsets can be 4 to 10 times larger than the 50–100 tags we have seen for English. With such large tagsets, each word needs to be morphologically analyzed to generate the list of possible morphological tag sequences (part-of-speech tags) for the word. The role of the tagger is then to disambiguate among these tags. This method also helps with unknown words since morphological parsers can accept unknown stems and still segment the affixes properly.

8.8 Summary

This chapter introduced parts of speech and named entities, and the tasks of part- of-speech tagging and named entity recognition: • Languages generally have a small set of closed class words that are highly frequent, ambiguous, and act as function words, and open-class words like nouns, verbs, adjectives. Various part-of-speech tagsets exist, of between 40 and 200 tags. • Part-of-speech tagging is the process of assigning a part-of-speech label to each of a sequence of words. • Named entities are words for proper nouns referring mainly to people, places, and organizations, but extended to many other types that aren’t strictly entities or even proper nouns. BIBLIOGRAPHICALAND HISTORICAL NOTES 23

• Two common approaches to sequence modeling are a generative approach, HMM tagging, and a discriminative approach, CRF tagging. We will see a neural approach in following chapters. • The probabilities in HMM taggers are estimated by maximum likelihood es- timation on tag-labeled training corpora. The Viterbi algorithm is used for decoding, finding the most likely tag sequence • Conditional Random Fields or CRF taggers train a log-linear model that can choose the best tag sequence given an observation sequence, based on features that condition on the output tag, the prior output tag, the entire input sequence, and the current timestep. They use the Viterbi algorithm for inference, to choose the best sequence of tags, and a version of the Forward-Backward algorithm (see Appendix A) for training,

Bibliographical and Historical Notes

What is probably the earliest part-of-speech tagger was part of the parser in Zellig Harris’s Transformations and Discourse Analysis Project (TDAP), implemented be- tween June 1958 and July 1959 at the University of Pennsylvania (Harris, 1962), although earlier systems had used part-of-speech dictionaries. TDAP used 14 hand- written rules for part-of-speech disambiguation; the use of part-of-speech tag se- quences and the relative frequency of tags for a word prefigures modern algorithms. The parser was implemented essentially as a cascade of finite-state transducers; see Joshi and Hopely (1999) and Karttunen (1999) for a reimplementation. The Computational Grammar Coder (CGC) of Klein and Simmons (1963) had three components: a lexicon, a morphological analyzer, and a context disambigua- tor. The small 1500-word lexicon listed only function words and other irregular words. The morphological analyzer used inflectional and derivational suffixes to as- sign part-of-speech classes. These were run over words to produce candidate parts of speech which were then disambiguated by a set of 500 context rules by relying on surrounding islands of unambiguous words. For example, one rule said that between an ARTICLE and a VERB, the only allowable sequences were ADJ-NOUN, NOUN- ADVERB, or NOUN-NOUN. The TAGGIT tagger (Greene and Rubin, 1971) used the same architecture as Klein and Simmons (1963), with a bigger dictionary and more tags (87). TAGGIT was applied to the Brown corpus and, according to Francis and Kuceraˇ (1982, p. 9), accurately tagged 77% of the corpus; the remainder of the Brown corpus was then tagged by hand. All these early algorithms were based on a two-stage architecture in which a dictionary was first used to assign each word a set of potential parts of speech, and then lists of handwritten disambiguation rules winnowed the set down to a single part of speech per word. Probabilities were used in tagging by Stolz et al. (1965) and a complete proba- bilistic tagger with Viterbi decoding was sketched by Bahl and Mercer (1976). The Lancaster-Oslo/Bergen (LOB) corpus, a British English equivalent of the Brown cor- pus, was tagged in the early 1980’s with the CLAWS tagger (Marshall 1983; Mar- shall 1987; Garside 1987), a probabilistic algorithm that approximated a simplified HMM tagger. The algorithm used tag bigram probabilities, but instead of storing the word likelihood of each tag, the algorithm marked tags either as rare (P(tag|word) < .01) infrequent (P(tag|word) < .10) or normally frequent (P(tag|word) > .10). DeRose (1988) developed a quasi-HMM algorithm, including the use of dy- namic programming, although computing P(t|w)P(w) instead of P(w|t)P(w). The same year, the probabilistic PARTS tagger of Church (1988), (1989) was probably 24 CHAPTER 8•S EQUENCE LABELINGFOR PARTS OF SPEECHAND NAMED ENTITIES

the first implemented HMM tagger, described correctly in Church (1989), although Church (1988) also described the computation incorrectly as P(t|w)P(w) instead of P(w|t)P(w). Church (p.c.) explained that he had simplified for pedagogical pur- poses because using the probability P(t|w) made the idea seem more understandable as “storing a lexicon in an almost standard form”. Later taggers explicitly introduced the use of the hidden Markov model (Ku- piec 1992; Weischedel et al. 1993; Schutze¨ and Singer 1994). Merialdo (1994) showed that fully unsupervised EM didn’t work well for the tagging task and that reliance on hand-labeled data was important. Charniak et al. (1993) showed the importance of the most frequent tag baseline; the 92.3% number we give above was from Abney et al. (1999). See Brants (2000) for HMM tagger implementa- tion details, including the extension to trigram contexts, and the use of sophisticated unknown word features; its performance is still close to state of the art taggers. Log-linear models for POS tagging were introduced by Ratnaparkhi (1996), who introduced a system called MXPOST which implemented a maximum en- tropy Markov model (MEMM), a slightly simpler version of a CRF. Around the same time, sequence labelers were applied to the task of named entity tagging, first with HMMs (Bikel et al., 1997) and MEMMs (McCallum et al., 2000), and then once CRFs were developed (Lafferty et al. 2001), they were also applied to NER (McCallum and Li, 2003). A wide exploration of features followed (Zhou et al., 2005). Neural approaches to NER mainly follow from the pioneering results of Col- lobert et al. (2011), who applied a CRF on top of a convolutional net. BiLSTMs with word and character-based embeddings as input followed shortly and became a standard neural algorithm for NER (Huang et al. 2015, Ma and Hovy 2016, Lample et al. 2016) followed by the more recent use of Transformers and BERT. The idea of using letter suffixes for unknown words is quite old; the early Klein and Simmons (1963) system checked all final letter suffixes of lengths 1-5. The unknown word features described on page 17 come mainly from Ratnaparkhi (1996), with augmentations from Toutanova et al. (2003) and Manning (2011). State of the art POS taggers use neural algorithms, either bidirectional RNNs or Transformers like BERT; see Chapter 9 and Chapter 10. HMM (Brants 2000; Thede and Harper 1999) and CRF tagger accuracies are likely just a tad lower. Manning (2011) investigates the remaining 2.7% of errors in a high-performing tagger (Toutanova et al., 2003). He suggests that a third or half of these remaining errors are due to errors or inconsistencies in the training data, a third might be solv- able with richer linguistic models, and for the remainder the task is underspecified or unclear. Supervised tagging relies heavily on in-domain training data hand-labeled by experts. Ways to relax this assumption include unsupervised algorithms for cluster- ing words into part-of-speech-like classes, summarized in Christodoulopoulos et al. (2010), and ways to combine labeled and unlabeled data, for example by co-training (Clark et al. 2003; Søgaard 2010). See Householder (1995) for historical notes on parts of speech, and Sampson (1987) and Garside et al. (1997) on the provenance of the Brown and other tagsets.

Exercises

8.1 Find one tagging error in each of the following sentences that are tagged with the Penn Treebank tagset: 1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NN EXERCISES 25

2. Does/VBZ this/DT flight/NN serve/VB dinner/NNS 3. I/PRP have/VB a/DT friend/NN living/VBG in/IN Denver/NNP 4. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS 8.2 Use the Penn Treebank tagset to tag each word in the following sentences from Damon Runyon’s short stories. You may ignore punctuation. Some of these are quite difficult; do your best. 1. It is a nice night. 2. This crap game is over a garage in Fifty-second Street. . . 3. . . . Nobody ever takes the newspapers she sells . . . 4. He is a tall, skinny guy with a long, sad, mean-looking kisser, and a mournful . 5. . . . I am sitting in Mindy’s restaurant putting on the gefillte fish, which is a dish I am very fond of, . . . 6. When a guy and a doll get to taking peeks back and forth at each other, why there you are indeed. 8.3 Now compare your tags from the previous exercise with one or two friend’s answers. On which words did you disagree the most? Why? 8.4 Implement the “most likely tag” baseline. Find a POS-tagged training set, and use it to compute for each word the tag that maximizes p(t|w). You will need to implement a simple tokenizer to deal with sentence boundaries. Start by assuming that all unknown words are NN and compute your error rate on known and unknown words. Now write at least five rules to do a better job of tagging unknown words, and show the difference in error rates. 8.5 Build a bigram HMM tagger. You will need a part-of-speech-tagged corpus. First split the corpus into a training set and test set. From the labeled training set, train the transition and observation probabilities of the HMM tagger di- rectly on the hand-tagged data. Then implement the Viterbi algorithm so you can decode a test sentence. Now run your algorithm on the test set. Report its error rate and compare its performance to the most frequent tag baseline. 8.6 Do an error analysis of your tagger. Build a confusion matrix and investigate the most frequent errors. Propose some features for improving the perfor- mance of your tagger on these errors. 8.7 Develop a set of regular expressions to recognize the character shape features described on page 17. 8.8 The BIO and other labeling schemes given in this chapter aren’t the only possible one. For example, the B tag can be reserved only for those situations where an ambiguity exists between adjacent entities. Propose a new set of BIO tags for use with your NER system. Experiment with it and compare its performance with the schemes presented in this chapter. 8.9 Names of works of art (books, movies, video games, etc.) are quite different from the kinds of named entities we’ve discussed in this chapter. Collect a list of names of works of art from a particular category from a Web-based source (e.g., gutenberg.org, amazon.com, imdb.com, etc.). Analyze your list and give examples of ways that the names in it are likely to be problematic for the techniques described in this chapter. 8.10 Develop an NER system specific to the category of names that you collected in the last exercise. Evaluate your system on a collection of text likely to contain instances of these named entities. 26 Chapter 8• Sequence Labeling for Parts of Speech and Named Entities

Abney, S. P., Schapire, R. E., and Singer, Y. (1999). Boosting Hajic,ˇ J. (2000). Morphological tagging: Data vs. dictionar- applied to tagging and PP attachment. EMNLP/VLC. ies. NAACL. Seattle. Bada, M., Eckert, M., Evans, D., Garcia, K., Shipley, K., Sit- Hakkani-Tur,¨ D., Oflazer, K., and Tur,¨ G. (2002). Statistical nikov, D., Baumgartner, W. A., Cohen, K. B., Verspoor, K., morphological disambiguation for agglutinative languages. Blake, J. A., and Hunter, L. E. (2012). Concept annotation Journal of Computers and Humanities 36(4), 381–410. in the craft corpus. BMC bioinformatics 13(1), 161. Harris, Z. S. (1962). String Analysis of Sentence Structure. Bahl, L. R. and Mercer, R. L. (1976). Part of speech assign- Mouton, The Hague. ment by a statistical decision algorithm. Proceedings IEEE Householder, F. W. (1995). Dionysius Thrax, the technai, International Symposium on Information Theory. and Sextus Empiricus. Koerner, E. F. K. and Asher, R. E. Bamman, D., Popat, S., and Shen, S. (2019). An annotated (Eds.), Concise History of the Language Sciences, 99–103. dataset of literary entities. NAACL HLT. Elsevier Science. Bikel, D. M., Miller, S., Schwartz, R., and Weischedel, Hovy, E. H., Marcus, M. P., Palmer, M., Ramshaw, L. A., R. (1997). Nymble: A high-performance learning name- and Weischedel, R. (2006). Ontonotes: The 90% solution. finder. ANLP. HLT-NAACL. Brants, T. (2000). TnT: A statistical part-of-speech tagger. Huang, Z., Xu, W., and Yu, K. (2015). Bidirectional ANLP. LSTM-CRF models for sequence tagging. arXiv preprint Broschart, J. (1997). Why Tongan does it differently. Lin- arXiv:1508.01991. guistic Typology 1 , 123–165. Joshi, A. K. and Hopely, P. (1999). A parser from antiq- Charniak, E., Hendrickson, C., Jacobson, N., and Perkowitz, uity. Kornai, A. (Ed.), Extended Finite State Models of M. (1993). Equations for part-of-speech tagging. AAAI. Language, 6–15. Cambridge University Press. Chiticariu, L., Danilevsky, M., Li, Y., Reiss, F., and Zhu, H. Karlsson, F., Voutilainen, A., Heikkila,¨ J., and Anttila, (2018). SystemT: Declarative text understanding for enter- A. (Eds.). (1995). Constraint Grammar: A Language- prise. NAACL HLT, Vol. 3. Independent System for Parsing Unrestricted Text. Mouton Chiticariu, L., Li, Y., and Reiss, F. R. (2013). Rule-Based de Gruyter. Information Extraction is Dead! Long Live Rule-Based In- Karttunen, L. (1999). Comments on Joshi. Kornai, A. (Ed.), formation Extraction Systems!. EMNLP. Extended Finite State Models of Language, 16–18. Cam- Christodoulopoulos, C., Goldwater, S., and Steedman, M. bridge University Press. (2010). Two decades of unsupervised POS induction: How Klein, S. and Simmons, R. F. (1963). A computational ap- far have we come?. EMNLP. proach to grammatical coding of English words. Journal of Church, K. W. (1988). A stochastic parts program and noun the ACM 10(3), 334–347. phrase parser for unrestricted text. ANLP. Kupiec, J. (1992). Robust part-of-speech tagging using a Church, K. W. (1989). A stochastic parts program and noun hidden Markov model. Computer Speech and Language 6, phrase parser for unrestricted text. ICASSP. 225–242. Clark, S., Curran, J. R., and Osborne, M. (2003). Bootstrap- Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001). ping pos taggers using unlabelled data. CoNLL. Conditional random fields: Probabilistic models for seg- Collobert, R., Weston, J., Bottou, L., Karlen, M., menting and labeling sequence data. ICML. Kavukcuoglu, K., and Kuksa, P. (2011). Natural language Lample, G., Ballesteros, M., Subramanian, S., Kawakami, processing (almost) from scratch. JMLR 12, 2493–2537. K., and Dyer, C. (2016). Neural architectures for named DeRose, S. J. (1988). disambiguation entity recognition. NAACL HLT. by statistical optimization. Computational Linguistics 14, Ma, X. and Hovy, E. H. (2016). End-to-end sequence label- 31–39. ing via bi-directional LSTM-CNNs-CRF. ACL. Evans, N. (2000). Word classes in the world’s languages. Manning, C. D. (2011). Part-of-speech tagging from 97% to Booij, G., Lehmann, C., and Mugdan, J. (Eds.), Morphol- 100%: Is it time for some linguistics?. CICLing 2011. ogy: A Handbook on Inflection and Word Formation, 708– 732. Mouton. Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Francis, W. N. and Kucera,ˇ H. (1982). Frequency Analysis Penn treebank. Computational Linguistics 19(2), 313–330. of English Usage. Houghton Mifflin, Boston. Garside, R. (1987). The CLAWS word-tagging system. Gar- Marshall, I. (1983). Choice of grammatical word-class with- side, R., Leech, G., and Sampson, G. (Eds.), The Computa- out global syntactic analysis: Tagging words in the LOB Computers and the Humanities 17 tional Analysis of English, 30–41. Longman. corpus. , 139–150. Garside, R., Leech, G., and McEnery, A. (1997). Corpus Marshall, I. (1987). Tag selection using probabilistic meth- Annotation. Longman. ods. Garside, R., Leech, G., and Sampson, G. (Eds.), The Computational Analysis of English, 42–56. Longman. Gil, D. (2000). Syntactic categories, cross-linguistic varia- tion and universal grammar. Vogel, P. M. and Comrie, B. McCallum, A., Freitag, D., and Pereira, F. C. N. (2000). (Eds.), Approaches to the Typology of Word Classes, 173– Maximum entropy Markov models for information extrac- 216. Mouton. tion and segmentation. ICML. Greene, B. B. and Rubin, G. M. (1971). Automatic grammat- McCallum, A. and Li, W. (2003). Early results for named ical tagging of English. Department of Linguistics, Brown entity recognition with conditional random fields, feature University, Providence, Rhode Island. induction and web-enhanced lexicons. CoNLL. Exercises 27

Merialdo, B. (1994). Tagging English text with a probabilis- tic model. Computational Linguistics 20(2), 155–172. Mikheev, A., Moens, M., and Grover, C. (1999). Named entity recognition without gazetteers. EACL. Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic,ˇ J., Manning, C. D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., and Zeman, D. (2016a). Univer- sal Dependencies v1: A multilingual treebank collection. LREC. Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic,ˇ J., Manning, C. D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., and Zeman, D. (2016b). Univer- sal Dependencies v1: A multilingual treebank collection. LREC. Oravecz, C. and Dienes, P. (2002). Efficient stochastic part- of-speech tagging for Hungarian. LREC. Ramshaw, L. A. and Marcus, M. P. (1995). Text chunking us- ing transformation-based learning. Proceedings of the 3rd Annual Workshop on Very Large Corpora. Ratnaparkhi, A. (1996). A maximum entropy part-of-speech tagger. EMNLP. Sampson, G. (1987). Alternative grammatical coding sys- tems. Garside, R., Leech, G., and Sampson, G. (Eds.), The Computational Analysis of English, 165–183. Longman. Schutze,¨ H. and Singer, Y. (1994). Part-of-speech tagging using a variable memory Markov model. ACL. Søgaard, A. (2010). Simple semi-supervised training of part- of-speech taggers. ACL. Stolz, W. S., Tannenbaum, P. H., and Carstensen, F. V. (1965). A stochastic approach to the grammatical coding of English. CACM 8(6), 399–405. Thede, S. M. and Harper, M. P. (1999). A second-order hid- den Markov model for part-of-speech tagging. ACL. Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. HLT-NAACL. Voutilainen, A. (1999). Handcrafted rules. van Halteren, H. (Ed.), Syntactic Wordclass Tagging, 217–246. Kluwer. Weischedel, R., Meteer, M., Schwartz, R., Ramshaw, L. A., and Palmucci, J. (1993). Coping with ambiguity and un- known words through probabilistic models. Computational Linguistics 19(2), 359–382. Wu, S. and Dredze, M. (2019). Beto, Bentz, Becas: The surprising cross-lingual effectiveness of BERT. EMNLP. Zhou, G., Su, J., Zhang, J., and Zhang, M. (2005). Exploring various knowledge in relation extraction. ACL.