Sequence Part of Speech Tagging Labeling for Part of Speech and Named Entities Parts of Speech

From earliest linguistic traditions (Yaska and Panini 5th C. BCE, Aristotle 4th C. BCE), the idea that can be classified into grammatical categories • part of speech, classes, POS, POS tags 8 parts of speech attributed to Dionysius Thrax of Alexandria (c. 1st C. BCE): • , verb, , preposition, adverb, , participle, • These categories are relevant for NLP today. Two classes of words: Open vs. Closed

Closed class words • Relatively fixed membership • Usually function words: short, frequent words with grammatical function • determiners: a, an, the • : she, he, I • prepositions: on, under, over, near, by, … Open class words • Usually content words: , Verbs, , Adverbs • Plus interjections: oh, ouch, uh-huh, yes, hello • New nouns and verbs like iPhone or to fax Open class ("content") words Nouns Verbs Adjectives old green tasty

Proper Common Main Adverbs slowly yesterday Janet cat, cats eat Interjections Ow hello Italy mango went Numbers … more 122,312 one Closed class ("function") Auxiliary Determiners the some can Prepositions to with had Conjunctions and or Particles off up … more

Pronouns they its Part-of-Speech Tagging Assigning a part-of-speech to each word in a text. Words often have more than one POS. book: • VERB: (Book that flight) • NOUN: (Hand me that book). Part-of-Speech Tagging 8.2 • PART-OF-SPEECH TAGGING 5

Map from sequence x1,…,xn of words to y1,…,yn of POS tags

y1 y2 y3 y4 y5

NOUN AUX VERB DET NOUN

Part of Speech Tagger

Janet will back the bill

x1 x2 x3 x4 x5

Figure 8.3 The task of part-of-speech tagging: mapping from input words x1,x2,...,xn to output POS tags y1,y2,...,yn . ambiguity resolution thought that your flight was earlier). The goal of POS-tagging is to resolve these ambiguities, choosing the proper tag for the context. accuracy The accuracy of part-of-speech tagging algorithms (the percentage of test set tags that match gold labels) is extremely high. One study found accuracies over 97% across 15 from the Universal Dependency (UD) treebank (Wu and Dredze, 2019). Accuracies on various English treebanks are also 97% (no matter the algorithm; HMMs, CRFs, BERT perform similarly). This 97% number is also about the human performance on this task, at least for English (Manning, 2011).

Types: WSJ Brown Unambiguous (1 tag) 44,432 (86%) 45,799 (85%) Ambiguous (2+ tags) 7,025 (14%) 8,050 (15%) Tokens: Unambiguous (1 tag) 577,421 (45%) 384,349 (33%) Ambiguous (2+ tags) 711,780 (55%) 786,646 (67%) Figure 8.4 Tag ambiguity in the Brown and WSJ corpora (Treebank-3 45-tag tagset).

We’ introduce algorithms for the task in the next few sections, but first let’s explore the task. Exactly how hard is it? Fig. 8.4 shows that most word types (85-86%) are unambiguous (Janet is always NNP, hesitantly is always RB). But the ambiguous words, though accounting for only 14-15% of the vocabulary, are very common, and 55-67% of word tokens in running text are ambiguous. Particularly ambiguous common words include that, back, down, put and set; here are some examples of the 6 different parts of speech for the word back: earnings growth took a back/JJ seat a small building in the back/NN a clear majority of senators back/VBP the bill Dave began to back/VB toward the door enable the country to buy back/RP debt I was twenty-one back/RB then Nonetheless, many words are easy to disambiguate, because their different tags aren’t equally likely. For example, a can be a determiner or the a, but the determiner sense is much more likely. This idea suggests a useful : given an ambiguous word, choose the tag which is most frequent in the training corpus. This is a key concept: Most Frequent Class Baseline: Always compare a classifier against a baseline at least as good as the most frequent class baseline (assigning each token to the class it occurred in most often in the training set). 2 CHAPTER 8 • SEQUENCE LABELING FOR PARTS OF SPEECH AND NAMED ENTITIES 8.1 (Mostly) English Word Classes

Until now we have been using part-of-speech terms like noun and verb rather freely. In this section we give more complete definitions. While word classes do have semantic tendencies—adjectives, for example, often describe properties and nouns people— parts of speech are defined instead based on their grammatical relationship "Universalwith neighboring Dependencies" words or the morphological propertiesTagset about their affixes. Nivre et al. 2016 Tag Description Example ADJ : noun modifiers describing properties red, young, awesome ADV Adverb: verb modifiers of time, place, manner very, slowly, home, yesterday NOUN words for persons, places, things, etc. algorithm, cat, mango, beauty VERB words for actions and processes draw, provide, go

Open Class PROPN Proper noun: name of a person, organization, place, etc.. Regina, IBM, Colorado INTJ Interjection: exclamation, greeting, yes/no response, etc. oh, um, yes, hello ADP Adposition (Preposition/Postposition): marks a noun’s in, on, by under spacial, temporal, or other relation AUX Auxiliary: helping verb marking tense, aspect, mood, etc., can, may, should, are CCONJ Coordinating Conjunction: joins two phrases/clauses and, or, but DET Determiner: marks noun phrase properties a, an, the, this NUM Numeral one, two, first, second PART Particle: a preposition-like form used together with a verb up, down, on, off, in, out, at, by PRON Pronoun: a shorthand for referring to an entity or event she, who, I, others Closed Class Words SCONJ Subordinating Conjunction: joins a main clause with a that, which subordinate clause such as a sentential complement PUNCT ˙, , () SYM Symbols like $ or emoji $, % Other X Other asdf, qwfg Figure 8.1 The 17 parts of speech in the Universal Dependencies tagset (Nivre et al., 2016a). Features can be added to make finer-grained distinctions (with properties like number, case, definiteness, and so on).

closed class Parts of speech fall into two broad categories: closed class and open class. open class Closed classes are those with relatively fixed membership, such as prepositions— new prepositions are rarely coined. By contrast, nouns and verbs are open classes— new nouns and verbs like iPhone or to fax are continually being created or borrowed. Closed class words are generally function words like of, it, and, or you, which tend to be very short, occur frequently, and often have structuring uses in grammar. Four major open classes occur in the languages of the world: nouns (including proper nouns), verbs, adjectives, and adverbs, as well as the smaller open class of interjections. English has all five, although not every does. noun Nouns are words for people, places, or things, but include others as well. Com- common noun mon nouns include concrete terms like cat and mango, abstractions like algorithm and beauty, and verb-like terms like pacing as in His pacing to and fro became quite annoying. Nouns in English can occur with determiners (a goat, its bandwidth) take (IBM’s annual revenue), and may occur in the plural (goats, abaci). count noun Many languages, including English, divide common nouns into count nouns and mass noun mass nouns. Count nouns can occur in the singular and plural (goat/goats, rela- tionship/relationships) and can be counted (one goat, two goats). Mass nouns are used when something is conceptualized as a homogeneous group. So snow, salt, and proper noun communism are not counted (i.e., *two snows or *two communisms). Proper nouns, like Regina, Colorado, and IBM, are names of specific persons or entities. Sample "Tagged" English sentences

There/PRO were/VERB 70/NUM children/NOUN there/ADV ./PUNC Preliminary/ADJ findings/NOUN were/AUX reported/VERB in/ADP today/NOUN ’s/PART New/PROPN England/PROPN Journal/PROPN of/ADP Medicine/PROPN Why Part of Speech Tagging?

◦ Can be useful for other NLP tasks ◦ Parsing: POS tagging can improve syntactic parsing ◦ MT: reordering of adjectives and nouns (say from Spanish to English) ◦ Sentiment or affective tasks: may want to distinguish adjectives or other POS ◦ Text-to-speech (how do we pronounce “lead” or "object"?) ◦ Or linguistic or language-analytic computational tasks ◦ Need to control for POS when studying linguistic change like creation of new words, or meaning shift ◦ Or control for POS in measuring meaning similarity or difference How difficult is POS tagging in English?

Roughly 15% of word types are ambiguous • Hence 85% of word types are unambiguous • Janet is always PROPN, hesitantly is always ADV But those 15% tend to be very common. So ~60% of word tokens are ambiguous E.g., back earnings growth took a back/ADJ seat a small building in the back/NOUN a clear majority of senators back/VERB the bill enable the country to buy back/PART debt I was twenty-one back/ADV then POS tagging performance in English

How many tags are correct? (Tag accuracy) ◦ About 97% ◦ Hasn't changed in the last 10+ years ◦ HMMs, CRFs, BERT perform similarly . ◦ Human accuracy about the same But baseline is 92%! ◦ Baseline is performance of stupidest possible method ◦ "Most frequent class baseline" is an important baseline for many tasks ◦ Tag every word with its most frequent tag ◦ (and tag unknown words as nouns) ◦ Partly easy because ◦ Many words are unambiguous Sources of information for POS tagging

Janet will back the bill AUX/NOUN/VERB? NOUN/VERB? Prior probabilities of word/tag • "will" is usually an AUX Identity of neighboring words • "the" means the next word is probably not a verb Morphology and wordshape: ◦ Prefixes unable: un- ® ADJ ◦ Suffixes importantly: -ly ® ADJ ◦ Capitalization Janet: CAP ® PROPN Standard algorithms for POS tagging

Supervised Machine Learning Algorithms: • Hidden Markov Models • Conditional Random Fields (CRF)/ Maximum Entropy Markov Models (MEMM) • Neural sequence models (RNNs or Transformers) • Large Language Models (like BERT), finetuned All required a hand-labeled training set, all about equal performance (97% on English) All make use of information sources we discussed • Via human created features: HMMs and CRFs • Via representation learning: Neural LMs Sequence Part of Speech Tagging Labeling for Part of Speech and Named Entities Sequence Named Entity Recognition Labeling for Part (NER) of Speech and Named Entities Named Entities

◦ Named entity, in its core usage, means anything that can be referred to with a proper name. Most common 4 tags: ◦ PER (Person): “Marie Curie” ◦ LOC (Location): “New York City” ◦ ORG (Organization): “Stanford University” ◦ GPE (Geo-Political Entity): "Boulder, Colorado" ◦ Often multi-word phrases ◦ But the term is also extended to things that aren't entities: ◦ dates, times, prices Named Entity tagging

The task of named entity recognition (NER): • find spans of text that constitute proper names • tag the type of the entity. 6 CHAPTER 8 • SEQUENCE LABELING FOR PARTS OF SPEECH AND NAMED ENTITIES

The most-frequent-tag baseline has an accuracy of about 92%1. The baseline thus differs from the state-of-the-art and human ceiling (97%) by only 5%.

8.3 Named Entities and Named Entity Tagging

Part of speech tagging can tell us that words like Janet, Stanford University, and Colorado are all proper nouns; being a proper noun is a grammatical property of these words. But viewed from a semantic perspective, these proper nouns refer to different kinds of entities: Janet is a person, Stanford University is an organization,.. and Colorado is a location. named entity A named entity is, roughly speaking, anything that can be referred to with a proper name: a person, a location, an organization. The task of named entity recog- named entity recognition nition (NER) is to find spans of text that constitute proper names and tag the type of NER the entity. Four entity tags are most common: PER (person), LOC (location), ORG (organization), or GPE (geo-political entity). However, the term named entity is commonly extended to include things that aren’t entities per se, including dates, times,NER and other output kinds of temporal expressions, and even numerical expressions like prices. Here’s an example of the output of an NER tagger:

Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it has increased fares by [MONEY $6] per round trip on flights to some cities also served by lower-cost carriers. [ORG American Airlines],a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said. [ORG United], a unit of [ORG UAL Corp.], said the increase took effect [TIME Thursday] and applies to most routes where it competes against discount carriers, such as [LOC Chicago] to [LOC Dallas] and [LOC Denver] to [LOC San Francisco]. The text contains 13 mentions of named entities including 5 organizations, 4 loca- tions, 2 times, 1 person, and 1 mention of money. Figure 8.5 shows typical generic named entity types. Many applications will also need to use specific entity types like proteins, genes, commercial products, or works of art.

Type Tag Sample Categories Example sentences People PER people, characters Turing is a giant of computer science. Organization ORG companies, sports teams The IPCC warned about the cyclone. Location LOC regions, mountains, seas Mt. Sanitas is in Sunshine Canyon. Geo-Political Entity GPE countries, states Palo Alto is raising the fees for parking. Figure 8.5 A list of generic named entity types with the kinds of entities they refer to.

Named entity tagging is a useful first step in lots of natural language understand- ing tasks. In sentiment analysis we might want to know a consumer’s sentiment toward a particular entity. Entities are a useful first stage in question answering, or for linking text to information in structured knowledge sources like Wikipedia. And named entity tagging is also central to natural language understanding tasks of building semantic representations, like extracting events and the relationship be- tween participants. Unlike part-of-speech tagging, where there is no segmentation problem since each word gets one tag, the task of named entity recognition is to find and label

1 In English, on the WSJ corpus, tested on sections 22-24. Why NER?

Sentiment analysis: consumer’s sentiment toward a particular company or person? Question Answering: answer questions about an entity? Information Extraction: Extracting facts about entities from text. Why NER is hard

8.3 • NAMED ENTITIES AND NAMED ENTITY TAGGING 7 1) Segmentation spans• ofIn text, POS and tagging, is difficult no partly segmentation because of the problem ambiguity ofsince segmentation; each we need toword decide what’sgets one an entity tag. and what isn’t, and where the boundaries are. Indeed, most• words in a text will not be named entities. Another difficulty is caused by type ambiguity.In NER The mentionwe haveJFK tocan find refer and to a segment person, the airportthe entities! in New York, or any 2)numberType of schools,ambiguity bridges, and streets around the United States. Some examples of this kind of cross-type confusion are given in Figure 8.6.

[PER Washington] was born into slavery on the farm of James Burroughs. [ORG Washington] went up 2 games to 1 in the four-game series. Blair arrived in [LOC Washington] for what may well be his last state visit. In June, [GPE Washington] passed a primary seatbelt law. Figure 8.6 Examples of type ambiguities in the use of the name Washington.

The standard approach to sequence labeling for a span-recognition problem like NER is BIO tagging (Ramshaw and Marcus, 1995). This is a method that allows us to treat NER like a word-by-word sequence labeling task, via tags that capture both the boundary and the named entity type. Consider the following sentence:

[PER Jane Villanueva ] of [ORG United] , a unit of [ORG United Airlines Holding] , said the fare applies to the [LOC Chicago ] route. BIO Figure 8.7 shows the same excerpt represented with BIO tagging, as well as variants called IO tagging and BIOES tagging. In BIO tagging we label any token that begins a span of interest with the label , tokens that occur inside a span are tagged with an I, and any tokens outside of any span of interest are labeled O. While there is only one O tag, we’ll have distinct B and I tags for each named entity class. The number of tags is thus 2n + 1 tags, where n is the number of entity types. BIO tagging can represent exactly the same information as the bracketed notation, but has the advantage that we can represent the task in the same simple sequence modeling way as part-of-speech tagging: assigning a single label yi to each input word xi:

Words IO Label BIO Label BIOES Label Jane I-PER B-PER B-PER Villanueva I-PER I-PER E-PER of O O O United I-ORG B-ORG B-ORG Airlines I-ORG I-ORG I-ORG Holding I-ORG I-ORG E-ORG discussed O O O the O O O Chicago I-LOC B-LOC S-LOC route O O O . O O O Figure 8.7 NER as a sequence model, showing IO, BIO, and BIOES taggings.

We’ve also shown two variant tagging schemes: IO tagging, which loses some information by eliminating the B tag, and BIOES tagging, which adds an end tag E for the end of a span, and a span tag S for a span consisting of only one word. A sequence labeler (HMM, CRF, RNN, Transformer, etc.) is trained to label each token in a text with tags that indicate the presence (or absence) of particular kinds of named entities. BIO Tagging

How can we turn this structured problem into a sequence problem like POS tagging, with one label per word?

[PER Jane Villanueva] of [ORG United] , a unit of [ORG United Airlines Holding] , said the fare applies to the [LOC Chicago ] route. 8.3 • NAMED ENTITIES8.3 AND N• AMEDNAMEDENTITYENTITIESTAGGING AND NAMED7 ENTITY TAGGING 7

spans of text, and is difficultspans partlyof because text, and of is the difficult ambiguity partly of because segmentation; of the ambiguity we of segmentation; we need to decide what’s an entityneed and to what decide isn’t, what’s and where an entity the and boundaries what isn’t, are. and Indeed, where the boundaries are. Indeed, most words in a text will notmost be named words entities. in a text Another will not difficulty be named is entities. caused by Another type difficulty is caused by type ambiguity. The mention JFKambiguity.can refer to The a person, mention theJFK airportcan refer in New to a York, person, or any the airport in New York, or any number of schools, bridges, andnumber streets of aroundschools, the bridges, United and States. streets Some around examples the United of States. Some examples of this kind of cross-type confusionthis kind are given of cross-type in Figure confusion8.6. are given in Figure 8.6.

[PER Washington] was born[ intoPER slaveryWashington] on the was farm born of James into slavery Burroughs. on the farm of James Burroughs. [ORG Washington] went up 2[ORG gamesWashington] to 1 in the four-game went up 2 games series. to 1 in the four-game series. Blair arrived in [LOC Washington]Blair arrived for what in may[LOC wellWashington] be his last for state what visit. may well be his last state visit. In June, [GPE Washington] passedIn June, a primary [GPE Washington] seatbelt law. passed a primary seatbelt law. Figure 8.6 Examples of type ambiguitiesFigure 8.6 inExamples the use of of the type name ambiguitiesWashington in the. use of the name Washington.

The standard approach to sequenceThe standard labeling approach for a span-recognition to sequence labeling problem for like a span-recognition problem like NER is BIO tagging (RamshawNER and is BIO Marcus,tagging 1995)(Ramshaw. This is a and method Marcus, that 1995) allows. This us is a method that allows us to treat NER like a word-by-wordto treat sequence NER like labeling a word-by-word task, via tags sequence that capture labeling both task, via tags that capture both the boundary and the namedthe entity boundary type. Consider and the named the following entity type. sentence: Consider the following sentence:

[PER Jane Villanueva ] of [[ORGPER JaneUnited Villanueva] , a unit of] of[ORG[ORGUnitedUnited Airlines] , a unit of [ORG United Airlines Holding] , said the fare appliesHolding to the][,LOC saidChicago the fare applies] route. to the [LOC Chicago ] route. BIO Figure 8.7 showsBIO the sameFigure excerpt8.7 representedshows the with sameBIO excerpttagging, represented as well as with BIO tagging, as well as variants called IO tagging andvariantsBIOES calledtagging.IO tagging In BIO and taggingBIOES wetagging. label any In token BIO tagging we label any token that begins a span of interestthat withbegins the labela spanB, of tokens interest that with occur theinside label aB span, tokens are that occur inside a span are tagged with an I, and any tokenstagged outside with of an anyI, and span any of tokens interest outside are labeled of anyO span. While of interest are labeled O. While there is only one O tag, we’llthere have is distinct only oneB andO tag,I tags we’ll for have each distinct namedB entityand I class.tags for each named entity class. TheBIO number Tagging of tags is thus 2Then + 1 number tags, where of tagsn is is the thus number 2n + 1 oftags, entity where types.n is BIO the number of entity types. BIO tagging can represent exactlytagging the same can information represent exactly as the bracketed the same information notation, but as has the bracketed notation, but has the[PER advantage Jane Villanueva] that we can of represent[ORGthe advantage United] the task , thata inunit wethe of can same [ORG represent simple United sequence the Airlines task in modeling Holding] the same , simple sequence modeling waysaid as the part-of-speech fare applies tagging:to theway [LOC assigning as Chicago part-of-speech a single ] route. label tagging: yi to assigning each input a word singlexi label: yi to each input word xi:

Words IO LabelWords BIO LabelIO Label BIOESBIO Label Label BIOES Label Jane I-PER Jane B-PERI-PER B-PERB-PER B-PER Villanueva I-PER Villanueva I-PER I-PER E-PERI-PER E-PER of O of O O O O O United I-ORGUnited B-ORGI-ORG B-ORGB-ORG B-ORG Airlines I-ORGAirlines I-ORGI-ORG I-ORGI-ORG I-ORG Holding I-ORGHolding I-ORGI-ORG E-ORGI-ORG E-ORG discussed O discussed O O O O O the O the O O O O O Chicago I-LOC Chicago B-LOCI-LOC S-LOCB-LOC S-LOC route O route O O O O O . O . O O O O O Figure 8.7 NER as a sequenceFigure model, 8.7 showingNER IO, as a BIO, sequence and BIOES model, taggings. showing IO, BIO, and BIOES taggings. Now we have one tag per token!!! We’ve also shown two variantWe’ve tagging also schemes: shown two IO variant tagging, tagging which schemes: loses some IO tagging, which loses some information by eliminating theinformation B tag, and by BIOES eliminating tagging, the whichB tag, adds and BIOES an end tagging, tag which adds an end tag E for the end of a span, andE afor span the tag endS offor a a span, span and consisting a span of tag onlyS for one a span word. consisting of only one word. A sequence labeler (HMM,A CRF, sequence RNN, labeler Transformer, (HMM, etc.) CRF, is trainedRNN, Transformer, to label each etc.) is trained to label each token in a text with tags thattoken indicate in a the text presence with tags (or that absence) indicate of the particular presence kinds (or absence) of particular kinds of named entities. of named entities. 8.3 • NAMED ENTITIES8.3 AND N• AMEDNAMEDENTITYENTITIESTAGGING AND NAMED7 ENTITY TAGGING 7

spans of text, and is difficultspans partlyof because text, and of is the difficult ambiguity partly of because segmentation; of the ambiguity we of segmentation; we need to decide what’s an entityneed and to what decide isn’t, what’s and where an entity the and boundaries what isn’t, are. and Indeed, where the boundaries are. Indeed, most words in a text will notmost be named words entities. in a text Another will not difficulty be named is entities. caused by Another type difficulty is caused by type ambiguity. The mention JFKambiguity.can refer to The a person, mention theJFK airportcan refer in New to a York, person, or any the airport in New York, or any number of schools, bridges, andnumber streets of aroundschools, the bridges, United and States. streets Some around examples the United of States. Some examples of this kind of cross-type confusionthis kind are given of cross-type in Figure confusion8.6. are given in Figure 8.6.

[PER Washington] was born[ intoPER slaveryWashington] on the was farm born of James into slavery Burroughs. on the farm of James Burroughs. [ORG Washington] went up 2[ORG gamesWashington] to 1 in the four-game went up 2 games series. to 1 in the four-game series. Blair arrived in [LOC Washington]Blair arrived for what in may[LOC wellWashington] be his last for state what visit. may well be his last state visit. In June, [GPE Washington] passedIn June, a primary [GPE Washington] seatbelt law. passed a primary seatbelt law. Figure 8.6 Examples of type ambiguitiesFigure 8.6 inExamples the use of of the type name ambiguitiesWashington in the. use of the name Washington.

The standard approach to sequenceThe standard labeling approach for a span-recognition to sequence labeling problem for like a span-recognition problem like NER is BIO tagging (RamshawNER and is BIO Marcus,tagging 1995)(Ramshaw. This is a and method Marcus, that 1995) allows. This us is a method that allows us to treat NER like a word-by-wordto treat sequence NER like labeling a word-by-word task, via tags sequence that capture labeling both task, via tags that capture both the boundary and the namedthe entity boundary type. Consider and the named the following entity type. sentence: Consider the following sentence:

[PER Jane Villanueva ] of [[ORGPER JaneUnited Villanueva] , a unit of] of[ORG[ORGUnitedUnited Airlines] , a unit of [ORG United Airlines Holding] , said the fare appliesHolding to the][,LOC saidChicago the fare applies] route. to the [LOC Chicago ] route. BIO Figure 8.7 showsBIO the sameFigure excerpt8.7 representedshows the with sameBIO excerpttagging, represented as well as with BIO tagging, as well as variants called IO tagging andvariantsBIOES calledtagging.IO tagging In BIO and taggingBIOES wetagging. label any In token BIO tagging we label any token that begins a span of interestthat withbegins the labela spanB, of tokens interest that with occur theinside label aB span, tokens are that occur inside a span are tagged with an I, and any tokenstagged outside with of an anyI, and span any of tokens interest outside are labeled of anyO span. While of interest are labeled O. While there is only one O tag, we’llthere have is distinct only oneB andO tag,I tags we’ll for have each distinct namedB entityand I class.tags for each named entity class. The number of tags is thus 2Then + 1 number tags, where of tagsn is is the thus number 2n + 1 oftags, entity where types.n is BIO the number of entity types. BIO tagging can represent exactlytagging the same can information represent exactly as the bracketed the same information notation, but as has the bracketed notation, but has the advantage that we can representthe advantage the task that in wethe can same represent simple sequence the task in modeling the same simple sequence modeling BIO Tagging way as part-of-speech tagging:way assigning as part-of-speech a single label tagging:yi to assigning each input a word singlexi label: yi to each input word xi: Words IO LabelWords BIO LabelIO Label BIOESBIO Label Label BIOES Label B: token that begins aJane span I-PER Jane B-PERI-PER B-PERB-PER B-PER Villanueva I-PER Villanueva I-PER I-PER E-PERI-PER E-PER I: tokens inside a spanof O of O O O O O O: tokens outside of Unitedany span I-ORGUnited B-ORGI-ORG B-ORGB-ORG B-ORG Airlines I-ORGAirlines I-ORGI-ORG I-ORGI-ORG I-ORG Holding I-ORGHolding I-ORGI-ORG E-ORGI-ORG E-ORG discussed O discussed O O O O O # of tags (where n is #entitythe types):O the O O O O O Chicago I-LOC Chicago B-LOCI-LOC S-LOCB-LOC S-LOC 1 O tag, route O route O O O O O . O . O O O O O n B tags, Figure 8.7 NER as a sequenceFigure model, 8.7 showingNER IO, as a BIO, sequence and BIOES model, taggings. showing IO, BIO, and BIOES taggings. n I tags We’ve also shown two variantWe’ve tagging also schemes: shown two IO variant tagging, tagging which schemes: loses some IO tagging, which loses some total of 2n+1 information by eliminating theinformation B tag, and by BIOES eliminating tagging, the whichB tag, adds and BIOES an end tagging, tag which adds an end tag E for the end of a span, andE afor span the tag endS offor a a span, span and consisting a span of tag onlyS for one a span word. consisting of only one word. A sequence labeler (HMM,A CRF, sequence RNN, labeler Transformer, (HMM, etc.) CRF, is trainedRNN, Transformer, to label each etc.) is trained to label each token in a text with tags thattoken indicate in a the text presence with tags (or that absence) indicate of the particular presence kinds (or absence) of particular kinds of named entities. of named entities. 8.3 • NAMED ENTITIES AND NAMED ENTITY TAGGING 7

spans of text, and is difficult partly because of the ambiguity of segmentation; we need to decide what’s an entity and what isn’t, and where the boundaries are. Indeed, most words in a text will not be named entities. Another difficulty is caused by type ambiguity. The mention JFK can refer to a person, the airport in New York, or any number of schools, bridges, and streets around the United States. Some examples of this kind of cross-type confusion are given in Figure 8.6.

[PER Washington] was born into slavery on the farm of James Burroughs. [ORG Washington] went up 2 games to 1 in the four-game series. Blair arrived in [LOC Washington] for what may well be his last state visit. In June, [GPE Washington] passed a primary seatbelt law. Figure 8.6 Examples of type ambiguities in the use of the name Washington.

The standard approach to sequence labeling for a span-recognition problem like NER is BIO tagging (Ramshaw and Marcus, 1995). This is a method that allows us to treat NER like a word-by-word sequence labeling task, via tags that capture both the boundary and the named entity type. Consider the following sentence:

[PER Jane Villanueva ] of [ORG United] , a unit of [ORG United Airlines Holding] , said the fare applies to the [LOC Chicago ] route. BIO Figure 8.7 shows the same excerpt represented with BIO tagging, as well as variants called IO tagging and BIOES tagging. In BIO tagging we label any token that begins a span of interest with the label B, tokens that occur inside a span are tagged with an I, and any tokens outside of any span of interest are labeled O. While BIOthere is Tagging only one O tag, we’llvariants: have distinct BIOand andI tags for BIOES each named entity class. The number of tags is thus 2n + 1 tags, where n is the number of entity types. BIO [PERtagging Jane can Villanueva] represent exactlyof [ORG the United] same information , a unit of [ORG as the United bracketed Airlines notation, Holding] but has , saidthe the advantage fare applies that we to can the represent [LOC Chicago the task ] route. in the same simple sequence modeling way as part-of-speech tagging: assigning a single label yi to each input word xi:

Words IO Label BIO Label BIOES Label Jane I-PER B-PER B-PER Villanueva I-PER I-PER E-PER of O O O United I-ORG B-ORG B-ORG Airlines I-ORG I-ORG I-ORG Holding I-ORG I-ORG E-ORG discussed O O O the O O O Chicago I-LOC B-LOC S-LOC route O O O . O O O Figure 8.7 NER as a sequence model, showing IO, BIO, and BIOES taggings.

We’ve also shown two variant tagging schemes: IO tagging, which loses some information by eliminating the B tag, and BIOES tagging, which adds an end tag E for the end of a span, and a span tag S for a span consisting of only one word. A sequence labeler (HMM, CRF, RNN, Transformer, etc.) is trained to label each token in a text with tags that indicate the presence (or absence) of particular kinds of named entities. Standard algorithms for NER

Supervised Machine Learning given a human- labeled training set of text annotated with tags • Hidden Markov Models • Conditional Random Fields (CRF)/ Maximum Entropy Markov Models (MEMM) • Neural sequence models (RNNs or Transformers) • Large Language Models (like BERT), finetuned Sequence Named Entity Recognition Labeling for Part (NER) of Speech and Named Entities