Toward Automatic Processing of English Metalanguage Shomir Wilson* School of Informatics School of Computer Science University of Edinburgh Carnegie Mellon University 10 Crichton Street 5000 Forbes Avenue Edinburgh EH8 9AB, UK Pittsburgh, PA 15213 USA [email protected] Conventional stylistic cues, such as italics in (1), Abstract (2), and (3) and quotation marks in (4), some- times help the audience to recognize metalinguis- The metalinguistic facilities of natural lan- tic statements in written language. In spoken lan- guage are crucial to our ability to communi- guage or in written contexts where stylistic cues cate, but the patterns behind the appearance of are not used, the audience is expected to identify metalanguage—and thus the clues for how we metalinguistic statements using paralinguistic may instruct computers to detect it—have re- cues (such as intonation, when speaking) or con- mained relatively unknown. This paper de- scribes the first results on the feasibility of au- text and meaning. tomatically identifying metalanguage in Eng- Metalanguage is both pervasive and, paradox- lish text. A core metalinguistic vocabulary has ically, the subject of limited attention in research been identified, supporting intuitions about the on language technologies. The ability to produce phenomenon and aiding in its detection and and understand metalanguage is a core linguistic delineation. These results open the door to ap- competence that allows humans to converse flex- plications that can extract the direct, salient in- ibly, unrestricted by domain (Anderson et al., formation that metalanguage encodes. 2002). Humans use it to establish grounding, ver- ify audience understanding, and maintain com- 1 Introduction munication channels (Anderson et al., 2004). Metalanguage encodes direct and salient infor- In linguistic communication it is sometimes nec- mation about language, but many typical exam- essary to refer to features of language, such as ples thwart parsers with novel word usage or ar- orthography, vocabulary, structure, pragmatics, rangement (Wilson, 2011a). Metalanguage is or meaning. Metalanguage enables a speaker to difficult to classify through the interpretive lens select a linguistically-relevant referent over (or in of word senses, given that conventional word addition to) other typical referents (Audi, 1995). senses have little relevance when a word appears Metalanguage is illustrated in sentences such as chiefly “as a word”. The roles of metalanguage (1) Graupel refers to a kind of precipitation. in L2 language acquisition (Hu, 2010), expres- (2) The name is actually Rolla. sion of sentiment toward others’ utterances (3) Keep tabs on is a colloquial phrase. (Jaworski et al., 2004), and irony (Sperber and (4) He wrote “All gone” and nothing more. Wilson, 1981) have also been noted. This paper describes the results of the first ef- The roles of the bold substrings in the above sen- fort to automatically identify instances of meta- tences contrast with those in (5)-(8) below: language in English text. Mentioned language, a (5) Graupel fell on the weary hikers. common variety of metalanguage, is focused up- (6) Rolla is a small town. on for its explicit, direct nature, which makes its (7) Keep tabs on him, will you? structure and meaning easily accessible once an (8) They were all gone when I returned. instance is identified. Section 2 reviews a prior project by Wilson (2012) to create a corpus of instances of metalanguage, a necessary resource * for the present effort. Section 3 describes an ap- This research was performed during a prior affiliation with the University of Maryland at College Park. proach to distinguishing sentences that contain 760 International Joint Conference on Natural Language Processing, pages 760–766, Nagoya, Japan, 14-18 October 2013. metalanguage from those that do not, a task re- language tasks discussed in the introduction. ferred to as detection for brevity. Results show However, other metalinguistic constructions that the performance of this approach roughly draw attention to tokens outside of the referring matches an implied performance ceiling of inter- sentence. Some examples of this are (10)-(12) annotator agreement. Section 4 describes an ap- below. Supporting contexts are not shown for proach to delineate sequences of words that are these sentences, though such contexts are easily directly mentioned by a metalinguistic statement; imagined: although the results are preliminary, its accuracy (10) Disregard the last thing I said. shows promise for future development. Together, (11) That spelling, is it correct? these results on detection and delineation show (12) People don’t use those words lightly. the feasibility of enabling language technologies to extract the salient information about language In each of the above three sentences, a linguistic that metalanguage contains. entity (an utterance, a sequence of letters, and a sequence of words, respectively) is referred to, 2 Background but the referent is contained in a separate sen- tence. The referent may have been produced by a The reader is likely to be familiar with the con- different utterer or appeared in a different medi- cept of metalanguage, but a discussion is appro- um (e.g., speaking aloud while referring to writ- priate to ground the concept and connect to pre- ten text). These “extra-sentential” forms of meta- vious work. Section 2.1 summarizes a prior study language have clear value to understanding dis- (Wilson, 2012) to collect instances of metalan- course and coreference. The focus on mentioned guage, and 2.2 reviews some related efforts. language is a limitation to the present work, to 2.1 Prior Work utilize an existing corpus and to apply tractable boundaries to the identification tasks. A diverse variety of phenomena in natural lan- The mentioned language corpus of the prior guage satisfy the intuitive criteria that we associ- study2 was constructed by filtering a large vol- ate with metalanguage. The prior study focused ume of sentences with a heuristic, followed by on identifying sentences that contained men- annotation by a human reader. A randomly se- tioned language, a phenomenon defined below: lected subset of articles from English Wikipedia Definition: For T a token or a set of tokens in a was chosen as a source for text because of its sentence, if T is produced to draw attention to a representation of a large sample of English writ- property of the token T or the type of T, then T is ers (Adler et al., 2008), the rich frequency of an instance of mentioned language.1 mentioned language in its text, and the frequent use of stylistic cues in its text that delimit men- Here, a token is an instantiation of a linguistic tioned language (i.e., bold text, italic text, and entity (e.g., a letter, symbol, sound, word, phrase, quotation marks). Sentences were sought that or other related entity), and a property is an os- contained at least one of these stylistic cues and a tension of language (García-Carpintero, 2004; mention-significant word in close proximity. Saka, 2006), such as spelling, pronunciation, Mention-significant words were a set of 8,735 meaning (for a variety of interpretations of words and collocations with potential metalin- meaning), structure, connotation, or quotative guistic significance (e.g., word, symbol, call), source. Generally attention is drawn to the type extracted from the WordNet lexical ontology of T (for example, in Sentences (1)-(4)), but it (Fellbaum, 1998). Phrases highlighted by the can be drawn to the token of T for self-reference, stylistic cues were considered candidate instanc- as in Sentence (9): es, and these were labeled by a human reader, (9) “The” appears between quote marks. who determined that 629 sentences were mention sentences (i.e., containing instances of mentioned Although constructions like (9) are unusual and language) and the remaining 1,764 were not. carry less practical value, the definition accom- Mention sentences were categorized based on modates them for completeness. functional properties that emerged during catego- Mentioned language is a common form of rization. Table 1 shows some examples of col- metalanguage, used to perform the full variety of lected mention sentences in each category. 1 This definition was introduced by Wilson (2011a) along 2 with a practical rubric for evaluating candidate sentences. The corpus is available at For brevity, its full justification is not reproduced here. http://www.cs.cmu.edu/~shomir/um_corpus.html. 761 Category Examples es in conversational English. A lack of phrase- Words as The IP Multimedia Subsystem level annotations in their corpus as well as sub- Words architecture uses the term stantial noise made it suboptimal for the present transport plane to describe a effort. However, it is possible (if not likely) that function roughly equivalent to the indicators of metalanguage differ between writ- routing control plane. ten and spoken English, lending importance to The material was a heavy canvas the Anderson corpus as a resource. known as duck, and the brothers Metalanguage has a long history of theoretical began making work pants and treatments, which chiefly explained the mechan- shirts out of the strong material. ics of selected examples of the phenomenon. Names as Digeri is the name of a Thracian Many addressed it through the related topic of Names tribe mentioned by Pliny the El- quotation (Cappelen and Lepore, 1997; Da- der, in The Natural History. vidson, 1979; Maier, 2007; Quine, 1940; Tarski, Hazrat Syed Jalaluddin Bukhari's 1933), and others previously cited in this paper descendants are also called Naqvi discussed it directly as metalanguage or the use- al-Bukhari. mention distinction. The definition of mentioned Spelling or The French changed the spelling language in Section 2.1 was a synthesis of the Pronunciation to bataillon, whereupon it direct- most empirically-compatible theoretical treat- ly entered into German. ments, and the present effort to automatically Welles insisted on pronouncing identify metalanguage builds on that synthesis.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages7 Page
-
File Size-