Experiments in Clustering Urban-Legend Texts
Total Page:16
File Type:pdf, Size:1020Kb
How to Distinguish a Kidney Theft from a Death Car? Experiments in Clustering Urban-Legend Texts Roman Grundkiewicz, Filip Gralinski´ Adam Mickiewicz University Faculty of Mathematics and Computer Science Poznan,´ Poland [email protected], [email protected] Abstract is much easier to tap than the oral one, it becomes feasible to envisage a system for the machine iden- This paper discusses a system for auto- tification and collection of urban-legend texts. In matic clustering of urban-legend texts. Ur- this paper, we discuss the first steps into the cre- ban legend (UL) is a short story set in ation of such a system. We concentrate on the task the present day, believed by its tellers to of clustering of urban legends, trying to reproduce be true and spreading spontaneously from automatically the results of manual categorisation person to person. A corpus of Polish UL of urban-legend texts done by folklorists. texts was collected from message boards In Sec. 2 a corpus of 697 Polish urban-legend and blogs. Each text was manually as- texts is presented. The techniques used in pre- signed to one story type. The aim of processing the corpus texts are discussed in Sec. 3, the presented system is to reconstruct the whereas the clustering process – in Sec. 4. We manual grouping of texts. It turned out that present the clustering experiment in Sec. 5 and the automatic clustering of UL texts is feasible results – in Sec. 6. but it requires techniques different from the ones used for clustering e.g. news ar- 2 Corpus ticles. The corpus of N = 697 Polish urban-legend texts 1 Introduction was manually collected from the Web, mainly 2 Urban legend is a short story set in the present from message boards and blogs . The corpus was day, believed by its tellers to be true and spread- not gathered with the experiments of this study in ing spontaneously from person to person, often in- mind, but rather for the purposes of a web-site ded- icated to the collection and documentation of the cluding elements of humour, moralizing or horror. 3 Urban legends are a form of modern folklore, just Polish web folklore . The following techniques as traditional folk tales were a form of traditional were used for the extraction of web pages with ur- folklore, no wonder they are of great interest to ban legends: folklorists and other social scientists (Brunvand, 1. querying Google search engine with formu- 1981). As urban legends (and related rumours) of- laic expressions typical of the genre of urban ten convey misinformation with regard to contro- legends, like znajomy znajomego (= friend of versial issues, they can draw the attention of gen- a friend), słyszałem taka¸opowies´c´ (= I heard 1 eral public as well . this story) (Gralinski,´ 2009), The traditional way of collecting urban legends was to interview informants, to tape-record their 2. given a particular story type and its exam- narrations and to transcribe the recordings (Brun- ples, querying the search engine with various vand, 1981). With the exponential growth of the combinations of the keywords specific for the Internet, more and more urban-legend texts turn story type, their synonyms and paraphrases, up on message boards or blogs and in social me- dia in general. As the web circulation of legends 3. collecting texts and links submitted by read- ers of the web-site mentioned above, 1Snopes.com, an urban legends reference web-site, is ranked #2,650 worldwide by Alexa.com (as of May 2The corpus is available at http://amu.edu.pl/ 3, 2011), see http://www.alexa.com/siteinfo/ ˜filipg/uls.tar.gz snopes.com 3See http://atrapa.net 29 Proceedings of the Workshop on Information Extraction and Knowledge Acquisition, pages 29–36, Hissar, Bulgaria, 16 September 2011. 4. analysing backlinks to the web-site men- such a crime, to be more precse a girl tioned above (a message board post contain- was kidnapped from this place where ing an urban legend post is sometimes ac- parents leave their kids and go shop- companied by a reply “it was debunked here: ping (a mini kids play paradise).Yes the [link]” or similar), gril was brought back home but with- out a kidney... She is about 5 years old. 5. collecting urban-legend texts occurring on And I know that becase this girl is my the same web page as a text found using the mom’s colleague family... We cannot methods (1)-(4) – e.g. a substantial number of panic and hide our kids in the corners, threads like “tell an interesting story”, “which or pass on such info via GG [Polish in- urban legends do you know?” or similar were stant messenger] cause no this one’s go- found on various message boards. ing believe, but we shold talk about such stuf, just as a warning... The web pages containing urban legends were saved and organised with Firefox Scrapbook add- I don’t know if you heard about it or on4. not, but I will tell you one story which Admittedly, method (2) may be favourable to happened recently in Koszalin. In this clustering algorithms. This method, however, ac- city a very large shopping centre- Fo- counted for about 30% of texts 5 and, what’s more, rum was opened some time ago. And as the synonyms and paraphrases were prepared you know there are lots of people, com- manually (without using any lexicons), some of motion in such places. And it so hap- them being probably difficult to track by cluster- pened that a couple with a kid (a girl, I ing algorithms. guess she was 5,6 years old I don’t know Urban-legend texts were manually categorised exactly) went missing. And you know into 62 story types, mostly according to (Brun- they searched the shops etc themselves vand, 2002) with the exception of a few Polish leg- until in the end they called the police. ends unknown in the United States. The number of And they thought was kidnapping, they texts in each group varied from 1 to 37. say they waited for an information on For the purposes of the experiments described a ransom when the girl was found barely in this paper, each urban legend text was manu- alive without a kidney near Forum. Hor- ally delimited and marked. Usually a whole mes- rible...; It was a shock to me especially sage board or blog post was marked, but some- cause I live near Koszalin. I’m 15 my- times (e.g. when an urban legend was just quoted self and I have a little sister of a similar in a longer post) the selection had to be narrowed age and I don’t know what I’d do if such down to one or two paragraphs. The problems of a thing happened to her... Horrible...6 automatic text delimitation are disregarded in this The two texts represents basically the same story paper. of a kidney theft, but they differ considerably in Two sample urban-legend texts (both classified detail and wordings. as the kidney theft story type) translated into En- A smaller subcorpus of 11 story types and 83 glish are provided below. The typographical and legend texts were used during the development. spelling errors of the original Polish texts are pre- Note that the corpus of urban-legend texts can served in translation. be used for other purposes, e.g. as a story-level Well yeah... the story maybe not paraphrase corpus, as each time a given story is too real, but that kids’ organs are being re-told by various people in their own words. stolen it’s actually true!! More and more 3 Document Representation crimes of this kind have been reported, for example in the Lodz IKEA there was For our experiments, we used the standard Vec- tor Space Model (VSM), in which each document 4http://amb.vis.ne.jp/mozilla/ scrapbook/ 6The original Polish text: http://www.samosia. 5The exact percentage is difficult to establish, as informa- pl/pokaz/341160/jestem_przerazona_jak_ tion on which particular method led to which text was not- mozna_komus_podwedzic_dziecko_i_wyciac_ saved. mu_nerke_w_ch 30 is represented as a vector in a multidimensional example in a mathematical text one can expect space. The selection of terms, each correspond- words like function, theorem, equation, etc. to oc- ing to a dimension, often depends on a distinctive cur, no matter which topic, i.e. branch of mathe- nature of texts. In this section, we describe some matics (algebra, geometry or mathematical analy- text processing methods, the aim of which is to sis), is involved. increase similarity of documents of related topics Construction of the domain keywords list based (i.e. urban legends of the same story type) and de- on words frequencies in the collection of doc- crease similarity of documents of different topics. uments may be insufficient. An external, hu- man knowledge might be used for specifying such 3.1 Stop Words words. We decided to add the words specific to the A list of standard Polish stop words (function genre of urban legends, such as: words, most frequent words etc.) was used dur- • words expressing family and interpersonal ing the normalization. The stop list was obtained relations, such as znajomy (= friend), kolega by combining various resources available on the (= colleague), kuzyn (= cousin) (urban leg- Internet7. The final stop list contained 642 words. ends are usually claimed to happen to a friend We decided to expand the stop list with some of a friend, a cousin of a colleague etc.), domain-specific and non-standard types of words.