Content Extraction and Lexical Analysis from Customer-Agent Interactions
Total Page:16
File Type:pdf, Size:1020Kb
Lexical Analysis and Content Extraction from Customer-Agent Interactions Sergiu Nisioi Anca Bucur? Liviu P. Dinu Human Language Technologies Research Center, ?Center of Excellence in Image Studies, University of Bucharest fsergiu.nisioi,[email protected], [email protected] Abstract but also to forward a solution for cleaning and ex- tracting useful content from raw text. Given the In this paper, we provide a lexical compara- large amounts of unstructured data that is being tive analysis of the vocabulary used by cus- collected as email exchanges, we believe that our tomers and agents in an Enterprise Resource proposed method can be a viable solution for con- Planning (ERP) environment and a potential solution to clean the data and extract relevant tent extraction and cleanup as a preprocessing step content for NLP. As a result, we demonstrate for indexing and search, near-duplicate detection, that the actual vocabulary for the language that accurate classification by categories, user intent prevails in the ERP conversations is highly di- extraction or automatic reply generation. vergent from the standardized dictionary and We carry our analysis for Romanian (ISO 639- further different from general language usage 1 ro) - a Romance language spoken by almost as extracted from the Common Crawl cor- 24 million people, but with a relatively limited pus. Moreover, in specific business commu- nication circumstances, where it is expected number of NLP resources. The purpose of our to observe a high usage of standardized lan- approach is twofold - to provide a comparative guage, code switching and non-standard ex- analysis between how words are used in question- pression are predominant, emphasizing once answer interactions between customers and call more the discrepancy between the day-to-day center agents (at the corpus level) and language as use of language and the standardized one. it is standardized in an official dictionary, and to provide a possible solution to extract meaningful 1 Introduction content that can be used in natural language pro- It is often the case for companies that make use of cessing pipelines. Last, but not least, our hope is to a customer relationship management software, to increase the amount of digital resources available collect large amounts of noisy data from the inter- for Romanian by releasing parts of our data. actions of their customers with human agents. The customer-agent communication can have a wide 2 Data range of channels from speech, live chat, email While acknowledging the limits of a dictionary, or some other application-level protocol that is we consider it as a model of standardized words, wrapped over SMTP. If such data is stored in a and for this we make use of every morphologi- structured manner, companies can use it to opti- cal form defined in the Romanian Explicative Dic- mize procedures, retrieve information quickly, and tionary DEX1 - an electronic resource containing decrease redundancy which overall can prove ben- both user generated content and words normed eficial for their customers and maybe, more impor- by the Romanian Academy. We extract from the tant, for the well-being of their employees work- database a total of over 1.3 million words includ- ing as agents, who can use technology to ease their ing all the morphological inflected forms. It is im- day-to-day job. In our paper, we work with email portant to note here, the user generated content exchanges that have been previously stored as raw is being curated by volunteers and that not every text or html dumps into a database and attempt word appearing in the dictionary goes through an bring up some possible issues in dealing with this kind of data lexically, from an NLP perspective, 1https://dexonline.ro 132 Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pages 132–136 Brussels, Belgium, Nov 1, 2018. c 2018 Association for Computational Linguistics Questions Answers Common Crawl official normative process for the language. In Vocabulary size 21,914 25,493 148,980 consequence, this resource may contain various Dict diacr. overlap 41.75 40.65 42.22 Dict no diacr. overlap 55.51 52.96 60.87 region-specific word forms, low frequency or old Answers overlap 67.87 - 10.59 Answers diff English 4.83 - 7.28 terms and other technical neologisms. Questions overlap - 58.34 8.96 While a dictionary can provide the list of words, Questions diff English - 20.95 8.04 C. Crawl overlap 60.9 61.86 - it certainly lacks context and the way language C. Crawl diff English 7.13 13.17 - is used in a large written corpus. One of the Table 2: Comparison of overlapping dictionaries largest corpora of Romanian texts consists of news articles extracted from Common Crawl2, it con- sists of texts on various topics and genres, and re- nitude larger than the one corresponding to ques- cently it has been considered (Bojar et al., 2016) tions. Considering that type to token ratio is a a reliable resource for training a generic language reasonable indicator for lexical richness (Read, model for modern standard Romanian, as part of 2000), then customers use a rich vocabulary to the News task, Workshop of Machine Translation describe their problems, with a considerable high 2016. This corpus contains 54 million words, it probability for new words to appear in the received covers general content not related to a specific queries, while agents show a more standardized, topic, and since it has been scraped from pub- smaller vocabulary to formulate their replies. lic websites, it is reasonable to assume it con- tains standard official Romanian text, grammati- 3 Quantitative Lexical Analysis cally and lexically correct. The question-answer corpus consists of inter- We carry a comparison at the lexical level, in par- actions saved from an Enterprise Resource Plan- ticular by looking at the size and variety of the vo- ning (ERP) environment within a private Roma- cabulary with respect to a standard Romanian dic- 3 nian company. All data has been anonymized be- tionary. We extract word2vec embeddings using fore usage and personally identifiable information CBOW with negative sampling (Mikolov et al., ˇ has been removed. The topics are highly business- 2013; Rehu˚rekˇ and Sojka, 2010) for three cor- specific, covering processes such as sales, hu- pora: Common Crawl, the corpus of Questions, man resources, inventory, marketing, and finance. and the one containing Answers. The models are The data consists of interactions in the form of trained to prune out words with frequency smaller tasks, requests, or questions (Q) and activities, re- than 5, shrinking the vocabulary to ensure that sponses, or answers (A). the words included have good vectorial represen- One question may have multiple answers and tations. From those vocabularies, we discard num- the documents may contain email headers, footers, bers, punctuation and other elements that are not disclaimers or even automatic messages. To alle- contiguous sequences of characters. viate the effect of noise on our analysis, we have We then proceed to use the vocabulary from the implemented heuristics to remove automatic mes- trained models and compare against the entire dic- sages, signatures, disclaimers and headers from tionary Romanian of inflected forms. For the later, emails. we build two versions - one which contains diacrit- ics and a second one which contains words both questions answers with and without diacritics. # tokens 7,297,400 11,370,417 # types 4,425,651 4,439,299 For each vocabulary at hand we perform two type / token ratio 0.6065 0.3904 simple measurements: total tokens 18,667,817 1. overlap - the percentage of overlap between Table 1: Question answering corpus size one vocabulary and another The statistics regarding the size of the corpus 2. diff English - the percentage of differences are rendered in Table1, we observe that the num- between one vocabulary and another, that ber of types (unique words) is quite similar for are part of an English WordNet synset (Fell- both questions and answers, however the total baum, 1998) number of words used in the responses is a mag- 3The resources are released at https://github. 2http://commoncrawl.org com/senisioi/ro_resources 133 These basic measurements should give an indi- questions answers function words 17.22 16.47 cator on how much of the vocabulary used in our pronouns 5.11 4.78 ERP data is covered by the generic resources avail- sentences 14.81 11.29 able for Romanian, and how important domain token length 4.91 5.27 adaptation is for being able to correctly process Table 3: Average number of features / question or an- the texts. swer Table2 contains the values for these measure- ments in a pair-wise fashion between each vocabu- breviations of specific terms (e.g., exemplu (exam- lary - dictionary with and without diacritics, ques- ple) - ex, factura (invoice) - fact, fc) being more tions and answers vocabulary, and Common Crawl robust to noise and free-form lexical variations in model vocabulary. We also compare the vocabu- language. laries extracted from our corpora with the dictio- At last, in Table3, we count the average number nary having diacritics removed, as it is often the of content independent features (function words, case to write informal Romanian with no diacrit- pronouns, number of sentences and average to- ics. The second and third rows show an increase in ken length) that appear in both questions and an- overlapping percentage, regardless of the vocabu- swers. These features can provide information re- lary, when the diacritics are ignored, which indi- garding the style of a text (Chung and Pennebaker, cates that even official news articles contain non- 2007), being extensively used in previous research standard words and or omissions of diacritics.