Australasian Language Technology Association Workshop 2018
Proceedings of the Workshop
Editors: Sunghwan Mac Kim Xiuzhen (Jenny) Zhang
10–12 December 2018 The University of Otago Dunedin, New Zealand Australasian Language Technology Association Workshop 2018
(ALTA 2018)
http://alta2018.alta.asn.au
Online Proceedings: http://alta2018.alta.asn.au/proceedings
Gold Sponsors:
Silver Sponsor:
Bronze Sponsors:
Volume 16, 2018 ISSN: 1834-7037
II ALTA 2018 Workshop Committees
Workshop Co-Chairs
• Xiuzhen (Jenny) Zhang (RMIT University) • Sunghwan Mac Kim (CSIRO Data61)
Publication Co-chairs
• Sunghwan Mac Kim (CSIRO Data61) • Xiuzhen (Jenny) Zhang (RMIT University)
Programme Committee
• Abeed Sarker, University of Pennsylvania • Alistair Knott, University of Otago • Andrea Schalley, Karlstads Universitet • Ben Hachey, The University of Sydney and Digital Health CRC • Benjamin Borschinger, Google • Bo Han, Accenture • Daniel Angus, University of Queensland • Diego Mollá, Macquarie University • Dominique Estival, Western Sydney University • Gabriela Ferraro, CSIRO Data61 • Gholamreza Haffari, Monash University • Hamed Hassanzadeh, Australian e-Health Research Centre CSIRO • Hanna Suominen, Australian National University and Data61/CSIRO • Jennifer Biggs, Defence Science Technology • Jey-Han Lau, IBM Research • Jojo Wong, Monash University • Karin Verspoor, The University of Melbourne • Kristin Stock, Massey University • Laurianne Sitbon, Queensland University of Technology • Lawrence Cavedon, RMIT University • Lizhen Qu, CSIRO Data61 • Long Duong, Voicebox Technology Australia • Maria Kim, Defence Science Technology • Massimo Piccardi, University of Technology Sydney • Ming Zhou, Microsoft Research Asia • Nitin Indurkhya, University of New South Wales • Rolf Schwitter, Macquarie University • Sarvnaz Karimi, CSIRO Data61 • Scott Nowson, Accenture • Shervin Malmasi, Macquarie University and Harvard Medical School • Shunichi Ishihara, Australian National University
III • Stephen Wan, CSIRO Data61 • Teresa Lynn, Dublin City University • Timothy Baldwin, The University of Melbourne • Wei Gao, Victoria University of Wellington • Wei Liu, University of Western Australia • Will Radford, Canva • Wray Buntine, Monash University
IV Preface
This volume contains the papers accepted for presentation at the Australasian Language Technology Association Workshop (ALTA) 2018, held at The University of Otago in Dunedin, New Zealand on 10-12 December 2018. The goals of the workshop are to: • bring together the Language Technology (LT) community in the Australasian region and encourage interactions and collaboration; • foster interaction between academic and industrial researchers, to encourage dissemination of research results; • provide a forum for students and young researchers to present their research; • facilitate the discussion of new and ongoing research and projects; • increase visibility of LT research in Australasia and overseas and encourage interactions with the wider international LT community. This year’s ALTA Workshop presents 10 peer-reviewed papers, including 6 long papers and 4 short papers. We received a total of 17 submissions for long and short papers. Each paper was reviewed by three members of the program committee, using a double-blind protocol. Great care was taken to avoid all conflicts of interest. ALTA 2018 includes a presentations track, following the workshops since 2015 when it was first introduced. This aims to encourage broader participation and facilitate local socialisation of international results, including work in progress and work submitted or published elsewhere. Presentations were lightly reviewed by the ALTA chairs to gauge overall quality of work and whether it would be of interest to the ALTA community. Offering both archival and presentation tracks allows us to grow the standard of work at ALTA, to better showcase the excellent research being done locally. ALTA 2018 continues the tradition of including a shared task, this year on classifying patent applications. Participation is summarised in an overview paper by organisers Diego Mollá and Dilesha Seneviratne. Participants were invited to submit a system description paper, which are included in this volume without review. We would like to thank, in no particular order: all of the authors who submitted papers; the programme committee for the time and effort they put into maintaining the high standards of our reviewing process; the shared task organisers Diego Mollá and Dilesha Seneviratne; our keynote speakers Alistair Knott and Kristin Stock for agreeing to share their perspectives on the state of the field; and the tutorial presenter Phil Cohen for his efforts towards the tutorial of collaborative dialogue. We would like to acknowledge the constant support and advice of the ALTA Executive Committee such as budgets, sponsorship and more. Finally, we gratefully recognise our sponsors: CSIRO/Data61, Soul Machines, Google, IBM, Seek and ARC Centre of Excellence for the Dynamics of Language. Importantly, their generous support enabled us to offer travel subsidies to all students presenting at ALTA, and helped to subsidise conference catering costs and student paper awards.
Xiuzhen (Jenny) Zhang Sunghwan Mac Kim ALTA Programme Chairs
V ALTA 2018 Programme
Monday, 10 December 2018 Tutorial Session (Room 1.19) 14:00–17:00 Tutorial: Phil Cohen Towards Collaborative Dialogue 17:00 End of Tutorial
VI Tuesday, 11 December 2018 Opening & Keynote (Room 1.17) 9:00–9:15 Opening 9:15–10:15 Keynote 1 (from ADCS): Jon Degenhardt (Room 1.17) An Industry Perspective on Search and Search Applications 10:15–10:45 Morning tea Session A: Text Mining & Applications (Room 1.19) 10:45–11:05 Paper: Rolando Coto Solano, Sally Akevai Nicholas and Samantha Wray Development of Natural Language Processing Tools for Cook Islands Maori¯ 11:05–11:25 Paper: Bayzid Ashik Hossain and Rolf Schwitter Specifying Conceptual Models Using Restricted Natural Language 11:25–11:45 Presentation: Jenny McDonald and Adon Moskal Quantext: a text analysis tool for teachers 11:45–11:55 Paper: Xuanli He, Quan Tran, William Havard, Laurent Besacier, Ingrid Zukerman and Gho- lamreza Haffari Exploring Textual and Speech information in Dialogue Act Classification with Speaker Domain Adaptation 11:55–13:00 Lunch 13:00–14:00 Keynote 2: Alistair Knott (Room 1.17) Learning to talk like a baby 14:00–14:15 Break 14:15–15:15 Poster Session 1 (ALTA & ADCS) 15:15–15:45 Afternoon tea 15:45–16:50 Poster Session 2 (ALTA & ADCS) 16:50 End of Day 1
VII Wednesday, 12 December 2018 9:00–10:15 Keynote 3 (from ADCS): David Bainbridge (Room 1.17) Can You Really Do That? Exploring new ways to interact with Web content and the desktop 10:15–10:45 Morning tea Session B: Machine Translation & Speech (Room 1.19) 10:45–11:05 Paper: Cong Duy Vu Hoang, Gholamreza Haffari and Trevor Cohn Improved Neural Machine Translation using Side Information 11:05–11:25 Presentation: Qiongkai Xu, Lizhen Qu and Jiawei Wang Decoupling Stylistic Language Generation 11:25–11:45 Paper: Satoru Tsuge and Shunichi Ishihara Text-dependent Forensic Voice Comparison: Likelihood Ratio Estimation with the Hidden Markov Model (HMM) and Gaussian Mixture Model – Universal Background Model (GMM- UBM) Approaches 11:45–11:55 Paper: Nitika Mathur, Timothy Baldwin and Trevor Cohn Towards Efficient Machine Translation Evaluation by Modelling Annotators 11:55–12:55 Lunch 12:55–13:55 Keynote 4: Kristin Stock (Room 1.17) "Where am I, and what am I doing here?" Extracting geographic information from natural language text 13:55–14:05 Break Session C: Shared session with ADCS (Room 1.17) 14:05–14:30 Paper: Alfan Farizki Wicaksono and Alistair Moffat (ADCS long paper) Exploring Interaction Patterns in Job Search 14:30–14:50 Paper: Xavier Holt and Andrew Chisholm Extracting structured data from invoices 14:50–15:05 Paper: Bevan Koopman, Anthony Nguyen, Danica Cossio, Mary-Jane Courage and Gary Fran- cois (ADCS short paper) Extracting Cancer Mortality Statistics from Free-text Death Certificates: A View from the Trenches 15:05–15:15 Paper: Hanieh Poostchi and Massimo Piccardi Cluster Labeling by Word Embeddings and WordNet’s Hypernymy 15:15–15:35 Afternoon tea Session D: Word Semantics (Room 1.19) 15:35–15:55 Paper: Lance De Vine, Shlomo Geva and Peter Bruza Unsupervised Mining of Analogical Frames by Constraint Satisfaction 15:55–16:05 Paper: Navnita Nandakumar, Bahar Salehi and Timothy Baldwin A Comparative Study of Embedding Models in Predicting the Compositionality of Multiword Expressions Shared Task Session (Room 1.19) 16:05–16:15 Paper: Diego Mollá-Aliod and Dilesha Seneviratne Overview of the 2018 ALTA Shared Task: Classifying Patent Applications 16:15–16:25 Paper: Fernando Benites, Shervin Malmasi, Marcos Zampieri Classifying Patent Applications with Ensemble Methods 16:25–16:35 Paper: Jason Hepburn Universal Language Model Fine-tuning for Patent Classification 16:35–16:45 Break 16:45–16:55 Best Paper Awards 16:55–17:20 Business Meeting & ALTA Closing 17:20 End of Day 2
VIII Contents
Invited talks 1
Tutorials 3
Long papers 5
Improved Neural Machine Translation using Side Information Cong Duy Vu Hoang, Gholamreza Haffari and Trevor Cohn 6
Text-dependent Forensic Voice Comparison: Likelihood Ratio Estimation with the Hidden Markov Model (HMM) and Gaussian Mixture Model Satoru Tsuge and Shunichi Ishihara 17
Development of Natural Language Processing Tools for Cook Islands Maori¯ Rolando Coto Solano, Sally Akevai Nicholas and Samantha Wray 26
Unsupervised Mining of Analogical Frames by Constraint Satisfaction Lance De Vine, Shlomo Geva and Peter Bruza 34
Specifying Conceptual Models Using Restricted Natural Language Bayzid Ashik Hossain and Rolf Schwitter 44
Extracting structured data from invoices Xavier Holt and Andrew Chisholm 53
Short papers 60
Exploring Textual and Speech information in Dialogue Act Classification with Speaker Domain Adaptation Xuanli He, Quan Tran, William Havard, Laurent Besacier, Ingrid Zukerman and Gho- lamreza Haffari 61
Cluster Labeling by Word Embeddings and WordNet’s Hypernymy Hanieh Poostchi and Massimo Piccardi 66
A Comparative Study of Embedding Models in Predicting the Compositionality of Multiword Ex- pressions Navnita Nandakumar, Bahar Salehi and Timothy Baldwin 71
Towards Efficient Machine Translation Evaluation by Modelling Annotators Nitika Mathur, Timothy Baldwin and Trevor Cohn 77
IX ALTA Shared Task papers 83
Overview of the 2018 ALTA Shared Task: Classifying Patent Applications Diego Mollá and Dilesha Seneviratne 84
Classifying Patent Applications with Ensemble Methods Fernando Benites, Shervin Malmasi, Marcos Zampieri 89
Universal Language Model Fine-tuning for Patent Classification Jason Hepburn 93
X Invited keynotes
1 Alistair Knott (University of Otago & Soul Machines)
Learning to talk like a baby In recent years, computational linguists have embraced neural network models, and the vector-based representations of words and meanings they use. But while computational linguists have readily adopted the machinery of neural network models, they have been slower to embrace the original aim of neural network research, which was to understand how brains work. A large community of neural network researchers continues to pursue this “cognitive modelling” aim, with very interesting results. But the work of these more cognitively minded modellers has not yet percolated deeply into computational linguistics. In my talk, I will argue the cognitive modelling tradition of neural networks has much to offer computational linguistics. I will outline a research programme that situates language modelling in a broader cognitive context. The programme is distinctive in two ways. Firstly, the initial object of study is a baby, rather than an adult. Computational linguistics models typically aim to reproduce adult linguistic competence in a single training process, that presents an “empty” network with a corpus of mature language. I will argue that this training process doesn’t correspond to anything in human experience, and that we should instead aim to model a more gradual developmental process, that first achieves babylike language, then childlike language, and so on. Secondly, the new programme studies the baby’s language system as it interfaces with her other cognitive systems, rather than by itself. It pays particular attention to the sensory and motor systems through which a baby engages with the physical world, which are the primary means by which it activates semantic representations. I will argue that the structure of these sensorimotor systems, as expressed in neural network models, offer interesting insights about certain aspects of linguistic structure. I will conclude by demoing a model of the interface between language and the sensorimotor system, as it operates in a baby at an early stage of language learning.
Kristin Stock (Massey University)
“Where am I, and what am I doing here?” Extracting geographic information from natural language text The extraction of place names (toponyms) from natural language text has received a lot of attention in recent years, but location is frequently described in more complex ways, often using other objects as reference points. Examples include: ‘The accident occurred opposite the Orewa Post Office, near the pedestrian crossing’ or ‘the sample was collected on the west bank of the Waikato River, about 3km upstream from Huntly’. These expressions can be vague, imprecise, underspecified, rely on access to information about other objects in the environment, and the semantics of spatial relations like ‘opposite’ and ‘on’ are still far from clear. Furthermore, many of these kinds of expressions are context sensitive, and aspects such as scale, geometry and type of geographic feature may influence the way the expression is understood. Both machine learning and rule-based approaches have been developed to try to firstly parse expressions of this kind, and secondly to determine the geographic location that the expression refers to. Several relevant projects will be discussed, including the development of a semantic rather than syntactic approach to parsing geographic location descriptions; the creation of a manually annotated training set of geographic language; the challenges highlighted from human descriptions of location in the emergency services context; the interpretation and geocoding of descriptions of flora and fauna specimen collections; the development of models of spatial relations using social media data and the use of instance-based learning to interpret complex location descriptions.
2 Tutorials
3 Towards Collaborative Dialogue
Phil Cohen (Monash University) This tutorial will discuss a program of research for building collaborative dialogue systems, which are a core part of virtual assistants. I will briefly discuss the strengths and limitations of current approaches to dialogue, including neural network-based and slot-filling approaches, but then concentrate on approaches that treat conversation as planned collaborative behaviour. Collaborative interaction involves recognizing someone’s goals, intentions, and plans, and then performing actions to facilitate them. People have learned this basic capability at a very young age and are expected to be helpful as part of ordinary social interaction. In general, people’s plans involve both speech acts (such as requests, questions, confirmations, etc.) and physical acts. When collaborative behavior is applied to speech acts, people infer the reasons behind their interlocutor’s utterances and attempt to ensure their success. Such reasoning is apparent when an information agent answers the question “Do you know where the Sydney flight leaves?” with “Yes, Gate 8, and it’s running 20 minutes late.” It is also apparent when one asks “where is the nearest petrol station?” and the interlocutor answers “2 kilometers to your right” even though it is not the closest, but rather the closest one that is open. In this latter case, the respondent has inferred that you want to buy petrol, not just to know the location of the station. In both cases, the literal and truthful answer is not cooperative. In order to build systems that collaborate with humans or other artificial agents, a system needs components for planning, plan recognition, and for reasoning about agents’ mental states (beliefs, desires, goals, intentions, obligations, etc.). In this tutorial, I will discuss current theory and practice of such collaborative belief-desire-intention architectures, and demonstrate how they can form the basis for an advanced collaborative dialogue manager. In such an approach, systems reason about what they plan to say, and why the user said what s/he did. Because there is a plan standing behind the system’s utterances, it is able to explain its reasoning. Finally, we will discuss potential methods for incorporating such a plan-based approach with machine-learned approaches.
4 Long papers
5 Improved Neural Machine Translation using Side Information
Cong Duy Vu Hoang† and Gholamreza Haffari‡ and Trevor Cohn† † University of Melbourne, Melbourne, VIC, Australia ‡ Monash University, Clayton, VIC, Australia [email protected], [email protected], [email protected]
Abstract vagueness, out-of-vocabulary terms, and topic rel- evancy. In this work, we investigate whether side Inspired from the idea of multi-modal transla- information is helpful in the context of tion, in our work, we propose the use of another neural machine translation (NMT). We modality, namely metadata or side information. study various kinds of side information, Previously, Hoang et al.(2016a) have shown the including topical information and personal usefulness of side information for neural language traits, and then propose different ways models. This work will investigate the potential of incorporating these information sources usefulness of side information for NMT models. into existing NMT models. Our experi- In our work, we target towards unstructured and mental results show the benefits of side in- heterogeneous side information which potentially formation in improving the NMT models. can be found in practical applications. Specifi- 1 Introduction cally, we investigate different kinds of side infor- mation, including topic keywords, personality in- Neural machine translation is the task of gener- formation and topic classification. Then we study ating a target language sequence given a source different methods with minimal efforts for incor- language sequence, framed as a neural network porating such side information into existing NMT (Sutskever et al., 2014; Bahdanau et al., 2015, in- models. ter alia). Most research efforts focus on inducing more prior knowledge (Cohn et al., 2016; Zhang 2 Machine Translation Data with Side et al., 2017; Mi et al., 2016, inter alia), incor- Information porating linguistics factors (Hoang et al., 2016b; First, let’s explore some realistic scenarios in Sennrich and Haddow, 2016; Garc´ıa-Mart´ınez which the side information is potentially useful for et al., 2017) or changing the network architecture NMT. (Gehring et al., 2017b,a; Vaswani et al., 2017; El- bayad et al., 2018) in order to better exploit the TED Talks The TED Talks website2 hosts tech- source representation. Consider a different direc- nical videos from influential speakers around the tion, situations in which there exists other modal- world on various topics or domains, such as: ed- ity other than the text of the source sentence. For ucation, business, science, technology, creativity, instance, the WMT 2017 campaign1 proposed to etc. Thanks to users’ contributions, most of such use additional information obtained from images videos are subtitled in multiple languages. Based to enrich the neural MT models, as in (Calixto on this website, Cettolo et al.(2012) created a et al., 2017; Matusov et al., 2017; Calixto and parallel corpus for the MT research community. Liu, 2017). This task, also known as multi-modal Inspired by this, Chen et al.(2016) further cus- translation, seeks to leverage images which can tomised this dataset and included an additional contain cues representing the perception of the sentence-level topic information.3 We consider image in source text, and potentially can con- such topic information as side information. Fig- tribute to resolve ambiguity (e.g., lexical, gender), 2https://www.ted.com/talks 1http://www.statmt.org/wmt17/ 3https://github.com/wenhuchen/ multimodal-task.html iwslt-2015-de-en-topics
Cong Duy Vu Hoang, Gholamreza Haffari and Trevor Cohn. 2018. Improved Neural Machine Translation using Side Information. In Proceedings of Australasian Language Technology Association Workshop, pages 6−16. ure1 illustrates some examples of this dataset. As codes explicitly provide a hierarchical classifica- can be seen, the keywords (second column, treated tion of patents according to various different ar- as side information) contain additional contextual eas of technology to which they pertain. This kind information that can provide complementary cues of side information can provide a useful signal for so as to better guide the translation process. Let’s MT task – which has not yet been fully exploited. take an example in Figure1 (TED Video Id 172), Figure3 gives us an illustrating excerpt of this cor- the keyword “art” provides cues for words and pus. We can see that each of sentence pair in this phrases in target sequence such as: “place, de- corpus is associated with any number of IPC la- sign”; whereas the keyword “tech” refers to “Me- bel(s) as well as other metadata, e.g., patent ID, dia Lab, computer science”. patent family ID, publication date. In this work, we consider only the IPC labels. The full mean- Personalised Europarl For the second dataset, ing of all IPC labels can be found on the official we evaluate our proposed idea in the context of IPC website,6 however we provide in Figure3 the personality-aware MT. Mirkin et al.(2015) ex- glossess for each referenced label. Note that those plored whether translation preserves personality IPC labels form a WordNet style hierarchy (Fell- information (e.g., demographic and psychometric baum, 1998), and accordingly may be useful in traits) in statistical MT (SMT); and further Rabi- many other deep models of NLP. novich et al.(2017) found that personality infor- mation like author’s gender is an obvious signal in 3 NMT with Side Information source text, but it is less clear in human and ma- chine translated texts. As a result, they created a We investigate different ways of incorporating new dataset for personalised MT4 partially based side information into the NMT model(s). on the original Europarl. The personality such as author’s gender will be regarded as side informa- 3.1 Encoding of Side Information tion in our setup. An excerpt of this dataset is In this work, we propose the use of unstructured shown in Figure2. As can be seen from the figure, heterogeneous side information, which is often there exist many kinds of side information pertain- available in practical datasets. Due the hetero- ing to authors’ traits, including identification (ID, geneity of side information, our techniques are name), native language, gender, date of birth/age, based on a bag-of-words (BOW) representation and plenary session date. Here, we will focus on of the side information, an approach which was the “gender” trait and evaluate whether it can have shown to be effective in our prior work (Hoang any benefits in the context of NMT complement- et al., 2016a). Each element of the side informa- ing the work of Rabinovich et al.(2017) attempted tion (a label, or word) is embedded using a matrix a similar idea as part of a SMT, rather than NMT, s Hs×|Vs| W e ∈ R , where |Vs| is the vocabulary of system. side information and Hs the dimensionality of the Patent MT Collection Another interesting data hidden space. These embedding vectors are used is patent translation which includes rich side in- for the input to several different neural architec- formation. PatTR5 is a sentence-parallel corpus tures, which we now outline. which is a subset of the MAREC Patent Cor- 3.2 NMT Model Formulation pus (Waschle¨ and Riezler, 2012a). In general, PatTR contains millions of parallel sentences col- Recall the general formulation of NMT (Sutskever lected from all patent text sections (e.g., title, ab- et al., 2014; Bahdanau et al., 2015, inter alia) as stract, claims, description) in multiple languages a conditional language model in which the gen- (English, French, German) (Waschle¨ and Riezler, eration of target sequence is conditioned on the 2012b; Simianer and Riezler, 2013). An appeal- source sequence (Sutskever et al., 2014; Bahdanau ing feature of this corpus is that it provides a la- et al., 2015, inter alia), formulated as: belling at a sentence level, in the form of IPC (In- ternational Patent Classification) codes. The IPC yt+1 ∼ pΘ (yt+1|y 4http://cl.haifa.ac.il/projects/pmt/ = softmax (fΘ (y 7 TED Video Id Keywords German Sentence English Sentence Aber das Media Lab ist ein interessanter Ort, und But the Media Lab is an interesting place, es ist wichtig für mich, denn ich studierte and it's important to me because as a 172 arts,tech ursprünglich Computerwissenschaften und erst student, I was a computer science später in meinem Leben habe ich Design undergrad, and I discovered design later on entdeckt. in my life. In anderen Worten, ich glaube, dass der So in other words, I believe that President 645 politics,issues,business französische Präsident Sarkozy recht hat, wenn Sarkozy of France is right when he talks er über eine Mittelmeer Union spricht. about a Mediterranean union. Eine andere Welt tat sich ungefähr zu dieser Zeit Another world was opening up around this 1193 recreation,arts,issues auf: Auftreten und Tanzen. time: performance and dancing. Dieses Gebäude beinhaltet die Weltgrößte This building contains the world's largest Kollektion von Sammlungen und Artefakten die collection of documents and artifacts 692 politics,arts,issues,env der Rolle der USA im Kampf auf der commemorating the U.S. role in fighting on Chinesischen Seite gedenken. In diesem langen the Chinese side in that long war -- the Flying Krieg -- die "fliegenden Tiger". Tigers. Es erlaubt uns, Kunst, Biotechnologie, Software It allows us to do the art, the biotechnology, 1087 politics,education und all solch wunderbaren Dinge zu schaffen. the software and all those magic things. Ich liebe Bartóks Musik, so wie Herr Teszler, und I love Bartok's music, as did Mr. Teszler, and 208 recreation,education,arts,issues er hatte wirklich jede Aufnahme Bartóks die es he had virtually every recording of Bartok's gab. music ever issued. Fig. 1 An example with side information (e.g., keywords) for MT with TED Talks dataset. English Sentence: Accordingly , I consider it essential that both the identification of cattle and the labelling of beef be introduced as quickly as possible on a compulsory basis . German Sentence: Entsprechend halte ich es auch für notwendig , daß die Kennzeichnung möglichst schnell und verpflichtend eingeführt wird , und zwar für Rinder und für Rindfleisch . Meta Info: EUROID="2209" NAME="Schierhuber" LANGUAGE="DE" GENDER="FEMALE" DATE_OF_BIRTH="31 May 1946" SESSION_DATE="97-02-19" AGE="50" English Sentence: Can the Commission say that it will seek to have sugar declared a sensitive product ? German Sentence: Kann die Kommission sagen , dass sie danach streben wird , Zucker zu einem sensiblen Produkt erklären zu lassen ? Meta Info: EUROID="22861" NAME="Ó Neachtain (UEN)." LANGUAGE="EN" GENDER="MALE" DATE_OF_BIRTH="22 May 1947" SESSION_DATE="03-09-02" AGE="56" English Sentence: For example , Brazil has huge concerns about the proposals because the poor and landless there will suffer if sugar production expands massively , as is predicted . German Sentence: So hegt beispielsweise Brasilien bezüglich der Vorschläge enorme Bedenken , denn wenn die Zuckerproduktion , wie vorhergesagt , massiv expandiert , wird das die Not der Armen und Landlosen dort noch verstärken . Meta Info: EUROID="28115" NAME="McGuinness (PPE-DE )." LANGUAGE="EN" GENDER="FEMALE" DATE_OF_BIRTH="13 June 1959" SESSION_DATE="05-02-22" AGE="45" English Sentence: The European citizens ' initiative should be seen as an opportunity to involve people more closely in the EU 's decision-making process . German Sentence: Die Europäische Bürgerinitiative ist als Chance zu werten , um die Menschen stärker in den Entscheidungsprozess der EU miteinzubeziehen . Meta Info: EUROID="96766" NAME="Ernst Strasser" LANGUAGE="DE" GENDER="MALE" DATE_OF_BIRTH="29 April 1956" SESSION_DATE="10-12-15-010" AGE=“54" Fig. 2 An example with side information (e.g., author’s gender highlighted in red) for MT with person- alised Europarl dataset. Fig. 3 An example with side information (e.g., IPC highlighted in red) for MT with PatTR dataset. 8 where the probability pΘ (.) of generating the next orifics) as suffix of the source sequence for con- target word yt+1 is conditioned on the previously trolling the politeness in translated outputs. generated target words y and the source se- 9 improve the main NMT task. We can define a gen- mation, formulated as: p (y, e|x) erative model , formulated as: " # ! 1 X p = sigmoid W g0 (x ) + b ; p (y, e|x) := p (y|x, e) · p (e|x) ; s s |x| i s | {z } | {z } i translation model classification model (7) (4) where x is the source sequence, comprising of where p (y|x, e) is a translation model condi- x1, . . . , xi, . . . , x|x| words. Here, we denote a tioned on the side information as explained earlier; generic function term g0 (.) which refers to a vec- p (e|x) can be regarded as a classification model torised representation of a specific word depend- – which predicts the side information given the ing on designing the network architecture, e.g., source sentence. Note that side information can stacked bidirectional (forward and backward) net- often be represented as individual words – which works over the source sequence (Bahdanau et al., can be treated as labels, making the classification 2015; Luong et al., 2015); or a convolutional en- model feasible. coder (Gehring et al., 2017b,a) or a transformer 7 Importantly, the above formulation of a gener- encoder (Vaswani et al., 2017). Further, W s ∈ |Vs|×Hx |Vs| ative model would require summing over “e” at R and bs ∈ R are two additional model test/decode time, which might be done by decod- parameters for linear transformation of the source ing for all possible label combinations, then re- sequence representation (where Hx is a dimension porting the sentence with the highest model score. of the output of the g0 (.) function, it will differ This may be computationally infeasible in prac- from network architectures as discussed earlier). tice. We resort this by approximating the NMT Now, we have two objective functions at the model as p (y|x, e) ≈ p (y|x), resulting in training stage, including the NMT loss LNMT and the MLC loss LMLC . The total objective function p (y, e|x) ≈ p (y|x) · p (e|x); (5) of our joint learning will be: and thus force the model to encode the shared in- L := LNMT + λLMLC ; (8) formation in the encoder states. where: λ is the coefficient balancing the two task Our formulation in Equation5 gives rise to objectives, whose value is fine-tuned based on the multi-task learning (MTL). Here, we propose the development data to optimise for NMT accuracy joint learning of two different but related tasks: measured using BLEU (Papineni et al., 2002). NMT and multi-label classification (MLC). Here, The idea of MTL applied for NLP was firstly the MLC task refers to predicting the labels that explored by (Collobert and Weston, 2008), later possibly represent words of the given side infor- attracts increasing attentions from the NLP com- mation. This is interesting in the sense that the munity (Ruder, 2017). Specifically, the idea be- model is capable of not only generating the trans- hind MTL is to leverage related tasks which can lated outputs, but also explicitly predicting what be learned jointly — potentially introducing an in- the side information is. Here, we adopt a simple ductive bias (Feinman and Lake, 2018). An alter- instance of MTL for our case, called soft param- native explanation of the benefits of MTL is that eter sharing similar to (Duong et al., 2015; Yang joint training with multiple tasks acts as an addi- and Hospedales, 2016). In our MTL version, the tional regulariser to the model, reducing the risk NMT and MLC tasks share the parameters of the of overfitting (Collobert and Weston, 2008; Ruder, encoders. The difference between the two is at the 2017, inter alia). decoder part. In the NMT task, the decoder is kept unchanged. For the MLC task, we define its ob- 4 Experiments jective function (or loss), formulated as: 4.1 Datasets M X T As discussed earlier, we conducted our experi- LMLC := − 1 s log p ; (6) wm s ments using three different datasets including TED m=1 Talks (Chen et al., 2016), Personalised Europarl where ps is the probability of predicting the pres- 7Here, to avoid repeating the materials, we will not elab- ence or absence of each element in the side infor- orate their formulations. 10 No. of labels Examples TED Talks 11 tech business arts issues education health env recre- ation politics others Personalised Europarl 2 male female PatTR-1 (deep) 651 G01G G01L G01N A47F F25D C01B . . . PatTR-2 (shallow) 8 GAFCHBED Table 1 Side information statistics for the three datasets, showing the number of types of the side infor- mation label, and the set of tokens (display truncated for PatTR-1 (deep)). (Rabinovich et al., 2017), and PatTR (Waschle¨ languages (hence called shared BPE) increases and Riezler, 2012b; Simianer and Riezler, 2013), the consistency of the segmentation. We applied translating from German (de) to English (en). The 32000 operations for learning the shared BPE by statistics of the training and evaluation sets can be using the open-source toolkit.9 Also, we used dev shown in Table2. For the TED Talks and Per- sets for tuning model parameters and early stop- sonalised Europarl datasets, we followed the same ping of the NMT models based on the perplexity. sizes of data splits since they are made available Table2 shows the resulting vocabulary sizes after on the authors’ github repository and website. For subword segmentation for all datasets. the PatTR dataset, we use the Abstract sections for patents from 2008 or later, and the develop- 4.2 Baselines and Setups ment and test sets are constructed to have 2000 Recall that our method for incorporating the ad- sentences each, similar to (Waschle¨ and Riezler, ditional side information into the NMT models is 2012b; Simianer and Riezler, 2013). generic; hence, it is applicable to any NMT ar- It is important to note the labeling information chitecture. We chose the transformer architecture for side information. We extracted all kinds of side (Vaswani et al., 2017) for all our experiments since information from three aforementioned datasets in it arguably is currently the most robust NMT mod- the form of individual words or labels. This makes els compared to RNN and convolution based ar- the label embeddings much easier. Their relevant chitectures. We re-implemented the transformer - statistics and examples can be found in Table1. based NMT system using the C++ Neural Network We preprocessed all the data using Moses’s Library - DyNet10 as our deep learning backend training scripts8 with standard steps: punctuation toolkit. Our re-implementation results in the open normalisation, tokenisation, truecasing. For train- source toolkit.11 ing sets, we set word-based length thresholds for In our experiments, we use the same configura- filtering long sentences since they will not be use- tions for all transformer models and datasets, in- ful when training the seq2seq models as suggested cluding: 2 encoder and decoder layers; 512 input in the NMT literature (Sutskever et al., 2014; Bah- embedding and hidden layer dimensions; sinusoid danau et al., 2015; Luong et al., 2015, inter alia). positional encoding; dropout with 0.1 probability We chose 80, 80, 150 length thresholds for TED for source and target embeddings, sub-layers (at- Talks, Personalized Europarl, and PatTR datasets, tention + feedforward), attentive dropout; and la- respectively. Note that the 150 threshold indi- bel smoothing with weight 0.1. For training our cates that the sentences in the PatTR dataset is neural models, we used early stopping based on in average much longer than in the others. For development perplexity, which usually occurs af- better handling the OOV problem, we segmented ter 20-30 epochs.12 all the preprocessed data with subword units us- We conducted our experiments with various in- ing byte-pair-encoding (BPE) method proposed by corporation methods as discussed in Section3. We Sennrich et al.(2016b). We already know that 9https://github.com/rsennrich/ languages such English and German share an al- subword-nmt phabet (Sennrich et al., 2016b), hence learning 10https://github.com/clab/dynet/ BPE on the concatenation of source and target 11https://github.com/duyvuleo/Transformer-DyNet/ 12The training process of transformer models is much 8https://github.com/moses-smt/ faster than the RNN and convolution - based ones, but re- mosesdecoder/tree/master/scripts quires more epochs for convergence. 11 dataset # tokens (M) # types (K) # sents # length limit TED Talks de→en train 3.73 3.75 19.78 14.23 163653 80 dev 0.02 0.02 4.03 3.15 567 n.a. test 0.03 0.03 6.07 4.68 1100 n.a. Personalised Europarl de→en train 8.46 8.39 21.15 14.04 278629 80 dev 0.16 0.16 14.67 9.83 5000 n.a. test 0.16 0.16 14.76 9.88 5000 n.a. PatTR de→en train 33.07 32.52 24.97 13.28 656352 150 dev 0.13 0.13 13.50 6.88 2000 n.a. test 0.13 0.12 13.35 6.89 2000 n.a. Table 2 Statistics of the training & evaluation sets from datasets including TED Talks, Personalised Europarl, and PatTR; showing in each cell the count for the source language (left) and target language (right); “#types” refers to subword-segmented vocabulary sizes; “n.a.” is not applicable, for development and test sets. Note that all the “#tokens”’ and “#types”’ are approximated. Method TED Talks Personalised Europarl PatTR-1 PatTR-2 base 29.48 31.12 45.86 si−src−prefix 29.28 30.87 45.99 45.97 si−src−suffix 29.36 31.03 46.01 45.83 si−trg−prefix−p 29.06 30.89 45.97 45.85 si−trg−prefix−h 29.28 30.93 46.03 45.92 output−layer 29.99† 31.22 46.32† 46.09 w/o side info 29.62 31.10 46.14 45.99 mtl 29.86† 31.12 46.14 46.01 Table 3 Evaluation results with BLEU scores of various incorporation variants against the baseline; bold: better than the baseline, †: statistically significantly better than the baseline. denote the system variants as follows: the development set, using the value range of base refers to the baseline NMT system using the {64, 128, 256, 512}. Similarly, the balancing transformer without using any side informa- weight in the mtl method is fine-tuned using the tion. value range of {0.001, 0.01, 0.1, 1.0}. For evalua- tion, we measured the end translation quality with si-src-prefix and si-src-suffix refer to the NMT case-sensitive BLEU (Papineni et al., 2002). We system using the side information as respec- averaged 2 runs for each of the method variants. tive prefix or suffix of the source sequence (Jehl and Riezler, 2018), applied to both 4.3 Results and Analysis training and decoder/inference. The experimental results can be seen in Ta- si-trg-prefix refers to the NMT system using the ble3. Overall, we obtained limited success for side information as prefix of the target se- the method of adding side information as prefix quence. There are two variants, including or suffix for TED Talks and Personalised Europarl “si-trg-prefix-p” means the side information datasets. On the PatTR dataset, small improve- is generated by the model itself and is then ments (0.1-0.2 BLEU) are observed. We experi- used for decoding/inference; “si-trg-prefix-h” mented two sets of side information in the PatTR means the side information is given at decod- dataset, including PatTR-1 (651 deep labels) and 13 ing/inference runtime. PatTR (8 shallow labels). The possible reason for this phenomenon is that the multi-head atten- output-layer refers to the method of incorporat- tion mechanism in the transformer may have some ing side information in the final output layer. confusion given the existing side information, ei- mtl refers to the multi-task learning method. 13The shallow setting takes the first character of each label code, which denotes the highest level concept in the type hi- It’s worth noting that the dimensional value erarchy, e.g., G01P (measuring speed) → G (physics), with for the output-layer method was fine-tuned over definitions as shown in Fig3. 12 35.0 ther in source or target sequences. In some am- 34.7 33.5 biguous cases, the multi-head attention may not 32.5 know where it should pay more attention. Another 31.4 31.3 30.5 30.5 30.0 30.4 30.2 possible reason is that the implicit ambiguity of BLEU 29.6 29.7 29.6 29.1 29.2 28.6 side information that may exist in the data. 28.3 27.5 28.0 27.9 27.9 27.5 27.1 Contrary to these variants, the output-layer vari- 25.0 ant was more consistently successful, obtaining arts business education environment health issues politics recreation technology others Category the best results across datasets. In the TED Talks baseline output-layer and PatTR datasets, this method also provides (a) The TEDTalks dataset. 51.0 50.1 the statistically significant results compared to the 49.4 48.9 48.6 47.8 baselines. Additionally, we conducted another ex- 46.8 47.1 47.5 47.4 45.4 45.4 periment by splitting the TED Talks and coarse 42.6 43.0 42.4 42.1 PatTR-2 datasets by the meta categories, then ob- BLEU 39.6 served the individual effects when incorporating 38.4 35.8 the side information with output-layer variant, as 34.2 35.2 shown in Figure 4a and 4b. In the TED Talks 30.0 A B C D E F G H dataset, we observed improvements for most cat- Category baseline output-layer egories, except for “business, education”. In the (b) The coarse PatTR-2 dataset. coarse PatTR-2 dataset, the improvements are ob- tained across all categories. The key behind this Fig. 4 Effects on individual BLEU scores for each success of the output-layer variant is that the rep- of categories in the TEDTalks and coarse PatTR-2 resentation of existing side information is added datasets, with the NMT model enhanced with the in the final output layer and controlled by addi- output-layer variant. tional learnable model parameters. In that sense, it results in a more direct effect on lexical choice of the NMT model. This resembles the success in alised Europarl dataset). However, sometimes it is the context of language modelling as presented in still good to reserve the personal traits in the tar- Hoang et al.(2016a). Further, we also obtained get translations (Rabinovich et al., 2017) although the promising results for the mtl variant although their BLEU scores are not necessarily better. we did implement a very simple instance of MTL with a sharing mechanism and no side information 5 Related Work given at a test time. For a fair comparison with the Our work is mainly inspired from Hoang et al. output-layer method, we added an additional ex- (2016a) who proposed the use of side informa- periment in which the output-layer method does tion for boosting the performance of recurrent neu- not have the access of side information. As ex- ral network language models. We further apply pected, its performance has been dropped, as can this idea for a downstream task in neural machine be seen in the second last row in Table3. In this translation. mtl case, the method without the side information We’ve adapted different methods in the litera- at a test time performs better. We believe that ture for this specific problem and evaluated using mtl more careful design of the variant can lead to different datasets with different kinds of side in- even better results. We also think that the hybrid formation. method combining the output-layer and mtl vari- Our methods for incorporating side informa- ants is also an interesting direction for future re- tion as suffix, prefix for either source or target se- search, e.g., relaxing the approximation as shown quences have been adapted from (Sennrich et al., in Equation5. 2016a; Johnson et al., 2017). Also working on Given the above results, we can find that the the same patent dataset, Jehl and Riezler(2018) characteristics of side information plays an impor- proposed to incorporate document meta informa- tant role in improving the NMT models. Our em- tion as special tokens, similar to our source pre- pirical experiments show that topical information fix/suffix method, or by concatenating the tag with (as in the TED Talks and PatTR datasets) is more each source word. They report an improvements, useful than the personal traits (as in the Person- consistent with our findings, although the changes 13 they observe are larger, of about 1 BLEU point, 2017 Conference on Empirical Methods in Natu- albeit from a lower baseline. ral Language Processing, EMNLP 2017, Copen- Also, Michel and Neubig(2018) proposed to hagen, Denmark, September 9-11, 2017. pages 992–1003. https://aclanthology.info/papers/D17- personalise neural MT systems by taking the vari- 1105/d17-1105. ance that each speaker speaks/writes on his own into consideration. They proposed the adaptation Iacer Calixto, Qun Liu, and Nick Campbell. 2017. Doubly-attentive decoder for multi-modal neural process which takes place in the “output” layer, machine translation. In Proceedings of the 55th An- similar to our output layer incorporation method. nual Meeting of the Association for Computational The benefit of the proposed MTL approach is Linguistics, ACL 2017, Vancouver, Canada, July 30 not surprising, resembling from existing works, - August 4, Volume 1: Long Papers. pages 1913– 1924. https://doi.org/10.18653/v1/P17-1175. e.g., jointly training translation models from/to multiple languages (Dong et al., 2015); jointly Mauro Cettolo, Christian Girardi, and Marcello Fed- 3 training the encoders (Zoph and Knight, 2016) or erico. 2012. WIT : Web Inventory of Transcribed and Translated Talks. In Proceedings of the 16th both encoders and decoders (Johnson et al., 2017). Conference of the European Association for Ma- chine Translation (EAMT). Trento, Italy, pages 261– 6 Conclusion 268. In this work, we have presented various situations Wenhu Chen, Evgeny Matusov, Shahram Khadivi, to which extent the side information can boost the and Jan-Thorsten Peter. 2016. Guided align- performance of the NMT models. We have stud- ment training for topic-aware neural ma- chine translation. CoRR abs/1607.01628. ied different kinds of side information (e.g. topic http://arxiv.org/abs/1607.01628. information, personal trait) as well as present dif- ferent ways of incorporating them into the existing Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vy- molova, Kaisheng Yao, Chris Dyer, and Gholam- NMT models. Though being simple, the idea of reza Haffari. 2016. Incorporating Structural Align- utilising the side information for NMT is indeed ment Biases into an Attentional Neural Transla- feasible and we have proved it via our empirical tion Model. In Proceedings of the 2016 Confer- experiments. Our findings will encourage practi- ence of the North American Chapter of the Associ- ation for Computational Linguistics: Human Lan- tioners to pay more attention to the side informa- guage Technologies. Association for Computational tion if exists. Such side information can provide Linguistics, San Diego, California, pages 876–885. valuable external knowledge that compensates for http://www.aclweb.org/anthology/N16-1102. the learning models. Further, we believe that this Ronan Collobert and Jason Weston. 2008. A uni- idea is not limited to the context of neural LM or fied architecture for natural language process- NMT, but it may be applicable to other NLP tasks ing: Deep neural networks with multitask learn- such as summarisation, parsing, reading compre- ing. In Proceedings of the 25th International hension, and so on. Conference on Machine Learning. ACM, New York, NY, USA, ICML ’08, pages 160–167. https://doi.org/10.1145/1390156.1390177. Acknowledgments Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and We thank the reviewers for valuable feedbacks and Haifeng Wang. 2015. Multi-task learning for mul- discussions. Cong Duy Vu Hoang is supported tiple language translation. In ACL (1). The Associa- by Australian Government Research Training Pro- tion for Computer Linguistics, pages 1723–1732. gram Scholarships at the University of Melbourne, Long Duong, Trevor Cohn, Steven Bird, and Paul Australia. Cook. 2015. Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser. In Proceedings of the 53rd Annual Meet- References ing of the Association for Computational Linguistics and the 7th International Joint Conference on Natu- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- ral Language Processing (Volume 2: Short Papers). gio. 2015. Neural Machine Translation by Jointly Association for Computational Linguistics, pages Learning to Align and Translate. In Proc. of 3rd 845–850. https://doi.org/10.3115/v1/P15-2139. International Conference on Learning Representa- tions (ICLR2015). Maha Elbayad, Laurent Besacier, and Jakob Ver- beek. 2018. Pervasive attention: 2d con- Iacer Calixto and Qun Liu. 2017. Incorporating volutional neural networks for sequence-to- global visual features into attention-based neu- sequence prediction. CoRR abs/1808.03867. ral machine translation. In Proceedings of the http://arxiv.org/abs/1808.03867. 14 Reuben Feinman and Brenden M. Lake. Thang Luong, Hieu Pham, and Christopher D. Man- 2018. Learning inductive biases with sim- ning. 2015. Effective Approaches to Attention- ple neural networks. CoRR abs/1802.02745. based Neural Machine Translation. In Proceed- http://arxiv.org/abs/1802.02745. ings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Christiane Fellbaum, editor. 1998. WordNet: An Elec- Computational Linguistics, Lisbon, Portugal, pages tronic Lexical Database. Language, Speech, and 1412–1421. http://aclweb.org/anthology/D15-1166. Communication. MIT Press, Cambridge, MA. Evgeny Matusov, Andy Way, Iacer Calixto, Daniel Mercedes Garc´ıa-Mart´ınez, Lo¨ıc Barrault, and Fethi Stein, Pintu Lohar, and Sheila Castilho. 2017. Bougares. 2017. Neural machine translation by Using images to improve machine-translating e- generating multiple linguistic factors. CoRR commerce product listings. In Proceedings of abs/1712.01821. http://arxiv.org/abs/1712.01821. the 15th Conference of the European Chap- ter of the Association for Computational Lin- Jonas Gehring, Michael Auli, David Grangier, and guistics, EACL 2017, Valencia, Spain, April Yann Dauphin. 2017a. A convolutional en- 3-7, 2017, Volume 2: Short Papers. pages coder model for neural machine translation. In 637–643. https://aclanthology.info/papers/E17- Proceedings of the 55th Annual Meeting of the 2101/e17-2101. Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe 4, Volume 1: Long Papers. pages 123–135. Ittycheriah. 2016. Coverage Embedding Models for https://doi.org/10.18653/v1/P17-1012. Neural Machine Translation. In Proceedings of the 2016 Conference on Empirical Methods in Natu- Jonas Gehring, Michael Auli, David Grangier, De- ral Language Processing. Association for Compu- nis Yarats, and Yann N. Dauphin. 2017b. Con- tational Linguistics, Austin, Texas, pages 955–960. volutional sequence to sequence learning. In Proceedings of the 34th International Conference Paul Michel and Graham Neubig. 2018. Extreme on Machine Learning, ICML 2017, Sydney, NSW, adaptation for personalized neural machine trans- Australia, 6-11 August 2017. pages 1243–1252. lation. In Proceedings of the 56th Annual Meet- http://proceedings.mlr.press/v70/gehring17a.html. ing of the Association for Computational Lin- guistics (Volume 2: Short Papers). Association Cong Duy Vu Hoang, Trevor Cohn, and Gholam- for Computational Linguistics, pages 312–318. reza Haffari. 2016a. Incorporating Side Informa- http://aclweb.org/anthology/P18-2050. tion into Recurrent Neural Network Language Mod- els. In Proceedings of the 2016 Conference of Shachar Mirkin, Scott Nowson, Caroline Brun, and the North American Chapter of the Association Julien Perez. 2015. Motivating personality- for Computational Linguistics: Human Language aware machine translation. In Proceedings Technologies. Association for Computational Lin- of the 2015 Conference on Empirical Meth- guistics, San Diego, California, pages 1250–1255. ods in Natural Language Processing. Association http://www.aclweb.org/anthology/N16-1149. for Computational Linguistics, pages 1102–1108. https://doi.org/10.18653/v1/D15-1130. Cong Duy Vu Hoang, Reza Haffari, and Trevor Kishore Papineni, Salim Roukos, Todd Ward, and Cohn. 2016b. Improving Neural Translation Mod- Wei-Jing Zhu. 2002. BLEU: A Method for Au- els with Linguistic Factors. In Proceedings of tomatic Evaluation of Machine Translation. In the Australasian Language Technology Association Proceedings of the 40th Annual Meeting on As- Workshop 2016. Melbourne, Australia, pages 7–14. sociation for Computational Linguistics. Asso- http://www.aclweb.org/anthology/U16-1001. ciation for Computational Linguistics, Strouds- Laura Jehl and Stefan Riezler. 2018. Document-level burg, PA, USA, ACL ’02, pages 311–318. information as side constraints for improved neural https://doi.org/10.3115/1073083.1073135. patent translation. In Proceedings of the 13th Con- Ella Rabinovich, Raj Nath Patel, Shachar Mirkin, Lu- ference of the Association for Machine Translation cia Specia, and Shuly Wintner. 2017. Personal- in the Americas, AMTA 2018, Boston, MA, USA, ized machine translation: Preserving original author March 17-21, 2018 - Volume 1: Research Papers. traits. In Proceedings of the 15th Conference of the pages 1–12. https://aclanthology.info/papers/W18- European Chapter of the Association for Computa- 1802/w18-1802. tional Linguistics: Volume 1, Long Papers. Asso- ciation for Computational Linguistics, pages 1074– Melvin Johnson, Mike Schuster, Quoc Le, Maxim 1084. http://aclweb.org/anthology/E17-1101. Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernand a Vigas, Martin Wattenberg, Greg Corrado, Sebastian Ruder. 2017. An overview of multi- Macduff Hughes, and Jeffrey Dean. 2017. Google’s task learning in deep neural networks. CoRR multilingual neural machine translation system: En- abs/1706.05098. http://arxiv.org/abs/1706.05098. abling zero-shot translation. Transactions of the As- sociation for Computational Linguistics 5:339–351. Rico Sennrich and Barry Haddow. 2016. Linguistic https://transacl.org/ojs/index.php/tacl/article/view/1081. input features improve neural machine translation. 15 In Proceedings of the First Conference on Machine task learning. CoRR abs/1606.04038. Translation: Volume 1, Research Papers. Associ- http://arxiv.org/abs/1606.04038. ation for Computational Linguistics, pages 83–91. https://doi.org/10.18653/v1/W16-2209. Jiacheng Zhang, Yang Liu, Huanbo Luan, Jingfang Xu, and Maosong Sun. 2017. Prior knowledge inte- Rico Sennrich, Barry Haddow, and Alexandra Birch. gration for neural machine translation using poste- 2016a. Controlling politeness in neural machine rior regularization. In Proceedings of the 55th An- translation via side constraints. In Proceedings nual Meeting of the Association for Computational of the 2016 Conference of the North American Linguistics (Volume 1: Long Papers). Association Chapter of the Association for Computational Lin- for Computational Linguistics, pages 1514–1523. guistics: Human Language Technologies. Associ- https://doi.org/10.18653/v1/P17-1139. ation for Computational Linguistics, pages 35–40. https://doi.org/10.18653/v1/N16-1005. B. Zoph and K. Knight. 2016. Multi-source neural translation. In Proc. NAACL. Rico Sennrich, Barry Haddow, and Alexandra http://www.isi.edu/natural-language/mt/multi- Birch. 2016b. Neural machine translation of source-neural.pdf. rare words with subword units. In Proceed- ings of the 54th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 1715–1725. http://www.aclweb.org/anthology/P16-1162. Patrick Simianer and Stefan Riezler. 2013. Multi- task learning for improved discriminative training in SMT. In Proceedings of the Eighth Workshop on Statistical Machine Translation. Association for Computational Linguistics, Sofia, Bulgaria, pages 292–300. http://www.aclweb.org/anthology/W13- 2236. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Net- works. In Proceedings of the 27th International Conference on Neural Information Processing Sys- tems. MIT Press, Cambridge, MA, USA, NIPS’14, pages 3104–3112. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Ben- gio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Informa- tion Processing Systems 30, Curran Associates, Inc., pages 5998–6008. http://papers.nips.cc/paper/7181- attention-is-all-you-need.pdf. Katharina Waschle¨ and Stefan Riezler. 2012a. Analyzing Parallelism and Domain Similar- ities in the MAREC Patent Corpus. Mul- tidisciplinary Information Retrieval pages 12–27. http://www.cl.uni-heidelberg.de/ rie- zler/publications/papers/IRF2012.pdf. Katharina Waschle¨ and Stefan Riezler. 2012b. Struc- tural and Topical Dimensions in Multi-Task Patent Translation. In Proceedings of the 13th Confer- ence of the European Chapter of the Association for Computational Linguistics (EACL 2012), Avi- gnon, France. http://www.aclweb.org/anthology- new/E/E12/E12-1083.pdf. Yongxin Yang and Timothy M. Hospedales. 2016. Trace norm regularised deep multi- 16 Text-dependent Forensic Voice Comparison: Likelihood Ratio Estima- tion with the Hidden Markov Model (HMM) and Gaussian Mixture Model – Universal Background Model (GMM-UBM) Approaches Satoru Tsuge Shunichi Ishihara Daido University, Japan Australian National University [email protected] [email protected] Abstract ly accepted in forensic voice comparison (FVC) as well (Morrison, 2009). Among the more typical forensic voice There are two different approaches in FVC. comparison (FVC) approaches, the acous- They are the ‘acoustic-phonetic statistical ap- tic-phonetic statistical approach is suitable proach’ and the ‘automatic approach’ (Morrison et for text-dependent FVC, but it does not fully exploit available time-varying infor- al., 2018). The former usually works on compara- mation of speech in its modelling. The au- ble phonetic units that can be found in both the of- tomatic approach, on the other hand, es- fender and suspect samples. In the latter, acoustic sentially deals with text-independent cas- measurements are usually carried out over all por- es, which means temporal information is tions of the available recordings, resulting in more not explicitly incorporated in the model- detailed acoustic characteristics of the speakers. ling. Text-dependent likelihood ratio (LR)- The common statistical models used in the auto- based FVC studies, in particular those that matic approach are the Gaussian mixture model – adopt the automatic approach, are few. universal background model (GMM-UBM) This preliminary LR-based FVC study (Reynolds et al., 2000) and i-vectors with proba- compares two statistical models, the Hid- den Markov Model (HMM) and the bilistic linear discrimination analysis (PLDA) Gaussian Mixture Model (GMM), for the (Burget et al., 2011). Due to its nature, the auto- calculation of forensic LRs using the same matic approach is mainly used for text- speech data. FVC experiments were car- ‘independent’ FVC, and there is a good amount of ried out using different lengths of Japanese research on this (Enzinger & Morrison, 2017; short words under a forensically realistic, Enzinger et al., 2016). The acoustic-phonetic sta- but challenging condition: only two speech tistical approach is a type of text-‘dependent’ FVC tokens for model training and LR estima- because it tends to focus on particular linguistic tion. Log-likelihood-ratio cost (Cllr) was units, such as phonemes, words, phrases, etc. Hav- used as the assessment metric. The study ing said that, even if one is targeting a particular demonstrates that the HMM system con- stantly outperforms the GMM system in word or phrase, for example ‘hello’, all obtainable features are not exploited in the acoustic-phonetic terms of average Cllr values. However, words longer than three mora are needed if statistical approach because it still tends to focus the advantage of the HMM is to become on particular segments or phonemes of the word evident. With a seven-mora word, for ex- or phrase, e.g. the formant trajectories of the diph- ample, the HMM outperformed the GMM thong and the static spectral information of the by a Cllr value of 0.073. fricative (Rose, 2017). One of the advantages of text-dependent FVC 1 Introduction is the availability of the time-varying characteris- tics of a speaker, which is information that can be After the DNA success story, the likelihood ratio explicitly included in the modelling. (LR)-based approach became the new paradigm There are a good number of LR-based text- for evaluating and presenting forensic evidence in independent FVC studies in the automatic ap- court. The LR approach has also been applied to proach (Enzinger & Morrison, 2017; Enzinger et speech evidence(Rose, 2006), and it is increasing- al., 2016). However, although there are some stud- Satoru Tsuge and Shunichi Ishihara. 2018. Text-dependent Forensic Voice Comparison: Likelihood Ratio Estimation with the Hidden Markov Model (HMM) and Gaussian Mixture Model. In Proceedings of Australasian Language Technology Association Workshop, pages 17−25. ies in which text-independent models (e.g. GMM) bility of E, given Hp, in other words the prosecu- were applied to text-dependent FVC scenarios tion or same-speaker hypothesis; p(E|Hd) is the (Morrison, 2011), to the best of our knowledge, probability of E, given Hd, in other words the de- studies on LR-based text-dependent FVC in the fence or different-speaker hypothesis (Robertson automatic approach are scarce. & Vignaux, 1995). The LR can be considered in In this study, a text-dependent LR-based FVC terms of the ratio between similarity and typicali- system with the GMM-UBM based system ty. Similarity here means the similarity of evi- (GMM system) and that with the hidden Markov dence attributable to the offender and the suspect, model (HMM system) are compared in their per- respectively. Typicality means the typicality of formance using the same data. The transitional that evidence against the relevant population. characteristic of individual speech can be explicit- The relative strength of the given evidence with ly modelled in the latter system. respect to the competing hypotheses (Hp vs. Hd) is Words of various length are used for testing reflected in the magnitude of the LR. If the evi- purposes to see how word duration influences the dence is more likely to occur under the prosecu- performance of the systems. Having the forensi- tion hypothesis than under the defence hypothesis, cally realistic condition of data sparsity in mind, the LR will be higher than 1. If the evidence is we used only two tokens of each word for model- more likely to occur under the defence hypothesis ling and testing. than under the persecution hypothesis, the LR will It is naturally expected that, given a sufficient be lower than 1. For example, LR = 30 means that amount of data, the HMM system outperforms the the evidence is 30 times more likely to occur on GMM system. However, it is not so clear whether the assumption that the evidence is from the same the above expectation is realistic when the amount person than on the assumption that it is not. of data is limited. Even if the HMM system works The important point is that the LR is concerned better, it is important to establish how the HMM with the probability of the evidence, given the hy- and GMM systems compare with respect to the pothesis (either Hp or Hd). The probability of the calculation of strength of LR, and also how and evidence can be estimated by forensic scientists. under what conditions the former is more advan- They legally must not and logically cannot esti- tageous than the latter. mate the probability of the hypothesis, given the evidence. This is because the forensic scientist is 2 Likelihood Ratios not legally in a position to refer to the ultimate ‘guilty vs. non-guilty’ question, i.e. the probability The LR framework has been advocated by many of the hypothesis, given the evidence. That is the as the logically and legally correct framework for task of the trier-of-fact. Furthermore, the forensic assessing forensic evidence and reporting the out- scientist would need to refer to the Bayesian theo- come in court (Aitken, 1995; Aitken & Stoney, rem to estimate the probability of the hypothesis, 1991; Aitken & Taroni, 2004; Balding & Steele, given the evidence, using prior information that is 2015; Evett, 1998; Robertson & Vignaux, 1995). only accessible to the trier-of-fact; thus the foren- A substantial amount of fundamental research on sic scientist cannot logically estimate the probabil- FVC has been carried out since the late 1990s ity of the hypothesis. (Gonzalez-Rodriguez et al., 2007; Morrison, 2009; Rose, 2006), and it is now accepted in an 3 Experimental Design increasing number of countries (Morrison et al., 2016). In this section, the nature of the database used for In the LR framework, the task of the forensic the experiments is explained first. This is followed expert is to estimate strength of evidence and re- by an illustration as to how the speaker compari- port it to the court. LR is a measure of the quanti- sons were set up for the experiments. The acoustic tative strength of evidence, and is calculated using features used in this study will be explained to- the formula in 1). wards the end. p(E|Hp) LR= 1) 3.1 Database p(E|Hd) Our data were extracted from the National Re- In 1), E is the evidence, i.e. the measured prop- search Institute of Police Science (NRIPS) data- erties of the voice evidence; p(E|Hp) is the proba- 18 ze.ro ku.ru.ma ko.o.so.ku hya.ku ka.ne go.ze.n ‘zero’ ‘car’ ‘highway’ ‘hundred’ ‘money’;= ‘AM’ re.e de.n.wa ya.ku.so.ku sa.n.bya.ku da.i.jyo.o.bu wa.ta.shi ‘zero’ ‘telephone’ ‘promise’ ‘three hundred’ ‘fine’ ‘I’ i.chi ke.e.sa.tsu o.n.na ro.p.pya.ku ki.no.o ko.do.mo ‘one’ ‘police’ ‘woman’ ‘six hundred’ ‘yesterday’ ‘child’ sa.n do.ku o.ku.sa.n ha.p.pya.ku kyo.o ke.e.ta.i ‘three’ ‘poison’ ‘wife’ ‘eight hundred’ ‘today’ ‘mobile phone’ yo.n re.n.ra.ku re.su.to.ra.n se.n a.shi.ta ka.ji ‘four’ ‘contact’ ‘restaurant’ ‘thousand’ ‘tomorrow’ ‘fire’ ro.ku ba.ku.da.n po.su.to i.s.se.n ge.n.ki.n ko.n.bi.ni ‘six’ ‘bomb’ ‘post’ ‘one thousand’ ‘cash’ ‘store’ na.na gi.n.ko.o sa.a.bi.su.e.ri.a go.go a.no.o ta.ku.shi.i ‘seven’ ‘bank’ ‘road house’ ‘afternoon’ ‘well (filler)’ ‘taxi’ shi.chi ji.ka.n sa.n.ze.n e.ki ne.e i.n.ta.a ‘seven’ ‘time’ ‘three thousand’ ‘station’ ‘well (filler)’ ‘interchange’ ha.chi mo.shi.mo.shi ha.s.se.n o.ma.e a.no.ne.e me.e.ru ‘eight’ ‘hello (phone)’ ‘eight thousand’ ‘you’ ‘well (filler)’’ ‘mail’ kyu.u ha.i ma.n o.i na.ka.ma ba.n.go.o ‘nine’ ‘yes’ ‘ten thousand’ ‘hay’ ‘mate’ ‘number’ jyu.u o.re. o.ku ba.ku.ha.tsu ka.i.sha ko.o.za ‘ten’ ‘I’ ‘million’ ‘explosion’ ‘company’ ‘account’ Table 1: 66 target words with their glosses. Each mora is separated by a period. base (Makinae et al., 2007). The database consists was estimated for each of the comparisons. The of recordings collected from 316 male and 323 development database was also called upon for female speakers. All utterances were read-out simulating offender-suspect comparisons, but the speech, consisting of single syllables, words, se- derived scores (pre-calibration LRs) were specifi- lected sentences and so on. The word-based re- cally used to obtain the weights for calibration (re- cordings stored in the database provided the data fer to §4.4 for details on calibration). The back- used in this study. ground database was used to build the statistical Participants ranged in age from 18 to 76 years. model for typicality. The metadata provide information on the areas of Experiments Test Dev Back Japan (or overseas in some cases) where they have Exp1 Gr1 Gr2 Gr3,4,5,6 resided, as well as their height, weight, and their Exp2 Gr2 Gr3 Gr1,4,5,6 health conditions on the day of recording. Only Exp3 Gr3 Gr4 Gr1,2,5,6 male speakers who completed the recordings in Exp4 Gr4 Gr5 Gr1,2,3,6 two different sessions separated by 2-3 months, Exp5 Gr5 Gr1 Gr2,3,4,6 without any mis-recordings for the target 66 Table 2: Usage of Gr1~6 for experiments (Exp). Test, words, were selected for the current study (result- Dev and Back refer to test, development and back- ing in 310 speakers). Each word was recorded on- ground databases. ly twice in each session. As mentioned earlier, there are two recordings The rhythmic unit of Japanese is the mora. per speaker for each word in each session. The Based on mora, the 66 words, all listed in Table 1, suspect model was built using two recordings tak- consist of 25 two-, 16 three-, 22 four-, 2 five- and en from one session, and an LR was estimated for 1 seven-mora words. each of the two recordings of the other session The 310 speakers were separated into six dif- (offender evidence). The same process was re- ferent, mutually exclusive groups: Gr1 (59 speak- peated by swapping the recordings of the sessions. ers), Gr2 (60), Gr3 (60), Gr4 (60), Gr5 (60) and In this way, 4 LRs were obtained for each SS Gr6 (13). Five different experiments were con- comparison, and 8 LRs for each DS comparison. ducted using the six groups, as shown in Table 2. Thus, the number of comparisons is 4*n (n = The test database was used for simulating two number of speakers) for the SS comparisons, and types of offender-suspect comparisons: same- 8*nC2 (C=combination) for the DS comparisons. speaker (SS) and different-speaker (DS). An LR Using the five different groups (Gr1~5) separately 19 as a test database, it was possible, altogether, to unspecific word-independent GMM, which carry out 1188 SS comparisons and 69392 DS was generated in 1), with the relevant word comparisons. The breakdowns of the SS and DS recordings of the background database; comparisons are given in Table 3 for the five ex- 3) To build the speaker-specific word-dependent periments (Exp1~5). GMM (suspect model = λsus) for each word by Experiments SS DS training the speaker-unspecific word- Exp1: Gr1 (59) 236 13688 independent GMM, which was built in 2), Exp2: Gr2 (60) 240 14160 with the speaker specific data in the test data- Exp3: Gr3 (60) 240 14160 base, while applying a MAP adaptation. Exp4: Gr4 (58) 232 13224 Exp5: Gr5 (60) 240 14160 The speaker-unspecific word-dependent GMM, Total 1188 69392 which was built in 2) for each word, was used as Table 3: Numbers of SS and DS comparisons for each the background model (λbkg). word. The NRIPS database also contains the record- 4.2 HMM Models ings of 50 sentences that are based on ATR pho- The following is the process of building a speak- netically balanced Japanese sentences (Kurematsu er-specific word-dependent HMM for each speak- et al., 1990). These sentences were used to build er. the initial statistical models (refer to §4.1 and §4.2 1) To build speaker-unspecific phoneme- for details). dependent HMMs using the recordings of the phonetically balanced utterances; 3.2 Acoustic Features 2) To build an initial speaker-unspecific word- Twelve mel-frequency cepstral coefficients dependent HMM for each word by concate- (MFCCs), 12 Δ MFCCs and Δ log power (a fea- nating speaker-unspecific phoneme- ture vector of 25th-order) were extracted with a 20 dependent HMMs, which were built in 1). msec wide hamming window shirting every 10 3) To build speaker-specific word-dependent msec. HMM (suspect model = λsus) by training the initial speaker-unspecific word-dependent 4 Estimation of Likelihood Ratios HMM, which was built in 2), with the speaker specific data in the test database, while apply- In this section, the two different modelling tech- ing a MAP adaptation. niques used in the current study are explained. This is followed by an exposition of the method The initial speaker-unspecific word-dependent for calculating scores with these models. The HMM, which was built in 2), was trained with the method used for converting the scores to the LRs, relevant word recordings of the background data- namely calibration, will be explained last. base, and the resultant model was used as the For this study, the suspect model, rather than speaker-unspecific word-dependent background being based solely on the data of the suspect model (λ ). speaker, was generated by adapting a speaker- bkg unspecific model (background model) by means 4.3 Score Calculations of a maximum a posteriori (MAP) procedure. The score of each comparison can be estimated Three different numbers of Gaussians (4, 8 and using the equation given in 2), in which s = score, 16) were tried in the models. xt = an observation sequence of vectors of acous- 4.1 GMM Models tic features constituting the offender data of which there are a total of T, λ = suspect model and λ The following is the process of building a speak- sus bkg = background model. er-specific word-dependent GMM for each speak- 푇 er. 1 푠 = ∑ log(푝(푥 |휆 )) 1) To build a speaker-unspecific word- 푇 푡 푠푢푠 independent GMM using the recordings of the 푡=1 2) phonetically balanced utterances; − log (푝(푥푡|휆푏푘푔)) 2) To build a speaker-unspecific word-dependent GMM for each word by training the speaker- 20 A score is estimated as the mean of the relative is not binary in nature, but continuous. For exam- values of the two probability density functions for ple, both LR = 10 and LR = 20 support the correct the feature vectors extracted from the offender da- hypothesis for the SS comparisons, but the latter ta, and was calculated for each of the SS and DS supports the hypothesis more strongly than the comparisons. former. The relative strength of the LR needs to be taken into account in the assessment. 4.4 Scores to Likelihood Ratios Hence, in this study, the log-likelihood-ratio The outcomes of the GMM and HMM systems cost (Cllr), which is a gradient metric based on are not LRs, but are known as scores. The value of LR, was used as the metric for assessing the per- a score provides information about the degree of formance of the LR-based FVC system. The cal- the similarity between the two speech samples, i.e. culation of Cllr is given in 3) (Brümmer & du the offender and suspect samples, having taken in- Preez, 2006). to account their typicality with respect to the rele- 1 NHp 1 ∑ log2 (1+ ) + vant population; it is not directly interpretable as 1 NH i forH =true LRi p p 3) Cllr= N an LR (Morrison, 2013, p. 2). Thus, the scores 2 1 Hd need to be converted to LRs by means of a cali- ∑ log2(1+LRj) NH j forH =true bration process. As we will see in §6, calibration ( d d ) is an essential part of LR-based FVC. In 3), NHp and NHd are the number of SS and Logistic-regression calibration (Brümmer & du DS comparisons, and LRi and LRj are the linear Preez, 2006) is a commonly used method that LRs derived from the SS and DS comparisons, re- converts scores to interpretable LRs by applying spectively. Under a perfect system, all SS compar- linear shifting and scaling in the log odds space. A isons should produce LRs greater than 1, since or- logistic-regression line (e.g. y = ax + b; x = score; igins are identical; as, in the case of DS compari- y = log10LR) whose weights (i.e. a and b in y = ax sons, origins are different, DS comparisons should + b) are estimated from the SS and DS scores of produce LRs less than 1. Cllr takes into account the development database is used to monotonical- the magnitude of derived LR values, and assigns ly shift (by the amount of b) and scale (by the them appropriate penalties. In Cllr, LRs that sup- amount of a) the scores of the testing database to port the counter-factual hypotheses or, in other the log10LRs. words, contrary-to-fact LRs (LR < 1 for SS com- parisons and LR > 1 for DS comparisons) are 5 Assessment Metrics heavily penalised and the magnitude of the penal- ty is proportional to how much the LRs deviate A common way of assessing the performance of from unity. Optimum performance is achieved a classification system is with reference to its when Cllr = 0 and decreases as Cllr approaches and correct- or incorrect-classification rate: for in- exceeds 1. Thus, the lower the Cllr value, the better stance, how many of the SS comparisons were the performance. correctly assessed as coming from the same The C measures the overall performance of a speakers, and how many of the DS comparisons llr system in terms of validity based on a cost func- were correctly assessed as coming from different tion in which there are two main components of speakers. In the context of LR-based FVC, an LR min loss: discrimination loss (Cllr ) and calibration can be used as a classification function with LR cal = 1 as unity. However, correct- or incorrect- loss (Cllr ) (Brümmer & du Preez, 2006). The classification rate is a binary decision (same former is obtained after the application of the so- speaker or different speakers), which refers to the called pooled-adjacent-violators (PAV) transfor- ultimate issue of ‘guilty vs. non-guilty’. As ex- mation – an optimal non-parametric calibration plained in §2, it is not the task of the forensic ex- procedure. The latter is obtained by subtracting pert, but of the trier-of-fact, to make such a deci- the former from the Cllr. In this study, besides Cllr, min cal sion. Thus, any metrics based on binary decision Cllr and Cllr are also referred to. are not coherent with the LR framework. The magnitude of the derived LRs is visually As emphasised in §2, the task of the forensic presented using Tippett plots. Details on how to expert is to estimate the strength of evidence as read a Tippett plot are explained in §6, when the accurately as possible, and the strength of evi- plots are presented. dence, which can be quantified by means of a LR, 21 6 Experimental Results and Discussions 0.028). The calibration loss improves (albeit at a very small rate) as a function of word duration, min cal The average Cllr, Cllr and Cllr values were cal- except in the case of the GMM system with the culated according to the mora numbers; they are seven-mora word. plotted in Figure 1 as a function of word duration, As has been described by means of Figure 1 separately for the HMM and GMM systems. The and Table 4, it is clearly advantageous to include numerical values of Figure 1 are given in Table 4. temporal information in modelling in Japanese, 2 3 4 5 7 even under the challenging condition of data spar- Cllr G 0.309 0.239 0.182 0.146 0.136 sity. However, the difference in performance may H 0.302 0.230 0.156 0.114 0.063 not be evident with short, e.g. two- and three- C min G 0.270 0.206 0.152 0.118 0.099 llr mora, words. Put differently, if a forensic speech H 0.261 0.192 0.126 0.085 0.047 cal expert is working on a comparable word or phrase Cllr G 0.038 0.032 0.029 0.028 0.037 H 0.040 0.037 0.030 0.028 0.016 of relatively good length, the decision to either in- clude transitional information in the modelling or Table 4: Numerical information of Figure 1. not is likely to substantially impact on the out- G = GMM and H = HMM. come. For example, the HMM system outper- Although this was expected, it can be seen from formed the GMM system by the Cllr values of Figure 1a and Table 4 that the overall perfor- 0.073 (= 0.136 - 0.063) with the seven-mora mance (Cllr) of both systems improves as the word. words become longer in terms of mora, and also Three different numbers of Gaussians – 4, 8 that the HMM system constantly outperforms the and 16 – were used in the study. Table 5 shows GMM system as far as average Cllr values are which mixture number of Gaussians performed concerned. The performance gap between the two best for words of different mora duration accord- systems becomes wider as the number of mora in- ing to the different systems. For example, out of creases, with the performance of the two systems the 25 two-mora words, the GMM system with a being similar with words of two and three moras. mixture number of 8 (M = 8) returned the best re- For 12 out of the 25 two-mora words and 6 out of sult for 11 words, and the HMM system with a the 16 three-mora words, the GMM system per- mixture number of 4 (M = 4) yielded the lowest formed better than the HMM system in terms of Cllr value for 13 words. Cllr. In other words, the evidence suggests that the Mora System M = 4 M = 8 M = 16 HMM may not be clearly advantageous for short G 0 11 14 2 (25) words, e.g. two- or three-mora words. For the sake H 13 5 7 of reference, for only 6 out of the 22 four-mora G 0 2 14 3 (16) words, the GMM system outperformed the HMM H 13 3 0 system. For the five- and seven-mora words, the G 0 1 21 4 (22) HMM system constantly outperformed the GMM H 14 4 3 system. G 0 2 0 min 5 (2) The discriminability of the systems (Cllr ) H 1 1 1 G 0 0 1 (Figure 1b) also exhibits the same trend as the 7 (1) overall performance in that discriminability im- H 0 0 1 G 0 (0%) 16 (24%) 50 (76%) proves with more moras, the HMM system con- Total stantly performed better than the GMM system, H 41 (62%) 13 (20%) 12 (18%) and the performance of the former improves at a Table 5: Best-performing Gaussian numbers (M) for faster rate than that of the latter. As a result, there words with different mora numbers. is a larger gap in discriminability between the two G = GMM and H = HMM. systems with the seven-mora word (0.052 = According to Table 5, there is a clear difference 0.099-0.047) than there is with the two-mora between the two systems with respect to the best words (0.009 = 0.270-0.261). performing mixture number of Gaussians, in that cal The calibration loss of both systems (Cllr ) the GMM tends to require a higher mixture num- (Figure 1c) is very similar for two-, three-, four- ber for optimal performance (overall, 76% of and five-mora words, which are essentially the words worked best with a mixture number of 16), same for the two systems (2: 0.038 and 0.040; 3: while the HMM generally does not require a 0.032 and 0.037; 4: 0.029 and 0.030; 5: 0.028 and 22 min a) Cllr b) Cllr cal c) Cllr min cal Figure 1: Cllr (Panel a), Cllr (b) and Cllr (c) values are plotted as a function of mora duration, separately for GMM (empty circle) and HMM (filled circle) systems. Note that the Y-axis scale in Panel c is different from that in Panels a and b. higher mixture number (overall, 62% of words the dotted curves are for scores (pre-calibration performed best with a mixture number of 4). LRs). For all Tippett plots, the cumulative propor- To investigate whether there are any differences tion of trails is plotted on the y-axis against the in the nature and magnitude of the derived LRs, a log10 LRs on the x-axis. Tippett plot was generated for each word in each As can be seen in Figure 2, the derived scores experiment, and this was done separately for the (pre-calibration LRs), which are given in dotted GMM and HMM systems. Figure 2 has Tippett curves, are uncalibrated in different ways for the plots of the five-mora word ‘daijyoobu’ with 16 GMM and HMM systems: the former (Figure 2a) Gaussians: Panel a) = GMM and Panel b) = is uncalibrated to the left and the latter (Figure 2b) HMM. The plots are fairly typical and illustrate is uncalibrated to the right. This indicates that cal- the differences between the two systems. ibration is essential in both systems. In fact, cali- Tippet plots show the cumulative proportion of brating system output is recommended as standard the LRs of the DS comparisons (DSLRs), which practice (Morrison, 2018). are plotted rising from the right, as well as of the The dotted curves are more widely apart in LRs of the SS comparisons (SSLRs), plotted ris- Figure 2a (GMM) than in Figure 2b (HMM). This ing from the left. The solid curves are for LRs and means that the magnitude of the derived scores is 23 greater with the GMM system than with the servative LRs than scores (Rose, 2013), while it HMM system. However, after calibration (solid contributes to stronger LRs for the automatic ap- curves), it can be seen that the magnitude of the proach (Morrison, 2018). However, it is not clear DSLRs is very similar between the two systems at this stage what the observed differences be- while the SSLRs are far stronger for the HMM tween the two systems with respect to the rela- system than for the GMM system. That is, the cal- tionship between the scores and LRs entail. This ibration causes different effects in the two sys- warrants further investigation. tems; it brings about more conservative LRs for Apart from the similar degree of magnitude of the GMM system, but enhanced LRs for the the DSLRs (including both consistent-with-fact HMM system. and contrary-to-fact LRs) that were obtained for the GMM and HMM systems, Figure 2 shows that the magnitude of the consistent-with-fact SSLRs is far greater for the HMM system (Figure 2b), and also that all of the SS comparisons were accurately classified as being from the same speakers for the HMM system. As a result, the HMM system is assessed to be better in Cllr than the GMM system (GMM: Cllr = 0.182 and HMM: Cllr = 0.156). 7 Conclusions This is a preliminary study investigating the use- a) GMM fulness of speaker-individuating information man- ifested in the time-varying aspect of speech in a text-dependent FVC system, in particular in the automatic FVC approach. In this study, perfor- mance of the GMM and HMM systems was com- pared using the same data under a forensically re- alistic, but challenging condition, which is sparsi- ty of data. Even with short durations of two-, three-, four-, five- and seven-mora words, the HMM system constantly outperformed the GMM system in terms of average Cllr values. However, the benefits of the transitional information become evident when the HMM system is built with words longer than two- or three mora. With a sev- en-mora word, for example, the HMM system performed better than the GMM system by a C b) HMM llr value of 0.073. This study also demonstrates that the outcomes (scores) of the GMM and HMM systems are not well-calibrated; thus calibration is an essential Figure 2: Tippett plots of the five-mora word part of the FVC if they are to be used as models in ‘daijyoobu’ (Exp5) with 16 Gaussians: Panel a) = the system. GMM and Panel b) = HMM Although calibration usually results in a better Acknowledgments performance, its impact on the magnitude of LRs The authors thank the reviewers for their valuable seems to be different depending on various fac- comments. tors, including the types of features and modelling techniques. Many FVC studies, in particular those based on the acoustic-phonetic statistical ap- proach, report that calibration results in more con- 24 References information and Communication Engineers, 97- 102. Aitken, C. G. G. (1995). Statistics and the Evaluation of Evidence for Forensic Scientists. Chichester: Morrison, G. S. (2009). Forensic voice comparison John Wiley. and the paradigm shift. Science & Justice, 49(4), 298-308. Aitken, C. G. G., & Stoney, D. A. (1991). The Use of Statistics in Forensic Science. New York; London: Morrison, G. S. (2011). A comparison of procedures Ellis Horwood. for the calculation of forensic likelihood ratios from acoustic-phonetic data: Multivariate kernel Aitken, C. G. G., & Taroni, F. (2004). Statistics and density (MVKD) versus Gaussian mixture model- the Evaluation of Evidence for Forensic Scientists. universal background model (GMM-UBM). Chichester: John Wiley & Sons. Speech Communication, 53(2), 242-256. Balding, D. J., & Steele, C. D. (2015). Weight-of- Morrison, G. S. (2013). Tutorial on logistic-regression evidence for Forensic DNA Profiles. Chichester: calibration and fusion: Converting a score to a John Wiley & Sons. likelihood ratio. Australian Journal of Forensic Brümmer, N., & du Preez, J. (2006). Application- Sciences, 45(2), 173-197. independent evaluation of speaker detection. Morrison, G. S. (2018). The impact in forensic voice Computer Speech and Language, 20(2-3), 230-275. comparison of lack of calibration and of Burget, L., Plchot, O., Cumani, S., Glembek, O., mismatched conditions between the known-speaker Matějka, P., & Brümmer, N. (2011). recording and the relevant-population sample Discriminatively trained probabilistic linear recordings. Forensic Science International, 283, discriminant analysis for speaker verification. E1-E7. Proceedings of the Acoustics, Speech and Signal Morrison, G. S., Enzinger, E., & Zhang, C. (2018). Processing (ICASSP), 2011 IEEE International Forensic Speech Science. In I. Freckelton & H. Conference on, 4832-4835. Selby (Eds.), Expert Evidence. Sydney, Australia: Enzinger, E., & Morrison, G. S. (2017). Empirical test Thomson Reuters. of the performance of an acoustic-phonetic Morrison, G. S., Sahito, F. H., Jardine, G., Djokic, D., approach to forensic voice comparison under Clavet, S., Berghs, S., & Dorny, C. G. (2016). conditions similar to those of a real case. Forensic INTERPOL survey of the use of speaker Science International, 277, 30-40. identification by law enforcement agencies. Enzinger, E., Morrison, G. S., & Ochoa, F. (2016). A Forensic Science International, 263, 92-100. demonstration of the application of the new Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. paradigm for the evaluation of forensic evidence (2000). Speaker verification using adapted under conditions reflecting those of a real forensic- Gaussian mixture models. Digital Signal voice-comparison case. Science & Justice, 56(1), Processing, 10(1-3), 19-41. 42-57. Robertson, B., & Vignaux, G. A. (1995). Interpreting Evett, I. W. (1998). Towards a uniform framework for Evidence: Evaluating Forensic Science in the reporting opinions in forensic science casework. Courtroom. Chichester: John Wiley. Science & Justice, 38(3), 198-202. Rose, P. (2006). Technical forensic speaker Gonzalez-Rodriguez, J., Rose, P., Ramos-Castro, D., recognition: Evaluation, types and testing of Toledano, D. T., & Ortega-Garcia, J. (2007). evidence. Computer Speech and Language, 20(2- Emulating DNA: Rigorous quantification of 3), 159-191. evidential weight in transparent and testable forensic speaker recognition. Ieee Transactions on Rose, P. (2013). More is better: Likelihood ratio- Audio Speech and Language Processing, 15(7), based forensic voice comparison with vocalic 2104-2115. segmental cepstra frontends. International Journal of Speech, Language and the Law, 20(1), 77-116. Kurematsu, A., Takeda, K., Sagisaka, Y., Katagiri, S., Kuwabara, H., & Shikano, K. (1990). Atr Japanese Rose, P. (2017). Likelihood ratio-based forensic voice Speech Database as a Tool of Speech Recognition comparison with higher level features: research and and Synthesis. Speech Communication, 9(4), 357- reality. Computer Speech and Language, 45, 475- 363. 502. Makinae, H., Osanai, T., Kamada, T., & Tanimoto, M. (2007). Construction and preliminary analysis of a large-scale bone-conducted speech database. Proceedings of the Institute of Electronics, 25 Development of Natural Language Processing Tools for Cook Islands Maori¯ Rolando Coto-Solano1, Sally Akevai Nicholas2, and Samantha Wray3 1School of Linguistics and Applied Language Studies Victoria University of Wellington Te Whare Wananga¯ o te Upoko¯ o te Ika a Maui¯ [email protected] 2School of Language and Culture Auckland University of Technology [email protected] 3Neuroscience of Language Lab New York University Abu Dhabi, United Arab Emirates [email protected] Abstract 1.1 Minority Languages and NLP Lack of resources makes it difficult to train data- This paper presents three ongoing projects driven NLP tools for smaller languages. This is for NLP in Cook Islands Maori:¯ Un- compounded by the difficulty in generating input trained Forced Alignment (approx. 9% er- for Indigenous and endangered languages, where ror when detecting the center of words), dwindling numbers of speakers, non-standardized automatic speech recognition (37% WER writing systems and lack of resources to train spe- in the best trained models) and automatic cialist transcribers and analysts create a vicious part-of-speech tagging (92% accuracy for cycle that makes it even more difficult to take ad- the best performing model). These new re- vantage of NLP solutions. Amongst the hundreds sources fill existing gaps in NLP for the of languages of the Americas, for example, very language, including gold standard POS- few have large spoken and written corpora (e.g. tagged written corpora, transcribed speech Zapotec from Mexico, Guaran´ı from Paraguay and corpora, and time-aligned corpora down Quechua from Bolivia and Peru),´ some have spo- to the phoneme level. These are part of ken and written corpora, and only a handful pos- efforts to accelerate the documentation of sess more sophisticated tools like spell-checkers Cook Islands Maori¯ and to increase its vi- and machine translation (Mager et al., 2018). tality amongst its users. Overcoming these limitations is an important part of accelerating language documentation. Cre- 1 Introduction ating NLP resources also enhances the profile of endangered languages, creating a symbolic impact Cook Islands Maori¯ has been moderately well to generate positive associations towards the lan- documented with two dictionaries (Buse et al., guage and attract new learners, particularly young 1996; Savage, 1962), a comprehensive description members of the community who might not other- (Nicholas, 2017), and a corpus of audiovisual ma- wise see their language in a digital environment terials (Nicholas, 2012). However these materials (Aguilar Gil, 2014; Kornai, 2013). are not yet sufficiently organized or annotated so As for Polynesian languages, te reo Maori,¯ as to be machine readable and thus maximally use- the Indigenous language spoken in Aotearoa New ful for both scholarship and revitalization projects. Zealand, is the one that has received the most at- The human resources needed to achieve the de- tention from the NLP community. It has multiple sired level of annotation are not available, which corpora and Google provides machine-translations has encouraged us to take advantage of NLP meth- for it as part of Google Translate, and tools such ods to accelerate documentation and research. as speech-to-text, text-to-speech and parsing are Rolando Coto Solano, Sally Akevai Nicholas and Samantha Wray. 2018. Development of Natural Language Processing Tools for Cook Islands Maori.¯ In Proceedings of Australasian Language Technology Association Workshop, pages 26−33. under development (Bagnall et al., 2017). Other (Nicholas, 2018, 46). Overall its vitality is be- languages in the family have also received some tween a 7 (shifting) and an 8a (moribund) on attention. For example, Johnson et al., (2018) have the Expanded Graded Intergenerational Disrup- worked on forced alignment for Tongan. tion Scale (Lewis and Simons, 2010, 2). Among the diaspora population and that on the island 1.2 Cook Islands Maori¯ of Rarotonga there has been a shift to English Southern Cook Islands Maori¯ 1 (ISO 639-3 rar, and there is very little intergenerational transmis- or glottology raro1241) is an endangered East sion of Cook Islands Maori.¯ The only contexts Polynesian language indigenous to the realm of where the vitality is strong is within the small New Zealand. It originates from the southern populations of the remaining islands of the south- Cook Islands (see figure1). Today however, ern Cook Islands, where Cook Islands Maori¯ still most of its speakers reside in diaspora popu- serves as the lingua franca (see Nicholas (2018) lations in Aotearoa New Zealand and Australia for a full discussion). (Nicholas, 2018). The languages most closely related to Cook Islands Maori¯ are the languages 1.3 Grammatical Structure of Rakahanga/Manihiki (rkh, raka1237) and Penrhyn (pnh, penr1237) which originate from The following is a selection of grammatical fea- the northern Cook Islands. Te reo Maori¯ (mri, tures of Cook Islands Maori¯ as described in maor1246) and Tahitian (tah, tahi1242), Nicholas (2017). Cook Islands Maori¯ is of the both East Polynesian languages, are also closely isolating type with very few productive morpho- related. There is some degree of mutual intelligi- logical processes. There is widespread polysemy, bility between these languages but they are gen- particularly within the grammatical particles. For erally considered to be separate languages by lin- example, in the following sentence, glossed using guists and community members alike. the Leipzig Glossing Rules (Bickel et al., 2008), there are four homophones of the particle i: the past tense marker, the cause preposition, the loca- tive preposition and the temporal locative preposi- tion. (1) I mate a Mere i te mango¯ PST be-dead DET Mere CAUSE the shark i roto i te moana i LOC inside LOC the ocean LOC.TIME te Tapati. the Sunday ‘Mere was killed by the shark in the ocean on Sunday.’ Furthermore, nearly every lexical item that can occur in a verb phrase can also occur in a noun Figure 1: Cook Islands (CartoGIS Services et al., phrase without any overt derivation, making tasks 2017) like POS tagging more difficult (see section 2.3). The unmarked constituent order is predicate ini- Cook Islands Maori¯ is severely endangered tial. There are verbal and non-verbal predicate 1Southern Cook Islands Maori¯ (henceforth Cook Islands types. In sentences with verbal predicates the un- Maori)¯ has historically been called Rarotongan by non- marked order is VSO. The phoneme paradigm is Indigenous scholars. However, this name is disliked by the small, as is typical for Polynesian languages, with speech community and should not be used to refer to South- ern Cook Islands Maori¯ but rather to the specific variety orig- 9 consonants and 5 vowels which have a phonemic inating from the island of Rarotonga (Nicholas, 2018:36). length distinction. 27 2 CIM NLP Projects Under Development CIM Arpabet kite K IY1 T EH1 There is a need to accelerate the documentation nga‘i¯ NG AE1 T IY1 of the endangered Cook Islands Maori¯ language, so that it can be revitalized in Rarotonga and Table 1: Arpabet conversion of CIM data Aotearoa New Zealand and its usage domains can be expanded in the islands where it is still the lingua franca. We have begun working on to align previously transcribed recordings of Cook this through three projects: (i) We have used un- Islands Maori¯ speech. Figure2 shows the output trained forced speech alignment to generate corre- of this process, a time-aligned transcription of the spondences between transcriptions and their audio audio in the Praat (Boersma et al., 2002) TextGrid files, with the purpose of improving phonetic and format. phonological documentation. (ii) We are training speech-to-text models to automatize the transcrip- tion of both legacy material and recordings made during linguistic fieldwork. (iii) We are develop- ing an interface-accompanied part-of-speech tag- ger as a first step towards full automatic parsing of the language. The following subsections provide details about the current state of each project. 2.1 Untrained Forced Alignment Figure 2: Praat TextGrid for CIM forced aligned text Forced alignment is a technique that matches the sound wave with its transcription, down to the After the automatic TextGrids were generated, word and phoneme levels (Wightman and Talkin, 1045 phonemic tokens (628 vowels, 298 conso- 1997). This makes the work of generating align- nants, 119 glottal stops) and the words containing ment grids up to 30 times faster than manual pro- them were hand-corrected to verify the accuracy cessing (Labov et al., 2013). In theory, forced of the automatic system. Table2 shows a summary alignment needs a specific language model to of the results. The alignment showed error rates of function (e.g. an English language model aligning 9% for the center of words and 20% for the center English text). However, untrained forced align- of vowels2. This error is higher than that observed ment, where for example Cook Islands Maori¯ au- for other instances of untrained forced alignment dio recordings and transcriptions are processed us- (2%, 7% and 3% than that observed for the Central ing an English language model, has been proven American languages Bribri, Cabecar´ and Malecu to be useful for the alignment of text and audio respectively (Coto-Solano and Solorzano, 2016; in Indigenous and under-resourced languages (Di- Coto-Solano and Solorzano´ , 2017)), but it pro- Canio et al., 2013). vides significant improvements in speed over hand In order to use this technique, a dictionary of alignment. Cook Islands Maori¯ to English Arpabet was built, Error (relative to the so that the Cook Islands Maori¯ words could be in- Type of interval troduced as new English words into the the align- duration of the interval) ment tool. Some examples are shown in table1. Words 9% ± 12% The phones of Cook Islands Maori¯ words were Vowels 20% ± 25% matched with the closest English phone in the Arpabet system (e.g. the T in kite ‘to know’). Table 2: Errors for Untrained Forced Alignment Some Cook Islands Maori¯ phones were not avail- able in English; long Cook Islands Maori¯ vowels Utilization of forced alignment as a method for were replaced by the equivalent accented vowel in documentation and phonetic research has already English, and the glottal stop was replaced by the 2We are currently documenting phonetic variation in vow- Arpabet phone T, as in nga‘i¯ ‘place’. els using data that was first force aligned and then manu- ally corrected. Because of this, we don’t know at this point The English language acoustic model from whether the 20% error is due to phonetic differences between FAVE-align (Rosenfelder et al., 2014) was used English and CIM vowels, or if it’s due to the data itself. 28 been demonstrated for Austronesian languages. vitalization, and it is in those environments where In Nicholas and Coto-Solano(2018), Cook Is- we can see our worst performance. More work on lands Maori¯ phonemic tokens including vowels, this area is needed (e.g. crowdsourcing the record- consonants and glottal stops were forced-aligned ing of fixed phrases from numerous speakers for and manually corrected to study the glottal stop more reliable training). phoneme. This work was able to show that in is- lands such as ‘Atiu, whose dialect is reported as 2.3 Part-of-Speech Tagging having lost the glottal stop, the phoneme survives We have begun developing automatic part-of- as shorter stop or as creaky voice. The corrected speech (POS) tagging to aid in linguistic research Praat TextGrids are publicly available at Paradisec of the syntax of Cook Islands Maori,¯ and to build (Thieberger and Barwick, 2012), in the collection towards a full parser of the language. To begin of Nicholas(2012). our work we hand-tagged 418 sentences (2916 words) using the part of speech tags from Nicholas 2.2 Automatic Speech Recognition (2017). This is the only POS annotated corpus of We have begun the training of an Automatic Cook Islands Maori¯ existing to date. The corpus Speech Recognition (ASR) system using the Kaldi is currently being prepared for public release. system (Povey et al., 2011), both independently The corpus is currently annotated using two lev- and through the ELPIS pipeline developed by Co- els of tagging: a more shallow/broad level with 23 EDL (Foley et al., 2018). While this work is still in tags, and a second, narrower level with 70 tags. progress, our preliminary results point to the need For example, the shallow level contains tags like of custom models for each speaker. As can be seen v for verbs, n for nouns and tam for tense-aspect- in table3, our recordings produced models with mood particles. The narrower level provides fur- very different per-speaker word-error rates. (The ther detail for each tag. For example, it separates “all speakers” model has cross-speaker test set). the verbs into v:vt for transitive verbs, v:vi This, in addition to the paucity of data (approx. 80 for intransitives, v:vstat for stative verbs, and minutes of speech for all speakers) makes the task v:vpass for passives. extremely difficult. Classification experiments were carried out in the WEKA environment for machine learning Speaker WER (Hall et al., 2009). We tested algorithms that we Female, middle-aged, believed would cope with the sparseness of the 37% controlled environment data given the size of the corpus. These algo- Female, middle-aged, rithms were: D. Tree (J48): an open-source Java- 55% open environment based extension of the C4.5 decision tree algo- Female, elderly, rithm (Quinlan, 2014) and R. Forest: merger of 62% open environment random decision trees (with 100 iterations). In- Male, elderly, cluded as a reference baseline is a zeroR classifi- 68% open environment cation algorithm predicting the majority POS class All speakers 64% for all words. All algorithms were evaluated by splitting the entire corpus of 2916 words into a Table 3: Errors for Untrained Forced Alignment 90% set of sentences for training and 10% set for (per-speaker) testing. The model used position-dependent word con- The recording with the best performance was text features for classification of each word’s POS. recorded in a very controlled environment (a silent These included: room with a TASCAM DR-1 recorder). The worst • w recordings were those of elderly speakers who the word ( ) were speaking in their living rooms with open win- • the previous word (w-1) dows. The main issue here is that it is precisely these kinds of recordings (open environments with • the word before the previous word (w-2) elderly practitioners of traditional knowledge or • the word two before the previous word (w-3) tellers of traditional stories) that are of most in- terest to linguists and practitioners of language re- • the following word (w+1) 29 Model Accuracy Error type % of Shallow/broad POS tags assigned tag ⇒ correct tag errors D Tree (j48) 87.59% n (NOUN) ⇒ v (VERB) 23% R Forest (I = 100) 92.41% tam (tense aspect mood) ⇒ prep 9% zeroR (baseline) 21.03% prep ⇒ tam (tense aspect mood) 5% Narrow POS tags v (VERB) ⇒ n (NOUN) 5% D Tree (j48) 80.00% R Forest (I = 100) 82.41% Table 5: Most common POS tagger errors zeroR (baseline) 15.52% for shallow/broad tags for top-performing tagger (Random Forest) Table 4: Accuracy of POS tagging models. Per- formance is reported in accuracy (per-token) for public launch. A comparison of the classification algorithms using the features above is shown in Table4. As seen here, the current top performing classifier that we have identified is a Random Forest classifier. This algorithm performs best when the POS tags are less informative; that is, it performs best on the shallow/broad tags with an accuracy of 92.41%. Despite the fact that the narrow tags do not col- lapse across types and therefore are more diffi- cult to classify, the best performer for the narrow tags is also the Random Forest classifier. This performance is comparable to other POS tagging Figure 3: Interface for the POS tagger tasks for under-resourced languages for which a new minimal dataset was manually tagged as the The assignation of parts of speech is a very dif- sole input for training, such as 71-78% for Kin- ficult task in Cook Islands Maori¯ not only because yarwanda and Malagasy using a Hidden Markov of its data-driven nature, but because the orthog- Model (Garrette and Baldridge, 2013). It is also raphy of Cook Islands Maori¯ has not been fixed comparable to POS tagging for related languages until recently. As detailed in Nicholas(2013), his- with relatively larger corpora, such as Indonesian torically and even today there are numerous vari- (94% accuracy, with 355,000 annotated tokens) ations in spelling. The glottal stop is frequently (Fu et al., 2018). omitted in spelling, as are the macrons for vo- To obtain an assessment of directions for future calic length. As a result, words like ‘e ‘nominal work aimed at improving the model, we also deter- predicate marker’ can be written as ‘e¯ or e, in- mined the most commonly confused tags by con- creasing the homography of the language. Fig- sulting a confusion matrix. The most common er- ure4 shows the JSP interface attempting to tag rors for the top performer (Shallow/broad tags as two variants of the greeting ‘Aere mai ‘Welcome’. classified by the Random Forest) are seen in table The first is spelled according to the current ortho- 5. Recall from section 1.3 that grammatically, lex- graphic guidelines, with a glottal stop character at ical items which occur in a noun phrase can also the beginning, and the first word gets tagged cor- occur in a verb phrase with no overt derivational rectly as an intransitive verb. The second word, marking. This explains the fact that the majority however, is spelled without the glottal stop, which of errors occurred as the result of confusion be- leads the model to misidentify it as a noun. Future tween v and n. work includes experiments on how to best tackle After training the model, we built a JavaServer this problem by utilizing naturally occurring va- Pages (JSP) interface to demo the model and ob- riety written as spontaneous orthographies, look- tain POS tagged versions of new, raw untagged ing at solutions already found for languages with sentences. This is illustrated in figure3 below. non-standardized writing systems such as collo- The interface is in the process of being prepared quial Arabic (Wray et al., 2015). 30 4 Conclusions This paper summarizes ongoing work in the ap- plication of NLP techniques to Cook Islands Maori,¯ including untrained forced alignment for producing time-aligned transcriptions down to the phoneme, automatic speech recognition for the production of transcriptions from recordings of speech, and part-of-speech tagging for producing tagged text. We have given an overview of the Figure 4: Error when tagging words without a state of these projects, and presented ideas for fu- glottal stop ture work in this area. We believe that this inter- disciplinary work can accelerate and enhance not 2.4 Language Revitalization Applications only the documentation of the language, but can ultimately bring more of the Cook Islands com- In addition to the scholarly advances these projects munity in contact with its language and help in its will support, they also have utility for the revital- revitalization. ization of Cook Islands Maori.¯ The forced align- ment work will support the teaching of the phonet- Acknowledgments ics, phonology, and ‘pronunciation’, as well as im- prove community understanding of variation. The We wish to thank Dr. Ben Foley and Dr. Miriam consumption of audio visual material with closed Meyerhoff of CoEDL for their support in numer- captions in the target language is known to be ben- ous aspects of this project, as well as three anony- eficial for language learning (Vanderplank, 2016), mous reviewers for their helpful comments. We and automatic speech recognition will greatly in- also wish to thank Jean Mason, Director of the crease the quantity of such material available to Rarotonga Library, Teata Ateriano, principal of learners. The increased size of the searchable cor- the School of Ma’uke, and Dr. Tyler Peterson from pus of Cook Islands Maori¯ will facilitate the pro- Arizona State University for their continued sup- duction of corpus-based and multimedia pedagog- port in the documentation of Cook Islands Maori.¯ ical materials. Similarly, the POS tagged data will provide further benefits for the accurate design of pedagogical materials and the linguistic training References of teachers who speak Cook Islands Maori.¯ Ad- Yasnaya¯ Elena Aguilar Gil. 2014. para qu pub- ditional tools that can be developed using this well licar libros en lenguas indgenas si nadie los lee? annotated corpus include text-to-speech technol- e’px. http://archivo.estepais.com/site/2014/para- ogy and chatbot applications. quepublicar-libros-en-lenguas-indigenas-si-nadie- los-lee/. 3 Future work Douglas Bagnall, Keoni Mahelona, and Peter-Lucas Jones. 2017. Korero¯ Maori:¯ a serious attempt at There is much work to be done to move forward speech recognition for te reo Maori.¯ Paper presented in these three projects. The next step for the POS at the Conference 2017 NZ LingSoc, Auckland. tagging is to add resilience to cope with the non- Balthasar Bickel, Bernard Comrie, and Martin Haspel- stardardized writing it will most commonly find. math. 2008. The leipzig glossing rules. conventions As for the speech recognition, crowdsourcing sim- for interlinear morpheme by morpheme glosses. Re- ilar to that carried out by Bagnall et al (2017) vised version of February . might help increase the amount of controlled, pre- Paul Boersma et al. 2002. Praat, a system for doing transcribed audio available for speech recognition phonetics by computer. Glot international 5. training. Furthermore, we are investigating the in- corporation of algorithms for treatment of audio Jasper E Buse, Raututi Taringa, Bruce Biggs, and data with low signal-to-noise ratio to improve au- Rangi Moeka‘a. 1996. Cook Islands Maori¯ dic- tionary with English-Cook Island Maori¯ finderlist. dio quality which theoretically should lower the Pacific Linguistics, Research School of Pacific and high word error rate for recordings conducted in Asian Studies, Australian National University, Can- an open, noisy environment. berra, ACT. 31 CartoGIS Services, College of Asia and Manuel Mager, Ximena Gutierrez-Vasques, Gerardo the Pacific, and The Australian Na- Sierra, and Ivan Meza. 2018. Challenges of lan- tional University. 2017. Cook islands. guage technologies for the indigenous languages of http://asiapacific.anu.edu.au/mapsonline/base- the americas. Proceedings of the 27th International maps/cook-islands [Accessed 2017-10-06]. Conference on Computational Linguistics (COLING 2018) . Rolando Coto-Solano and Sof´ıa Flores Solorzano.´ 2017. Comparison of two forced alignment sys- Sally Nicholas. 2017. Ko te Karama¯ o te Reo Maori¯ o tems for aligning bribri speech. CLEI Electron. J. te Pae Tonga o Te Kuki Airani: A Grammar of South- 20(1):2–1. ern Cook Islands Maori¯ . Ph.D. thesis, University of Rolando Coto-Solano and Sofia Flores Solorzano. Auckland. 2016. Alineacion´ forzada sin entrenamiento para la anotacion´ automatica´ de corpus orales de las lenguas Sally Akevai Nicholas. 2012. Te Vairanga Tuatua ind´ıgenas de costa rica. Ka´nina˜ 40(4):175–199. o te Te Reo Maori¯ o te Pae Tonga: Cook Is- lands Maori¯ (Southern dialects) (sn1). Digital Christian DiCanio, Hosung Nam, Douglas H Whalen, collection managed by PARADISEC. [Open H Timothy Bunnell, Jonathan D Amith, and Access] DOI: 10.4225/72/56E9793466307 Rey Castillo Garc´ıa. 2013. Using automatic align- http://catalog.paradisec.org.au/collections/SN1. ment to analyze endangered language data: Testing the viability of untrained alignment. The Journal Sally Akevai Nicholas. 2018. Language contexts: Te of the Acoustical Society of America 134(3):2235– Reo Maori¯ o te Pae Tonga o te Kuki Airani also 2246. known as Southern Cook Islands Maori.¯ Language Documentation and Description 15:36–64. Ben Foley, Josh Arnold, Rolando Coto-Solano, Gau- tier Durantin, T. Mark Ellison, Daan van Esch, Scott Sally Akevai Nicholas and Rolando Coto-Solano. Heath, Frantisek Kratochvil, Zara Maxwell-Smith, 2018. Using untrained forced alignment to study David Nash, Ola Olsson, Mark Richards, Nay San, variation of glottalization in cook islands mori. In Hywel Stoakes, Nick Thieberger, and Janet Wiles. NWAV-AP5. University of Brisbane. 2018. Building speech recognition systems for lan- guage documentation: The coedl endangered lan- guage pipeline and inference system (elpis). In 6th Sally Akevai Te Namu Nicholas. 2013. Orthographic Intl. Workshop on Spoken Language Technologies reform in cook islands maori:¯ Human considera- for Under-Resourced Languages. tions and language revitalisation implications. 3rd International Conference on Language Documenta- Sihui Fu, Nankai Lin, Gangqin Zhu, and Shengyi Jiang. tion and Conservation (ICLDC) . 2018. Towards indonesian part-of-speech tagging: Corpus and models. In LREC. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Dan Garrette and Jason Baldridge. 2013. Learning Hannemann, Petr Motlicek, Yanmin Qian, Petr a part-of-speech tagger from two hours of annota- Schwarz, et al. 2011. The kaldi speech recogni- tion. In Proceedings of NAACL 2013. Atlanta, USA, tion toolkit. In IEEE 2011 workshop on automatic pages 138–147. speech recognition and understanding. IEEE Signal Processing Society, EPFL-CONF-192584. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. 2009. The weka data mining software: an update. J Ross Quinlan. 2014. C4. 5: programs for machine ACM SIGKDD explorations newsletter 11(1):10–18. learning. Elsevier. Lisa M Johnson, Marianna Di Paolo, and Adrian Bell. Ingrid Rosenfelder, Josef Fruehwald, Keelan Evanini, 2018. Forced alignment for understudied language Scott Seyfarth, Kyle Gorman, Hilary Prichard, and varieties: Testing prosodylab-aligner with tongan Jiahong Yuan. 2014. FAVE (Forced Alignment and data. Language Documentation & Conservation . Vowel Extraction) Program Suite. Andras´ Kornai. 2013. Digital language death. PloS Stephen Savage. 1962. A dictionary of the Maori lan- one 8(10):e77056. guage of Rarotonga / manuscript by Stephen Sav- William Labov, Ingrid Rosenfelder, and Josef Frue- age. New Zealand Department of Island Territories, hwald. 2013. One hundred years of sound change Wellington. in philadelphia: Linear incrementation, reversal, and reanalysis. Language 89(1):30–65. Nicholas Thieberger and Linda Barwick. 2012. Keep- ing records of language diversity in melanesia, the M. Paul Lewis and Gary F. Simons. 2010. Mak- pacific and regional archive for digital sources in ing EGIDS assessment for the ethnologue. endangered cultures (paradisec). Melanesian lan- www.icc.org.kh/download/Making EGIDS- guages on the edge of Asia: Challenges for the 21st Assessments English.pdf [Accessed 2017-07-20]. Century pages 239–53. 32 Robert Vanderplank. 2016. Effects of and effects with captions: How exactly does watching a tv pro- gramme with same-language subtitles make a dif- ference to language learners?. Language Teaching 49(2):235. Colin W Wightman and David T Talkin. 1997. The aligner: Text-to-speech alignment using markov models. In Progress in speech synthesis, Springer, pages 313–323. Samantha Wray, Hamdy Mubarak, and Ahmed Ali. 2015. Best Practices for Crowdsourcing Dialec- tal Arabic Speech Transcription. In Proceedings of Workshop on Arabic Natural Language Processing. 33 Unsupervised Mining of Analogical Frames by Constraint Satisfaction Lance De Vine Shlomo Geva Peter Bruza Queensland University of Technology Brisbane, Australia {l.devine, s.geva, p.bruza}@qut.edu.au Abstract structure can be uncovered. For example, we are not aware of any proposal or system that is fo- It has been demonstrated that vector-based cussed on the unsupervised and systematic discov- representations of words trained on large text ery of analogies from word vectors that does not corpora encode linguistic regularities that may be exploited via the use of vector space arith- make use of exemplar relations, existing linguistic metic. This capability has been extensively resources or seed terms. The automatic identifica- explored and is generally measured via tasks tion of linguistic analogies, however, offers many which involve the automated completion of potential benefits for a diverse range of research linguistic proportional analogies. The ques- and applications, including language learning and tion remains, however, as to what extent it is computational creativity. possible to induce relations from word embed- Computational models of analogy have been dings in a principled and systematic way, with- out the provision of exemplars or seed terms. studied since at least the 1960’s (Hall, 1989; In this paper we propose an extensible and effi- French, 2002) and have addressed tasks relating cient framework for inducing relations via the to both proportional and structural analogy. Many use of constraint satisfaction. The method is computational systems are built as part of inves- efficient, unsupervised and can be customized tigations into how humans might perform analog- in various ways. We provide both quantitative ical inference (Gentner and Forbus, 2011). Most and qualitative analysis of the results. make use of Structure Mapping Theory (SMT) 1 Introduction (Gentner, 1983) or a variation thereof which maps one relational system to another, generally using a The use and study of analogical inference and symbolic representation. Other systems use vector structure has a long history in linguistics, logic, space representations constructed from corpora of cognitive psychology, scientific reasoning and ed- natural language text (Turney, 2013). The analo- ucation (Bartha, 2016), amongst others. The use gies that are computed using word embeddings of analogy has played an especially important have primarily been proportional analogies and are role in the study of language, language change, closely associated with the prediction of relations and language acquisition and learning (Kiparsky, between words. For example, a valid semantic 1992). It has been a part of the study of phonology, proportional analogy is “cat is to feline as dog is morphology, orthography and syntactic grammar to canine” which can be written as “cat : feline :: (Skousen, 1989), as well as the development of dog : canine.” applications such as machine translation and para- In linguistics proportional analogies have been phasing (Lepage and Denoual, 2005). extensively studied in the context of both in- Recent progress with the construction of vec- flectional and derivational morphology (Blevins, tor space representations of words based on their 2016). Proportional analogies are used as part distributional profiles has revealed that analogi- of an inference process to fill the cells/slots in a cal structure can be discovered and operationalised word paradigm. A paradigm is an array of mor- via the use of vector space algebra (Mikolov et al., phological variations of a lexeme. For example, 2013b). There remain many questions regarding {cat, cats} is a simple singular-noun, plural-noun the extent to which word vectors encode analog- paradigm in English. Word paradigms exhibit ical structure and also the extent to which this inter-dependencies that facilitate the inference of Lance De Vine, Shlomo Geva and Peter Bruza. 2018. Unsupervised Mining of Analogical Frames by Constraint Satisfaction. In Proceedings of Australasian Language Technology Association Workshop, pages 34−43. new forms and for this reason have been studied composition law ⊕ making it a semi-group (X,⊕). within the context of language change. The in- This definition also applies to any richer algebraic n formativeness of a form correlates with the degree structure such as groups or vector spaces. In R , to which knowledge of the form reduces uncer- given x1, x2 and x3 there is always only one point tainty about other forms within the same paradigm that can be assigned to x4 such that proportionality (Blevins et al., 2017). holds. (Miclet et al., 2008) define a relaxed form In this paper we propose a construction which of analogy which reads as “x1 is to x2 almost as we call an analogical frame. It is intended to elicit x3 is to x4”. To accompany this they introduce a associations with the terms semantic frame and measure of analogical dissimilarity (AD) which is proportional analogy. It is an extension of a lin- a positive real value and takes the value 0 when guistic analogy in which the elements satisfy cer- the analogy holds perfectly. A set of four points n tain constraints that allow them to be induced in an R can therefore be scored for analogical dissimi- unsupervised manner from natural language text. larity and ranked. We expect that analogical frames will be useful for a variety of purposes relating to the automated 2.2 Word Vectors and Proportional induction of syntactic and semantic relations and Analogies categories . The background just mentioned provides a use- The primary contributions of this paper are two- ful context within which to place the work on lin- fold: guistic regularities in word vectors (Mikolov et al., 2013b; Levy et al., 2014). (Mikolov et al., 2013b) 1. We introduce a generalization of proportional showed that analogies can be completed using vec- analogies with word embeddings which we tor addition of word embeddings. This means that call analogical frames. given x1, x2 and x3 it is possible to infer the value of x . This is accomplished with the vector offset 2. We introduce an efficient constraint satisfac- 4 formula, or 3COSADD (Levy et al., 2014). tion based approach to inducing analogical frames from natural language embeddings in an unsupervised fashion. arg max s(x4, x2 + x3 − x1) 3CosAdd x4 In section 2 we present background and related The s in 3COSADD is a similarity measure. In research. In section 3 we present and explain the practice unit vectors are generally used with co- proposal of Analogical Frames. In section 4 we sine similarity. (Levy et al., 2014) introduced an present methods implemented for ensuring search expression 3COSMUL which tends to give a small efficiency of Analogical Frames. In section 5 we improvement when evaluated on analogy comple- present some analysis of empirical results. In sec- tion tasks. tion 6 we present discussion of the proposal and in section 7 we conclude. s(x4, x3) · s(x4, x2) arg max 3COSMUL x4 s(x4, x1) + 2 Background and Related Work 3COSADD and 3COSMUL are effectively scor- 2.1 Proportional Analogies ing functions that are used to judge the correctness A proportional analogy is a 4-tuple which we write of a value for x4 given values for x1, x2 and x3. as x1 : x2 :: x3 : x4 and read as “x1 is to x2 as 2.3 Finding Analogies x3 is to x4”, with the elements of the analogy be- longing to some domain X (we use this notation Given that it is possible to complete analogies with as it is helpful later). From here-on we will use word vectors it is natural to ask whether analogies the term “analogy” to refer to proportional analo- can be identified without being given x1, x2 and gies unless indicated otherwise. Analogies can be x3.(Stroppa and Yvon, 2005) considers an anal- defined over different types of domains, for ex- ogy to be valid when analogical proportions hold ample, strings, geometric figures, numbers, vec- between all terms in the analogy. They describe a tor spaces, images etc. (Stroppa and Yvon, 2005) finite-state solver which searches for formal analo- propose a definition of proportional analogy over gies in the domain of strings and trees. As noted any domain which is equipped with an internal by several authors (Lavallee´ and Langlais, 2009; 35 Langlais, 2016; Beltran et al., 2015) a brute force 2014) to scale to larger datasets for the purpose search to discovering proportional analogies is of machine translation, but also limit themselves computationally difficult, with the complexity of to the formal or graphemic level instead of more a naive approach being at least O(n3) and perhaps general semantic relations between words. O(n4). A computational procedure must at least traverse the space of all 3-tuples if assuming that 2.4 Other Related Work the 4th term of an analogy can be efficiently in- The present work is related to a number of re- ferred. search themes in language learning, relational For example, if we use a brute force approach learning and natural language processing. We pro- to discovering linguistic analogies using a vocab- vide a small sample of these. ulary of 100 000 words, we would need to ex- (Holyoak and Thagard, 1989) introduce the use amine all combinations of 4-tuples, or 1000004, of constraint satisfaction as a key requirement for or 1020 combinations. Various strategies may be models of analogical mapping. Their computer program ACME (Analogical Constraint Mapping Engine) uses a connectionist network to balance x x 1 2 structural, semantic and pragmatic constraints for mapping relations. (Hummel and Holyoak, 1997) propose a computational model of analogical in- x3 x4 ference and schema induction using distributed patterns for representing objects and predicates. Figure 1: A Proportional Analogy template, with 4 (Doumas et al., 2008) propose a computational variables to be assigned appropriate values model which provides an account of how struc- tured relation representations can be learned from considered for making this problem tractable. The unstructured data. More specifically to language first observation is that symmetries of an analogy acquisition, (Bod, 2009) uses analogies over trees should not be recomputed (Lepage, 2014). This to derive and analyse new sentences by combin- can reduce the compute time by 8 for a single anal- ing fragments of previously seen sentences. The ogy as there are 8 symmetric configurations. proposed framework is able to replicate a range of Another proposed strategy is the construction of phenomena in language acquisition. a feature tree for the rapid computation and anal- From a more cognitive perspective (Kurtz et al., ysis of tuple differences over vectors with binary 2001) investigates how mutual alignment of two attributes (Lepage, 2014). This method was used situations can create better understanding of both. to discover analogies between images of Chinese Related to this (Gentner, 2010), and many others, characters. This has complexity O(n2). It com- argue that analogical ability is the key factor in hu- n(n+1) man cognitive development. putes the 2 vectors between pairs of tuples and collects them together into clusters contain- (Turney, 2006, 2013) makes extensive investi- ing the same difference vector. A variation of this gations of the use of corpus based methods for method was reported in (Fam and Lepage, 2016) determining relational similarities and predicting and (Fam and Lepage, 2017) for automatically dis- analogies. covering analogical grids of word paradigms us- (Miclet and Nicolas, 2015) propose the concept ing edit distances between word strings. It is not of an analogical complex which is a blend of ana- immediately obvious, however, how to extend this logical proportions and formal concept analysis. to the case of word embeddings where differences More specifically in relation to word embed- between word representations are real valued vec- dings, (Zhang et al., 2016) presents an unsu- tors. pervised approach for explaining the meaning of A related method is used in (Beltran et al., 2015) terms via word vector comparison. for identifying analogies in relational databases. In the next section we describe an approach It is less constrained as it uses analogical dis- which addresses the task of inducing analogies in similarity as a metric when determining valid an unsupervised fashion from word vectors and analogies. builds on existing work relating to word embed- (Langlais, 2016) extend methods from (Lepage, dings and linguistic regularities. 36 3 Analogical Frames of variables to the words which are within the nearest neighbourhood of the words to which they The primary task that we address in this paper is are connected. We define this as: the discovery of linguistic proportional analogies in an unsupervised fashion given only the distri- xi ∈ Neight(xj) and xj ∈ Neight(xi) butional profile of words. The approach that we take is to consider the problem as a constraint sat- where Neight(xi) is the nearest neighbourhood isfaction problem (CSP) (Rossi et al., 2006). We of t words of the the word assigned to xi, as increase the strength of constraints until we can measured in the vector space representation of the accurately decide when a given set of words and words. their embeddings forms a valid proportional anal- C4. Parallel Constraint. The Parallel Con- ogy. At this point we introduce some terminology: straint forces opposing difference vectors to have Constraint satisfaction problems are generally a minimal degree of parallelism. defined as a triple P = hX,D,Ci where x\2 − x1 · x\5 − x4 < pT hreshold X = {x1, ..., xn} is a set of variables. D = {d1, ..., dn} is the set of domains associated where pThreshold is a parameter. For the paral- with the variables. lel constraint we ensure that the difference vector C = {c , ..., c } is the set of constraints to be sat- 1 m x2 − x1 has a minimal cosine similarity to the isfied. difference vector x5 − x1. This constraint over- A solution to a constraint satisfaction prob- laps to some extent with the proportionality con- lem must assign a single value to each variable straint (below), however it serves a different pur- x1, ..., xn in the problem. There may be multiple pose, which is to eliminate low probability candi- solutions. In our problem formulation there is only date analogies. one domain which is the set of word types which C5. Proportionality Constraint. The Propor- comprise the vocabulary. Each variable must be tionality Constraint constrains the vector space assigned a word identifier and each word is as- representation of words to form approximate ge- sociated with one or more vector space represen- ometric proportional analogies. It is a quarternary tations. Constraints on the words and associated constraint. For any given 4-tuple, we use the con- vector space representation limit the values that cept of “inter-predictability” (Blevins et al., 2017) the variables can take. From here-on we will we to decide whether the 4-tuple is acceptable. We use the bolded symbol xi to indicate the vector enforce inter-predictability by requiring that each space value of the word assigned to the variable term in a 4-tuple is predicted by the other three xi. terms. This implies four analogy completion tasks In our proposal we use the following five which must be satisfied for the variable assign- constraints. ment to be accepted. x1 : x2 :: x3 : x ⇒ x = x4 C1. AllDiff constraint. The Alldiff constraint x2 : x1 :: x4 : x ⇒ x = x3 constrains all terms of the analogy to be distinct x3 : x4 :: x1 : x ⇒ x = x2 (The Alldiff constraint is a common constraint in x4 : x3 :: x2 : x ⇒ x = x1 CSPs), such that xi 6= xj for all 1 < i < j < n. C2. Asymmetry constraint. The asymmetry con- (Stroppa and Yvon, 2005) use a similar ap- straint (Meseguer and Torras, 2001) is used to proach with exact formal analogies. With our eliminate unnecessary searches in the search tree. approach, however, we complete analogies using It is defined as a partial ordering on the values of word vectors and analogy completion formulas a subset of the variables. In the case of a 2x3 ana- (eg. using 3COSADD or 3COSMUL, or a deriva- logical frame (figure2), for example, we define the tive). ordering as: The proportionality constraint is a relatively ex- x1 ≺ x2 ≺ x3, and x1 ≺ x4. pensive constraint to enforce as it requires many where the ordering is defined on the integer vector-vector operations and comparison against identifiers of the words in the vocabulary. all word vectors in the vocabulary. This constraint C3. Neighbourhood Constraint. The neigh- may be checked approximately which we discuss bourhood constraint is used to constrain the value in the next section. 37 3.1 The Insufficiency of 2x2 Analogies When we are not given any prior knowledge, or When we experiment with discovering 2x2 analo- exemplar, there is no privileged direction within gies using the constraints just described we find an analogical frame. For example, in figure2, the that we can’t easily set the parameters of the con- pair (x2, x5) has the same importance as (x4, x5). straints such that only valid analogies are pro- We first construct difference vectors and then duced, without severely limiting the types of average those that are aligned. analogies which we accept. For example, the fol- lowing analogy is produced “oldest : old :: earliest dif 2,1 = x\2 − x1 dif 4,1 = x\4 − x1 : earlier”. We find that these types of mistakes are dif = x − x dif = x − x common. We therefore take the step of expanding 5,4 \5 4 5,2 \5 2 the number of variables so that the state space is dif 6,3 = x\6 − x3 now larger (figure2). difsum1 = dif 2,1 + dif 5,4 3 1 2 difsum2 = dif 4,1 + dif 5,2 + dif 6,3 x1 x2 x3 difsum1 4 4 4 dif = 1 |difsum | 1 2 1 x4 x5 x6 difsum2 3 dif2 = |difsum2| Figure 2: A 2x3 Analogical Frame The vector dif1 is the normalized average off- set vector indicated with a 1 in figure2. The vector The idea is to increase the inductive support for dif2 is the normalized average offset vector indi- each discovered analogy by requiring that analo- cated with a 4 in figure2. gies be part of a larger system of analogies. We Using these normalized average difference vec- refer to the larger system of analogies as an Ana- tors we define the scoring function for selecting x5 logical Frame. It is important to note that in figure given x1, x2 and x4 as: 2, x1 and x3 are connected. The numbers asso- arg max s(x5, x4) · s(x5, dif1) (1) ciated with the edges indicate aligned vector dif- x5 ferences. It is intended that the analogical pro- + s(x5, x2) · s(x5, dif2) portions hold according to the connectivity shown. For example, proportionality should hold such that The formulation makes use of all information available in the analogical frame. Previous work x1 : x2 :: x4 : x5 and x1 : x3 :: x4 : x6 and has provided much evidence for the linear com- x2 : x3 :: x5 : x6. It is also important to note that while we have added only two new variables, positionality of word vectors as embodied by we have increased the number of constraints by al- 3COSADD (Vylomova et al., 2015; Hakami et al., most 3 times. It is not exactly 3 because there is 2017). It has also been known since (Mikolov some redundancy in the proportions. et al., 2013a) that averaging the difference vectors of pairs exhibiting the same relation results in a 3.2 New Formulas difference vector with better predictive power for We now define a modified formula for complet- that relation (Drozd et al., 2016). ing analogies that are part of a larger systems of Extrapolating from figure2 larger analogical analogies such as the analogical frame in figure2. frames can be constructed, such as 2 x 4, 3 x 3, It can be observed that 3COSADD and 3COSMUL or 2 x 2 x 3. Each appropriately connected 4-tuple effectively assigns a score to vocabulary items and contained within the frame should satisfy the pro- then selects the item with the largest score. We portionality constraint. do the same but with a modified scoring function 4 Frame Discovery and Search Efficiency which is a hybrid of 3COSADD and 3COSMUL. 3COSADD or 3COSMUL could both be used as The primary challenge in this proposal is to effi- part of our approach, but the formula which we ciently search the space of variable assignments. propose, better captures the intuition of larger sys- We use a depth first approach to cover the search tems of analogies where the importance is placed space as well several other strategies to make this on 1) symmetry, and 2) average offset vectors. search efficient. 38 4.1 Word Embedding Neighbourhoods We find that if we set m to 10 then most incorrect analogies are eliminated. An exhaustive check for The most important strategy is the concentration correctness can be made as a post processing step. of compute resources on the nearest neighbour- hoods of word embeddings. Most analogies in- 4.3 Other Methods volve terms that are within the nearest neighbor- hood of each other when ranked according to sim- Another important method for increasing effi- ilarity (Linzen, 2016). We therefore compute the ciency is testing the degree to which opposing dif- nearest neighbour graph of every term in the vo- ference vectors in a candidate proportional anal- cabulary as a pre-processing step and store the re- ogy are parallel to each other. This is encapsulated sult as an adjacency list for each vocabulary item. in constraint C4. While using parallelism as a pa- We do this efficiently by using binary represen- rameter for solving word analogy problems has tations of the word embeddings (Jurgovsky et al., not been successful (Levy et al., 2014), we have 2016) and re-ranking using the full precision word found that the degree of parallelism is a good in- vectors. The nearest neighbour list of each vocab- dicator of the confidence that we can have in the ulary entry is used in two different ways, 1) for ex- analogy once other constraints have been satisfied. ploring the search space of variable assignments, Other methods employed to improve efficiency in- and 2) efficiently eliminating variable assignments clude the use of Bloom filters to test neighbour- (next section) that do not satisfy the proportional hood membership and the indexing of discovered analogy constraints. proportions so as not to repeat searching. When traversing the search tree we use the ad- 4.4 Extending Frames jacency list of an already assigned variable to as- sign a value to a nearby variable in the frame. The Frames can be extended by extending the initial breadth of the search tree at each node is therefore base frame in one or more directions and search- parameterized by a global parameter t assigned by ing for additional variable assignments that satisfy the user. The parameter t is one of the primary pa- all constraints. For example, a 2x3 frame can be rameters of the algorithm and will determine the extended to a 2x4 frame by assigning values to two length of the search and the size of the set of dis- new variables x7 and x8 (figure3). covered frames. x1 x2 x3 / x7 4.2 Elimination of Candidates by Sampling A general principle used in most CSP solving x4 x5 x6 / x8 is the quick elimination of improbable solutions. Figure 3: Extending a 2x3 frame When we assign values to variables in a frame we need to make sure that the proportional analogy constraint is satisfied. For four given variable as- As frames are extended the average offset vec- tors (equation 1) are recomputed so that the off- signments w1, w2, w3 and w4 this involves making set vectors become better predictors for computing sure that w4 is the best candidate for completing proportional analogies. the proportional analogy involving w1, w2 and w3. This is equivalent to knowing if there are any bet- 5 Experimental Setup and Results ter vocabulary entries than w4 for completing the analogy. Instead of checking all vocabulary en- The two criteria we use to measure our proposal tries we can limit the list of entries to the m near- include a) accuracy of analogy discovery, b) com- 1 est neighbours of w2, w3 and w4 to check if any pute scalability. Other criteria are also possible score higher than w4. If any of them score higher such as relation diversity and interestingness of re- the proportional analogy constraint is violated and lations. the variable assignment is discarded. The advan- The primary algorithm parameters are 1) The tage of this is that we only need to check 3 x m number of terms in the vocabulary to search, 2) entries to approximately test the correctness of w4. the size of the nearest neighbourhood of each term to search, 3) the degree of parallelism required for 1assuming w is farthest away from w and that the near- 1 4 opposing vector differences and 4) whether to ex- est neighbours of w1 are not likely to help check the correct- ness of w4 tend base frames. 39 5.1 Retrieving Analogical Completions from Table 1: Analogy Completion on Google Subset Frames When frames are discovered they are stored in a 3CosAdd 3CosMul Frames Frame Store, a data structure used to efficiently Sem. (2681) 2666 2660 2673 store and retrieve frames. Frames are indexed us- Syn. (5908) 5565 5602 5655 ing posting lists similar to a document index. To Tot. (8589) 8231 8262 8328 complete an incomplete analogy, all frames which contain all terms in the incomplete analogy are re- trieved. For each retrieved frame the candidate and 3CosMul using the same embeddings as used term for completing the analogy is determined by to build the frames. cross-referencing the indices of the terms within Results show that the frames are slightly more the frame (figure4). If diverse candidate terms accurate than 3CosAdd and 3CosMul, achieving are selected from multiple frames, voting is used 96.9% on the 8589 evaluation items. It needs to to select the final analogy completion, or random be stressed, however, that the objective is not to selection in the case of tied counts. outperform vector arithmetic based methods, but rather to verify that the frames have a high degree a : a :: a :? 1,1 1,3 3,1 of accuracy. a1,1 a1,2 a1,3 To better determine the accuracy of the discov- a2,1 a2,2 a2,3 ered frames we also randomly sampled 1000 of the 21571 frames generated for the results shown in a3,1 a3,2 a3,3 table2, and manually checked them. The raw out- 4 Figure 4: Determining an analogy completion from puts are included in the online repository . These a larger frame frames cover many relations not included in the Google analogy test set. We found 9 frames with errors giving an accuracy of 99.1%. We conducted experiments with embeddings It should be noted that the accuracy of frames is constructed by ourselves as well as with publicly influenced by the quality of the embeddings. How- 2 accessible embeddings from the fastText web site ever, even with embeddings trained on small cor- trained on 600B tokens of the Common Crawl pora it is possible to discover analogies provided (Mikolov et al., 2018). that sufficient word embedding training epochs We evaluated the accuracy of the frames by at- have been completed. tempting to complete the analogies from the well The online repository contains further empirical 3 known Google analogy test set. The greatest evaluations and explanations regarding parameter challenge in this type of evaluation is adequately choices, including raw outputs and errors made by covering the evaluation items. At least three of the the system. terms in an analogy completion item need to be si- multaneously present in a single frame for the item 5.2 Scaling: Effect of Neighbourhood Size to be attempted. We report the results of a typical and pThreshold execution of the system using a nearest neighbour- Tables2 and3 show the number of frames discov- hood size of 25, a maximum vocabulary size of 50 ered on typical executions of the software. The re- 000, and a minimal cosine similarity between op- ported numbers are intended to give an indication posing vector differences of 0.3. For this set of of the relationship between neighbourhood size, parameters, 8589 evaluation items were answered number of frames produced and execution time. by retrieval from the frame store, covering approx- The reported times are for 8 software threads. imately 44% of the evaluation items (table1). Ap- proximately 30% of these were from the semantic 5.3 Qualitative Analysis category, and 70% from the syntactic. We com- pared the accuracy of completing analogies using From inspection of the frames we see that a large the frame store, to the accuracy of both 3CosAdd part of the relations discovered are grammatical or morpho-syntactic, or are related to high frequency 2https://fasttext.cc/docs/en/english-vectors.html 3http://download.tensorflow.org/data/questions-words.txt 4https://github.com/ldevine/AFM 40 Table 2: Par Thres = 0.3 (Less Constrained) as set membership constraints where sets may be clusters, or documents. Near. Sz. 15 20 25 30 One improvement that could be made to the pro- Frames 13282 16785 19603 21571 posed system is to facilitate the discovery of re- Time (sec) 20.1 29.3 38.3 45.9 lations that are not one-to-one. While we found many isolated examples of one-to-many relations expressed in the frames, a strictly symmetrical Table 3: Par Thres = 0.5 (More Constrained) proportional analogy does not seem ideal for cap- turing one-to-many relations. Near. Sz. 15 20 25 30 As outlined by (Turney, 2006) there are many Frames 6344 7995 9286 10188 applications of automating the construction and/or Time (sec) 10.4 15.7 21.1 26.0 discovery of analogical relations. Some of these include relation classification, metaphor detection, word sense disambiguation, information extrac- entities. However we also observe a large num- tion, question answering and thesaurus generation. ber of other types of relations such as synonyms, Analogical frames should also provide insight antonyms, domain alignments and various syntac- into the geometry of word embeddings and may tic mappings. We provide but a small sample be- provide an interesting way to measure their qual- low. ity. eyes eye blindness The most interesting application of the system is ears ear deafness in the area of computational creativity with a hu- Father Son Himself man in the loop. For example, analogical frames Mother Daughter Herself could be chosen for their interestingness and then geschichte gesellschaft musik expanded. histoire societe musique 6.1 Software and Online Repository aircraft flew skies air ships sailed seas naval The software implementing the proposed system as a set of command line applications can be found always constantly constant 6 often frequently frequent in the online repository . The software is imple- sometimes occasionally occasional mented in portable C++11 and compiles on both Windows and Unix based systems without com- brighter stronger piled dependencies. Example outputs of the sys- bright strong tem as well as parameter settings are provided in the online repository including the outputs created fainter weaker (2 x 2 x 2 Frame) from embeddings trained on a range of corpora. faint weak We have also observed reliable mappings with 7 Future Work and Conclusions embeddings trained on non-English corpora. Further empirical evaluation is required. The The geometry of analogical frames can also be establishment of more suitable empirical bench- explored via visualizations of projections onto 2D marks for assessing the effectiveness of open 5 sub-spaces derived from the offset vectors. analogy discovery is important. The most inter- esting potential application of this work is in the 6 Discussion combination of automated discovery of analogies We believe that the constraint satisfaction ap- and human judgment. There is also the possibility proach introduced in this paper is advantageous of establishing a more open-ended compute because it is a systematic but flexible and can make architecture that could search continuously for use of methods from the constraint satisfaction analogical frames in an online fashion. domain. We have only mentioned a few of the primary CSP concepts in this paper. Other con- straints can be included in the formulation such 5Examples provided in the online repository 6 https://github.com/ldevine/AFM 41 References Keith J Holyoak and Paul Thagard. 1989. Analogical mapping by constraint satisfaction. Cognitive sci- Paul Bartha. 2016. Analogy and analogical reasoning. ence, 13(3):295–355. The Stanford Encyclopedia of Philosophy. John E Hummel and Keith J Holyoak. 1997. Dis- William Correa Beltran, Hel´ ene` Jaudoin, and Olivier tributed representations of structure: A theory of Pivert. 2015. A clustering-based approach to the analogical access and mapping. Psychological re- mining of analogical proportions. In Tools with Ar- view, 104(3):427. tificial Intelligence (ICTAI), 2015 IEEE 27th Inter- national Conference on, pages 125–131. IEEE. Johannes Jurgovsky, Michael Granitzer, and Christin Seifert. 2016. Evaluating memory efficiency and ro- James P Blevins. 2016. Word and paradigm morphol- bustness of word embeddings. In European Con- ogy. Oxford University Press. ference on Information Retrieval, pages 200–211. Springer. James P Blevins, Farrell Ackerman, Robert Malouf, J Audring, and F Masini. 2017. Word and paradigm Paul Kiparsky. 1992. Analogy. International Encyclo- morphology. pedia of Linguistics. Rens Bod. 2009. From exemplar to grammar: A prob- Kenneth J Kurtz, Chun-Hui Miao, and Dedre Gentner. abilistic analogy-based model of language learning. 2001. Learning by analogical bootstrapping. The Cognitive Science, 33(5):752–793. Journal of the Learning Sciences, 10(4):417–446. Leonidas AA Doumas, John E Hummel, and Cather- Philippe Langlais. 2016. Efficient identification of for- ine M Sandhofer. 2008. A theory of the discovery mal analogies. and predication of relational concepts. Psychologi- cal review, 115(1):1. Jean-Franc¸ois Lavallee´ and Philippe Langlais. 2009. Unsupervised morphological analysis by formal Aleksandr Drozd, Anna Gladkova, and Satoshi Mat- analogy. In Workshop of the Cross-Language Eval- suoka. 2016. Word embeddings, analogies, and ma- uation Forum for European Languages, pages 617– chine learning: Beyond king-man+ woman= queen. 624. Springer. In Proceedings of COLING 2016, the 26th Inter- national Conference on Computational Linguistics: Yves Lepage. 2014. Analogies between binary im- Compu- Technical Papers, pages 3519–3530. ages: Application to chinese characters. In tational Approaches to Analogical Reasoning: Cur- Rashel Fam and Yves Lepage. 2016. Morphological rent Trends, pages 25–57. Springer. predictability of unseen words using computational Yves Lepage and Etienne Denoual. 2005. Purest ever analogy. example-based machine translation: Detailed pre- Rashel Fam and Yves Lepage. 2017. A study of the sentation and assessment. Machine Translation, saturation of analogical grids agnostically extracted 19(3-4):251–282. from texts. Omer Levy, Yoav Goldberg, and Israel Ramat-Gan. 2014. Linguistic regularities in sparse and explicit Robert M French. 2002. The computational modeling word representations. In CoNLL, pages 171–180. of analogy-making. Trends in cognitive Sciences, 6(5):200–205. Tal Linzen. 2016. Issues in evaluating seman- tic spaces using word analogies. arXiv preprint Dedre Gentner. 1983. Structure-mapping: A theo- arXiv:1606.07736. retical framework for analogy. Cognitive science, 7(2):155–170. Pedro Meseguer and Carme Torras. 2001. Exploit- ing symmetries within constraint satisfaction search. Dedre Gentner. 2010. Bootstrapping the mind: Ana- Artificial intelligence, 129(1-2):133–163. logical processes and symbol systems. Cognitive Science, 34(5):752–775. Laurent Miclet, Sabri Bayoudh, and Arnaud Delhay. 2008. Analogical dissimilarity: definition, algo- Dedre Gentner and Kenneth D Forbus. 2011. Compu- rithms and two experiments in machine learning. tational models of analogy. Wiley interdisciplinary Journal of Artificial Intelligence Research, 32:793– reviews: cognitive science, 2(3):266–276. 824. Huda Hakami, Kohei Hayashi, and Danushka Bolle- Laurent Miclet and Jacques Nicolas. 2015. From for- gala. 2017. An optimality proof for the pairdiff oper- mal concepts to analogical complexes. In CLA 2015, ator for representing relations between words. arXiv page 12. preprint arXiv:1709.06673. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- Rogers P Hall. 1989. Computational approaches to frey Dean. 2013a. Efficient estimation of word analogical reasoning: A comparative analysis. Ar- representations in vector space. arXiv preprint tificial intelligence, 39(1):39–120. arXiv:1301.3781. 42 Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Ad- vances in pre-training distributed word representa- tions. In Proceedings of the International Confer- ence on Language Resources and Evaluation (LREC 2018). Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013b. Linguistic regularities in continuous space word representations. In HLT-NAACL, volume 13, pages 746–751. Francesca Rossi, Peter Van Beek, and Toby Walsh. 2006. Handbook of constraint programming. El- sevier. Royal Skousen. 1989. Analogical modeling of lan- guage. Springer Science & Business Media. Nicolas Stroppa and Francc¸ois Yvon. 2005. An ana- logical learner for morphological analysis. In Pro- ceedings of the Ninth Conference on Computational Natural Language Learning, pages 120–127. Asso- ciation for Computational Linguistics. Peter D Turney. 2006. Similarity of semantic relations. Computational Linguistics, 32(3):379–416. Peter D Turney. 2013. Distributional semantics beyond words: Supervised learning of analogy and para- phrase. arXiv preprint arXiv:1310.5042. Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Timothy Baldwin. 2015. Take and took, gaggle and goose, book and read: Evaluating the utility of vec- tor differences for lexical relation learning. arXiv preprint arXiv:1509.01692. Yating Zhang, Adam Jatowt, and Katsumi Tanaka. 2016. Towards understanding word embeddings: Automatically explaining similarity of terms. In Big Data (Big Data), 2016 IEEE International Confer- ence on, pages 823–832. IEEE. 43 Specifying Conceptual Models Using Restricted Natural Language Bayzid Ashik Hossain Rolf Schwitter Department of Computing Department of Computing Sydney, NSW Sydney, NSW Australia Australia [email protected] [email protected] Abstract havioral perceptions, and the actual database man- agement system (DBMS) that is used to imple- The key activity to design an infor- ment the design of the information system (Bernus mation system is conceptual modelling et al., 2013). The DBMS could be based on any of which brings out and describes the general the available data models. Designing a database knowledge that is required to build a sys- means constructing a formal model of the desired tem. In this paper we propose a novel ap- application domain which is often called the uni- proach to conceptual modelling where the verse of discourse (UOD). Conceptual modelling domain experts will be able to specify and involves different parties who sit together and de- construct a model using a restricted form fine the UOD. The process of conceptual mod- of natural language. A restricted natural elling starts with the collection of necessary infor- language is a subset of a natural language mation from the domain experts by the knowledge that has well-defined computational prop- engineers. The knowledge engineers then use tra- erties and therefore can be translated un- ditional modelling techniques to design the system ambiguously into a formal notation. We based on the collected information. will argue that a restricted natural lan- To design the database, a clear understanding guage is suitable for writing precise and of the application domain and an unambiguous in- consistent specifications that lead to exe- formation representation scheme is necessary. Ob- cutable conceptual models. Using a re- ject role modelling (ORM) (Halpin, 2009) makes stricted natural language will allow the the database design process simple by using nat- domain experts to describe a scenario in ural language for verbalization, as well as dia- the terminology of the application domain grams which can be populated with suitable ex- without the need to formally encode this amples and by adding the information in terms of scenario. The resulting textual specifica- simple facts. On the other hand, entity relation- tion can then be automatically translated ship modelling (ERM) (Richard, 1990; Frantiska, into the language of the desired conceptual 2018) does this by considering the UOD in terms modelling framework. of entities, attributes and relationships. Object- oriented modelling techniques such as the unified 1 Introduction modelling language (UML) (O´ Regan, 2017) pro- It is well-known that the quality of an information vide a wide variety of functionality for specifying system application depends on its design. To guar- a data model at an implementation level which is antee accurateness, adaptability, productivity and suitable for the detailed design of an object ori- clarity, information systems are best specified at ented system. UML can be used for database de- the conceptual level using a language with names sign in general because its class diagrams provide for individuals, concepts and relations that are a comprehensive entity-relationship notation that easily understandable by domain experts (Bernus can be annotated with database constructs. et al., 2013). Conceptual modelling is the most Alternatively, a Restricted Natural Language important part of requirements engineering and is (RNL) (Kuhn, 2014) can be used by the domain the first phase towards designing an information experts to specify system requirements for con- system (Olive´, 2007). The conceptual design pro- ceptual modelling. A RNL can be defined as a cedure generally includes data, process and be- subset of a natural language that is acquired by Bayzid Ashik Hossain and Rolf Schwitter. 2018. Specifying Conceptual Models Using Restricted Natural Language. In Proceedings of Australasian Language Technology Association Workshop, pages 44−52. constraining the grammar and vocabulary in order class diagrams (Berardi et al., 2005) and ORM di- to reduce or remove its ambiguity and complexity. agrams (Franconi et al., 2012). The DL ALCQI These RNLs are also known as controlled natural is an extension of the basic propositionally closed language (CNL) (Schwitter, 2010). RNLs fall into description logic AL and includes complex con- two categories: 1. those that improve the read- cept negation, qualified number restriction, and in- ability for human beings especially for non-native verse role. speakers, and 2. those that facilitate the automated Table1 shows the constructs of the ALCQI translation into a formal target language. The main description logic with suitable examples. It is benefits of an RNL are: they are easy to under- reported that finite model reasoning with AL- stand by humans and easy to process by machines. CQI is decidable and ExpTime-complete1. Us- In this paper, we show how an RNL can be used ing logic to formally represent the conceptual dia- to write a specification for an information sys- grams introduces some problems too. For exam- tem and how this specification can be processed ple, it is difficult to generate logical representa- to generate a conceptual diagram. The grammar tions, in particular for domain experts; it is also of our RNL specifies and restricts the form of the difficult for them to understand these representa- input sentences. The language processor trans- tions and no well established methodologies are lates RNL sentences into a version of description available to represent the conceptual models for- logic. Note that the conceptual modelling process mally. A solution to these problems is to use usually starts from scratch and therefore cannot a RNL for the specification of conceptual mod- rely on existing data that would make this process els. There exist several ontology editing and immediately suitable for machine learning tech- authoring tools such as AceWiki (Kuhn, 2008), niques. CLOnE (Funk et al., 2007), RoundTrip Ontol- ogy Authoring (Davis et al., 2008), Rabbit (De- 2 Related Work naux et al., 2009), Owl Simplified English (Power, 2012) that already use RNL for the specification of There has been a number of works on formalizing ontologies; they translate a specification into a for- conceptual models for verification purposes (Be- mal notation. There are also works on mapping rardi et al., 2005; Calvanese, 2013; Lutz, 2002). formal notation into conceptual models (Brock- This verification process includes consistency and mans et al., 2004; Bagui, 2009). redundancy checking. These approaches first rep- resent the domain of interest as a conceptual 3 Proposed Approach model and then formalize the conceptual model using a formal language. The formal representa- Several approaches have been proposed to use tion can be used to reason about the domain of in- logic with conventional modelling techniques to terest during the design phase and can also be used verify the models and to get the semantics of the to extract information at run time through query domains. These approaches allow machines to un- answering. derstand the models and thus support automated reasoning. To overcome the disadvantages asso- Traditional conceptual modelling diagrams ciated with these approaches, we propose to use such as entity relationship diagrams and unified an RNL as a language for specifying conceptual modelling language diagrams are easy to gener- models. The benefits of an RNL are: 1. the lan- ate and easily understandable for the knowledge guage is easy to write and understand for domain engineers. These modelling techniques are well experts as it is a subset of a natural language, 2. established. The problems with these conven- the language gets its semantics via translation into tional modelling approaches are: they have no pre- a formal notation, and 3. the resulting formal no- cise semantics and no verification support; they tation can be used further to generate conceptual are not machine comprehensible and as a conse- models. quence automated reasoning on the conceptual di- Unlike previous approaches, we propose to agrams is not possible. Previous approaches used write a specification of the conceptual model in logic to formally represent the diagrams and to RNL first and then translate this specification into overcome these problems. The description logic description logic. Existing description logic rea- (DL) ALCQI is well suited to do reasoning with entity relationship diagrams (Lutz, 2002), UML 1http://www.cs.man.ac.uk/∼ezolin/dl/ 45 Construct Syntax Example atomic concept C Student atomic role P hasChild atomic negation ¬C ¬ Student conjunction C u D Student u Teacher (unqual.) exist. restriction ∃R ∃ hasChild universal value restriction ∀R.C ∀ hasChild.Male full negation ¬(C u D) ¬ (Student u Teacher) qual. cardinality restrictions ≥ nR.C ≥ 2 hasChild.Female inverse role p− ∃hasChild−.Teacher Table 1: The DL ALCQI. soners2,3 can be used to check the consistency RNL and their number is fixed. Content words of the formal notation and after that desired con- (e.g, nouns and verbs) are domain specific and can ceptual models can be generated from this nota- be added to the lexicon during the writing process. tion. Our approach is to derive the conceptual The reconstruction process of this scenario in RNL model from the specification whereas in conven- is supported by a look-ahead text editor (Guy and tional approaches knowledge engineers first draw Schwitter, 2017). The reconstructed scenario in the model and then use programs to translate the RNL looks as follows: model into a formal notation (Fillottrani et al., 2012). Figure1 shows the proposed system ar- (a) No student is a unit. chitecture for conceptual modelling. (b) No student is a program. 3.1 Scenario (c) Every student is enrolled in exactly one pro- Let’s consider an example scenario of a learning gram. management system for a university stated below: (d) Every student studies at least one unit and at most four units. A Learning Management System (LMS) keeps track of the units the students do during their (e) Every student has a student id and has a stu- undergraduate or graduate studies at a particular dent name. university. The university offers a number of programs and each program consists of a number (f) No program is a unit and is a student. of units. Each program has a program name and (g) Every program is composed of a unit. a program id. Each unit has a unit code and a unit name. A student can take a number of (h) Every program is enrolled by a student. units whereas a unit has a number of students. A student must study at least one unit and at most (i) Every program has a program id and has a four units. Every student can enrol into exactly program name. one program. The system stores a student id and (j) No unit is a student and is a program. a student name for each student. (k) Every unit is studied by a student. We reconstruct this scenario in RNL and af- (l) Every unit belongs to a program. ter that the language processor translates the RNL specification into description logic using a feature- (m) Every unit has a unit code and has a unit based phrase structure grammar (Bird et al., 2009). name. Our RNL consists of function words and content words. Function words (e.g., determiners, quanti- Additionally, we use the following terminologi- fiers and operators) describe the structure of the cal statements expressed in RNL: 2https://franz.com/agraph/racer/ (n) The verb studies is the inverse of the verb 3http://www.hermit-reasoner.com/ studied by. 46 Figure 1: Proposed system architecture for conceptual modelling using restricted natural language. (o) The verb composed of is the inverse of the Copula[NUM=?n] JJ[NUM=?n] verb belongs to. VB[NUM=pl] -> "study" | ... VB[NUM=sg] -> "studies" | ... 3.2 Grammar VBN -> "studied" "by" | ... Copula[NUM=sg] -> "is" A feature-based phrase structure grammar has Copula[NUM=pl] -> "are" been built using the NLTK (Loper and Bird, 2002) toolkit to parse the above-mentioned specification. JJ -> "inverse" "of" CC -> "and" | "or" The resulting parse trees for these sentences are then translated by the language processor into their Det[NUM=sg] -> "A" | "a" | ... equivalent description logic statements. Below we Det -> "The" | "the" show a scaffolding of the grammar rules with fea- UQ[NUM=sg] -> "Every" ture structures that we used for our case study: NQ -> "No" S -> Neg -> "not" NP[NUM=?n, FNC=subj] VP[NUM=?n] N[NUM=sg] -> "student" | ... FS N[NUM=pl] -> "students" | ... VP[NUM=?n] -> RB -> "exactly" | ... V[NUM=?n] NP[FNC=obj] | V[NUM=?n] Neg NP[NUM=?n, FNC=obj] | CD[NUM=sg] -> "one" VP[NUM=?n] CC VP[NUM=?n] CD[NUM=pl] -> "two" | ... | "four" NP[NUM=?n, FNC=subj] -> KP -> "The" "verb" | ... UQ[NUM=?n] N[NUM=?n] | NQ[NUM=?n] N[NUM=?n] | FS -> "." Det[NUM=?n] N[NUM=?n] | KP[NUM=?n] VB[NUM=?n] | In order to translate the resulting syntax trees KP[NUM=?n] VBN[NUM=?n] into the description logic representation, we have 4 NP[NUM=?n, FNC=obj] -> used the owl/xml syntax of Web Ontology Lan- Det[NUM=?n] N[NUM=?n] | guage (OWL) as the formal target notation. RB[NUM=?n] CD[NUM=?n] N[NUM=?n] | KP[NUM=?n] VB[NUM=?n] | KP[NUM=?n] VBN[NUM=?n] 3.3 Case Study V[NUM=?n] -> The translation process starts by reconstructing the Copula[NUM=?n] | specification in RNL that follows the rules of the VB[NUM=?n] | Copula[NUM=?n] VBN[NUM=?n] | 4https://www.w3.org/TR/owl-xmlsyntax/ 47 feature based grammar. While writing the specifi- cation in RNL, we tried to use the same vocabulary 48 49 this inconsistency (”owl:Nothing”) is reported af- Algorithm 1: Mapping description logic rep- ter running the reasoner. resentation to SQLite commands for generat- 50 Figure 2: Entity relationship diagram generated from the formal representation using SchemaCrawler. these formal representations can be mapped to dif- facilitate better understanding of the modelling ferent conceptual models. The proposed approach process which is only available in limited forms for conceptual modelling addresses two research in the current conceptual modelling frameworks. challenges8: 1. providing the right set of mod- elling constructs at the right level of abstraction 8 Conclusion to enable successful communication among the In this paper we demonstrated that an RNL can stakeholders (i.e. domain experts, knowledge en- serve as a high-level specification language for gineers, and application programmers); 2. pre- conceptual modelling, in particular for specifying serving the ease of communication and enabling entity-relationship models. We described an ex- the generation of a database schema which is a part periment that shows how we can support the pro- of the application software. posed modelling approach. We translated a spec- ification of a conceptual model written in RNL 7 Future Work into an executable description logic program that We are planning to develop a fully-fledged re- is used to generate the entity-relationship model. stricted natural language for conceptual modelling Our RNL is supported by automatic consistency of information systems. We want to use this lan- checking, and is therefore very suitable for for- guage as a specification language that will help malizing and verifying conceptual models. The the domain experts to write the system require- presented approach is not limited to a particular ments precisely. We are also planning to develop modeling framework and can be used apart from a conceptual modelling framework that will allow entity-relationship models also for object-oriented users to write specifications in RNL and will gen- models and object role models. Our approach has erate conceptual models from the specification. the potential to bridge the gap between a seem- This tool will also facilitate the verbalization of ingly informal specification and a formal represen- the conceptual models and allow users to manipu- tation in the domain of conceptual modelling. late the models in a round tripping fashion (from specification to conceptual models and conceptual models to specifications). This approach has sev- References eral advantages for the conceptual modelling pro- Sikha Bagui. 2009. Mapping OWL to the entity re- cess: Firstly, it will use a common formal repre- lationship and extended entity relationship models. International Journal of Knowledge and Web Intel- sentation to generate different conceptual models. ligence 1(1-2):125–149. Secondly, it will make the conceptual modelling process easy to understand by providing a frame- Daniela Berardi, Diego Calvanese, and Giuseppe De work to write specifications, generate visualiza- Giacomo. 2005. Reasoning on uml class diagrams. Artificial Intelligence 168(1):70–118. tions, and verbalizations. Thirdly, it is machine- processable like other logical approaches and sup- Anders Berglund, Scott Boag, Don Chamberlin, port verification; furthermore, verbalization will Mary F Fernandez,´ Michael Kay, Jonathan Ro- bie, and Jer´ omeˆ Simeon.´ 2003. XML Path Lan- 8http://www.conceptualmodeling.org/Conceptual guage (XPath). World Wide Web Consortium (W3C) Modeling.html https://www.w3.org/TR/xpath20/. 51 Peter Bernus, Kai Mertins, and Gunter¨ Schmidt. 2013. Tobias Kuhn. 2014. A survey and classification of con- Handbook on architectures of information systems. trolled natural languages. Computational Linguis- Springer Science & Business Media. tics 40(1):121–170. Steven Bird, Ewan Klein, and Edward Loper. 2009. Jean-Baptiste Lamy. 2017. Owlready: Ontology- Publications received. Computational Linguistics oriented programming in python with automatic 36:283–284. classification and high level constructs for biomed- ical ontologies. Artificial intelligence in medicine Sara Brockmans, Raphael Volz, Andreas Eberhart, and 80:11–28. Peter Loffler.¨ 2004. Visual modeling of owl dl on- tologies using uml. In International Semantic Web Edward Loper and Steven Bird. 2002. Nltk: The natu- Conference. Springer, pages 198–213. ral language toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for Diego Calvanese. 2013. Description Logics for Con- teaching natural language processing and computa- ceptual Modeling Forms of reasoning on UML Class tional linguistics-Volume 1. Association for Compu- Diagrams. EPCL Basic Training Camp 2012-2013. tational Linguistics, pages 63–70. Brian Davis, Ahmad Ali Iqbal, Adam Funk, Valentin Carsten Lutz. 2002. Reasoning about entity relation- Tablan, Kalina Bontcheva, Hamish Cunningham, ship diagrams with complex attribute dependencies. and Siegfried Handschuh. 2008. Roundtrip ontol- In Proceedings of the International Workshop in ogy authoring. In International Semantic Web Con- Description Logics 2002 (DL2002), number 53 in ference. Springer, pages 50–65. CEUR-WS (http://ceur-ws.org). pages 185–194. Ronald Denaux, Vania Dimitrova, Anthony G Cohn, Gerard O´ Regan. 2017. Unified modelling language. Catherine Dolbear, and Glen Hart. 2009. Rabbit to In Concise Guide to Software Engineering, Springer, owl: ontology authoring with a cnl-based tool. In pages 225–238. International Workshop on Controlled Natural Lan- Antoni Olive.´ 2007. Conceptual Modeling of Informa- guage. Springer, pages 246–264. tion Systems. Springer-Verlag, Berlin, Heidelberg. Pablo R Fillottrani, Enrico Franconi, and Sergio Tes- Richard Power. 2012. Owl simplified english: a finite- saris. 2012. The icom 3.0 intelligent conceptual state language for ontology editing. In Interna- modelling tool and methodology. Semantic Web tional Workshop on Controlled Natural Language. 3(3):293–306. Springer, pages 44–60. Enrico Franconi, Alessandro Mosca, and Dmitry Solo- Barker Richard. 1990. CASE Method: Entity Relation- makhin. 2012. Orm2: formalisation and encoding ship Modelling. Addition-Wesley Publishing Com- in owl2. In OTM Confederated International Con- pany, ORACLE Corporation UK Limited. ferences” On the Move to Meaningful Internet Sys- tems”. Springer, pages 368–378. Rolf Schwitter. 2010. Controlled natural languages for knowledge representation. In Proceedings of the Joseph Frantiska. 2018. Entity-relationship diagrams. 23rd International Conference on Computational In Visualization Tools for Learning Environment De- Linguistics: Posters. Association for Computational velopment, Springer, pages 21–30. Linguistics, pages 1113–1121. Adam Funk, Valentin Tablan, Kalina Bontcheva, Hamish Cunningham, Brian Davis, and Siegfried Handschuh. 2007. Clone: Controlled language for ontology editing. In The Semantic Web, Springer, pages 142–155. Birte Glimm, Ian Horrocks, Boris Motik, Giorgos Stoi- los, and Zhe Wang. 2014. HermiT: An OWL 2 Rea- soner. Journal of Automated Reasoning 53(3):245– 269. Stephen C Guy and Rolf Schwitter. 2017. The peng asp system: architecture, language and authoring tool. Language Resources and Evaluation 51(1):67–92. Terry Halpin. 2009. Object-role modeling. In Encyclo- pedia of Database Systems, Springer, pages 1941– 1946. Tobias Kuhn. 2008. Acewiki: A natural and expressive semantic wiki. arXiv preprint arXiv:0807.4618. 52 Extracting structured data from invoices Xavier Holt * Andrew Chisholm * Sypht Sypht [email protected] [email protected] Abstract Business documents encode a wealth of information in a format tailored to human consumption – i.e. aesthetically disbursed natural language text, graphics and tables. We address the task of extracting key fields (e.g. the amount due on an invoice) Figure 1: Energy bill with extracted fields. from a wide-variety of potentially unseen document formats. In contrast to tradi- tional template driven extraction systems, verification of payment, supplier and pricing in- we introduce a content-driven machine- formation. Template and RegEx driven extrac- learning approach which is both robust tion systems address this problem in part by shift- to noise and generalises to unseen docu- ing the burden of annotation from individual doc- ment formats. In a comparison of our ap- uments into the curation of extraction templates proach with alternative invoice extraction which cover a known document format. These ap- systems, we observe an absolute accuracy proaches still necessitate ongoing human effort to gain of 20% across compared fields, and a produce reliable extraction templates as new sup- 25%–94% reduction in extraction latency. plier formats are observed and old formats change over time. This presents a significant challenge – Australia bill payments provider BPAY covers 1 1 Introduction 26,000 different registered billers alone . We introduce SYPHT – a scaleable machine- To unlock the potential of data in documents we learning solution to document field extraction. must first interpret, extract and structure their con- SYPHT combines OCR, heuristic filtering and a su- tent. For bills and invoices, data extraction enables pervised ranking model conditioned on the con- a wide variety of downstream applications. Ex- tent of document to make field-level predictions traction of fields such as the amount due and that are robust to variations in image quality, skew, biller information enable the automation of in- orientation and content layout. We evaluate sys- voice payment for businesses. Moreover, extrac- tem performance on unseen document formats and tion of information such as the daily usage or compare 3 alternative invoice extraction systems supply charge as found on an electricity bill on a common subset of key fields. Our system (e.g. Figure1) enables the aggregation of usage achieves the best results with an average accuracy statistics over time and automated supplier switch- of 92% across field types on unseen documents ing advice. Manual annotation of document con- and the fastest median prediction latency of 3.8 tent is a time-consuming, costly and error-prone seconds. We make our system available as an API2 process (Klein et al., 2004). For many organi- – enabling low latency key-field extraction scal- sations, processing accounts payable or expense able to hundreds of document per second. claims requires ongoing manual transcription for 1www.bpay.com.au * Authors contributed equally to this work 2www.sypht.com Xavier Holt and Andrew Chisholm. 2018. Extracting structured data from invoices. In Proceedings of Australasian Language Technology Association Workshop, pages 53−59. 2 Background 3 Task Information Extraction (IE) deals broadly with the We define the extraction task as follows: given a problem of extracting structured information from document and set of fields to query, provide the unstructured text. In the domain of invoice and value of each field as it appears in the document. bill field extraction, document input is often bet- If there is no value for a given field present re- ter represented as a sparse arrangement of multiple turn null. This formulation is purely extractive text blocks rather than a single contiguous body of – we do not consider implicit or inferred field val- text. As financial documents are often themselves ues in our experiments or annotation. For exam- machine-generated, there is broad redundancy in ple, while it may be possible to infer the value this spatial layout of key fields across instances in of tax paid with high confidence given the net a corpus. Early approaches exploit this structure and gross amount totals on an invoice, without by extracting known fields based on their relative this value being made explicit in text the correct position to extracted lines (Tang et al., 1995) and system output is null. We do however consider detected forms (Cesarini et al., 1998). Subsequent inference over field names. Regardless of how a work aims to better generalise extractions patterns value is presented or labeled on a document, if it by constructing formal descriptions of document meets our query field definition systems must ex- structure (Couasnon¨ , 2006) and developing sys- tract it. For example, valid invoice number val- tems which allow non-expert end-users to dynam- ues may be labeled as “Reference”, “Document ically build extraction templates ad-hoc (Schuster ID” or even have no explicit label present. This et al., 2013). Similarly, the ITESOFT system (Ru- canonicalization of field expression across docu- siol et al., 2013) fits a term-position based extrac- ment types is the core challenge addressed by ex- tion model from a small sample of human labeled traction systems. samples which may be updated iteratively over To compare system extractions we first nor- time. More recently, D’Andecy et al.(2018) build malise the surface form of extracted values by upon this approach by incorporating an a-priori type. For example, dates expressed under a variety model of term-positions to their iterative layout- of formats are transformed to yyyy-mm-dd and specific extraction model, significantly boosting numeric strings or reference number types (e.g. performance on difficult fields. ABN, invoice number) have spaces and extrane- While these approaches deliver high-precision ous punctuation is removed. We adopt the eval- extraction on observed document formats they uation scheme common to IE tasks such as Slot cannot reliably or automatically generalise to un- Filling (McNamee et al., 2009) and relation ex- seen field layouts. Palm et al.(2017) present the traction (Mintz et al., 2009). For a given field closest work to our own with their CloudScan sys- predictions are judged true-positive if the pre- tem for zero-shot field extraction from unseen in- dicted value matches the label; false-positive if voice document forms. They train a recurrent neu- the predicted value does not match the label; true- ral network (RNN) model on a corpus of over 300K negative if both system and label are null; and invoices to recognize 8 key fields, observing an ag- false-negative if the predicted value is null and gregate F-score of 0.84 for fields extracted from label is not null. In each instance we consider held-out invoice layouts on their dataset. We con- the type-specific normalised form for both value sider a similar supervised approach but address and label in comparisons. Standard metrics such the learning problem as one of value ranking in- as F-score or accuracy may then be applied to as- place of sequence tagging. As they note, system sess system performance. comparison is complicated by a lack of a pub- Notably we do not consider the position of out- licly available data for invoice extraction. Given put values emitted by a system. In practise it is the sensitive nature of invoices and prevalence of common to find multiple valid expressions of the personally identifiable information, well-founded same field at different points on a document – in privacy concerns constrain open publishing in this this instance, labeling each value explicitly is both domain. We address this limitation in part by rig- laborious for annotators and generally redundant. orously anonymising a diverse set of invoices and This may however incorrectly assign credit to sys- submit them for evaluation to publicly available tems for a missed predictions in rare cases, e.g. systems — without making public the data itself. if both the net and gross totals normalise to the 54 same value (i.e. no applicable tax) a system may OCR each page is independently parsed by an be marked correct for predicting either token for Optical Character Recognition (OCR) system in each field. parallel which extracts textual tokens and their corresponding in-document positions. 3.1 Fields SYPHT provides extraction on a range of fields. Filtering for each query field we filter a subset For the scope of this paper and the sake of compar- of tokens as candidates in prediction based on the ison, we restrict ourselves to the following fields target field type. For example, we do not consider relevant to invoices and bill payments: currency denominated values as candidate fills for a date field. Supplier ABN represents the Australian Busi- ness Number (ABN) of the invoice or bill supplier. Prediction OCRed tokens and page images For example, 16 627 246 039. make up the input to our prediction model. For each field we rank the most likely value from the Document Date the date at which the document document for that field. If the most likely predic- was released or printed. Generally distinct from tion falls below a tuned likelihood threshold, we the due date for bills and may be presented in a va- emit null to indicate no field value is predicted. riety of formats, e.g. 11st December, 2018 We describe our model implementation and train- or 11-12-2018. ing in Section 4.1. Invoice number a reference generated by the Validation (optional) — consumers of the supplier which uniquely identifies a document, SYPHTAPI may specify a confidence threshold at e.g. INV-1447. Customer account numbers are which uncertain predictions are human validated not considered invoice references. before finalisation. We briefly describe our predic- Net amount the total amount of new charges for tion assisted annotation and verification work-flow goods and services, before taxes, discounts and system in Section 4.2. $50.00 other bill adjustments, e.g. . Output a JSON formatted object containing the GST the amount of GST charged as it relates to extracted field-value pairs, model confidence and the net amount of goods and services, e.g. $5.00. bounding-box information for each prediction is returned via an API call. Gross amount the total gross cost of new charges for goods and services, including GST or 4.1 Model and training $55.00 any adjustments, e.g. . Given an image and OCRed content as input, our model predicts the most likely value for a given 4 SYPHT query field. We use Spacy3 to tokenise the OCR In this section we describe our end-to-end system output. Each token is then represented through a for key-field extraction from business documents. wide range of features which describe the token’s We introduce a pipeline for field extraction at a syntactic, semantic, positional and visual content high level and describe the prediction model and and context. We utilise part-of-speech tags, word- field annotation components in detail. shape and other lexical features in conjuction with Although our system facilitates human-in-the- a sparse representation of the textual neighbour- loop prediction validation, we do not utilise hood around a token to capture local textual con- human-assisted predictions in our evaluation of text. In addition we capture a broad set of posi- system performance in Section5. tional features including the x and y coordinates, in-document page offset and relative position of a Preprocessing documents are uploaded in a va- token in relation to other predictions in the doc- riety of formats (e.g. PDF or image files) and nor- ument. Our model additionally includes a range malised to a common form of one-JPEG image per of proprietary engineered features tailored to field page. In development experiments we observe and document types of interest. faster performance without degrading prediction Field type information is incorporated into the accuracy by capping the rendered page resolution model through token-level filtering. Examples of (∼8MP) and limiting document colour channels to black and white. 3spacy.io/models/en#en_core_web_sm 55 Figure 2: Our annotation and prediction verification tool — SYPHT VALIDATE. Tasks are presented with fields to annotate on the left and the source document for extraction on the right. We display the top predictions for each target field as suggestions for the user. In this example the most likely Amount due has been selected and the position of this prediction in the source document has been highlighted for confirmation. field types which benefit from filtering are date, 4.2 Validation currency and integer fields; and fields with check- An ongoing human annotation effort is often cen- sum rules. To handle multi-token field outputs, tral to the training and evaluation of real-world we utilise a combination of heuristic token merg- machine learning systems. Well designed user- ing (e.g. pattern based string combination for experiences for a given annotation task can sig- Supplier ABNs) and greedy token aggregation nificant reduce the rate of manual-entry errors and under a minimum sequence likelihood threshold speed up data collection (e.g. Prodigy5). We de- from token level predictions (e.g. name and ad- signed a predication-assisted annotation and val- dress fields). idation tool for field extraction – SYPHT VALI- We train our model by sampling instances at the DATE. Figure2 shows a task during annotation. token level. Matcher functions perform normali- Our tool is used to both supplement the train- sation and comparison to annotated document la- ing set and optionally – where field-level confi- bels for both for single and multi-token fields. All dence does not meet a configurable threshold; pro- tokens which match the normalised form of the vide human-in-the-loop prediction verification in human-agreed value for a field are used to gen- real time. Suggestions are pre-populated through erate positive instances in a process analogous to SYPHT predictions, transforming an otherwise te- distant supervision (Mintz et al., 2009). Other to- dious manual entry task into a relatively simple kens in a document which match the field-type fil- decision confirmation problem. Usability features ter are randomly sampled as negative training in- such as token-highlighting and keyboard naviga- stances. Instances of labels and sparse features tion greatly decrease the time it takes to annotate are then used to train a gradient boosting decision a given document. tree model (LightGBM)4. To handle null predic- We utilise continuous active learning by priori- tions, we fit a threshold on token-level confidence tising the annotation of new documents from our which optimises a given performance metric; i.e. unlabeled corpus where the model is least confi- F-score for the models considered in this work. dent. Conversely we observe high-confidence pre- If the maximum likelihood value for a predicted dictions which disagree with past human annota- token-sequence falls below the threshold for that tions are good candidates for re-annotation; often field, a null prediction is returned instead. indicating the presence of annotation errors. 4github.com/Microsoft/LightGBM 5https://prodi.gy/ 56 4.3 Service architecture the generic invoice extraction model for parity SYPHT has been developed with performance at with other comparison systems. By contrast with scale as a primary requirement. We use a micro- other systems which provided seamless API ac- service architecture to ensure our system is both cess, we operated the user interface manually and robust to stochastic outages and that we can scale were unable to reliably record the distribution of up individual pipeline components to meet de- prediction time per document. As such we only mand. Services interact via a dedicated message note the average extraction time aggregated over queue which increases fault-tolerance and ensure all test documents in Table2 consistent throughput. Our system is capable of EzzyBills 7 automate data entry of invoice and scaling to service a throughput of hundreds of re- account-payable in buisness accounting systems. quests per second at low latency to support mobile We utilised the EzzyBills RESTAPI. and other near real-time prediction use-cases. We consider latency a core metric for real-world sys- Rossum8 advertise a deep-learning driven data tem performance and include it in our evaluation extraction API platform. We utilised their Python of comparable systems in Section5. API9 in our experiments. 5 Evaluation 6 Results In this section we describe our methodology for Table1 presents accuracy results by field for each creating the experimental dataset and system eval- comparison system. SYPHT delivers the highest uation. We aim to understand how a variety of performance across measures fields with a macro alternative extraction systems deals with various averaged accuracy exceeding our comparable re- invoice formats. As a coarse representation of vi- sults by 23.7%, 22.8% and 20.2% (for Ezzy, AB- sual document structure, we compute a perceptual BYY, Rossum respectively). Interestingly we ob- hash (Niu and Jiao, 2008) from the first-page of serve low scores across the board on the net each document in a sample of Australian invoices. amount field with every systems performing sig- Personally identifiable information (PII) was then nificantly worse than the closely related gross manually removed from each invoice by a human amount. This field also obtained the lowest level reviewer. SYPHT VALIDATE was used to generate of annotator agreement and was notoriously diffi- the labels for the task, with between two and four cult to reliably assess – for example, the inclusion annotators per field dependent on inter-annotator or exclusion of discounts, delivery costs and other agreement. Annotators worked closely to ensure adjustments to various sub totals on an invoice of- consistency between their labels and the data defi- ten complicates extraction. nitions listed in Section 3.1, with all fields having The next best system Rossum performed sur- a sampled Cohen’s kappa greater that 0.8, and all prising well considering their coverage of the fields except net amount having a kappa greater the European market; excluding support for than 0.9. During the annotation procedure four Australian-specific invoice fields such as ABN. documents were flagged as low quality and ex- Still, even after excluding ABN, net amount and cluded from the evaluation set, resulting in a final GST which may align to different field definitions, count of 129. In each of these cases annotators SYPHT maintains an 8 point accuracy advantage could not reliably determine field values due to and more than 14 times lower median prediction poor image quality. We evaluated against our de- latency. ployed system after ensuring that all documents in Table2 summarises the average prediction la- the evaluation set were excluded from the model’s tency in seconds for each system alongside the training set. times for documents at the 25th, 50th and 75th percentile of the response time distribution. Un- 5.1 Compared systems der the constraint of batch processing within the 6 ABBYY We ran ABBYY FlexiCapture 12 in desktop ABBYY extraction environment we were batch mode on a modern quad-core desktop com- unable to reliable record per-document prediction puter. While ABBYY software provides tools for creating extraction templates by hand, we utilised 7www.ezzybills.com/api/ 8www.rossum.ai 6www.abbyy.com/en-au/flexicapture/ 9pypi.org/project/rossum 57 Field Ezzy ABBYY Rossum Ours Supplier ABN 76.7 80.6 - 99.2 Invoice Number 72.1 82.2 86.8 94.6 Document Date 67.4 45.0 90.7 96.1 Net Amount 53.5 51.2 55.8 80.6 GST Amount 69.8 72.1 45.0 90.7 Gross Amount 75.2 89.1 84.5 95.3 Avg. 69.1 70.0 72.6 92.8 Table 1: Prediction accuracy by field. Avg. 25th 50th 75th We also see an exciting opportunity to provide Rossum 67.06 47.7 54.4 91.0 self-service model development – the ability for a Ezzy 27.9 20.6 26.9 34.5 customer to use their own documents to generate ABBYY 5.6 - - - a model tailored to their set of fields. This would Ours 4.2 3.3 3.8 4.8 allow us to offer SYPHT for use cases where ei- ther we cannot or would not collect the prerequi- Table 2: Prediction latency in seconds. site data. SYPHT VALIDATE provides a straight- forward method for bootstrapping extraction mod- times and thus do not indicate their prediction re- els by providing rapid data annotation and efficient sponse percentiles. SYPHT was faster than all use of annotator time through active learning. comparison systems, and significantly faster rel- ative to the other SaaS based API services. Even 8 Conclusion with the lack of network overhead inherent to AB- We present SYPHT, a SaaSAPI for key-field ex- BYY’s local extraction software, SYPHT maintains traction from business documents. Our compar- a 25% lower average prediction latency. In a di- ison with alternative extraction systems demon- rect comparison with other API based products we strate both high accuracy and lower latency across demonstrate stronger results still, with EzzyBills extracted fields – enabling applications in real time and Rossum being slower than SYPHT by a fac- for invoices and bill payment. tor of 6.6 and 15.9 respectively in terms of mean prediction time per document. Acknowledgements 7 Discussion and future work We would like to thank members of the SYPHT While it is not a primary component of our cur- team for their contributions to the system, annota- rent system, we have developed and continue to tion and evaluation effort: Duane Allam, Farzan develop a number of solutions based on neural Maghami, Paarth Arora, Raya Saeed, Saskia network models. Models for sequence labelling, Parker, Simon Mittag and Warren Billington. such as LSTM (Gers et al., 1999) or Transformer (Vaswani et al., 2017) networks can be directly en- sembled into the current system. We are also ex- ploring the use of object classification and detec- References tion models to make use of the visual component of document data. Highly performant models such Francesca Cesarini, Marco Gori, Simone Marinai, and Giovanni Soda. 1998. Informys: A flexible invoice- as YOLO (Redmon and Farhadi, 2018), are partic- like form-reader system. IEEE Trans. Pattern Anal. ularly interesting due to their ability to be used in Mach. Intell. 20(7):730–745. real-time. We expect sub-5 second response times to constitute a rough threshold for realistic deploy- Bertrand Couasnon.¨ 2006. Dmos, a generic doc- ment of extraction systems in real time applica- ument recognition method: application to table structure analysis in a general and in a spe- tions, making SYPHT the best system in contrast cific way. International Journal of Document to either of the other two API-based services. Analysis and Recognition (IJDAR) 8(2):111–122. https://doi.org/10.1007/s10032-005-0148-5. 58 Vincent Poulain D’Andecy, Emmanuel Hartmann, 2013 12th International Conference on Document and Marc¸al Rusinol.˜ 2018. Field extraction by Analysis and Recognition. IEEE Computer Society, hybrid incremental and a-priori structural tem- Washington, DC, USA, ICDAR ’13, pages 101–105. plates. In 13th IAPR International Workshop https://doi.org/10.1109/ICDAR.2013.28. on Document Analysis Systems, DAS 2018, Vi- enna, Austria, April 24-27, 2018. pages 251–256. Y. Y. Tang, C. Y. Suen, Chang De Yan, and M. Cheriet. https://doi.org/10.1109/DAS.2018.29. 1995. Financial document processing based on staff line and description language. IEEE Transactions Felix A Gers, Jurgen¨ Schmidhuber, and Fred Cummins. on Systems, Man, and Cybernetics 25(5):738–754. 1999. Learning to forget: Continual prediction with https://doi.org/10.1109/21.376488. lstm . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Bertin Klein, Stevan Agne, and Andreas Dengel. 2004. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Results of a study on invoice-reading systems in ger- Kaiser, and Illia Polosukhin. 2017. Attention is all many. In Simone Marinai and Andreas R. Dengel, you need. In Advances in Neural Information Pro- editors, Document Analysis Systems VI. Springer cessing Systems. pages 5998–6008. Berlin Heidelberg, Berlin, Heidelberg, pages 451– 462. Paul McNamee, Heather Simpson, and Hoa Trang Dang. 2009. Overview of the TAC 2009 Knowledge Base Population Track. In Proceedings of the 2009 Text Analysis Conference. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceed- ings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Con- ference on Natural Language Processing of the Asian Federation of Natural Language Process- ing. Association for Computational Linguistics, Stroudsburg, PA, USA, volume 2, pages 1003–1011. http://dl.acm.org/citation.cfm?id=1690219.1690287. Xia-mu Niu and Yu-hua Jiao. 2008. An overview of perceptual hashing. Acta Electronica Sinica 36(7):1405–1411. Rasmus Berg Palm, Ole Winther, and Florian Laws. 2017. Cloudscan - A configuration-free invoice analysis system using recurrent neural networks. In 14th IAPR International Conference on Docu- ment Analysis and Recognition, ICDAR 2017, Ky- oto, Japan, November 9-15, 2017. pages 406–413. https://doi.org/10.1109/ICDAR.2017.74. Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 . Maral Rusiol, Tayeb Benkhelfallah, and Vin- cent Poulain D’Andecy. 2013. Field extraction from administrative documents by incremental structural templates. In Proceedings of the 12th International Conference on Document Analysis and Recognition. IEEE Computer Society, pages 1100–1104. Daniel Schuster, Klemens Muthmann, Daniel Esser, Alexander Schill, Michael Berger, Christoph Wei- dling, Kamil Aliyev, and Andreas Hofmeier. 2013. Intellix – end-user trained information extraction for document archiving. In Proceedings of the 59 Short papers 60 Exploring Textual and Speech information in Dialogue Act Classification with Speaker Domain Adaptation Xuanli He∗ Quan Hung Tran∗ William Havard Monash University Monash University Univ. Grenoble Alpes [email protected] [email protected] [email protected] Laurent Besacier Ingrid Zukerman Gholamreza Haffari Univ. Grenoble Alpes Monash University Monash University [email protected] [email protected] [email protected] Abstract promising outcomes (Ji et al., 2016; Shen and Lee, 2016; Tran et al., 2017a). In spite of the recent success of Dia- Despite the success of previous work in DA logue Act (DA) classification, the major- classification, there are still several fundamental ity of prior works focus on text-based clas- issues. Firstly, most of the previous works rely sification with oracle transcriptions, i.e. on transcriptions (Ji et al., 2016; Shen and Lee, human transcriptions, instead of Auto- 2016; Tran et al., 2017a). Fewer of these focus on matic Speech Recognition (ASR)’s tran- combining speech and textual signals (Julia et al., scriptions. Moreover, the performance of 2010), and even then, the textual signals in these this classification task, because of speaker works utilise the oracle transcriptions. We argue domain shift, may deteriorate. In this pa- that in the context of a spoken dialog system, or- per, we explore the effectiveness of using acle transcriptions of utterances are usually not both acoustic and textual signals, either or- available, i.e. the agent does not have access to acle or ASR transcriptions, and investigate the human transcriptions. Speech and textual data speaker domain adaptation for DA classi- complement each other, especially when textual fication. Our multimodal model proves to data is from ASR systems rather than oracle tran- be superior to the unimodal models, par- scripts. Furthermore, domain adaptation in text or ticularly when the oracle transcriptions are speech-based DA classification is relatively under- not available. We also propose an effec- investigated. As shown in our experiments, DA tive method for speaker domain adapta- classification models perform much worse when tion, which achieves competitive results. they are applied to new speakers. 1 Introduction In this paper, we explore the effectiveness of us- ing both acoustic and textual signals, and investi- Dialogue Act (DA) classification is a sequence- gate speaker domain adaptation for DA classifica- labelling task, mapping a sequence of utterances to tion. We present a multimodal model to combine their corresponding DAs. Since DA classification text and speech signals, which proves to be supe- plays an important role in understanding sponta- rior to the unimodal models, particularly when the neous dialogue (Stolcke et al., 2000), numerous oracle transcriptions are not available. Moreover, techniques have been proposed to capture the se- we propose an effective unsupervised method for mantic correlation between utterances and DAs. speaker domain adaptation, which learns a suitable Earlier on, statistical techniques such as Hid- encoder for the new domain giving rise to repre- den Markov Models (HMMs) were widely used sentations similar to those in the source domain. to recognise DAs (Stolcke et al., 2000; Julia et al., 2010). Recently, due to the enormous success of 2 Model Description neural networks in sequence labeling/transduction tasks (Sutskever et al., 2014; Bahdanau et al., In this section, we describe the basic structure of 2014; Popov, 2016), several recurrent neural net- our model, which combines the textual and speech work (RNN) based architectures have been pro- modalities. We also introduce a representation posed to conduct DA classification, resulting in learning approach using adversarial ideas to tackle ∗Equal contribution the domain adaptation problem. Xuanli He, Quan Tran, William Havard, Laurent Besacier, Ingrid Zukerman and Gholamreza Haffari. 2018. Exploring Textual and Speech information in Dialogue Act Classification with Speaker Domain Adaptation. In Proceedings of Australasian Language Technology Association Workshop, pages 61−65. Figure 1: The multimodal model. For the utterance t, the left and right sides are encoded speech and text, respectively. 2.1 Our Multimodal Model Mel-frequency cepstral coefficients (MFCCs), A conversation is comprised of a sequence of ut- which have been very effective in speech recogni- tion (Mohamed et al., 2012). To learn the context- terances u1, ..., uT , and each utterance ut is la- specific features of the speech signal, a convo- beled with a DA at. An utterance could in- clude text, speech or both. We focus on online lutional neural network (CNN) is employed over DA classification, and our classification model at- MFCCs: tempts to directly model the conditional probabil- 01 0m 1 k x t , ..., x t = CNN(st , ..., st ) ity p(a1:T |u1:T ) decomposed as follows: i where st is a MFCC feature vector at the position T i for the t-th utterance. Y p(a1:T |u1:T ) = p(at|at−1, ut). (1) Encoding of Text+Speech. As illustrated in Fig- t=1 ure1, we employ two RNNs with LSTM units to According to Eqn.1, during the training time the encode the text and speech sequences of an utter- previous label is from the ground-truth data, while ance ut: this information comes from the model during the c(u )tx = RNN (x1, ..., xn) inference stage. This discrepancy, referred as label t θ t t sp 01 0m bias, can result in error accumulation. To incorpo- c(ut) = RNNθ0 (x t , ..., x t ). rate the previous DA information and mitigate the tx label-bias problem, we adopt the uncertainty prop- where the encoding of the text c(ut) and speech sp agation architecture (Tran et al., 2017b). The con- c(ut) are the last hidden states of the corre- ditional probability term in Eqn.1 is computed as sponding RNNs whose parameters are denoted by 0 follows: θ and θ . The distributed representation c(ut) of the utterance ut is then the concatenation of at|at−1, ut ∼ q tx sp t c(ut) and c(ut) . qt = softmax(W · c(ut) + b) X X 2.2 Speaker Domain Adaptation W = q (a)W a , b = q (a)ba t−1 t−1 Different people tend to speak differently. This a a a a creates a problem for DA classification systems, where W and b are DA-specific parameters as unfamiliar speech signals might not be recog- gated on the DA a, c(ut) is the encoding of the nised properly. In our preliminary experiments, utterance ut, and qt−1 represents the uncertainty the performance of DA classification on speakers distribution over the DAs at the time step t − 1. that are unseen in the training set suffers from dra- Text Utterance. An utterance ut includes a list of matic performance degradation over test set. This 1 n i words wt , ..., wt . The word wt is embedded by motivates us to explore the problem of speaker do- i i xt = e(wt) where e is an embedding table. main adaptation in DA classification. Speech Utterance. We apply a frequency-based We assume we have a large amount of labelled transformation on raw speech signals to acquire source data pair {Xsrc,Ysrc}, and a small amount 62 Figure 2: Overview of discriminative model. Dashed lines indicate frozen parts of unlabelled target data Xtrg, where an utterance speech RNN only. This is possibly because the u ∈ X includes both speech and text parts. In- major difference between source and target do- spired by Tzeng et al.(2017), our goal is to learn main data is due to the speech signals. We al- a target domain encoder which can fool a domain ternate between optimising L1 and L2 by using classifier Cφ in distinguishing whether the utter- Adam (Kingma and Ba, 2014) until a training con- ance belongs to the source or target domain. Once dition is met. the target encoder is trained to produce represen- tations which look like those coming from the 3 Experiments source domain, the target encoder can be used to- 3.1 Datasets gether with other components of the source DA prediction model to predict DAs for the target do- We test our models on two datasets: the MapTask main (see Figure 2). Dialog Act data (Anderson et al., 1991) and the We use a 1-layer feed-forward network as the Switchboard Dialogue Act data (Jurafsky et al., domain classifier: 1997). MapTask dataset This dataset consist of 128 con- Cφ(r) = σ(W C · r + bC ) versations labelled with 13 DAs. We randomly partition this data into 80% training, 10% devel- where the classifier produces the probability of opment and 10% test sets, having 103, 12 and 13 the input representation r belonging to the source conversations respectively. domain, and φ denotes the classifier parameters {W C , bC }. Let the target and source domain en- Switchboard dataset There are 1155 transcrip- coders are denoted by ctrg(utrg) and csrc(utrg), tions of telephone conversations in this dataset, respectively. The training objective of the domain and each utterance falls into one of 42 DAs. We classifier is: follow the setup proposed by Stolcke et al.(2000): 1115 conversations for training, 21 for develop- min L1(Xsrc,Xtrg,Cφ) = ment and 19 for testing. Since we do not have φ access to the original recordings of Switchboard − [log C (c (u))] Eu∼Xsrc φ src dataset, we use synthetic speeches generated by a − Eu∼Xtrg [1 − log Cφ(ctrg(u))]. text-to-speech (TTS) system from the oracle tran- scriptions. As mentioned before, we keep the source encoder fixed and train the parameters of the target domain 3.2 Results encoder. The training objective of the target do- In-Domain Evaluation. Unlike most prior work main encoder is (Ji et al., 2016; Shen and Lee, 2016; Tran et al., 2017a), we use ASR transcripts, produced by the min L2(Xtrg,C ) = 0 φ θtrg CMUSphinx ASR system, rather than the oracle text. We argue that most dialogues in the real − Eu∼Xtrg [log Cφ(ctrg(u))] world are in the speech format, thus our setup is where the optimisation is performed over the closer to the real-life scenario. 0 speech RNN parameters θtrg of the target encoder. As shown in Tables1 and2, our multimodal We also tried to optimise other parameters (i.e. model outperforms strong baselines on Switch- CNN parameters, word embeddings and text RNN board and MapTask datasets, when using the ASR parameters), but the performance is similar to the transcriptions. When using the oracle text, the in- 63 formation from the speech signal does not lead to for the new speakers using speeches from both further improvement though, possibly due to the known and new speakers. During domain adap- existence of acoustic features (such as tones, ques- tation, the five known speakers are marked as the tion markers etc) in the high quality transcriptions. source domain, while the three new speakers are On MapTask, there is a large gap between oracle- treated as the target domains. For domain adap- based and ASR-based models. This degradation is tation with unlabelled data, the DA tags of both mainly caused by the poor quality acoustic signals the source and target domains are removed. We in MapTask, making ASR ineffective compared to test the source-only model and the domain adap- directly predicting DAs from the speech signal. tation models merely on the three new speakers in test data. As shown in Table3, compared Models Accuracy with the source-only model, the domain adapta- Oracle text tion strategy improves the performance of speech- Stolcke et al.(2000) 71.00% only and text+speech models, consistently and Shen and Lee(2016) 72.60% substantially. Tran et al.(2017a) 74.50% Text only (ours) 74.97% Methods Speech Text+Speech Text+Speech (ours) 74.98% Unadapted 48.73% 63.57% Domain Adapted 54.37% 67.21% Speech and ASR Supervised Learning 56.19 % 68.04% Speech only 59.71% Text only (ASR) 66.39% Table 3: Experimental results of the unadapted Text+Speech (ASR) 68.25% (i.e. source-only) and domain adapted models us- ing unlabeled data on Switchboard, as well as the Table 1: Results of different models on Switch- supervised learning upperbound. board data. To assess the effectiveness of our domain adap- tation architecture, we compare it with the super- Models Accuracy vised learning scenario where the model has ac- Oracle text cess to labeled data from all speakers during train- Julia et al.(2010) 55.40% ing. To do this, we randomly add two thirds of Tran et al.(2017a) 61.60% labelled development data of new speakers to the Text only (ours) 61.73% training set, and apply the trained model to the test Text+Speech (ours) 61.67% set. The supervised learning scenario is an upper- Speech and ASR bound to our domain adaptation approach, as it Speech only 39.32% makes use of labeled data; see the results in the Text only (ASR) 38.10% last row of Table3. However, the gap between Text+Speech (ASR) 39.39% supervised learning and domain adaptation is not big compared to that between the adapted and un- adapted models, showing that our domain adap- Table 2: Results of different models on MapTask tion technique has been effective. data. 4 Conclusion Out-of-Domain Evaluation. We evaluate our domain adaptation model on the out of domain In this paper, we have proposed a multimodal data on Switchboard. Our training data comprises model to combine textual and acoustic signals for of five known speakers, whereas development and DA prediction. We have demonstrated that the test sets include data from three new speakers. The our model exceeds unimodal models, especially speeches for these 8 speakers are generated by a when oracle transcriptions do not exist. In addi- TTS system. tion, we have proposed an effective domain adap- As described in Section 2.2, we pre-train our tation technique in order to adapt our multimodal speech models on the labeled training data from DA prediction model to new speakers. the 5 known speakers, then train speech encoders 64 References Marie Meteer. 2000. Dialogue act modeling for au- tomatic tagging and recognition of conversational Anne H Anderson, Miles Bader, Ellen Gurman Bard, speech. CoRR cs.CL/0006023. Elizabeth Boyle, Gwyneth Doherty, Simon Garrod, Stephen Isard, Jacqueline Kowtko, Jan McAllister, Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Jim Miller, et al. 1991. The hcrc map task corpus. Sequence to sequence learning with neural net- Language and speech 34(4):351–366. works. CoRR abs/1409.3215. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Quan Hung Tran, Ingrid Zukerman, and Gholamreza Bengio. 2014. Neural machine translation by Haffari. 2017a. A hierarchical neural model for jointly learning to align and translate. CoRR learning sequences of dialogue acts. In Proceedings abs/1409.0473. of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Vol- John J Godfrey, Edward C Holliman, and Jane Mc- ume 1, Long Papers. volume 1, pages 428–437. Daniel. 1992. Switchboard: Telephone speech cor- pus for research and development. In Acoustics, Quan Hung Tran, Ingrid Zukerman, and Gholamreza Speech, and Signal Processing, 1992. ICASSP-92., Haffari. 2017b. Preserving distributional informa- 1992 IEEE International Conference on. IEEE, vol- tion in dialogue act classification. In Proceedings of ume 1, pages 517–520. the 2017 Conference on Empirical Methods in Nat- ural Language Processing. pages 2141–2146. Yangfeng Ji, Gholamreza Haffari, and Jacob Eisen- stein. 2016. A latent variable recurrent neural net- Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor work for discourse relation language models. CoRR Darrell. 2017. Adversarial discriminative domain abs/1603.01913. adaptation. CoRR abs/1702.05464. Fatema N. Julia, Khan M. Iftekharuddin, and Atiq U. Islam. 2010. Dialog act classification using acoustic and discourse information of maptask data. Inter- national Journal of Computational Intelligence and Applications 09(04):289–311. D. Jurafsky, E. Shriberg, and D. Biasca. 1997. Switch- board SWBD-DAMSL shallow-discourse-function annotation coders manual. Technical Report Draft 13, University of Colorado, Institute of Cognitive Science. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980. http://arxiv.org/abs/1412.6980. Abdel-rahman Mohamed, Geoffrey Hinton, and Ger- ald Penn. 2012. Understanding how deep belief net- works perform acoustic modelling. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, pages 4273– 4276. Ramesh Nallapati, Bing Xiang, and Bowen Zhou. 2016. Sequence-to-sequence rnns for text summa- rization. CoRR abs/1602.06023. Alexander Popov. 2016. Deep Learning Architec- ture for Part-of-Speech Tagging with Word and Suf- fix Embeddings, Springer International Publishing, Cham, pages 68–77. Sheng-syun Shen and Hung-yi Lee. 2016. Neural at- tention models for sequence classification: Analysis and application to key term extraction and dialogue act detection. CoRR abs/1604.00077. Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca A. Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and 65 Cluster Labeling by Word Embeddings and WordNet’s Hypernymy Hanieh Poostchi Massimo Piccardi University of Technology Sydney University of Technology Sydney Capital Markets CRC [email protected] [email protected] Abstract vidually. Cluster-internal labeling approaches in- clude computing the clusters’ centroids and using Cluster labeling is the assignment of rep- them as labels, or using lists of terms with high- resentative labels to clusters of documents est frequencies in the clusters. However, all these or words. Once assigned, the labels can approaches can only select cluster labels from the play an important role in applications such terms and phrases that explicitly appear in the doc- as navigation, search and document clas- uments, possibly failing to provide an appropri- sification. However, finding appropriately ate level of abstraction or description (Lau et al., descriptive labels is still a challenging 2011). As an example, a word cluster containing task. In this paper, we propose various words dog and wolf should not be labeled with ei- approaches for assigning labels to word ther word, but as canids. For this reason, in this clusters by leveraging word embeddings paper we explore several approaches for labeling and the synonymy and hypernymy rela- word clusters obtained from a document collec- tions in the WordNet lexical ontology. Ex- tion by leveraging the synonymy and hypernymy periments carried out using the WebAP relations in the WordNet taxonomy (Miller, 1995), document dataset have shown that one of together with word embeddings (Mikolov et al., the approaches stand out in the compari- 2013; Pennington et al., 2014). son and is capable of selecting labels that A hypernymy relation represents an asymmetric are reasonably aligned with those chosen relation between a class and each of its instances. by a pool of four human annotators. A hypernym (e.g., vertebrate) has a broader con- text than its hyponyms (bird, fishes, reptiles etc). 1 Introduction and Related Work Conversely, the contextual properties of the hy- Document collections are often organized into ponyms are usually a subset of those of their hy- clusters of either documents or words to facilitate pernym(s). Hypernymy has been used extensively applications such as navigation, search and classi- in natural language processing, including in re- fication. The organization can prove more useful cent works such as Yu et al. (2015) and HyperVec if its clusters are characterized by sets of represen- (Nguyen et al., 2017) that have proposed learning tative labels. The task of assigning a set of labels word embeddings that reflect the hypernymy rela- to each individual cluster in a document organi- tion. Based on this, we have decided to make use zation is known as cluster labeling (Wang et al., of available hypernym-hyponym data to propose 2014) and it can provide a useful description of an approach for labeling clusters of keywords by a the collection in addition to fundamental support representative selection of their hypernyms. for navigation and search. In the proposed approach, we first extract a set In Manning et al. (2008), cluster labeling ap- of keywords from the original document collec- proaches have been subdivided into i) differen- tion. We then apply a step of hierarchical cluster- tial cluster labeling and ii) cluster-internal label- ing on the keywords to partition them into a hier- ing. The former selects cluster labels by compar- archy of clusters. To this aim, we represent each ing the distribution of terms in one cluster with keyword as a real-valued vector using pre-trained those of the other clusters while the latter selects word embeddings (Pennington et al., 2014) and labels that are solely based on each cluster indi- repeatedly apply a standard clustering algorithm. Hanieh Poostchi and Massimo Piccardi. 2018. Cluster Labeling by Word Embeddings and WordNet’s Hypernymy. In Proceedings of Australasian Language Technology Association Workshop, pages 66−70. 4. Compute the word co-occurrence matrix X|W |×|W | for W . 5. Calculate word score score(w) = deg(w)/freq(w), where deg(w) = P Figure 1: The proposed cluster labeling pipeline. i∈{1,...,|W |} X[w, i] and freq(w) = P i∈{1,...,|W |}(X[w, i] 6= 0). For labeling the clusters, we first look up all the 6. Score each candidate keyword as the sum of synonyms of the keywords and, in turn, their hy- its member word scores. pernyms in the WordNet hierarchy. We then en- 7. Select the top T scoring candidates as key- code the hypernyms as word embeddings and use words for the document. various approaches to select them based on their distance from the clusters’ centers. The experi- Alternatively, RAKE can use other combina- mental results over a benchmark document collec- tions of deg(w) and freq(w) as the word scor- tion have shown that such a distance-based selec- ing function. The keywords extracted from all the tion is reasonably aligned with the hypernyms se- documents are accumulated into a set, C, ensuring lected by four, independent human annotators. As uniqueness. a side result, we show that the employed word em- beddings spontaneously contain the hypernymy 2.2 Hierarchical Clustering of Keywords relation, offering a plausible justification for the A top-down approach is used to hierarchically effectiveness of the proposed method. cluster the keywords in C. First, each component word of each keyword is mapped onto a numeri- 2 The Proposed Pipeline cal vector using pre-trained GloVe50d1 word em- The proposed pipeline of processing steps is beddings (Pennington et al., 2014); missing words are mapped to zero vectors. Then, each keyword shown in Figure1. First, keywords are extracted −→ from each document in turn and accumulated in k is represented with the average vector k of its an overall set of unique keywords. After mapping component words. Then, we start from set C as such keywords to pre-trained word embeddings, the root of the tree and follow a branch-and-bound hierarchical clustering is applied in a top-down approach, where each tree node is clustered into manner. The leaves of the constructed tree are c clusters using the k-means algorithm (Hartigan considered as the clusters to be labeled. Finally, and Wong, 1979). A node is marked as a leaf if each cluster is labeled automatically by leverag- it contains less than n keywords or it belongs to ing a combination of WordNet’s hypernyms and level d, the tree’s depth limit. The leaf nodes are synsets and word embeddings. The following sub- the clusters to be named with a set of verbal terms. sections present each step in greater detail. 2.3 Cluster Labeling 2.1 Keyword Extraction As discussed in Section1, we aim to label each For the keyword extraction, we have used the rapid cluster with descriptive terms. The labels should automatic keyword extraction (RAKE) of Rose et be more general than the cluster’s members to ab- al. (2010). This method extracts keywords (i.e., stract the nature of the cluster. To this end, we single words or very short word sequences) from a leverage the hypernym-hyponym correspondences given document collection and its main steps can in the lexical ontology. First, for each cluster, we L be summarized as: create a large set, , of candidate labels by includ- ing the hypernyms2 of the component words, ex- 1. Split a document into sentences using a pre- panded by their synonyms, of all the keywords. defined set of sentence delimiters. The synonyms are retrieved from the WordNet’s sets of synonyms, called synsets. Then, we ap- 2. Split sentences into sequences of contiguous ply the four following approaches to select l labels words at phrase delimiters to build the candi- from set L: date set. 1http://nlp.stanford.edu/data/ 3. Collect the set of unique words (W ) that ap- wordvecs/glove.6B.zip pear in the candidate set. 2Nouns only (not verbs). 67 • FreqKey: Choose the l most frequent hyper- nyms of the l most frequent keywords. • CentKey: Choose the l most central hyper- nyms of the l most central keywords. • FreqHyp: Choose the l most frequent hyper- nyms. • CentHyp: Choose the l most central hyper- nyms. Approaches FreqKey and FreqHyp are based on Figure 2: Precision at k (P @k) for k = 1,..., 10 frequencies in the collection. For performance averaged over the eight chosen clusters for the evaluation, we sort their selected labels in de- compared approaches. scending frequency order. In CentKey and Cen- tHyp, the centrality is computed with respect to the cluster’s center in the embedding space as we had considered asking the annotators to also −→ the average vector of all its keywords K = select representative labels from K, but a prelimi- 1 P −→ nary analysis showed that they were unsuitable to |K| k∈K k . The distance between hypernym h −→ −→ −→ −→ describe the cluster as a whole (Table1 shows an and the cluster’s center is d( h , K) = || h − K||, −→ example). Although the annotators were asked to where h is the average vector of the hypernym’s provide their selection as a ranked list, we did not component words. The labels selected by these make use of their ranking order in the evaluation. two approaches are sorted in ascending distance To evaluate the prediction accuracy, for each order. cluster we have considered the union of the lists 3 Experiments and Results provided by the human annotators as the ground truth (since |L| was typically in the order of 150 − For the experiments, we have used the WebAP 200, the intersection of the lists was often empty 3 dataset (Keikha et al., 2014) as the document col- or minimal). As performance figure, we have lection. This dataset contains 6, 399 documents of decided to report the well-known precision at k diverse nature with a total of 1, 959, 777 sentences. (P @k) for values of k between one and ten. We 4 For the RAKE software , the hyper-parameters are have not used the recall since the ground truth had the minimum number of characters of each key- size 40 in most cases while the prediction’s size word, the maximum number of words of each key- was kept to l = 10 in all cases, resulting in a high- word, and the minimum number of times each est possible recall of 0.25. Figure2 compares the keyword appears in the text, and they have been average P @k for k = 1,..., 10 for the four pro- left to their default values of 5, 3, and 4, respec- posed approaches. The two approaches based on tively. Likewise, parameter T has been set to its minimum distance to the cluster center (CentKey default value of one third of the words in the co- and CentHyp) have outperformed the other two ap- occurrence matrix. For the hierarchical clustering, proaches based on frequencies (FreqKey and Fre- we have used c = 8, n = 100 and d = 4 based on qHyp) for all values of k. This shows that the word our own subjective assessment. embedding space is in good correspondence with 3.1 Human Annotation and Evaluation the human judgement. Moreover, approach Cen- tHyp has outperformed all other approaches for all For the evaluation, eight clusters (one from each values of k, showing that the hypernyms’ central- sub-tree) were chosen to be labeled manually by ity in the cluster is the key property for their effec- four, independent human annotators. For this pur- tive selection. pose, for each cluster, we provided the list of its keywords, K, and the candidate labels, L, to the 3.2 Visualization of Keywords and annotators, and asked them to select the best l = Hypernyms 10 terms from L to describe the cluster. Initially, Hypernyms are more general terms than the cor- 3https://ciir.cs.umass.edu/downloads/ WebAP/ responding keywords, thus we expect them to be 4https://github.com/aneesha/RAKE in larger mutual distance in the word embedding 68 website www, clearinghouse, nih website, bulletin, websites, hotline, kbr publications, pfm file, syst publication, gov web site, dhhs publication, beta site, lexis nexis document, private http, national register bulletin, daily routines, data custodian, information, serc newsletter, certified mail, informational guide, dot complaint database, coverage edit followup, local update, mass mailing, ahrq web site, homepage, journal messenger, npl site, pdf private, htm centers, org website, web site address, telephone directory, service records, page layout program, service invocation, newsletter, card reader, advisory workgroup, library boards, full text online, usg publication, webpage, bulletin boards, fbis online, teleconference info, journal url, insert libraries, headquarters files, volunteer website http, bibliographic records, vch publishers, ptd web site, tsbp newsletter, electronic bulletin boards, email addresses, ecommerce, traveler, Keywords api service, intranet, website http, newsletter nps files, mail advertisement transmitted, subscribe, nna program, npci website, bulletin board, fais information, archiving, page attachment, nondriver id, mail etiquette, ip address, national directory, web page, pdq editorial boards, aml sites, dhs site, ptd website, directory ers web site, forums, digest, beta site management, directories, ccir papers, ieee press, fips publication, org web site, clearinghouse database, monterey database, hotlines, dslip description info, danish desk files, sos web site, bna program, newsletters, inspections portal page, letterhead, app ropri, image file directory, website, electronic mail notes, web site http, customized template page, mail addresses, health http, internet questionnaire assistance, electronic bulletin board, eos directly addresses, templates directory, beta site testers, informational, dataplot auxiliary directory, coverage edit, quarterly newsletter, distributed, reader, records service, web pages. Annotator 1 electronic communication, computer network, web page, web site, mail, text file, computer file, protocol, software, electronic equipment Annotator 2 computer network, telecommunication, computer, mail, web page, information, news, press, code, software Annotator 3 news, informing, medium, web page, computer file, written record, document, press, article, essay Annotator 4 communication, electronic communication, informing, press, medium, document, electronic equipment, computer network, transmission, record CentHyp electronic communication, information measure, text file, web page, informing, print media, web site, computer file, commercial enterprise, reference book Table 1: An example cluster. The hypernyms selected by CentHyp and by at least one annotator are shown in boldface. dots) are among the closest to the cluster’s cen- ter, and thus those selected by CentHyp (red and magenta dots) have the best correspondence (ma- genta dots alone) among the explored approaches. 3.3 A Detailed Example As a detailed example, Table1 lists all the key- words of a sample cluster and the hypernyms se- lected by the four human annotators and CentHyp. Some of the hypernyms selected by more than one annotator (e.g., “electronic communication”, “web page” and “computer file”) have also been Figure 3: Two-dimensional visualization of an successfully identified by CentHyp. On the other example cluster (this figure should be viewed in hand, CentHyp has selected at least two terms color). The black and blue dots are the cluster’s (“commercial enterprise” and “reference book’) keywords and the keywords’ hypernyms, respec- that are unrelated to the cluster. Qualitatively, we tively. The green dots are the hypernyms selected deem the automated annotation as noticeably infe- by the human annotators, the red dots are the hy- rior to the human annotations, yet usable wherever pernyms selected by CentHyp, and their intersec- manual annotation is infeasible or impractical. tion is recolored in magenta. The cluster’s center is the turquoise star. 4 Conclusion space. To explore their distribution, we have used This paper has explored various approaches for la- two-dimensional multidimensional scaling (MDS) beling keyword clusters based on the hypernyms visualizations (Borg and Groenen, 2005) of se- from the WordNet lexical ontology. The proposed lected clusters. For each cluster, the keywords set approaches map both the keywords and their hy- K, the hypernyms set L, and the cluster’s center pernyms to a word embedding space and leverage have all been aggregated as a single set before ap- the notion of centrality in the cluster. Experiments plying MDS. An examples is shown in Figure3. carried out using the WebAP dataset have shown As can be seen, the hypernyms (blue dots) nicely that one of the approaches (CentHyp) has outper- distribute as a circular crown, external and con- formed all the others in terms of precision at k for centric to the keywords (black dots), showing that all values of k, and it has provided labels which are the hypernymy relation corresponds empirically to reasonably aligned with those of a pool of annota- a radial expansion away from the cluster’s center. tors. We plan to test the usefulness of the labels This likely stems from the embedding space’s re- for tasks of search expansion in the near future. quirement to simultaneously enforce meaningful Acknowledgments distances between the different keywords, the key- words and the corresponding hypernyms, and be- This research has been funded by the Capital Mar- tween the hypernyms themselves. The hypernyms kets Cooperative Research Centre in Australia and selected by the annotators (green and magenta supported by Semantic Sciences Pty Ltd. 69 References G. A. Miller. 1995. WordNet: A Lexical Database for English. Communications of the ACM 38(11):39– I. Borg and P. J. F. Groenen. 2005. Modern Mul- 41. tidimensional Scaling: Theory and Applications. Springer Series in Statistics. Springer New York. K. A. Nguyen, M. Koeper,¨ S. Schulte im Walde, and N. T. Vu. 2017. Hierarchical Embeddings for Hy- J A. Hartigan and M. A. Wong. 1979. A K-Means pernymy Detection and Directionality. In Proceed- Clustering Algorithm. JSTOR: Applied Statistics ings of the 2017 Empirical Methods in Natural Lan- 28(1):100–108. guage Processing (EMNLP). pages 233–243. M. Keikha, J. H. Park, W. B. Croft, and M. Sanderson. J. Pennington, R. Socher, and C. D. Manning. 2014. 2014. Retrieving Passages and Finding Answers. GloVe: Global Vectors for Word Representation. In In Proceedings of the 2014 Australasian Document Proceedings of the 2014 Empirical Methods in Nat- Computing Symposium (ADCS). pages 81–84. ural Language Processing (EMNLP). volume 14, J. H. Lau, K. Grieser, D. Newman, and T. Baldwin. pages 1532–1543. 2011. Automatic Labelling of Topic Models. In S. Rose, D. Engel, N. Cramer, and W. Cowley. Proceedings of the 49th Annual Meeting of the Asso- 2010. Automatic Keyword Extraction from Individ- ciation for Computational Linguistics: Human Lan- ual Documents. In Text Mining. Applications and guage Technologies (ACL). volume 1, pages 1536– Theory, Wiley-Blackwell, chapter 1, pages 1–20. 1545. J. Wang, C. Kang, Y. Chang, and J. Han. 2014. A Hi- C. D. Manning, P. Raghavan, and H. Schutze.¨ 2008. erarchical Dirichlet Model for Taxonomy Expansion Introduction to Information Retrieval. Cambridge for Search Engines. In Proceedings of the 23rd In- University Press, New York, NY, USA. ternational Conference on World Wide Web (WWW). T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and pages 961–970. J. Dean. 2013. Distributed Representations of Z. Yu, H. Wang, X. Lin, and M. Wang. 2015. Learning Words and Phrases and their Compositionality. In Term Embeddings for Hypernymy Identification. In Proceedings of the 26th International Conference on Proceedings of the 24th International Joint Confer- Neural Information Processing Systems (NIPS), vol- ence on Artificial Intelligence (IJCAI). pages 1390– ume 2, pages 3111–3119. 1397. 70 A Comparative Study of Embedding Models in Predicting the Compositionality of Multiword Expressions Navnita Nandakumar Bahar Salehi Timothy Baldwin School of Computing and Information Systems The University of Melbourne Victoria 3010, Australia [email protected] {salehi.b,tbaldwin}@unimelb.edu.au Abstract in learning distributed representations of words and their meanings, starting out with word em- In this paper, we perform a compara- beddings (Mikolov et al., 2013; Pennington et al., tive evaluation of off-the-shelf embedding 2014) and now also involving the study of models over the task of compositional- character- and document-level models (Baroni ity prediction of multiword expressions et al., 2014; Le and Mikolov, 2014; Bojanowski (“MWEs”). Our experimental results sug- et al., 2017; Conneau et al., 2017). This work has gest that character- and document-level been applied in part to predicting the composition- models do capture some aspects of MWE ality of MWEs (Salehi et al., 2015a; Hakimi Parizi compositionality and are effective at mod- and Cook, 2018), work that this paper builds on elling varying levels of compositionality, directly, in performing a comparative study of the but ultimately are not as effective as a sim- performance of a range of off-the-shelf representa- ple word2vec baseline. However they have tion learning methods over the task of MWE com- the advantage over word-level models that positionality prediction. they do not require token-level identifica- Our contributions are as follows: (1) we show tion of MWEs in the training corpus. that, despite their effectiveness over a range of 1 Introduction other tasks, recent off-the-shelf character- and document-level embedding learning methods are In recent years, the study of the semantic id- inferior to simple word2vec at modelling MWE iomaticity of multiword expressions (“MWEs”: compositionality; and (2) we demonstrate the util- Baldwin and Kim(2010)) has focused on com- ity of using paraphrase data in addition to simple positionality prediction, a regression task involv- lemmas in predicting MWE compositionality. ing the mapping of an MWE onto a continuous scale, representing its compositionality either as a 2 Related work whole or for each of its component words (Reddy et al., 2011; Ramisch et al., 2016; Cordeiro et al., The current state-of-the-art in compositionality to appear). In the case of couch potato “an idler prediction involves the use of word embeddings who spends much time on a couch (usually watch- (Salehi et al., 2015a). The vector representa- ing television)”, e.g., on a scale of [0, 1] the over- tions of each component word (e.g. couch and all compositionality may be judged to be 0.3, and potato) and the overall MWE (e.g. couch potato) the compositionality of couch and potato as 0.8 are taken as a proxy for their respective meanings, and 0.1, respectively. The main motivation for the and compositionality of the MWE is then assumed study of compositionality is to better understand to be proportional to the relative similarity be- the semantic of the compound and the semantic tween each of the components and overall MWE relationships between the component words of the embedding. However, word-level embeddings re- MWEs, which has applications in various infor- quire token-level identification of each MWE in mation retrieval and natural language processing the training corpus, meaning that if the set of tasks (Venkatapathy and Joshi, 2006; Acosta et al., MWEs changes, the model needs to be retrained. 2011; Salehi et al., 2015b). This limitation led to research on character-level Separately, there has been burgeoning interest models, since character-level models can implic- Navnita Nandakumar, Bahar Salehi and Timothy Baldwin. 2018. A Comparative Study of Embedding Models in Predicting the Compositionality of Multiword Expressions. In Proceedings of Australasian Language Technology Association Workshop, pages 71−76. itly handle an unbounded vocabulary of compo- 3.2 Character-level Embeddings nent words and MWEs (Hakimi Parizi and Cook, In a character embedding model, the vector for a 2018). There has also been work in the extension word is constructed from the character n-grams of word embeddings to document embeddings that that compose it. Since character n-grams are map entire sentences or documents to vectors (Le shared across words, assuming a closed-world al- and Mikolov, 2014; Conneau et al., 2017). phabet,2 these models can generate embeddings for OOV words, as well as words that occur infre- 3 Embedding Methods quently. The two character-level embedding mod- els we experiment with are fastText (Bojanowski We use two character-level embedding models et al., 2017) and ELMo (Peters et al., 2018), as (fastText and ELMo) and two document-level detailed below. models (doc2vec and infersent) to compare with word-level word2vec, as used in the state-of-the- fastText We used the 300-dimensional model art method of Salehi et al.(2015a). In each case, pre-trained on Common Crawl and Wikipedia us- we use canonical pre-trained models, with the ex- ing CBOW. fastText assumes that all words are ception of word2vec, which must be trained over whitespace delimited, so in order to generate a rep- data with appropriate tokenisation to be able to resentation for the combined MWE, we remove generate MWE embeddings, as it treats words any spaces and treat it as a fused compound (e.g. atomically and cannot generate OOV words. couch potato becomes couchpotato). In the case of paraphrases, we use the same word averaging 3.1 Word-level Embeddings technique as we did in word2vec. Word embeddings are mappings of words to vec- ELMo We used the ElmoEmbedder class in tors of real numbers. This helps create a more Python’s allennlp library.3 The model was pre- compact (by means of dimensionality reduction) trained over SNLI and SQuAD, with a dimension- and expressive (by means of contextual similarity) ality of 1024. word representation. Note that the primary use case of ELMo is to generate embeddings in context, but we are not word2vec We trained word2vec (Mikolov providing any context in the input, for consis- et al., 2013) over the latest English Wikipedia tency with the other models. As such, we are dump.1 We first pre-processed the corpus, re- knowingly not harnessing the full potential of the moving XML formatting, stop words and punc- model. However, this naive use of ELMo is not tuation, to generate clean, plain text. We then inappropriate as the relative compositionality of iterated through 1% of the corpus (following a compound is often predictable from its compo- Hakimi Parizi and Cook(2018)) to find every oc- nent words only, even for novel compounds such currence of each MWE in our datasets and con- as giraffe potato (which has a plausible composi- catenate them, assuming every occurrence of the tional interpretation, as a potato shaped like a gi- component words in sequence to be the compound raffe) vs. couch intelligence (where there is no nat- noun (e.g. every couch potato in the corpus be- ural interpretation, suggesting that it may be non- comes couchpotato). We do this because instead compositional). of a single embedding for the MWE, word2vec generates separate embeddings for each of the 3.3 Document-level Embeddings component words, owing to the space between Document-level embeddings aim to learn vec- them. If the model still fails to generate embed- tor representations of documents (sentences or dings for either the MWE or its components (due even paragraphs), to generate a representation to data sparseness), we assign the MWE a default 2 compositionality score of 0.5 (neutral). In the case Which is a safe assumption for languages with small- scale alphabetic writing systems such as English, but po- of paraphrases, we compute the element-wise av- tentially problematic for languages with large orthographies erage of the embeddings of each of the component such as Chinese (with over 10k ideograms in common use, words to generate the embedding of the phrase. and many more rarer characters) or Korean (assuming we treat each Hangul syllable as atomic). 3options file = https://bit.ly/2CInZPV, 1Dated 02-Oct-2018, 07:23 weight file = https://bit.ly/2PvNqHh 72 of its overall content in the form of a fixed- Emb. method Directpre Directpost dimensionality vector. The two document-level word2vec 0.684 0.710 (α = 0.3) embeddings used in this research are doc2vec (Le and Mikolov, 2014) and infersent (Conneau et al., fastText 0.223 0.285 (α = 0.3) 2017), as detailed below. ELMo 0.056 0.399 (α = 0.0) doc2vec −0.049 0.025 (α = 0.0) doc2vec We used the gensim implementation infersent 0.413 0.500 (α = 0.5) of doc2vec (Lau and Baldwin, 2016; Rehˇ u˚rekˇ GloVe infersent 0.557 0.610 (α = 0.5) and Sojka, 2010), pretrained on Wikipedia data us- infersent ing the word2vec skip-gram models pretrained on Table 1: Pearson correlation coefficient for com- Wikipedia and AP News.4 positionality prediction results on the REDDY infersent We used two versions of in- dataset. fersent of 300 dimensions, using the inbuilt infersent.build vocab k words func- the respective components in predicting the com- tion to train the model over the 100,000 most positionality of the compound. The intuition be- popular English words, using: (1) GloVe hind both of these methods is that if the MWE ap- (Pennington et al., 2014) word embeddings pears in similar contexts to its components, then it (“infersentGloVe”); and (2) fastText word is compositional. embeddings (“infersentfastText”). 4.2 Paraphrases 4 Modelling Compositionality Our second approach is to calculate the similar- In order to measure the overall compositionality ity of the MWE embedding with that of its para- of an MWE, we propose the following three broad phrases, assuming that we have access to para- 6 approaches. phrase data. We achieve this using the following three formulae: 4.1 Direct Composition Para first = cos(mwe, para1) Our first approach is to directly compare the em- X Para all = cos(mwe, para ) beddings of each of the component nouns with pre i i the embedding of the MWE via cosine similar- N 1 X ity, in one of two ways: (1) pre-combine the em- Para all = cos(mwe, para ) post N i beddings for the component words via element- i=1 wise sum, and compare with the embedding for where para and para denote the embedding for the MWE (“Direct ”); and (2) compare each in- 1 i pre the first (most popular) and i-th paraphrases, re- dividual component word with the embedding for spectively. the MWE, and post-hoc combine the scores via a We apply this method to RAMISCH only, since weighted sum (“Directpost”). Formally: REDDY does not have any paraphrase data (see Section 5.1 for details). Directpre = cos(mwe, mwe1 + mwe2) Directpost =α cos(mwe, mwe1)+ 4.3 Combination (1 − α) cos(mwe, mwe2) Our final approach (“Combined”) is based on the combination of the direct composition and para- where: mwe, mwe1, and mwe2 are the embed- phrase methods, as follows: dings for the combined MWE, first component and 5 second component, respectively; mwe1 +mwe2 Combined =β max Directpre, Directpost + is the element-wise sum of the vectors of each of (1 − β) max Para first, Para allpre, the component words of the MWE; and α ∈ [0, 1] Para allpost is a scalar which allows us to vary the weight of where β ∈ [0, 1] is a scalar weighting factor to bal- 4https://github.com/jhlau/doc2vec/ blob/master/README.md ance the effects of the two methods. The choice 5Noting that all MWEs are binary in our experiments, but 6Each paraphrase shows an interpretation of the com- equally that the methods generalise trivially to larger MWEs. pound semantics. e.g. olive oil is “oil from olive” 73 Emb. method Directpre Directpost Para first Para allpre Para allpost Combined word2vec 0.667 0.731 (α = 0.7) 0.714 0.822 0.880 0.880 (β = 0.0) fastText 0.395 0.446 (α = 0.7) 0.569 0.662 0.704 0.704 (β = 0.0) ELMo 0.139 0.295 (α = 0.0) 0.367 0.642 0.664 0.669 (β = 0.2) doc2vec −0.146 0.048 (α = 1.0) 0.405 0.372 0.401 0.419 (β = 0.3) infersentGloVe 0.321 0.427 (α = 0.7) 0.639 0.704 0.741 0.774 (β = 0.5) infersentfastText 0.274 0.380 (α = 0.8) 0.615 0.781 0.783 0.783 (β = 0.0) Table 2: Pearson correlation coefficient for compositionality prediction results on the RAMISCH dataset. of the max operator here to combine the sub- ELMo fastText methods for each of the direct composition and doc2vec infersentGloVe paraphrase methods is that all methods tend to un- 1 infersentfastText derestimate the compositionality (and empirically, word2vec 0.75 it was superior to taking the mean). r 0.5 5 Experiments 0.25 5.1 Datasets 0 We evaluate the models on the following two −0.25 datasets, which are comprised of 90 English bi- 0 0.25 0.5 0.75 1 nary noun compounds each, rated for composi- α tionality on a scale of 0 (non-compositional) to 5 (compositional). In each case, we evaluate model Figure 1: Sensitivity analysis of α (REDDY) performance via the Pearson’s correlation coeffi- cient (r). ELMo fastText REDDY This dataset contains scores for the doc2vec infersentGloVe compositionality of the overall MWE, as well as 1 infersentfastText word2vec that of each component word (Reddy et al., 2011); 0.75 in this research, we use the overall compositional- r 0.5 ity score of the MWE only, and ignore the compo- nent scores. 0.25 0 RAMISCH Similarly to REDDY, this dataset contains scores for the overall compositionality −0.25 of the MWE as well as the relative composition- 0 0.25 0.5 0.75 1 ality of each of its component words, in addi- α tion to paraphrases suggested by the annotators, in decreasing order of popularity (Ramisch et al., Figure 2: Sensitivity analysis of α (RAMISCH) 2016); in this research, we use the overall compo- sitionality score and paraphrase data only. 2, for the REDDY and RAMISCH datasets, respec- 5.2 Results and Discussion tively. The results of the experiments on REDDY and The first observation to be made is that none RAMISCH are presented in Tables1 and2, re- of the pretrained models match the state-of-the-art spectively. In this work, we simplistically present method based on word2vec, despite the simplic- the results for the best α and β values for each ity of the method. ELMo and doc2vec in partic- method over a given dataset, meaning we are ef- ular perform worse than expected, suggesting that fectively peaking at our test data. Sensitivity of their ability to model non-compositional language the α hyper-parameter is shown in Figures1 and is limited. Recall, however, our comment about 74 using ELMo naively, in not including any con- pretrained and, as such, did not require any manip- text when generating the embeddings for the com- ulation of the training corpus in order to generate ponent words and, more importantly, the overall vector embeddings of the MWEs. This means they MWE. The results show that doc2vec performs can be applied to new datasets without the need for better when representing paraphrases, and strug- retraining and are, therefore, more robust. gles with compounds without sentential context. In future work, we intend to train the models In Table1, we find Direct post to produce a higher used in our study on a fixed corpus, to compare correlation in all cases, with α ranging from 0.0 to their performance in a more controlled setting. We 0.5, suggesting that the second element (= head) will also do proper tuning of the hyperparameters contributes more to the overall compositionality of over held-out data, and plan to experiment with the MWE than the first element (= modifier); this other languages. is borne out in Figure1. In Table2, on the other hand, we find that, with References the exception of ELMo, the α values favour the modifier of the MWE over the head (i.e. α > 0.5; Otavio Acosta, Aline Villavicencio, and Viviane Mor- also seen in Figure2), implying that the former eira. 2011. Identification and treatment of multi- word expressions applied to information retrieval. is more significant in predicting the composition- In Proceedings of the Workshop on Multiword Ex- ality of the MWE. The reason for the mismatch pressions: from Parsing and Generation to the Real between the two datasets is not immediately clear, World. Portland, USA, pages 101–109. other than the obvious data sparsity. Timothy Baldwin and Su Nam Kim. 2010. Multiword We also see that the paraphrases achieve a expressions. In Nitin Indurkhya and Fred J. Dam- higher correlation across all models, suggesting erau, editors, Handbook of Natural Language Pro- cessing, CRC Press, Boca Raton, USA. 2nd edition. this is a promising direction for future study. The low β values for Combined also confirm Marco Baroni, Georgiana Dinu, and German´ that the paraphrase methods have greater predic- Kruszewski. 2014. Don’t count, predict! a systematic comparison of context-counting vs. tive power than the direct composition methods. context-predicting semantic vectors. In Proceedings Among the paraphrase experiments, we find that of the 52nd Annual Meeting of the Association for Para allpost— the average of the similarities of the Computational Linguistics (ACL 2014). Baltimore, MWE with each of its paraphrases — consistently USA, pages 238–247. achieves the best results. We hypothesize that Piotr Bojanowski, Edouard Grave, Armand Joulin, and the paraphrases provide additional information re- Tomas Mikolov. 2017. Enriching word vectors with garding the compounds that further help determine subword information. Transactions of the Associa- tion for Computational Linguistics 5:135–146. their compositionality. Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıc 6 Conclusions and Future Work Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from This paper has investigated the application of a natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Nat- range of embedding generation methods to the ural Language Processing. pages 670–680. task of predicting the compositionality of an MWE, either directly based on the MWE and its Silvio Cordeiro, Aline Villavicencio, Marco Idiart, and Carlos Ramisch. to appear. Unsupervised composi- component words, or indirectly based on para- tionality prediction of nominal compounds. Compu- phrase data for the MWE. Our results show that tational Linguistics . modern character- and document-level embedding Ali Hakimi Parizi and Paul Cook. 2018. Do character- models are inferior to the simple word2vec ap- level neural network language models capture proach at the task. We also show that paraphrase knowledge of multiword expression compositional- data captures valuable data regarding the compo- ity? In Proceedings of the Joint Workshop on Lin- sitionality of the MWE. guistic Annotation, Multiword Expressions and Con- structions (LAW-MWE-CxG-2018). pages 185–192. Since we have achieved such promising results with the paraphrase data, it might be interesting Jey Han Lau and Timothy Baldwin. 2016. An empiri- cal evaluation of doc2vec with practical insights into to consider other possible settings in future tests. document embedding generation. In Proceedings While none of the other approaches could outper- of the 1st Workshop on Representation Learning for form word2vec, it is useful to note that they were NLP. pages 78–86. 75 Quoc Le and Tomas Mikolov. 2014. Distributed repre- sentations of sentences and documents. In Proceed- ings of the 31st International Conference on Ma- chine Learning (ICML 2014). Beijing, China, pages 1188–1196. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. In Proceedings of Workshop at the International Conference on Learning Repre- sentations. Jeffrey Pennington, Richard Socher, and Christo- pher D. Manning. 2014. GloVe: Global vectors for word representation. In Empirical Methods in Nat- ural Language Processing (EMNLP). pages 1532– 1543. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep- resentations. In Proceedings of the 2018 Confer- ence of the North American Chapter of the Associ- ation for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long Papers). pages 2227–2237. Carlos Ramisch, Silvio Cordeiro, Leonardo Zilio, Marco Idiart, and Aline Villavicencio. 2016. How naked is the naked truth? a multilingual lexicon of nominal compound compositionality. In Proceed- ings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Pa- pers). pages 156–161. Siva Reddy, Diana McCarthy, and Suresh Manandhar. 2011. An empirical study on compositionality in compound nouns. In Proceedings of 5th Interna- tional Joint Conference on Natural Language Pro- cessing. pages 210–218. Radim Rehˇ u˚rekˇ and Petr Sojka. 2010. Software frame- work for topic modelling with large corpora. In Pro- ceedings of the LREC 2010 Workshop on New Chal- lenges for NLP Frameworks. pages 45–50. Bahar Salehi, Paul Cook, and Timothy Baldwin. 2015a. A word embedding approach to predicting the com- positionality of multiword expressions. In Proceed- ings of the 2015 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 977–983. Bahar Salehi, Nitika Mathur, Paul Cook, and Timothy Baldwin. 2015b. The impact of multiword expres- sion compositionality on machine translation eval- uation. In Proceedings of the NAACL HLT 2015 Workshop on Multiword Expressions. Denver, USA, pages 54–59. Sriram Venkatapathy and Aravind K Joshi. 2006. Us- ing information about multi-word expressions for the word-alignment task. In Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties. pages 20–27. 76 Towards Efficient Machine Translation Evaluation by Modelling Annotators Nitika Mathur Timothy Baldwin Trevor Cohn School of Computing and Information Systems The University of Melbourne Victoria 3010, Australia [email protected] {tbaldwin,tcohn}@unimelb.edu.au Abstract However, the quality control process has low recall for good workers: as demonstrated in Sec- Current machine translation evaluations tion 3, about one third of good data is discarded, use Direct Assessment, based on crowd- increasing expense. Once good workers are identi- sourced judgements from a large pool fied, their outputs are simply averaged to produce of workers, along with quality control the final ‘true’ score, despite their varying accu- checks, and a robust method for combin- racy. ing redundant judgements. In this paper In this paper, we provide a detailed analysis of we show that the quality control mecha- these shortcomings of DA and propose a Bayesian nism is overly conservative, increasing the model to address these issues. Instead of stan- time and expense of the evaluation. We dardising individual worker scores, our model can propose a model that does not filter work- automatically infer worker offsets using the raw ers, and takes into account varying anno- scores of all workers as input. In addition, by tator reliabilities. Our model effectively learning a worker-specific precision, each worker weights each worker’s scores based on the effectively has a differing magnitude of vote in inferred precision of the worker, and is the ensemble. When evaluated on the WMT 2016 much more reliable than the mean of ei- Tr-En dataset which has a high proportion of un- ther the raw or standardised scores. skilled annotators, these models are more efficient than the mean of the standardised scores. 1 Introduction 2 Background Accurate evaluation is critical for measuring progress in machine translation (MT). Despite The Conference on Machine Translation (WMT) progress over the years, automatic metrics are annually collects human judgements to evaluate still biased, and human evaluation is still a fun- the MT systems and metrics submitted to the damental requirement for reliable evaluation. The shared tasks. The evaluation methodology has process of collecting human annotations is time- evolved over the years, from 5 point adequacy and consuming and expensive, and the data is always fluency rating, to relative rankings (“RR”), to DA. noisy. The question of how to efficiently collect With RR, annotators are asked to rank translations this data has evolved over the years, but there is of 5 different MT systems. In earlier years, the still scope for improvement. Furthermore, once final score of a system was the expected number the data has been collected, there is no consensus of times its translations score better than transla- on the best way to reason about translation quality. tions by other systems (expected wins). Bayesian Direct Assessment (“DA”: Graham et al. models like Hopkins and May (Hopkins and May, (2017)) is currently accepted as the best practice 2013) and Trueskill (Sakaguchi et al., 2014) were for human evaluation, and is the official method then proposed to learn the relative ability of the at the Conference for Machine Translation (Bo- MT systems. Trueskill was adopted by WMT in jar et al., 2017a). Every annotator scores a set 2015 as it is more stable and efficient than the ex- of translation-pairs, which includes quality control pected wins heuristic. items designed to filter out unreliable workers. DA was trialled at WMT 2016 (Bojar et al., Nitika Mathur, Timothy Baldwin and Trevor Cohn. 2018. Towards Efficient Machine Translation Evaluation by Modelling Annotators. In Proceedings of Australasian Language Technology Association Workshop, pages 77−82. 2016a), and has replaced RR since 2017 (Bojar The paired Wilcoxon rank-sum test is used to et al., 2017a). It is more scalable than RR as the test whether the worker scored degraded transla- number of systems increases (we need to obtain tions worse than the corresponding system trans- one annotation per system, instead of one anno- lation. The (arbitrary but customary) cutoff of tation per system pair). Each translation is rated p < 0.05 is used to determine good workers. The independently, minimising the risk of being influ- paired Wilcoxon rank-sum test (p < 0.05) is used enced by the relative quality of other translations. to test whether the worker scored degraded trans- Ideally, it is possible that evaluations can be com- lations worse than the corresponding system trans- pared across multiple datasets. For example, we lation. The remaining workers are further tested can track the progress of MT systems for a given to check that there is no significant difference be- language pair over the years. tween their scores for repeat-pairs. Another probabilistic model, EASL (Sakaguchi Worker scores are manually examined to filter and Van Durme, 2018), has been proposed that out workers who obviously gave the same score to combines some advantages of DA with Trueskill. all translations, or scored translations at random. Annotators score translations from 5 systems at Only these workers are rejected payment. Thus, the same time on a sliding scale, allowing users other workers who do not pass the quality control to explicitly specify the magnitude of difference check are paid for their efforts, but their scores are between system translations. Active learning to unused, increasing the overall cost. select the systems in each comparison to increase Some workers might have high standards and efficiency. But it does not model worker reliabil- give consistently low scores for all translations, ity, and is, very likely, not compatible with longi- while others are more lenient. And some workers tudinal evaluation, as the systems are effectively may only use the central part of the scale. Stan- scored relative to each other. dardising individual workers’ scores makes them In NLP, most other research on learning annota- more comparable, and reduces noise before calcu- tor bias and reliability has been on categorical data lating the mean. (Snow et al., 2008; Carpenter, 2008; Hovy et al., The final score of an MT system is the mean 2013; Passonneau and Carpenter, 2014). standardised score of its translations after discard- ing scores that do not meet quality control criteria. 3 Direct Assessment The noise in worker scores is cancelled out when To measure adequacy, in DA, annotators are asked a large number of translations are averaged. to rate how adequately an MT output expresses the To obtain accurate scores of individual transla- meaning of a reference translation using a contin- tions, multiple judgments are collected and aver- uous slider, which maps to an underlying scale of aged. As we increase the number of annotators 0–100. These annotations are crowdsourced us- per translation, there is greater consistency and re- ing Amazon Mechanical Turk, where “workers” liability in the mean score. This was empirically complete “Human Intelligence Tasks” (HITs) in tested by showing that there is high correlation be- the form of one or more micro-tasks. tween the mean of two independent sets of judg- Each HIT consists of 70 MT system transla- ments, when the sample size is greater than 15 tions, along with an additional 30 control items: (Graham et al., 2015). 1. degraded versions of 10 of these translations; However, both these tests are based on a 2. 10 reference translations by a human expert, sample-size of 10 items, and, as such, the first corresponding to 10 system translations; and test has low power; we show that it filters out a 3. repeats of another 10 translations. large proportion of the total workers. One solu- The scores on the quality control items are used tion would be to increase the sample size of the to filter out workers who either click randomly degraded-reference-pairs, but this would be at the or on the same score continuously. A conscien- expense of the number of useful worker annota- tious worker would give a near perfect score to tions. It is better to come up with a model that reference translations, give a lower score to de- would use the scores of all workers, and is more graded translations when compared to the corre- robust to low quality scores. sponding MT system translation, and be consistent Automatic metrics such as BLEU (Papineni with scores for repeat translations. et al., 2002) are generally evaluated using the Pear- 78 tions. In the evaluation of the WMT 17 Neural MT Training Task, the baseline system trained on 4GB GPU memory was evaluated separately from the baseline trained on 8 GB GPU memory and the other submissions. In this setup of manual eval- uation, Baseline-4GB scores slightly higher than Baseline-8GB when using raw scores, which is possibly due to chance. However, it scores sig- nificantly higher when using standardised scores, (a) all language pairs which goes against our expectations (Bojar et al., 2017b). 4 Models We use a simple model, assuming that a worker score is normally distributed around the true qual- ity of the translation. Each worker has a precision parameter τ that models their accuracy: workers with high τ are more accurate. In addition, we include a worker-specific offset β, which models (b) Tr-En language pair their deviation from the true score. For each translation i ∈ T , we draw the true Figure 1: Accuracy of “good” vs “bad” workers in quality µ from the standard normal distribution.1 the WMT 2016 dataset. Then for each worker j ∈ W , we draw their ac- curacy τj from a gamma distribution with shape 2 son correlation with the mean standardised score parameter k and rate parameter θ. The offset βj of the good workers. We similarly evaluate a is again drawn from the standard normal distribu- worker’s accuracy using the Pearson correlation of tion. The worker’s score rij is drawn from a nor- the worker’s scores with this ground truth. Over all mal distribution, with mean µi + βj, and precision the data collected for WMT16, the group of good τj. workers are, on average, more accurate than the −1 rij = N µi + βj, τ (1) group of workers who failed the significance test. j However, as seen in Figure 1a, there is substantial To help the model, we add constraints on the overlap in the accuracies of the two groups. We quality control items: the true quality of the de- can see that very few inaccurate workers were in- graded translation is lower than the quality of the cluded. However, about a third of the total workers corresponding system translation. In addition, the whose scores have a correlation greater than 0.6 true quality of the repeat items should be approxi- were not approved. In particular, over the Tr-En mately equal. Dataset, the significance test was not very effec- We expect that the model will learn a high τ for tive, as seen in Figure 1b. good quality workers, and give their scores higher Workers whose scores pass the quality control weight when estimating the mean. We believe that check are given equal weight, despite the variation the additional constraints will help the model to in their reliability. Given that quality control is not infer the worker precision. always reliable (as with the Tr-En dataset, e.g.), DA can be viewed as the Maximum Likelihood this could include worker with scores as low as Estimate of this model, with the following sub- r = 0.2 correlation with the ground truth. stitutions in Equation (1): sij is the standardised While worker standardisation succeeds in in- score of worker j, βj is 0 for all workers, and τ is creasing inter-annotator consistency, this process discards information about the absolute quality of 1We first standardise scores (across all workers together) the translations in the evaluation set. When us- in the dataset 2We use k = 2 and θ = 1 based on manual inspection ing the mean of standardised scores, we cannot of the distribution of worker precisions on a development compare MT systems across independent evalua- dataset (WMT18 Cs-En) 79 W βj τj T µi rij Figure 2: The proposed model, where worker j ∈ W has offset βj and precision τj, translation i ∈ T has quality µi, and worker j scores translation i Figure 3: Pearson’s r of the estimated true score with rij with the “ground truth” as we increase the number . of workers per translation. constant for all workers.