Etytree: a Graphical and Interactive Etymology Dictionary Based on Wiktionary
Total Page:16
File Type:pdf, Size:1020Kb
Etytree: A Graphical and Interactive Etymology Dictionary Based on Wiktionary Ester Pantaleo Vito Walter Anelli Wikimedia Foundation grantee Politecnico di Bari Italy Italy [email protected] [email protected] Tommaso Di Noia Gilles Sérasset Politecnico di Bari Univ. Grenoble Alpes, CNRS Italy Grenoble INP, LIG, F-38000 Grenoble, France [email protected] [email protected] ABSTRACT a new method1 that parses Etymology, Derived terms, De- We present etytree (from etymology + family tree): a scendants sections, the namespace for Reconstructed Terms, new on-line multilingual tool to extract and visualize et- and the etymtree template in Wiktionary. ymological relationships between words from the English With etytree, a RDF (Resource Description Framework) Wiktionary. A first version of etytree is available at http: lexical database of etymological relationships collecting all //tools.wmflabs.org/etytree/. the extracted relationships and lexical data attached to lex- With etytree users can search a word and interactively emes has also been released. The database consists of triples explore etymologically related words (ancestors, descendants, or data entities composed of subject-predicate-object where cognates) in many languages using a graphical interface. a possible statement can be (for example) a triple with a lex- The data is synchronised with the English Wiktionary dump eme as subject, a lexeme as object, and\derivesFrom"or\et- at every new release, and can be queried via SPARQL from a ymologicallyEquivalentTo" as predicate. The RDF database Virtuoso endpoint. has been exposed via a SPARQL endpoint and can be queried Etytree is the first graphical etymology dictionary, which at http://etytree-virtuoso.wmflabs.org/sparql. could be used to search specific etymological definitions as Etytree provides a graphical interface to the database well as to discover new relations among words. Moreover, it which consists in an intuitive and multilingual graphical ety- can be effectively adopted by Wiktionary editors to identify mology dictionary. The graphical etymology dictionary rep- inconsistencies or missing information in the data. resents the extracted etymological relationships as well as the associated lexical information using graphs and tooltips, respectively. It uses d3.js2, a JavaScript library for manip- Keywords ulating documents based on data, and infers the tree struc- ture from the RDF database on the fly through specific queries etymology; Wiktionary; natural language processing; d3.js from the Virtuoso3 SPARQL endpoint. With etytree, users can discover new words when they 1. INTRODUCTION search for a specific etymological definition, e.g., they can discover words that derive from the same ancestral word, Etytree is a new tool to extract and visualize etymologi- both in their own language and in other languages. This cal relationships between lexemes (or words, for simplicity) happens in an intuitive way without having to read fairly using data coming from the English Wiktionary. The in- long and complex sentences that describe etymological rela- terest of this tool lies in its potential for Wiktionary users, tionships between words and without the need to navigate editors and for researchers or more generally people inter- across multiple pages. Moreover, with the vi- ested in languages and etymologies. Wiktionary sualization of the etymological tree, editors can easily spot It is built on top of DBnary[1] which extracts word Defini- inconsistencies between etymological relationships described tions, Parts of Speech, Synonyms, and other lexical informa- across multiple Wiktionary pages. Finally, researchers can tion from Wiktionary pages. Etytree extends DBnary with use the database of etymological relationships to study ety- mologies on a large scale. Potentially, they could extend the database of etymological relationships to include semantics c 2017 International World Wide Web Conference Committee (IW3C2), or pronunciations, to study how they evolved through time published under Creative Commons CC BY 4.0 License. across etymological trees and across languages. WWW’17 Companion, April 3–7, 2017, Perth, Australia. ACM 978-1-4503-4913-0/17/04. http://dx.doi.org/10.1145/3038912.3038914 1The extended version of DBnary is available at https:// bitbucket.org/esterpantaleo/dbnary_etymology 2https://d3js.org/ 3 . https://virtuoso.openlinksw.com/ 1635 Figure 1: A screenshot of the interactive visualization produced by etytree for the English word \gorgeous". From the graph it is possible to see that the English words \gorgeous", \disgorge", \gorget", and archaic words \gorge" and \engorge" are etymologically related. 1636 A project similar to etytree is Etymological Wordnet4 SUPERSEDED BY7 or COGNATE TO8. [2, 3] which is, unfortunately, neither publicly available nor Also we use a pattern to match compounds, i.e., sentences maintained anymore. like ffmjenjdoorgg+ffmjenjbellgg 2. THE MODEL The etytree extraction tool uses regular expressions and or parsing of both Wiktionary templates and links. It assumes a standard structure for the different sections containing et- Compound of ffmjenjdoorgg and ffmjenjbellgg ymologies, i.e., the Etymology section, the Derived terms section, the Descendants section, the namespace with Re- Whenever we find a match to a compound pattern, we ig- constructed Terms (still in the works), the etymtree tem- nore everything after the match, as there is no standard for plate5. the etymology of compound words. While the selected patterns generally correctly reflect real 2.1 Etymology sections patterns (as Etymology sections use very well defined stan- 9 Figure 2 presents a screenshot of the Etymology section of dards ), some etymologies are written in non-standard ways, English word \gorgeous" in English Wiktionary. The same which implies that the corresponding extraction is incorrect section in the xml dump (our data source) as well as in the (or partially incorrect). We are trying to interact with the edit tab of the online English Wiktionary is: community of editors of English Wiktionary to better un- derstand the standards they use and to encourage the use ===Etymology=== of more standards that would allow the community to have From Early Modern English ffmjenjgorgiousgg, ffmjenj gor- a lower amount of data loss and a lower rate of incorrectly geousegg, from ffetyljfrmjengg ffmjfrmjgorgiasjjelegant, fashion- extracted etymological relationships. ablegg, from ffetyljfrojengg ffmjfrojgourgiasgg, ffmjfrojgorgiasjj One example of non-standard Etymology sections uses gorgeous, gaudy, flaunting, gallant, finegg, of uncertain forma- links instead of templates to represent words that are et- tion, but apparently connected with ffcogjfrojgorgiasjja gorget, ymologically related (e.g. [[door]] instead of ffmjenjdoorgg). ruffle for the neckgg, from ffetyljfrojengg ffmjfrojgorgejjbosom, This is a major problem because in Etymology sections words throatgg. See ffljenjgorgegg. Sense evolution was probably that with links often correspond to descriptive words or glossary, of \swelling of the throat or bosom due to pride, bridling up" to for example the Etymology section of \Davidsen" is: \assume an air of importance, flaunting". ===Etymology=== After inspection of many different Etymology sections we Originally a [[patronymic]] from ffsuffixjDavidjsenjlang=dagg. inferred a set of recurrent patterns that we constructed us- ing regular expressions. The most common pattern is6: and clearly\patronymic"here is not etymologically related to \Davidson". In this particular case, a standard that encour- (FROM )?(LANGUAGE LEMMA jLEMMA )(COMMA jDOT ages the use of links to the glossary for words like \patrony- jOR ) mic", i.e. [[Appendix:Glossary#patronymicjpatronymic]], (and for \ablative",\zero-grade", etc.) in Etymology sections Using this pattern plus a set of rules we extract etymo- would help automatic data extraction. logical relationships into a RDF database. In what follows we Other lexemes that usually have non-standard Etymol- present some examples of rules that we use. ogy sections are phrases. For example \until the cows come If we find a match to the pattern above with DOT or OR home" has the following Etymology section: in the last group, we ignore all the text following the match. We ignore anything after a dot (DOT) because generally Et- ===Etymology=== ymology sections start with a chain of etymological relation- Possibly from the fact that [[cattle]] let out to pasture may be ships followed by a dot and then contain some descriptive only expected to return for milking the next morning; thus, for text that is not easily parsable. We ignore anything follow- example, a party that goes on \ until the cows come home" is a ing OR (alternative etymologies) as alternative etymologies very long one. Alternatively, the phrase may have a Scottish ori- are not presented in a standard format in the English Wik- gin,<ref>See, for example, ffcite-webjtitle=Till the cows come tionary. We also ignore anything that follows a match to home jurl=http://www.phrases.org.uk/meanings/382900.html jarchiveurl=https://web.archive.org/web/20160611134612/ 4www1.icsi.berkeley.edu/~demelo/etymwn/ http://www.phrases.org.uk/meanings/382900.html j 5See https://en.wiktionary.org/wiki/Template: archivedate=11 June 2016 jwork=Phrase Finder j accessdate=30 etymtree March 2013gg.</ref> and may derive from the fact that cattle 6 where FROM can be any of the following: in the [[w:Scottish HighlandsjHighlands]] are put out to graze