Concepticon: a Resource for the Linking of Concept Lists

Concepticon: A Resource for the Linking of Concept Lists Johann-Mattis List1, Michael Cysouw2, Robert Forkel3 1CRLAO/EHESS and AIRE/UPMC, Paris, 2Forschungszentrum Deutscher Sprachatlas, Marburg, 3Max Planck Institute for the Science of Human History, Jena [email protected], [email protected], [email protected] Abstract We present an attempt to link the large amount of different concept lists which are used in the linguistic literature, ranging from Swadesh lists in historical linguistics to naming tests in clinical studies and psycholinguistics. This resource, our Concepticon, links 30 222 concept labels from 160 conceptlists to 2495 concept sets. Each concept set is given a unique identifier, a unique label, and a human-readable definition. Concept sets are further structured by defining different relations between the concepts. The resource can be used for various purposes. Serving as a rich reference for new and existing databases in diachronic and synchronic linguistics, it allows researchers a quick access to studies on semantic change, cross-linguistic polysemies, and semantic associations. Keywords: concepts, concept list, Swadesh list, naming test, word norms, cross-linguistically linked data 1. Introduction in the languages they were working on, or that they turned In 1950, Morris Swadesh (1909 – 1967) proposed the idea out to be not as stable and universal as Swadesh had claimed that certain parts of the lexicon of human languages are uni- (Matisoff, 1978; Alpher and Nash, 1999). Up to today, versal, stable over time, and rather resistant to borrowing. dozens of different concept lists have been compiled for var- As a result, he claimed that this part of the lexicon, which ious purposes. They are used as heuristical tools for the de- was later called basic vocabulary, would be very useful to tection of deep genetic relationships among languages (Dol- address the problem of subgrouping in historical linguistics: gopolsky, 1964), as basic values for traditional lexicosta- tistical and glottochronological studies (Dyen et al., 1992; [...] it is a well known fact that certain types of Starostin, 1991), or as litmus test for dubious cases of lan- morphemes are relatively stable. Pronouns and guage relationship which might be due to inheritance or bor- numerals, for example, are occasionally replaced rowing (McMahon et al., 2005; Chén Bǎoyà 陈保亚, 1996; either by other forms from the same language or Wang and Wang, 2004). by borrowed elements, but such replacement is Apart from concept lists proposed for the application in rare. The same is more or less true of other ev- historical linguistics, there is a large amount of not explic- eryday expressions connected with concepts and itly diachronic data, including concept lists serving as the experiences common to all human groups or to basis for field work in specific linguistic areas (Kraft, 1981), the groups living in a given part of the world dur- concept lists which serve as the basis for large surveys on ing a given epoch. (Swadesh, 1950, 157) specific linguistic phenomena (Haspelmath and Tadmor, He illustrated this by proposing a first list of basic concepts, 2009), or concept lists which deal with the internal structur- which was, in fact, nothing else than a collection of concept ing of concepts, be it cognitive associations (Nelson et al., labels, as shown below:1 2004; Hill et al., 2014), cross-linguistic polysemies (List et al., 2014), or frequently recurring semantic shifts (Bulakh I, thou, he, we, ye, one, two, three, four, five, et al., 2013). Concept lists play also an important role in six, seven, eight, nine, ten, hundred, all, ani- education, where they are used to measure and aid learn- mal, ashes, back, bad, bark, belly, big, [...] this, ers’ progress (Dolch, 1936), in psycholinguistics, where dif- tongue, tooth, tree, warm, water, what, where, ferent kinds of word norm data, like frequency and con- white, who, wife, wind, woman, year, yellow. creteness, are needed to control for variables in experiments (Swadesh, 1950, 161) (Wilson, 1988), and in public health studies, where stan- dardized naming tests are used to assess the degree of apha- In the following years, Swadesh refined his original concept sia or language disturbance (Nicholas et al., 1989; Ardila, lists of basic vocabulary items, thereby reducing the origi- 2007). nal test list of 215 items first to 200 (Swadesh, 1952) and then to 100 items (Swadesh, 1955). Scholars working on Given the multitude of concept data that has been pub- different language families and different datasets provided lished in the past, it is surprising that no real attempt has further modifications, be it that the concepts which Swadesh been carried out to provide reliable standards which would had proposed were lacking proper translational equivalents help scholars to compare concepts across resources or to define consistently which concepts they would use in a given 1This list contains 123 items in total. According to Swadesh, these study. Apparently available resources like the Princeton items occurred both in his original test list of English items, and WordNet (Princeton University, 2010) or its counterparts in in the data on the Salishan languages, which he employed for his other languages are only partly useful for this purpose, since first glottochronological study. their synsets are supposed to reflect the concrete meaning 2393 of words, and not the denotation range of concepts. When models to handle their interoperability. This is specifically trying to link a given concept list to WordNet or BabelNet important in the context of multilingual language resources (Navigli and Ponzetto, 2012), like, for example, the famous and resources on less-well-studied languages. Language di- 100-item list by Swadesh (1955), one meets quickly insur- versity is often addressed with region- or language-specific mountable obstacles, since the exact definition can often questionnaires. This makes it difficult to integrate and com- only be captured by using two or more WordNet synsets, pare these resources on a greater scale. Despite the growing not to speak even of the cases where no corresponding con- body, the interoperability of language resources involving cept can be found.2 concepts and meanings, like wordlists and lexical datasets, It is for this reason that we decided to build a new re- has not yet been addressed in a systematic way. source that links between published and popular concept Our Concepticon is an attempt to overcome these dif- lists from scratch. Our resource was explicitly designed to ficulties by linking the many different concept lists which link existing concept lists and provide means to standardize are used in the linguistic literature. In order to do so, we future concept lists, and it is thus no lexical database like offer open, linked, and shared data and tools in open and WordNet, also no multilingual database, like BabelNet, but collaborative architectures. Our data is curated openly and an explicit meta-resource, in which we use concept sets to collaboratively on GitHub (https://github.com/clld/ link concepts across different concept lists, creating a true concepticon-data). The Concepticon itself is published concepticon in the spirit of Poornima and Good (2010). as Linked Open Data (http://concepticon.clld.org) within the CLLD framework, which allows us to reuse tools 2. Concept Lists built on top of the CLLD API, in particular the clldclient Concept lists are simply speaking collections of concepts package (https://github.com/clld/clldclient). which scholars decided to compile at some point. In an In our Concepticon, all entries from concept lists are ideal concept list, concepts would be described by a con- partitioned into sets of labels referring to the same concept label and a short definition. Most published concept cept – so called concept sets. Each concept set is given lists, however, only contain a concept label. On the other a unique identifier (Concepticon ID), a unique label (Con- hand, certain concept lists have been further expanded by cepticon Gloss), a human-readable definition (Concepticon adding structure, such as rankings, divisions, or relations. Definition), a rough semantic field, following the seman- Concept lists are compiled for a variety of different tic fields which are used in the World Loanword Database purposes. A major distinction can be made between those (Haspelmath and Tadmor, 2009), and a short description re- concept lists which have been compiled for the purpose of garding its ontological category which correlates roughly language comparison and those which have been compiled with the conception of parts of speech when dealing with for the purpose of concept comparison. Among the for- words in individual languages. Our Concepticon reflects mer, we can further distinguish those lists which are used the idea of metaglosses proposed by Cooper (2014), ex- to prove language relationship (Dolgopolsky, 1964), those ceeding it in application range while falling back in the on- which are used for linguistic subgrouping (Norman, 2003; tological grounding of concept sets, which we define and Starostin, 1991; Swadesh, 1955), and those which can be add on the basis of what we find in concept lists and what used to identify contact layers (Chén Bǎoyà 陈保亚, 1996). we deem to be important. Based on the availability of re- Among the latter, we can distinguish between concept lists sources, we further provide metadata for concept sets, in- with a primarily

Concepticon: a Resource for the Linking of Concept Lists

'Face' in Vietnamese

Modeling and Encoding Traditional Wordlists for Machine Applications

TEI and the Documentation of Mixtepec-Mixtec Jack Bowers

1 from Text to Thought How Analyzing Language Can Advance

Proceedings of the 55Th Annual Meeting of the Association For

Modeling Wordlists Via Semantic Web Technologies

Cross-Linguistic Data Formats, Advancing Data Sharing and Re-Use in Comparative Linguistics

Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing Etymdb 2.0 Clémentine Fourrier, Benoît Sagot

How Analyzing Language Can Advance Psychological Science

The Cross-Linguistic Linked Data Project

Teaching Forge to Verbalize Dbpedia Properties in Spanish

Repurposing Bible Translations for Grammar Sketches*