PanLex Empirical Concepticons
Draft in Progress 3 March 02014 Jonathan Pool The Long Now Foundation
Table of Contents
Abstract 2
Introduction 2
Concept discovery strategy 4
Expression evaluation 5
Meaning evaluation 7
Meaning-pair evaluation 8
Meaning consolidation 9
Expression validity evaluation 10
Expression selection 10
Ingestion 10
Observations 11
References 12
Appendix 13 Abstract Concepticons are lists of purportedly universal or basic concepts, labeled with words and phrases in one or more languages. They have applications in science, information management, and philosophy. They have generally been human- defined, but they could be derived algorithmically from empirical data. A possible source of such data is PanLex, a database of millions of lexemes from thousands of language varieties, with over a billion translation links among them. “PanLex Empirical Concepticon A” is derived from PanLex with an algorithm that selects groups of translated expressions and then consolidates them with other groups into clusters determined to be likely to share a meaning. The algorithm gives preference to expressions that appear in many and high-quality lexicographic sources, and expressions that are unambiguous. This concepticon contains 2,932 concepts, which are labeled with an average of 248 expressions each. Each label of each concept is assigned an estimated validity. Two versions of the concepticon have been made available for inspection: a “Complete” version, which includes all the labels, and a “Best” version, which discards lower-validity labels. A cursory inspection reveals that the concepts and their labels are generally plausible, but some anomalies appear, including multiple concepts that apparently should be combined and the omission of some concepts that seem much more significant than related ones that are included. Evaluation of these concepticons will lead to iterative algorithm refinements and revised editions.
Introduction What are the fundamental ideas universal to all of humankind? If civilization had to be restarted and languages reinvented, what most important concepts would new words begin to express? Creators of thesauri, wordnets, upper ontologies, subject headings, universal languages, Swadesh-type wordlists, and even encyclopedias are among those who have, arguably, proposed answers to this question, by listing, or organizing into taxonomies, hypothetically universal or basic concepts, either general or subject-specific. They do this for various purposes, including support for indexing, documentation, communication, translation, information retrieval, historical-linguistic research, and clear thinking. A term recently coined to describe such a concept list is concepticon (Poornima and Good, 02010). While some concepticons are hierarchical (containing relations such as broader than, type of, related to, and part of), here I ignore any internal organization and treat concepticons as if they were reduced to flat lists. Where do concepticons come from? Until now, they have generally been human- crafted. A single author or a committee decides which concepts should be included