Writing a Parser for an Artificial Language, Atasi Cnoba
Total Page:16
File Type:pdf, Size:1020Kb
Writing a parser for an artificial language, Atasi cnoba Johan Holmberg Department of Computer Science Institute of Technology, Lund University [email protected] Abstract merely being a lingua franca, or by even re- placing existing languages. This report describes an attempt to build a parser for a small and coherent artifi- Experimentation. Languages in this category are cal language. It presents a fairly com- usually created as a mean of trying out ideas, plete grammar denoted on the Panini- usually regarding logic and philosophy. Pro- Backus form, accomanied with a sample ponents of the Sapir–Whorf hypothesis com- thesaurus, all implemented using the Pro- monly uses this as a tool for exploring how log language. The project is part of the languages inflicts on the human mind. course Language Processing and Compu- Artistic ambitions. These are languages in- tational Linguistics given at Lund univer- tended to add an extra dimension to artistics sity in the autumn semester of 2006. works, such as books or movies. Famous 1 Introduction languages, such as Tolkien’s Quenya or Klingon of Star Trek fame are part of this While the World has a phletora of languages, group. some linguistics claim there to be 6500 languages thoughout the World, there still seems to be a Apart from these main reasons, there is another need for artificially created languages. Those one: obfuscation. Languages in this category are languages differ in their seriousness, longevity devised to hide information from a general public, and span, but they all fill a void needed to be while keeping it open to the speakers of the lan- filled. Most of those languages will not prevail, guage in question. A lot of informal languages are and the actual need of the languages can thus created this way, but since they are secretive by be questioned. Nonetheless do they exist, and nature, they tend to fall in oblivion as soon as their probably will for a long, long time. inventors and speakers drop them. This is an attempt to outline a possible way 3 The Atasi cnoba language to automatically parse and validate such a The Atasi cnoba language was originally con- language. structed for usage in refugee role playing games set up by Swedish scouts, as an integral part of 2 Rationale for an artificial language the act. The rationale for this was that the partic- There are mainly three different reasons for devel- ipants, who act as refugees, were to feel unsecure oping an artifical language. These are, in no par- and disorientated, but it was also meant to be a ticular order: useful tool for the game leader – staff communi- cation. As such, it has only been used a couple Auxiliarity. Languages, such as Interlingua, of times, due to the late date of maturity for the Volapuk¨ or Esperanto, resides in this cate- language. gory. They are constructed with the pur- The premises for developing a langauge like pose of being a common language, either by that, were clear and simple: The language had to be different enough to be percieved as a foreign seem to be homonyms, while they are in fact dis- language, yet simple enough to learn in a matter of tinct words on their own merits, and will con- days, or even hours, if needed. This was accom- sequently be treated as different entities by the plished by cutting back on grammatical nuances, parser. relying on the fact that the language will not ever Some simplifications had to be done in order for be used as a real world language. The develop- Atasi cnoba to be easy to grasp. This is especially ers were also keen of grouping words with similar true in regards to the somewhat perverse conso- meaning into more general words, relying on de- nant clusters used by the Georgians. Clusters of scriptors to further describe objects and phenom- up to six consonats are not uncommon, peaking at ena as needed. The result is a, within it’s domain, eight consecutive consonants, whereof some con- fairly useful language with no more than a hundred sonats would be rendered as two or more separate words, coupled with a handful of modifiers. Due sounds by a native Swedish speaker. The Swedish to the compactness and regularity of the language, language, as a comparison, features no more than it turned out to be well-suited for computational three consonants in a row, not counting possibly handling. appended suffixes. Atasi cnoba is to be stressed using trochee feet, 3.1 Vocabulary but as this has no consequences in regards to The vocabulary of Atasi cnoba is heavily influ- spelling, this fact can be easily dismissed in the enced by the author’s past experiences in the field scope of this project. of linguistics, coupled with the premise that the 3.3 Writing system language should be hard to grasp for an outsider. Hence, it’s main sources are the Romance, Ger- Since the sound system is effectively Georgian, man and Slavonic language families. The words we also use the Mkhedrulian alphabet, which is coming from these families are usually distorted in used for writing Georgian, as well as the other order to make them unintelligble by non-speakers. Kartvelian languages. The alphabet works flaw- lessy in conjucture with the Georgian sound sys- Some words, as in the case of the numerals, are tem. Furthermore, it has the property of looking direct loanwords from Georigan, a Kartvelian lan- radically different from the European scripts, al- guage. In fact, atasi cnoba is Georgian for a thou- though those alphabets share the same ancestor as sand words. Those words are usually not distorted, the Mkhedrulian: the classical Greek script. as a discreet homage to the language’s Georgian The Mkhedrulian alphabet provides a 1 : 1 roots. mapping for each sound in Georgian, and as a con- 3.2 Sound system sequence also for Atasi cnoba. Another useful trait of the Mhedrulian alphabet is the lack of capital In order to make the language sound foreign and letters. What this means in terms of this project, strange, the need of a distinct sound system was is that we don’t have to take conversion of upper needed. We chose the Georgian sound system as a case letters into lower case letters for lookups in basis to build on, mainly because the many conso- the thesaurus into account while developing the nant sounds in the language and the relative lack system. of vowels are radically different from Swedish. The Mkhedruli alphabet is covered by the Uni- The distinct sounds are however relatively easy to code standard, as discussed later. mimic for a non-native speaker, and are thus easy to comprehend. 3.4 Grammar Georgian features 33 consonant sounds and 5 The grammar of Atasi cnoba is, as previously vowel sounds. Since quite a few of the western mentioned, thought to be simplistic in it’s ap- European sounds doesn’t appear in the Georigan proach. It lacks ways of morphologically denote sound system, imported words and names has to the words’ roles in the sentence. Instead, it relies be retrofitted before being used. on an absolute subject-verb-object order through- Unlike most modern day European languages, out the languages. This is even true in ques- aspirated and unaspirated plosives and affricatives tions, different from, say English and Swedish, aren’t allophones in the Georgian language. As where questions usually are constructed in a verb- a result, some distinct words in Atasi cnoba may subject-object order. This is possible due to a type of question mark, which surrounds the whole sen- take on one out of two roles: an agent, using the tence, not unlike Portuguese or Spanish, the differ- -ari suffix, or an essence, using the -de suffix. ence here being that the question mark is actually Thus, the verb dokt’ (to instruct), would transform realized as a distinct phoneme. into the agent form dokt’ari (teacher), and into the essence form dokt’de (instruction, learning). 3.4.1 Morphology Atasi cnoba is an agglutinative language, mean- 3.4.3 Proper names ing that new words are created by the means of af- Proper names are denoted by adding an extra -i fixes added to a stem. By doing this, a verb might at the end of the word. This is a remnant from the turn into a noun, and a noun turn into a descriptor. Georgian nominative case. More importantly, affixes are used for verb conju- 3.4.4 Pronouns gation, descriptor comparison and noun inflection. Some affixes, eg. -ni- which acts as a negator, Atasi cnoba features nine pronouns, whereof are universal for the major lexical classes. seven are personal: 3.4.2 Nouns singular plural Atasi cnoba lacks both grammatical numbers 1st person ik ikek (inclusive) and genders, and thus keeping the different noun ikak (exclusive) forms to an absolute minimum. The nouns are, 2nd person ek ekek however, slightly inflicted in order to accompany 3rd person ak akak verbs, nouns and pronouns as descriptors. There The remaining pronouns are k’iak, which are three cases, not counting the nominative case, roughly translates into the English pronouns any- into which a noun can be inflicted. These are: body, somebody, and tot’ak, which roughly trans- lates into the English pronouns all, everybody. The possessive case is used to denote dependen- The inclusive and exclusive versions of the cies and possession. A simple example 1st person plural are used to denote whether the ia ioaniav Johan’s flower would be , meaning speaker includes it’s audience into the ”we collec- the flower of Johan or .