Context-Free Grammars for Macedonian Language
Total Page:16
File Type:pdf, Size:1020Kb
IJournals: International Journal of Software & Hardware Research in Engineering ISSN-2347-4890 Volume 5 Issue 2 February, 2017 Context-free Grammars for Macedonian Language Author: Kristina Bikoska1; Slavco Chungurski2; Emilija Kamceva3 FON University Skopje Macedonia E-mail: [email protected]; [email protected]; [email protected] ABSTRACT analysis and generation. A commonly used mathematical system for modelling constituent Modern approaches to the language are one kind structure in Natural Language is Context-Free of provocation for research in different branches. Grammar (CFG) which was first defined for The main idea that connects all of them historically Natural Language in (Chomsky 1957) and was and theoretically is the term universal grammar, or independently discovered for the description of the Algol programming language by Backus (1959) formalization of natural languages, which learns and Naur (1960). In this paper, one fragment of about the main characteristics of the language. Macedonian language will be presented as context- Such formalization depends on the nature and free grammar. Context-Free grammars belong to structure of the language. Natural language has an the realm of formal language theory (cf. Hopcroft underlying structure usually referred to under the and Ullman 1974 for a detailed overview) where a heading of Syntax. The syntax of one language language (formal or natural) is viewed as a set of refers to the principles by which words are sentences. A sentence as a string of one or more words from the vocabulary of the language and a grouped together. This paper formally represents grammar as a finite, formal specification of the the grammar for one piece of Macedonian (possibly infinite) set of sentences composing the language as a context - free grammar (CFG). language under study. Using modern computational notations to express natural language constructs makes development of applications that parse natural language inputs 2. SYNTAX FOR ONE FRAGMENT easier. In order to demonstrate the feasibility of the OF MACEDONIAN LANGUAGE presented CFG, we perform a set of examples for Natural language has an underlying structure parsing Macedonian sentences. usually referred to under the heading of Syntax. Keywords: Context-free grammar, syntax, The syntax of one language refers to the principles by which words are grouped together. The natural language, Macedonian, processing. fundamental idea of syntax is that words group together to form constituents, which are groups of 1. INTRODUCTION words or phrases which behave as a single unit. Macedonian is a South Slavic language spoken by These constituents can combine together to form about three million people. There are some two bigger constituents and eventually sentences. When million speakers in the Macedonia, and perhaps syntax for one piece of a language is creating, two another million or so in other countries. basic rules have to be processed: Macedonian has a high degree of mutual - Parsing the syntax categories which are part of intelligibility with Bulgarian, and to a lesser extent that language with group of main rules for every with Serbian. Literary Macedonian is based on the syntax category; dialects of the West Central region. This paper is - Creating the syntax rules. about natural language processing and building One language (formal or natural) is considered as a computational models of natural language for set of sentences, a sentence as a string of one or © 2017, IJournals All Rights Reserved Page 34 www.ijournals.in IJournals: International Journal of Software & Hardware Research in Engineering ISSN-2347-4890 Volume 5 Issue 2 February, 2017 more words which belong to the language and a Then one language L generated by some grammar grammar as a finite, formal specification of the G can be formally defined as the set of strings language. A context-free grammar consists of composed of terminal symbols which can be Lexicon of words and symbols and a set of rules derived from the designed start symbol S. which define how those words and symbols are grouped together. Generally, a CFG or Phrase- L = A|a is in T* and S =>* a Structure Grammar consists of four components: In Macedonian language exist many sentence • T- class of terminal vocabulary: the symbols that structures, but here will be shown some of them. correspond to words in the language; One of the most used sentence structures are Declarative sentences and the usual element order • N- class of non-terminal vocabulary: a set of in declarative sentences is subject-verb-object. This symbols disjoint from T that express clusters or order is not strict and it can be changed in some generalizations; situations (for example in poetry). • P, a set of productions (rules), each of the form A 3. SUBJECT → x, where A is a non-terminal and x is a string of (Podmet) Subject in one sentence is the entity that symbols from the infinite set of strings (T U N) *. performs the action in the sentence. The subject of a sentence is the person, place, thing, or idea that is • S, the start symbol, a member from N doing or being something. Subject in one sentence in Macedonian language can be noun phrase In context-free rules the element to the right of the (sequence of words surrounding at least one noun)- arrow (→) is an ordered list of one or more NP, verbal noun-VN (A verbal noun is a form of a elements of class N and class T, while to the left of verb ending in -ing that acts as a noun. In the arrow is a single non-terminal symbol. The Macedonian language verbal nouns are formed by arrow (→) means “rewrite the symbol on the left adding a suffix –ње to the verb.), to-construction- TC, clause-CL, direct quote-DQ or pronoun-PN with the string of symbols on the right”. The set of (Macedonian pronouns decline for case ('падеж'), strings in one language defined by context free i.e., their function in a phrase as subject ( Тој 'He'), grammar have to be derivable from the start direct object (него 'him'), or object of a preposition symbol of that context free grammar. Every context (од неа 'from her')). There are sentences which free grammar must have start symbol which is don’t have subject and those sentences are called often called S. S represents the sentence node and impersonal sentences. the set of strings that are derivable from S is the set Simple rule for context – free grammar for subject of sentences in some version of one language. in a sentence would be as is shown on Figure 1: This means that a language is defined via the concept of derivation. One string derives another one if it can be rewritten as the second one via some series of productions. If x →y is a production, then any sequence of symbols which contains the symbol x can be rewritten by replacing the x with y. More formally, if X → Y is a production of P and x and y are any strings in the set (T U N) *, then it can be said that xXy directly derives xYy, or xXy => xYy. Then, derivation is a generalization of direct derivation. Let x1, x2, …, xm be the strings in (T U N) *, m >= 1, such that Figure 1 Simple rule for context – free grammar for subject in a sentence x1 => x2, x2 => x3, …, xm-1 => xm The or – symbol (|) indicate that a non – terminal is said that x1 derives xm, or x1 =>* xm. has alternate possible expansions. © 2017, IJournals All Rights Reserved Page 35 www.ijournals.in IJournals: International Journal of Software & Hardware Research in Engineering ISSN-2347-4890 Volume 5 Issue 2 February, 2017 4. PREDICATE verb is auxiliary verb and the verb of to- construction is that which shows the meaning. (Prirok) - The predicate provides information about the subject, such as what the subject is, what the Examples: subject is doing, or what the subject is like. - Ти треба да одиш во училоште. (Ti Predicate is the main part of one sentence, without treba da odish vo uchilishte. – You have to predicate there is no sentence. In Macedonian go to school.) language predicate can be verbal or noun-verbal. Noun verbal predicate consists of verb-connection Verbal predicate can be simple or complex. and noun phrase. The verb-connection mostly is the There are two types of simple verbal predicate: verb to be (сум-sum), but also verb-connection can be other verbs: стане(become), станува Simple verbal predicate which consists of (becoming), остане(stay), останува(staying)… In simple verb form (there is no additional the noun verbal predicate the meaning is shown by words). Simple verb forms in Macedonian the noun phrase (nouns, pronouns adjectives, language are: Present tense (сегашно време), numbers) and the verb is auxiliary. Imperfect (минато определено несвршено времe, 'past definite incomplete tense'), Aorist Examples: (минато определено свршено време, 'past - Таа е убава. (Taa e ubava. – She is definite complete tense') and Imperative beautiful.) (заповеден начин). - Јас сум студент. (Jas sum student. – I am Examples: a student) - Тој игра. (Toj igra. – He plays.) - Present - Тој остана буден. (Toj ostana buden. – tense (сегашно време) He stayed awake.) - Тој играше. (Toj igrashe. – He was The most important feature for context-free playing.) - Imperfect (минато определено grammars in Macedonian language is transitivity. несвршено времe, 'past definite Transitivity is a property of verbs in Macedonian incomplete tense') language (and in many other languages) that relates - Тој изигра. (Toj izigra. – He played) - to whether a verb can take direct objects and how Aorist (минато определено свршено many such objects a verb can take. It is also време, 'past definite complete tense') important to note that when is talked about transitivity only are considered the obligatory noun - Ти, играј! (Ti igraj! – You, play!) - phrases and prepositional phrases (PP) when it Imperative (заповеден начин) comes to determining how many arguments a Simple verbal predicate which consists of predicate has.