IJournals: International Journal of Software & Hardware Research in Engineering ISSN-2347-4890 Volume 5 Issue 2 February, 2017 Context-free for

Author: Kristina Bikoska1; Slavco Chungurski2; Emilija Kamceva3

FON University Skopje Macedonia

E-mail: [email protected]; [email protected]; [email protected]

ABSTRACT analysis and generation. commonly used mathematical system for modelling constituent Modern approaches to the language are one kind structure in Natural Language is Context-Free of provocation for research in different branches. (CFG) which was first defined for The main idea that connects all of them historically Natural Language in (Chomsky 1957) and was and theoretically is the term universal grammar, or independently discovered for the description of the Algol programming language by Backus (1959) formalization of natural languages, which learns and Naur (1960). In this paper, one fragment of about the main characteristics of the language. Macedonian language will presented as context- Such formalization depends on the nature and free grammar. Context-Free grammars belong to structure of the language. Natural language has an the realm of formal language theory (cf. Hopcroft underlying structure usually referred to under the and Ullman 1974 for a detailed overview) where a heading of . The syntax of one language language (formal or natural) is viewed as a set of refers to the principles by which words are sentences. A sentence as a string of one or more words from the vocabulary of the language and a grouped together. This paper formally represents grammar as a finite, formal specification of the the grammar for one piece of Macedonian (possibly infinite) set of sentences composing the language as a context - free grammar (CFG). language under study. Using modern computational notations to express natural language constructs makes development of applications that parse natural language inputs 2. SYNTAX FOR ONE FRAGMENT easier. In order to demonstrate the feasibility of the OF MACEDONIAN LANGUAGE presented CFG, we perform a set of examples for Natural language has an underlying structure parsing Macedonian sentences. usually referred to under the heading of Syntax. Keywords: Context-free grammar, syntax, The syntax of one language refers to the principles by which words are grouped together. The natural language, Macedonian, processing. fundamental idea of syntax is that words group

together to form constituents, which are groups of 1. INTRODUCTION words or phrases which behave as a single unit. Macedonian is a South Slavic language spoken by These constituents can combine together to form about three million people. There are some two bigger constituents and eventually sentences. When million speakers in the Macedonia, and perhaps syntax for one piece of a language is creating, two another million or so in other countries. basic rules have to be processed: Macedonian has a high degree of mutual - Parsing the syntax categories which are part of intelligibility with Bulgarian, and to a lesser extent that language with group of main rules for every with Serbian. Literary Macedonian is based on the syntax category; dialects of the West Central region. This paper is - Creating the syntax rules. about natural language processing and building One language (formal or natural) is considered as a computational models of natural language for set of sentences, a sentence as a string of one or

© 2017, IJournals All Rights Reserved Page 34 www.ijournals.in

IJournals: International Journal of Software & Hardware Research in Engineering ISSN-2347-4890 Volume 5 Issue 2 February, 2017

more words which belong to the language and a Then one language L generated by some grammar grammar as a finite, formal specification of the G can be formally defined as the set of strings language. A context-free grammar consists of composed of terminal symbols which can be Lexicon of words and symbols and a set of rules derived from the designed start symbol S. which define how those words and symbols are grouped together. Generally, a CFG or Phrase- L = A|a is in T* and S =>* a Structure Grammar consists of four components: In Macedonian language exist many sentence • T- class of terminal vocabulary: the symbols that structures, but here will be shown some of them. correspond to words in the language; One of the most used sentence structures are Declarative sentences and the usual element order • N- class of non-terminal vocabulary: a set of in declarative sentences is -verb-object. This symbols disjoint from T that express clusters or order is not strict and it can be changed in some generalizations; situations (for example in ).

• P, a set of productions (rules), each of the form A 3. SUBJECT → x, where A is a non-terminal and x is a string of (Podmet) Subject in one sentence is the entity that symbols from the infinite set of strings (T N) *. performs the action in the sentence. The subject of a sentence is the person, place, thing, or idea that is • S, the start symbol, a member from N doing or being something. Subject in one sentence in Macedonian language can be phrase In context-free rules the element to the right of the (sequence of words surrounding at least one noun)- arrow (→) is an ordered list of one or more NP, -VN (A verbal noun is a form of a elements of class N and class T, while to the left of verb ending in -ing that acts as a noun. In the arrow is a single non-terminal symbol. The Macedonian language verbal are formed by arrow (→) means “rewrite the symbol on the left adding a suffix –ње to the verb.), to-construction- TC, clause-CL, direct quote-DQ or pronoun-PN with the string of symbols on the right”. The set of (Macedonian pronouns decline for case ('падеж'), strings in one language defined by context free .e., their function in a phrase as subject ( Тој 'He'), grammar have to be derivable from the start direct object (него 'him'), or object of a preposition symbol of that context free grammar. Every context (од неа 'from her')). There are sentences which free grammar must have start symbol which is don’t have subject and those sentences are called often called S. S represents the sentence node and impersonal sentences. the set of strings that are derivable from S is the set Simple rule for context – free grammar for subject of sentences in some version of one language. in a sentence would be as is shown on Figure 1:

This means that a language is defined via the concept of derivation. One string derives another one if it can be rewritten as the second one via some series of productions. If x →y is a production, then any sequence of symbols which contains the symbol x can be rewritten by replacing the x with y. More formally, if X → Y is a production of P and x and y are any strings in the set (T U N) *, then it can be said that xXy directly derives xYy, or xXy => xYy. Then, derivation is a generalization of direct derivation. Let x1, x2, …, xm be the strings in (T U N) *, m >= 1, such that Figure 1 Simple rule for context – free grammar for subject in a sentence x1 => x2, x2 => x3, …, xm-1 => xm The or – symbol (|) indicate that a non – terminal is said that x1 derives xm, or x1 =>* xm. has alternate possible expansions.

© 2017, IJournals All Rights Reserved Page 35 www.ijournals.in

IJournals: International Journal of Software & Hardware Research in Engineering ISSN-2347-4890 Volume 5 Issue 2 February, 2017 4. PREDICATE verb is auxiliary verb and the verb of to- construction is that which shows the meaning. (Prirok) - The predicate provides information about the subject, such as what the subject is, what the Examples: subject is doing, or what the subject is like. - Ти треба да одиш во училоште. (Ti Predicate is the main part of one sentence, without treba da odish vo uchilishte. – You have to predicate there is no sentence. In Macedonian go to school.) language predicate can be verbal or noun-verbal. Noun verbal predicate consists of verb-connection Verbal predicate can be simple or complex. and noun phrase. The verb-connection mostly is the There are two types of simple verbal predicate: verb to be (сум-sum), but also verb-connection can be other verbs: стане(become), станува  Simple verbal predicate which consists of (becoming), остане(stay), останува(staying)… In simple verb form (there is no additional the noun verbal predicate the meaning is shown by words). Simple verb forms in Macedonian the noun phrase (nouns, pronouns , language are: (сегашно време), numbers) and the verb is auxiliary. (минато определено несвршено времe, 'past definite incomplete tense'), Examples: (минато определено свршено време, 'past - Таа е убава. (Taa e ubava. – She is definite complete tense') and Imperative beautiful.) (заповеден начин). - Јас сум студент. (Jas sum student. – I am Examples: a student) - Тој игра. (Toj igra. – He plays.) - Present - Тој остана буден. (Toj ostana buden. – tense (сегашно време) He stayed awake.) - Тој играше. (Toj igrashe. – He was The most important feature for context-free playing.) - Imperfect (минато определено grammars in Macedonian language is transitivity. несвршено времe, 'past definite Transitivity is a property of verbs in Macedonian incomplete tense') language (and in many other languages) that relates - Тој изигра. (Toj izigra. – He played) - to whether a verb can take direct objects and how Aorist (минато определено свршено many such objects a verb can take. It is also време, 'past definite complete tense') important to note that when is talked about transitivity only are considered the obligatory noun - Ти, играј! (Ti igraj! – You, play!) - phrases and prepositional phrases (PP) when it Imperative (заповеден начин) comes to determining how many arguments a  Simple verbal predicate which consists of predicate has. Obligatory elements are considered complex verb form. Complex verb forms in arguments while optional ones are never counted in Macedonian language are: of the list of arguments. There are three types of imperfective verbs (минато неопределено transitivity of verbs in Macedonian language: несвршено време, 'past indefinite incomplete intransitive verbs (IV) that cannot take a direct tense'), Perfect of perfective verbs (минато object, transitive verbs (TV) that take one direct неопределено свршено време, 'past indefinite object and reflexive (RV) verbs, verbs where the complete tense'), Past perfect tense subject and direct object are the same. (предминато време), (идно Examples: време), Future-in-the-past (минато-идно време), Future perfect tense (идно - Снегот падна. (Snegot pagja. – The snow прекажано), Potential mood (можен начин), falls.) – intransitive verb Have-construction (има-конструкција), Be- - Тие трчаат. (Tie trchaat. – They are construction (сум-конструкција). running) – intransitive verb Examples: - Јас јадам торта. (Jas jadam torta. – I am - Тој играл. (Toj igral. – He has played.) - eating a cake.) – transitive verb Perfect of perfective verbs (минато - Марија пишува песна. (Marija pishuva неопределено свршено време, 'past pesna. – Maria is writing a song) - indefinite complete tense') transitive verb - Тој има играно. (Toj ima igrano. He has - Ангела се шминка. (Angela se shminka. been playing.) - минато неопределено - Angela puts on her makeup.) – reflexive несвршено време, 'past indefinite verb incomplete tense') The Predicate is verb phrase (VP). - Тој ќе игра. (Toj igra. – He will play.) A simple rule for verb phrase is shown on Figure2. Complex verbal predicate consists of two verbs of which the second is in to-construction. The first

© 2017, IJournals All Rights Reserved Page 36 www.ijournals.in

IJournals: International Journal of Software & Hardware Research in Engineering ISSN-2347-4890 Volume 5 Issue 2 February, 2017 and prepositions of manner. The rule for preposition is shown on Figure 4.

Figure 2 Rule for verb phrase Figure 4 Rule for preposition 5. OBJECT

The object in a sentence is the entity that is acted upon by the subject. The main verb in a sentence The sentences in Macedonian language are divided determines if and what objects are present. into simple and complex. Simple sentences are Transitive verbs require the presence of an object, those which have only one predicate (only one verb whereas intransitive and reflexive verbs cannot phrase). Complex sentences have two or more verb take an object. The object can be taken as the part phrases. of the predicate. Objects can be in any form of syntactic categories, but the most used are: noun, 7. CREATING A SIMPLE LEXICON noun phrase, pronoun, clause (del recenica) or AND GRAMMAR verbal noun. The rule for object is shown on Figure CFG has been widely used for defining 3. programming languages rather than natural languages. A CFG involves the following four quantities: 1) Terminals: Terminals define the basic symbols of which strings in the language are composed. 2) Non-terminals: Non-terminals are special symbols that denote the set of strings of the language. Nonterminals are described recursively in terms of Figure 3 Rule for object each other and terminals. 3) Productions: 6. PREPOSITIONS Productions are rules that define the ways in which Prepositions (предлози, predlozi) (PP) are part of non-terminals may be built from one the closed word class that is used to express the another and from terminals. Production rules are relationship between the words in a sentence. Since represented as follows: Macedonian lost the case system, the prepositions A→ α are very important for creation and expression of where A is a non-terminal and α is a string of various grammatical categories. The most terminals and non-terminals. important Macedonian preposition is на (na, 'of', 4) Start symbol: 'on' or 'to'). Regarding the form, the prepositions Start Symbol is a special non-terminal from which can be: simple (vo, na, za, do, so, niz, pred, zad, all other strings are derived. It signifies etc.) and complex (zaradi, otkaj, nasproti, pomeǵu, the language being defined. etc.). Based on the meaning the prepositions A context-free grammar only defines a language. It express, they can be divided into: prepositions of does not say how to determine whether a given place, prepositions of time, prepositions of quantity string belongs to the language it defines. To do this,

© 2017, IJournals All Rights Reserved Page 37 www.ijournals.in

IJournals: International Journal of Software & Hardware Research in Engineering ISSN-2347-4890 Volume 5 Issue 2 February, 2017

a parser can be used whose task is to map a string of words to its parse tree. In this section it will be presented simple lexicon for Macedonian language with some of the most used words (Figure 5) and simple grammar rules which were formed before (Figure 6). A convenient way to describe a parse is to show its parse tree, which is simply a graphical display of the parse. Figure 7 gives a simple parse tree example for a sentence according to grammar in Figure 6. Figure 6 Simple grammar rules

Figure 5 Simple lexicon

Figure 7 Parse tree for one simple sentence

8. SUMMARY 9. REFERENCES A context-free grammar is a list of rules that define [1]. “A Short Introduction to Regular Expressions and Context-Free Grammars” -Theodore the set of all well-formed sentences in a language. Norvell. Software Engineering 7893 R. Nicole Context-free grammars can be used to model [2]. “Three models for the description of language” various facts about the syntax of one language. - Noam Chomsky - Department of Modern When paired with parsers, such grammars constitute a critical component in many Languages and Research Labaratory of applications. Representation of Macedonian Electronics Massachusetts Institute of language as CFG provides a pointer towards the Technology possibility of representing more of natural [3]. "EVIDENCE AGAINST THE CONTEXT- languages in a formal way. This formal FREENESS OF NATURAL LANGUAGE" - representation establishes that some part of Macedonian grammar is highly structured. STUARD SHIEBER Extensive study of Macedonian grammar also [4]. https://en.wikipedia.org/wiki/Syntactic_Structu reveals that the notations used in it are analogous to res modern computational notations. This research for [5]. "Context Free Grammars" - Klaus Sutner - context-free grammars for Macedonian language can be improved. Carnegie Mellon University The structured nature of the language should be [6]. "Context-Free Grammars Formalism exploited to completely formalize the language. If Derivations Backus-Naur Form Left- and the CFG for the entire Macedonian grammar is Rightmost Derivations" written, applications such as semantic parsers for [7]. “Македонски правопис“ – изработен од Macedonian language Word processors will become possible. комисијата за јазик и правопис при министерството на народната просвета.

© 2017, IJournals All Rights Reserved Page 38 www.ijournals.in