Brno University of Technology Vysokeu´ Cenˇ ´I Technicke´ V Brneˇ

BRNO UNIVERSITY OF TECHNOLOGY VYSOKEU´ CENˇ Í TECHNICKE´ V BRNEˇ FACULTY OF INFORMATION TECHNOLOGY DEPARTMENT OF INFORMATION SYSTEMS FAKULTA INFORMACNˇ ÍCH TECHNOLOGIÍ USTAV´ INFORMACNˇ ÍCH SYSTEM´ U˚ FORMAL SYSTEMS BASED UPON AUTOMATA AND GRAMMARS PHD THESIS DISERTACNˇ Í PRACE´ AUTHOR MARTIN CERMˇ AK´ AUTOR PRACE´ BRNO 2012 BRNO UNIVERSITY OF TECHNOLOGY VYSOKEU´ CENˇ Í TECHNICKE´ V BRNEˇ FACULTY OF INFORMATION TECHNOLOGY DEPARTMENT OF INFORMATION SYSTEMS FAKULTA INFORMACNˇ ÍCH TECHNOLOGIÍ USTAV´ INFORMACNˇ ÍCH SYSTEM´ U˚ FORMAL SYSTEMS BASED UPON AUTOMATA AND GRAMMARS FORMALN´ Í SYSTEMY´ AUTOMATU˚ A GRAMATIK PHD THESIS DISERTACNˇ Í PRACE´ AUTHOR MARTIN CERMˇ AK´ AUTOR PRACE´ SUPERVISOR prof. RNDr. ALEXANDER MEDUNA, CSc. VEDOUCÍ PRACE´ BRNO 2012 Abstrakt Tyto teze navazuj´ına studium gramatických a automatových systém˚u.Na zaˇcátku,práce pojednáváo regulárnˇeˇr´ızených CD gramatických systémech vyuˇz´ıvaj´ıc´ı frázovˇestruk- turovanégramatiky jako komponenty. Do systém˚ujsou zavedena tˇrinováomezen´ı na derivac´ıch a je studovánjejich vliv na vyjadˇrovac´ıs´ılutˇechto systém˚u.Poté,tato prácede- finuje dva automatovéprotˇejˇskyke kanonickýmmulti-generatin´ımnonterminálema pravi- dlovˇesynchronizovynýmgramatickýmsystem˚um,generuj´ıc´ıch vektory ˇretˇezc˚u,a ukazuje, ˇzevˇsechny tyto vyˇsetˇrovanésystemy si jsou vzájemnˇeekvivalentn´ı. Dáletátoprácetyto systémy zobecˇnujea zakládáfundamentaln´ı hierarchii n-jazyk˚u(mnoˇzin n-tic ˇretˇezc˚u). V souvislosti se zavedenýmisystémy tyto teze zavád´ıautomatovˇe-gramatickýpˇrevodn´ık zaloˇzenýna koneˇcnémautomatu a bezkontextovégramatice. Tento pˇrevodn´ıkje pak stu- dovanýa pouˇzitýjako nástroj pˇr´ıméhopˇrekladu.V posledn´ıˇcástijsou v tétoprácizavedené automatovésystémy jádrempársovac´ımetody zaloˇzenéna stromovˇeˇr´ızených gramatikách s n omezenýmicestami. Abstract The present thesis continues with study of grammar and automata systems. First of all, it deals with regularly controlled CD grammar systems with phrase-structure grammars as components. Into these systems, three new derivation restrictions are placed and their effect on the generative power of these systems are investigated. Thereafter, this thesis defines two automata counterparts of canonical multi-generative nonterminal and rule synchro- nized grammar systems, generating vectors of strings, and it shows that these investigated systems are equivalent. Furthermore, this thesis generalizes definitions of these systems and establishes fundamental hierarchy of n-languages (sets of n-tuples of strings). In relation with these mentioned systems, automaton-grammar translating systems based upon finite automaton and context-free grammar are introduced and investigated as a mechanism for direct translating. At the end, in this thesis introduced automata systems are used as the core of parse-method based upon n-path-restricted tree-controlled grammars. Kl´ıˇcováslova automat, gramatika, automatovýsystém,gramatickýsystém,ˇr´ızenépˇrepisován´ı, multi- generován´ı, n-jazyk Keywords automaton, grammar, automata system, grammar system, controlled rewriting, multi- generation, n-language Citace Martin Cermák:Formalˇ Systems Based upon Automata and Grammars, disertaˇcn´ıpráce, Brno, FIT VUT v Brnˇe,2012 Citation Martin Cermák:Formalˇ Systems Based upon Automata and Grammars, PhD thesis, Brno, FIT BUT, 2012 Formal Systems Based upon Automata and Grammars Declaration Hereby I declare that this thesis is my authorial work that have been created under supervi- sion of prof. RNDr. Alexander Meduna, CSc. Some parts of this thesis are based on papers created in collaboration with colleagues RNDr. TomáˇsMasopust, Ph.D., Ing. Jiˇr´ıKoutný, or Ing. Petr Horáˇcek.Specifically, the first two theorems in Chapter 4 are largely the work of prof. RNDr. Alexander Meduna, CSc. and RNDr. TomáˇsMasopust, Ph.D. Chapter 8 and some parts of Chapter 6 are based on papers, which have been written in cooperation with Ing. Jiˇr´ıKoutný.Chapter 7 is based on paper, where the part consulting applications of investigated transducer has been written in cooperation with Ing. Petr Horáˇcek. All other results are an outcome of my own research. All sources, references, and literature used or excerpted during elaboration of this thesis are properly cited in complete reference to the due source. ....................... Martin Cermákˇ June 13, 2012 Acknowledgements I would like to thank Alexander Meduna, my supervisor, for his inspiration, willingness, and friendly pedagogic and scientific leading. Without his guidance, this work would not have been possible. I also would like to thank Erzsébet Csuhaj-Varjúand GyörgyVaszil for consultations, and my colleagues Jiˇr´ıKoutný,ZbynˇekKˇrivka, TomáˇsMasopust, and Petr Horáˇcekfor their inspiration and motivation. Last, but certainly not least, I would like to thank my wife Daniela for her infinite support. c Martin Cermák,2012.ˇ Tato práce vznikla jako ˇskoln´ı d´ılo na Vysokémuˇcen´ı technickémv Brnˇe, Fakultˇein- formaˇcn´ıch technologi´ı. Práce je chránˇenaautorskýmzákonema jej´ı uˇzit´ı bez udˇelen´ı oprávnˇen´ıautorem je nezákonné,s výjimkouzákonem definovanýchpˇr´ıpad˚u. Contents I Survey of Current State of Knowledge 3 1 Introduction 4 1.1 Motivation . .4 1.2 Organization . .6 2 Notation and Basic Definitions 8 2.1 Sets and Relations . .8 2.2 Alphabets, Strings, and Languages . 10 2.3 Grammars . 11 2.4 Automata . 13 2.5 Regulated Formal Models . 16 3 Systems of Formal Models 19 3.1 Cooperating Distributed Grammar System . 19 3.2 Parallel Communicating Grammar Systems . 21 3.3 Automata Systems . 25 II New Systems of Formal Models and Results 26 4 Restrictions on CD Grammar Systems 27 4.1 Three Restrictions on Derivations in Grammar Systems . 27 4.2 Generative Power of Restricted Grammar Systems . 29 5 n-Accepting Restricted Pushdown Automata Systems 33 5.1 n-Accepting State-Restricted Automata Systems . 33 5.2 n-Accepting Move-Restricted Automata Systems . 40 6 Classes of n-Languages: Hierarchy and Properties 45 6.1 Definitions . 45 6.2 Class Hierarchy and Closure Properties . 47 7 Rule-Restricted Automaton-Grammar Transducers 57 7.1 Rule-restricted transducer . 58 7.2 Applications in Natural Language Translation . 66 1 8 Parsing of n-Path-Restricted Tree-Controlled Languages 76 8.1 n-Path-Restricted Tree-Controlled Grammars . 77 8.2 Examples . 77 8.3 Syntax analysis of n-path-TC ......................... 79 8.3.1 Top-Down Parsing of n-path-TC ................... 80 8.3.2 Bottom-Up Parsing of n-path-TC ................... 83 9 Conclusion 85 9.1 Summary and Further Investigation . 85 2 Part I Survey of Current State of Knowledge 3 Chapter 1 Introduction 1.1 Motivation In the seventh century before Christ, Egyptians believed they are the oldest nation in the world. The former king, Psantek I., wanted to confirm this assumption. The confirmation was based on the idea that children, who cannot learn to speak from adults, will use innate human language. That language was supposed to be Egyptian. For this purpose, Psantek I. took two children from a poor family and let them to grow up in care of a shepherd in an environment, where nobody was allowed to speak with these children. Although the test ultimately failed, it brings us testimony that already in old Egypt, people somehow felt the importance of languages (the whole story you can see in The story of psychology by Morton Hunt). In 1921, Ludwig Wittgenstein published a philosophical work (Logisch-philosophische Abhandlung) containing claim that says \The limits of my language mean the limits of my world". In the computer science, this claim is doubly true. Languages are a way how people express information and ideas in terms of computer science or information technology. In essence, any task or problem, which a computer scientist is able to describe, can be described by a language. The language represents a problem and all sentences belonging into this language are its solutions. Fact about the limitation by languages led to the birth of a new research area referred to as theory of formal languages studying languages from a mathematical point of view. The main initiator was linguist Noam Chomsky, who, in the late fifties, introduced hierarchy of formal languages given by four types of language generators. By this work, Noam Chom- sky inspired many mathematicians and computer scientists so they began to extend this fundamental hierarchy by adding new models for language definition. Because the theory of formal languages examines the languages from the precise mathematical viewpoint, its results are significant for many areas in information technology. Models, which are studied by the theory, are used in compilers, mathematical linguistics, bioinformatics, especially ge- netics and simulation of natural biology processes, artificial intelligence, computer graphics, computer networks, and others. The classical formal language theory uses three approaches to define formal languages. 1. Grammatical approach, where the languages are generated by grammars. 2. Automata approach, where the languages are recognized by automata. 3. Algebraic approach, where the languages are defined by some language operations. 4 To be more precise, in the grammatical approach, a grammar generates its language by application of derivation steps replacing sequences of symbols by other sequences according to its prescribed rules. The symbols can be terminal or nonterminal, and the sequences of these symbols are called strings. In a single derivation step, the grammar, by application of its rule, replaces a part of string by some other string. Any string, which contains no nonterminal symbol and which can be generated from a start nonterminal by application of a sequence of derivation steps, belongs to the language of the grammar. The language of the grammar is represented by the set of such generated strings. While a grammar generates language, an automaton represents formal algorithm by which the automaton can recognize correctly made sequences of symbols belonging into the language the automaton defines.

Load more