Recovery, Convergence and Documentation of Languages
Total Page:16
File Type:pdf, Size:1020Kb
Recovery, Convergence and Documentation of Languages by Vadim Zaytsev September 14, 2010 VRIJE UNIVERSITEIT Recovery, Convergence and Documentation of Languages ACADEMISCH PROEFSCHRIFT ter verkrijging van de graad Doctor aan de Vrije Universiteit Amsterdam, op gezag van de rector magnificus prof.dr. L.M. Bouter, in het openbaar te verdedigen ten overstaan van de promotiecommissie van de faculteit der Exacte Wetenschappen op woensdag 27 oktober 2010 om 15.45 uur in de aula van de universiteit, De Boelelaan 1105 door Vadim Valerievich Zaytsev geboren te Rostov aan de Don, Rusland promotoren: prof.dr. R. Lammel¨ prof.dr. C. Verhoef Dit onderzoek werd ondersteund door de Nederlandse Organisatie voor Wetenschappelijk Onderzoek via: This research has been sponsored by the Dutch Organisation of Scientific Research via: NWO 612.063.304 LPPR: Language-Parametric Program Restructuring Acknowledgements Working on a PhD is supposed to be an endeavour completed in seclusion, but in practice one cannot survive without the help and support from others, fruitful scientific discus- sions, collaborative development of tools and papers and valuable pieces of advice. My work was supervised by Prof. Dr. Ralf Lammel¨ and Prof. Dr. Chris Verhoef, who often believed in me more than I did and were always open to questions and ready to give expert advice. They have made my development possible. LPPR colleagues — Jan Heering, Prof. Dr. Paul Klint, Prof. Dr. Mark van den Brand — have been a rare yet useful source of new research ideas. All thesis reading committee members have dedicated a lot of attention to my work and delivered exceptionally useful feedback on the late stage of the research: Prof. Dr. Jean Bezivin,´ Dr. Jean-Marie Favre, Prof. Dr. Willem Jan Fokkink, Prof. Dr. Paul Klint, Dr. Steven Klusener. I am also grateful for Cor-Paul Bezemer and Toon Verwaest who provided proofreading and correcting services for the Dutch part of this thesis. There have been a lot of insightful discussions in the rooms and hallways of the Vrije Universiteit with Dr. Niels Veerman, Ernst-Jan Verhoeven, Łukasz Kwiatkowski and Johan Vincent de Vries. I would like to thank my family that backed me up with complete support and encour- agement through the years of research, especially my mother, Dr.ir. Liudmila Zaytseva; my grandmother, Dr. Svetlana Bocheva; my grandfather, Prof. Dr.ir. Alexander Bochev ; my uncle, Dr. Michael Bochev and my godfather, Prof. Dr. Yuri Bashmakov, MD. My close friends’ understanding, respect and interest in my work was also among the most important things that kept me going: Dr. Alexander Gufan, Dr. Stanislav Tsykavy and Stanislav Rezhabek. I have also been saved many times from depression and writer’s blocks by good mu- sic. I cannot name all the artists responsible for that, but the most credit goes to Huddie Ledbetter, William Broonzy, Fulton Allen, Thomas McClennan and Bruce Springsteen. i Contents Acknowledgementsi Contents ii List of Tables ................................... ix List of Figures................................... x List of Listings .................................. xi 1 Introduction1 1.1 Research context .............................. 1 1.2 Motivation and objectives.......................... 2 1.3 Example scenario.............................. 3 1.4 Thesis outline and contributions ...................... 10 1.4.1 Chapter 2 overview: additional background............ 10 1.4.2 Chapter 3 overview: case study on language recovery . 11 1.4.3 Chapter 4 overview: language convergence............ 12 1.4.4 Chapter 5 overview: case study on recovery and convergence . 13 1.4.5 Chapter 6 overview: language documentation........... 14 1.4.6 Chapter 7 overview: XBGF language manual........... 15 2 Additional background 17 2.1 Terminology................................. 17 2.2 Grammarware................................ 18 2.3 Techniques for grammars.......................... 19 2.4 Language evolution: versions and dialects................. 20 2.5 Grammar levels............................... 22 2.6 Grammar recovery methodology...................... 22 2.7 Grammar definition formalism....................... 23 2.8 Grammar idiosyncrasies and parsing technology.............. 24 2.9 Grammarware and tool generation..................... 25 2.10 Language documentation qualities..................... 27 2.11 Standardisation bodies ........................... 28 2.12 Languages used in the thesis ........................ 31 2.13 Transformations used in the thesis ..................... 34 3 Case study on recovery 37 ii Contents iii 3.1 Contributions................................ 37 3.2 Semi-automated recovery of C# grammar ................. 38 3.2.1 Step 1: Obtaining the standard................... 38 3.2.2 Step 2: Extracting the grammar .................. 39 3.2.3 Step 3: Fixing misprints ...................... 40 3.2.4 Step 4: Completing a formal part ................. 40 3.2.5 Step 5: Relaxation ......................... 40 3.2.6 Step 6: Removing idiosyncrasies from the grammar . 44 3.2.7 Step 7: Resolving conflicts..................... 48 3.2.8 Step 8: Improving the grammar .................. 49 3.2.9 Step 9: Generating the parser.................... 49 3.3 Proposed solution generalisation and evaluation.............. 50 3.4 Conclusion ................................. 54 3.4.1 Discussion on the method automation............... 54 3.4.2 Research objectives revisited.................... 55 4 Language convergence 57 4.1 Motivation.................................. 58 4.2 Contributions................................ 58 4.3 The domain................................. 59 4.3.1 Sources of convergence....................... 61 4.3.2 Targets of convergence....................... 61 4.3.3 BGF — BNF-like Grammar Format................ 63 4.4 Grammar extraction............................. 65 4.4.1 Abstraction by extraction...................... 65 4.4.2 Grammar extractors ........................ 66 4.5 Grammar comparison............................ 68 4.6 Grammar transformation.......................... 73 4.7 Convergence process ............................ 76 4.8 Programmable grammar transformations.................. 79 4.8.1 Transformation properties ..................... 79 4.8.2 Grammar refactoring........................ 79 4.8.3 Grammar editing.......................... 81 4.9 Transformation generators ......................... 83 4.10 Language Convergence Infrastructure ................... 86 4.10.1 Main configuration elements.................... 86 4.10.2 Shortcuts.............................. 87 4.10.3 Generators ............................. 87 4.10.4 Sources............................... 87 4.10.5 Targets ............................... 88 4.10.6 Phases................................ 89 4.10.7 Test sets............................... 89 4.10.8 Tools ................................ 90 4.10.9 xstring ............................... 90 4.11 Related work ................................ 91 iv Contents 4.11.1 Interoperability........................... 91 4.11.2 Testing grammarware........................ 91 4.11.3 Generators and synchronisers ................... 91 4.11.4 Grammar recovery......................... 92 4.11.5 Grammar transformation...................... 92 4.11.6 Grammar convergence....................... 92 4.12 Concluding remarks............................. 93 5 Case study on recovery and convergence 95 5.1 Java is not syntax-safe—apparently .................... 95 5.2 Contributions................................ 97 5.3 The JLS corpus............................... 98 5.3.1 JLS1 ................................ 98 5.3.2 JLS2 ................................ 99 5.3.3 JLS3 ................................ 99 5.3.4 Grammar data............................ 99 5.4 Automated grammar extraction.......................100 5.4.1 Assumed grammar format .....................101 5.4.2 Phase 1 — Preprocessing......................102 5.4.3 Phase 2 — Error recovery .....................104 5.4.4 Phase 3 — Removal of doubles ..................107 5.4.5 Phase 4 — Precise parsing.....................107 5.4.6 Extraction data...........................108 5.5 The convergence graph...........................108 5.6 Grammar transformation ..........................109 5.6.1 Semantics-preserving operators ..................109 5.6.2 Semantics-in/decreasing operators . 110 5.6.3 Semantics-revising operators....................112 5.6.4 Grammar refactoring........................114 5.7 Grammar convergence phases .......................117 5.7.1 Preparation phase: semantic error recovery . 117 5.7.2 Preparation phase: fixing known bugs . 117 5.7.3 Preparation phase: initial correction . 118 5.7.4 Nominal matching phase......................119 5.7.5 Structural matching phase .....................119 5.7.6 Resolution phase: extension ....................119 5.7.7 Resolution phase: relaxation....................120 5.7.8 Resolution phase: correction....................120 5.8 Measuring grammars and transformations . 121 5.9 Related work ................................122 5.9.1 Grammar recovery.........................123 5.9.2 Programmable grammar transformations . 124 5.9.3 Grammar engineering .......................127 5.9.4 Schema/metamodel comparison..................128 5.9.5 Coupled transformations......................128 Contents v 5.10 Concluding remarks.............................129 6 Language documentation 131 6.1 Motivation..................................132 6.2 Contributions ................................133 6.3 Grammar