dsl engineering 1

DSL Engineering Designing, Implementing and Using Domain-Specific Languages

Markus Voelter

with Sebastian Benz Christian Dietrich Birgit Engelmann Mats Helander Lennart Kats Eelco Visser Guido Wachsmuth

(c) 2010 - 2013 Markus Voelter

Feedback, Slides, and other updates at http://dslbook.org

This PDF book is donationware: while you can get the PDF for free from dslbook.org, I would highly appreciate if you used the Donate button on that website to send over some amount of money you deem appropriate for what you get from the book. Thank you!

A printed version of this book is available as well. Please go to http://dslbook.org to find links to various Amazon stores where you can buy the print version.

Contents

I Introduction 5

1 About this Book 7 1.1 Thank You! ...... 7 1.2 This book is Donationware ...... 8 1.3 Why this Book ...... 9 1.4 What you will Learn ...... 10 1.5 Who should Read this Book ...... 10 1.6 About the Cover ...... 11 1.7 Feedback, Bugs and Updates ...... 11 1.8 The Structure of the Book ...... 11 1.9 How to Read the Book ...... 12 1.10 Example Tools ...... 13 1.11 Case Studies and Examples ...... 14

2 Introduction to DSLs 23 2.1 Very Brief Introduction to the Terminology . . . . 23 2.2 From General Purpose Languages to DSLs . . . . 25 2.3 Modeling and Model-Driven Development . . . . 29 2.4 Modular Languages ...... 32 2.5 Benefits of using DSLs ...... 38 2.6 Challenges ...... 41 2.7 Applications of DSLs ...... 45 2.8 Differentiation from other Works and Approaches 48

II DSL Design 51

3 Conceptual Foundations 55 3.1 Programs, Languages and Domains ...... 55 3.2 Model Purpose ...... 59 3.3 The Structure of Programs and Languages . . . . 61 4 dslbook.org

3.4 Parsing versus Projection ...... 65

4 Design Dimensions 67 4.1 Expressivity ...... 68 4.2 Coverage ...... 78 4.3 Semantics and Execution ...... 80 4.4 Separation of Concerns ...... 100 4.5 Completeness ...... 109 4.6 Language Modularity ...... 114 4.7 Concrete Syntax ...... 130

5 Fundamental Paradigms 139 5.1 Structure ...... 139 5.2 Behavior ...... 147 5.3 Combinations ...... 156

6 Process Issues 159 6.1 DSL Development ...... 159 6.2 Using DSLs ...... 166

III DSL Implementation 171

7 Concrete and Abstract Syntax 175 7.1 Fundamentals of Free Text Editing and Parsing . 177 7.2 Fundamentals of Projectional Editing ...... 186 7.3 Comparing Parsing and Projection ...... 187 7.4 Characteristics of AST Formalisms ...... 194 7.5 Xtext Example ...... 198 7.6 Spoofax Example ...... 205 7.7 MPS Example ...... 210

8 Scoping and Linking 219 8.1 Scoping in Spoofax ...... 221 8.2 Scoping in Xtext ...... 227 8.3 Scoping in MPS ...... 232

9 Constraints 237 9.1 Constraints in Xtext ...... 239 9.2 Constraints in MPS ...... 240 9.3 Constraints in Spoofax ...... 245

10 Type Systems 251 10.1 Type Systems Basics ...... 252 10.2 Type Calculation Strategies ...... 253 dsl engineering 5

10.3 Xtext Example ...... 258 10.4 MPS Example ...... 261 10.5 Spoofax Example ...... 265

11 Transformation and Generation 269 11.1 Overview of the approaches ...... 270 11.2 Xtext Example ...... 272 11.3 MPS Example ...... 279 11.4 Spoofax Example ...... 288

12 Building Interpreters 295 12.1 Building an Interpreter with Xtext ...... 297 12.2 An Interpreter in MPS ...... 304 12.3 An Interpreter in Spoofax ...... 306

13 IDE Services 311 13.1 Code Completion ...... 311 13.2 Syntax Coloring ...... 314 13.3 Go-to-Definition and Find References ...... 319 13.4 Pretty-Printing ...... 321 13.5 Quick Fixes ...... 325 13.6 Refactoring ...... 327 13.7 Labels and Icons ...... 331 13.8 Outline ...... 332 13.9 Code Folding ...... 334 13.10Tooltips/Hover ...... 335 13.11Visualizations ...... 337 13.12Diff and Merge ...... 339

14 Testing DSLs 341 14.1 Syntax Testing ...... 342 14.2 Constraints Testing ...... 344 14.3 Semantics Testing ...... 347 14.4 Formal Verification ...... 354 14.5 Testing Editor Services ...... 361 14.6 Testing for Language Appropriateness ...... 365

15 Debugging DSLs 367 15.1 Debugging the DSL Definition ...... 367 15.2 Debugging DSL Programs ...... 374

16 Modularization, Reuse and Composition 391 16.1 Introduction ...... 391 16.2 MPS Example ...... 392 6 dslbook.org

16.3 Xtext Example ...... 412 16.4 Spoofax Example ...... 426

IV DSLs in Software Engineering 437

17 DSLs and Requirements 441 17.1 What are Requirements? ...... 441 17.2 Requirements versus Design versus Implemen- tation ...... 443 17.3 Using DSLs for Requirements Engineering . . . . 445 17.4 Integration with Plain Text Requirements . . . . . 448

18 DSLs and Software Architecture 453 18.1 What is Software Architecture? ...... 453 18.2 Architecture DSLs ...... 455 18.3 Component Models ...... 466

19 DSLs as Programmer Utility 475 19.1 The Context ...... 476 19.2 Jnario Described ...... 477 19.3 Implementation ...... 480 19.4 Summary ...... 485

20 DSLs in the Implementation 487 20.1 Introduction ...... 487 20.2 Challenges in Embedded Software ...... 489 20.3 The mbeddr Approach ...... 491 20.4 Design and Implementation ...... 499 20.5 Experiences ...... 509 20.6 Discussion ...... 516

21 DSLs and Product Lines 519 21.1 Introduction ...... 519 21.2 Feature Models ...... 520 21.3 Connecting Feature Models to Artifacts ...... 521 21.4 From Feature Models to DSLs ...... 526 21.5 Conceptual Mapping from PLE to DSLs ...... 532

22 DSLs for Business Users 537 22.1 Intentional Software ...... 537 22.2 The Project Challenge ...... 538 22.3 The DSL-Based Solution ...... 539 22.4 Wrapping Up ...... 556 Part I

Introduction

1 About this Book

This book is about creating domain-specific languages. It covers three main aspects: DSL design, DSL implementa- tion and software engineering with DSLs. The book only looks at external DSLs and focuses mainly on textual syn- tax. The book emphasizes the use of modern language work- benches. It is not a tutorial for any specific tool, but it pro- vides examples and some level of detail for some of them: Xtext, MPS and Spoofax. The goal of the book is to provide a thorough overview of modern DSL engineering. The book is based on my own experience and opinions, but strives to be objective.

1.1 Thank You!

Before I do anything else, I want to thank my reviewers. This book has profited tremendously from the feedback you sent me. It is a lot of work to read a book like this with sufficient concentration to give meaningful feedback. All of you did that, so thank you very much! Here is the list, in alphabetical order: Alexander Shatalin, Bernd Kolb, Bran Selic, Christa Schwan- ninger, Dan Ratiu, Domenik Pavletic, Iris Groher, Jean Bezivin, Jos Warmer, Laurence Tratt, Mats Helander, Nora Ludewig, Se- bastian Zarnekow and Vaclav Pech. I also want to thank the contributors to the book. They have added valuable perspectives and insights that I couldn’t have delivered myself. In particular:

• Eelco Visser contributed significantly to the DSL design sec- tion. In fact, this was the part we started out with when we still had the plan to write the book together. It was initially 10 dslbook.org

his idea to have a section on DSL design that is independent of implementation concerns and particular tools.

• Guido Wachsmuth and Lennart Kats contributed all the ex- amples for the Spoofax language workbench and helped a lot with the fundamental discussions on grammars and parsing.

• Mats Helander contributed the Business DSL case study with the Intentional Domain Workbench in Part IV of the book.

• Birgit Engelmann and Sebastian Benz wrote the chapter on utility DSLs that features the JNario language based on Xtext and Xbase.

• Christian Dietrich helped me with the language modular- ization examples for Xtext and Xbase.

Also, Moritz Eysholdt has contributed a section on his Xpext testing framework. Finally, some parts of this book are based on papers I wrote with other people. I want to thank these people for letting me use the papers in the book: Bernd Kolb, Domenik Pavletic, Daniel Ratiu and Bernhard Schaetz. A special thank you goes to my girlfriend Nora Ludewig. She didn’t just volunteer to provide feedback on the book, she also had to endure all kinds of other discussions around the topic all the time. Thanks Nora! Alex Chatziparaskewas, Henk Kolk, Magnus Christerson and Klaus Doerfler allowed me to use "their" applications as examples in this book. Thank you very much! I also want to thank itemis, for whom I have worked as an independent consultant for the last couple of years. The expe- rience I gained there, in particular while working with MPS in the LWES research project, benefitted the book greatly! Finally, I want to thank my copyeditor Steve Rickaby. I had worked with Steve on my pattern books and I really wanted to work with him again on this one – even though no publisher is involved this time. Luckily he was willing to work with me directly. Thank you, Steve!

1.2 This book is Donationware

This book is available as a print version and as a PDF version. You are currently reading the PDF version. The print version dsl engineering 11

can be acquired Amazon. The specific links are available at http://dslbook.org. This book is not free even though you can download it from Donations are much simpler than http://dslbook.org without paying for it. I ask you to please paying for the book at the time when when you get it: you can check out go to the website and donate some amount of money you deem the contents before you pay, you can appropriate for the value you take from the book. You should pay whatever you think makes sense for you, and we all don’t have to deal also register the book at the website, so I can keep you up-to- with DRM and chasing down "illegal date with new versions. copies". But it to make this work (and There is no Kindle version of the book because the layout/- keep the book project alive), it relies on you sticking to the rules. Thank you! figures/code do not translate very well into the Kindle format. However, you can of course read PDFs on a Kindle. I tried us- ing my Nexus 7 tablet to read the book: if you use landscape format, it works reasonably well. Here is some background on why I didn’t go with a real publisher. Unless you are famous or write a book on a main- stream topic, you will make maybe one or two euros for each copy sold if you go through a publisher. So if you don’t sell tens or hundreds of thousands of copies, the money you can make out of a book directly is really not relevant, considering 1 Publishers may help to get a book the amount of work you put into it. Going through a publisher advertised, but in a niche community will also make the book more expensive for the reader, so fewer like DSLs I think that word of mouth, blog or Twitter is more useful. So I people will read it. I decided that it is more important to reach hope that you the reader will help me as many readers as possible1. spread the word about the book.

1.3 Why this Book

First of all, there is currently no book available that explicitly covers DSLs in the context of modern language workbenches, with an emphasis on textual languages. Based on my experi- ence, I think that this way of developing DSLs is very produc- tive, so I think there is a need for a book that fills this gap. I wanted to make sure the book contains a lot of detail on how Since writing the original MDSD book, I have learned a lot in the meantime, my to design and build good DSLs, so it can act as a primer for viewpoints have evolved and the tools DSL language engineering, for students as well as practition- that are available today have evolved ers. However, I also want the book to clearly show the benefits significantly as well. The latter is a reflection of the fact that the whole of DSLs – not by pointing out general truths about the ap- MDSD community has evolved: ten proach, but instead by providing a couple of good examples of years ago, UML was the mainstay for MDSD, and the relationship to DSLs where and how DSLs are used successfully. This is why the was not clear. Today, DSLs are the basis book is divided into three parts: DSL Design, DSL Implemen- for most interesting and innovative tation and Software Engineering with DSLs. developments in MDSD. Even though I had written a book on Model-Driven Soft- 2 2 M. Voelter and T. Stahl. Model- ware Development (MDSD) before , I feel that it is time for a Driven Software Development: Technology, Engineering, Management. Wiley, 2006 12 dslbook.org

complete rewrite. So if you are among the people who read the previous MDSD book, you really should continue reading. This book is very different, but in many ways a natural evolu- tion of the old one. It may gloss over certain details present in the older book, but it will expand greatly on others.

1.4 What you will Learn

The purpose of this book is to give you a solid overview of the state of the art of today’s DSLs. This includes DSL design, DSL implementation and the use of DSLs in software engineering. After reading this book you should have a solid understanding of how to design, build and use DSLs. A few myths (good and bad) about DSLs should also be dispelled in the process. Part III of the book, on DSL implementation, contains a lot of example code. However, this part is not intended as a full tutorial for any of the three tools used in that part. However, you should get a solid understanding of what these tools – and the classes of tools they stand for – can do for you.

1.5 Who should Read this Book

Everybody who has read my original book on Model-Driven Software Development should read this book. This book can be seen as an update to the old one, even though it is a complete rewrite. On a more serious note, the book is intended for developers and architects who want to implement their own DSLs. I ex- pect solid experience in object oriented programming as well as basic knowledge about and (clas- sical) modeling. It also helps if readers have come across the terms grammar and parser before, although I don’t expect any significant experience with these techniques. The MDSD book had a chapter on process and organiza- tional aspects. Except for perhaps ten pages on process-related topics, this book does not address process and organization as- pects. There are two reasons for this: one, these topic haven’t changed much since the old book, and you can read them there. Second, I feel these aspects were the weakest part of the old book, because it is very hard to discuss process and organi- zational aspects in a general way, independent of a particular context. Any working software development process will work dsl engineering 13

with DSLs. Any strategy to introduce promising new tech- niques into an organization applies to introducing DSLs. The few specific aspects are covered in the ten pages at the end of the design chapter.

1.6 About the Cover

The cover layout resembles Addison-Wesley’s classic cover de- sign. I always found this design one of the most elegant book covers I have seen. The picture of a glider has been chosen to represent the connection to the cover of the original MDSD book, whose English edition also featured a glider3. 3 The MDSD book featured a Schleicher ASW-27, a 15m class racing glider. This book features a Schleicher ASH-26E, an 1.7 Feedback, Bugs and Updates 18m class self-launching glider.

Writing a book such as this is a lot of work. At some point I ran out of energy and just went ahead and published it. I am pretty confident that there are no major problems left, but I am sure there are many small bugs and problems in the book, for which I am sorry. If you find any, please let me know at [email protected]. There is also a Google+ community for the book; you can find it via the website dslbook.org. One of the advantages of an electronic book is that it is easy to publish new editions frequently. While I will certainly do other things in the near future (remember: I ran out of en- ergy!), I will try to publish an updated and bug-fixed version relatively soon. In general, updates for the book will be avail- able via my twitter account @markusvoelter and via the book website http://dslbook.org.

1.8 The Structure of the Book

The rest of this first part is a brief introduction to DSLs. It de- fines terminology, looks at the benefits and challenges of devel- oping and using DSLs, and introduces the notion of modular languages, which play an important role throughout the book. This first part is written in a personal style: it presents DSLs based on my experience, and is not intended to be a scientific treatment. Part II is about DSL design. It is a systematic exploration of seven design dimensions relevant to DSL design: expressiv- ity, coverage, semantics, separation of concerns, completeness, 14 dslbook.org

language modularization and syntax. It also discusses funda- mental language paradigms that might be useful in DSLs, and looks at a number of process-related topics. It uses five case studies to illustrate the concepts. It does not deal at all with implementation issues – we address these in part III. Part III covers DSL implementation issues. It looks at syntax definition, constraints and type systems, scoping, transforma- tion and interpretation, debugging and IDE support. It uses examples implemented with three different tools (Xtext, MPS, Spoofax). Part III is not intended as a tutorial for any one of these, but should provide a solid foundation for understanding the technical challenges when implementing DSLs. Part IV looks at using DSLs in for various tasks in software engineering, among them requirements engineering, architec- ture, implementation and, a specifically relevant topic, product line engineering. Part IV consists of a set of fairly indepen- dent chapters, each illustrating one of the software engineering challenges.

1.9 How to Read the Book

I had a lot of trouble deciding whether DSL design or DSL im- plementation should come first. The two parts are relatively independent. As a consequence of the fact that the design part comes first, there are some references back to design issues from within the implementation part. But the two parts can be read in any order, depending on your interests. If you are new to DSLs, I suggest you start with Part III on DSL implementa- tion. You may find Part II, DSL Design too abstract or dense if you don’t have hands-on experience with DSLs. Some of the examples in Part III are quite detailed, because we wanted to make sure we didn’t skim relevant details. How- ever, if some parts become too detailed for you, just skip ahead – usually the details are not important for understanding sub- sequent subsections. The chapters in Part IV are independent from each other and can be read in any sequence. Finally, I think you should at least skim the rest of Part I. If you are already versed in DSLs, you may want to skip some sections or skim over them, but it is important to understand where I am coming from to be able to make sense of some of the later chapters. dsl engineering 15

1.10 Example Tools

You could argue that this whole business about DSLs is noth- ing new. It has long been possible to build custom languages using parser generators such as lex/yacc, ANTLR or JavaCC. 4 M. Fowler. Domain-Specific Languages. 4 And of course you would be right. Martin Fowler’s DSL book Addison Wesley, 2010 emphasizes this aspect. However, I feel that language workbenches, which are tools to efficiently create, integrate and use sets of DSLs in powerful IDEs, make a qualitative difference. DSL developers, as well as the people who use the DSLs, are used to powerful, feature-rich IDEs and tools in general. If you want to establish the use of DSLs and you suggest that your users use vi or notepad.exe, you won’t get very far with most people. Also, the effort of developing (sets of) DSLs and their IDEs has been reduced sig- nificantly by the maturation of language workbenches. This is why I focus on DSL engineering with language workbenches, and emphasize IDE development just as much as language de- velopment. This is not a tutorial book on tools. However, I will show you how to work with different tools, but this should be un- derstood more as representative examples of different tooling approaches5. I tried to use diverse tools for the examples, but 5 I suggest you read the examples for the most part I stuck to those I happen to know well and for all tools, so that you appreciate the different approaches to solving a that have serious traction in the real world, or the potential common challenge in language design. to do so: Modeling + Xtext, JetBrains MPS, SDF/S- If you want to learn about one tool specifically, there are probably better tratego/Spoofax, and, to some extent, the Intentional Domain tutorials for each of them. Workbench. All except the last are open source. Here is a brief overview over the tools.

1.10.1 Eclipse Modeling + Xtext

The Eclipse Modeling project is an ecosystem – frameworks eclipse.org/Xtext and tools – for modeling, DSLs and all that’s needed or use- ful around it. It would easily merit its own book (or set of books), so I won’t cover it extensively. I have restricted my- self to Xtext, the framework for building textual DSLs, Xtend, a Java-like language optimized for code generation, as well as EMF/Ecore, the underlying meta meta model used to repre- sent model data. Xtext may not be as advanced as SDF/Strat- ego or MPS, but the tooling is very mature and has a huge user community. Also, the surrounding ecosystem provides a huge number of add-ons that support the construction of sophisti- 16 dslbook.org

cated DSL environments. I will briefly look at some of these tools, among them graphical editing frameworks.

1.10.2 JetBrains MPS

The Meta Programming System (MPS) is a projectional lan- jetbrains.com/mps guage workbench, which means that no grammar and parser is involved. Instead, editor gestures change the underlying AST directly, which is projected in a way that looks like text. As a consequence, MPS supports mixed notations (textual, sym- bolic, tabular, graphical) and a wide range of language com- position features. MPS is open source under the Apache 2.0 license, and is developed by JetBrains. It is not as widely used as Xtext, but supports many advanced features.

1.10.3 SDF/Stratego/Spoofax

These tools are developed at the University of Delft in Eelco strategoxt.org/Spoofax Visser’s group. SDF is a formalism for defining parsers for context-free grammars. Stratego is a term rewriting system used for AST transformations and code generation. Spoofax is an Eclipse-based IDE that provides a nice environment for working with SDF and Stratego. It is also not as widely used as Xtext, but it has a number of advanced features for language modularization and composition.

1.10.4 Intentional Domain Workbench

A few examples will be based on the Intentional Domain Work- intentsoft.com bench (IDW). Like MPS, it uses the projectional approach to editing. The IDW has been used to build a couple of very in- teresting systems that can serve well to illustrate the power of DSLs. The tool is a commercial offering of Intentional Software.

Many more tools exist. If you are interested, I suggest you look at the Language Workbench Competition6, where a number of 6 languageworkbenches.net language workbenches (13 at the time of writing of this book) are illustrated by implementing the same example DSLs. This provides a good way of comparing the various tools.

1.11 Case Studies and Examples

I strove to make this book as accessible and practically relevant as possible, so I provide lots of examples. I decided against dsl engineering 17

a single big, running example because (a) it becomes increas- ingly complex to follow, and (b) fails to illustrate different ap- proaches to solving the same problem. However, we use a set of case studies to illustrate many issues, especially in Part II, DSL design. These examples are introduced below. These are taken from real-world projects.

1.11.1 Component Architecture This language is an architecture DSL used to define the soft- ware architecture of a complex, distributed, component-based system in the transportation domain7. Among other architec- 7 This langauage is also used in the tural abstractions, the DSL supports the definition of compo- Part IV chapter on DSLs and software architecture: Chapter 18. nents and interfaces, as well as the definition of systems, which are connected instances of components. The code below shows interfaces and components. An interface is a collection of meth- ods (not shown) or collections of messages. Components then provide and require ports, where each port has a name, an interface and, optionally, a cardinality. namespace com.mycomany { namespace datacenter { component DelayCalculator { provides aircraft: IAircraftStatus provides console: IManagementConsole requires screens[0..n]: IInfoScreen } component Manager { requires backend[1]: IManagementConsole } interface IInfoScreen { message expectedAircraftArrivalUpdate( id: ID, time: Time ) message flightCancelled( id: ID ) } interface IAircraftStatus ... interface IManagementConsole ... } }

The next piece of code shows how these components can be instantiated and connected. namespace com.mycomany.test { system testSystem { instance dc: DelayCalculator instance screen1: InfoScreen instance screen2: InfoScreen connect dc.screens to (screen1.default, screen2.default) } }

Code generators generate code that acts as the basis for the implementation of the system, as well as all the code necessary to work with the distributed communication middleware. It is used by software developers and architects and implemented with Eclipse Xtext. 18 dslbook.org

1.11.2 Refrigerator Configuration This case study describes a set of DSLs for developing cooling algorithms in refrigerators. The customer with whom we have built this language builds hundres of different refrigerators, and coming up with energy-efficient cooling strategies is a big challenge. By using a DSL-based approach, the development and implementation process for the cooling behavior can be streamlined a lot. Three languages are used. The first describes the logical hardware structure of refrigerators. The second describes cool- ing algorithms in the refrigerators using a state-based, asyn- chronous language. Cooling programs refer to hardware fea- tures and can access the properties of hardware elements from expressions and commands. The third language is used to test cooling programs. These DSLs are used by thermodynamicists and are implemented with Eclipse Xtext. The code below shows the hardware structure definition in the refrigerator case study. An appliance represents the refrig- erator. It consists mainly of cooling compartments and com- pressor compartments. A cooling compartment contains vari- ous building blocks that are important to the cooling process. A compressor compartment contains the cooling infrastructure itself, e.g. a compressor and a fan. appliance KIR {

compressor compartment cc { static compressor c1 fan ccfan }

ambient tempsensor at

cooling compartment RC { light rclight superCoolingMode door rcdoor fan rcfan evaporator tempsensor rceva } }

The code below shows a simple cooling algorithm. Cooling algorithms are state-based programs. States can have entry ac- tions and exit actions. Inside a state we check whether specific conditions are true, then change the status of various hardware building blocks, or change the state. It is also possible to ex- press deferred behavior with the perform ...after keyword. program Standardcooling for KIR {

start: dsl engineering 19

entry { state noCooling }

state noCooling: check ( RC->needsCooling && cc.c1->standstillPeriod > 333 ) { state rcCooling } on isDown ( RC.rcdoor->open ) { set RC.rcfan->active = true set RC.rclight->active = false perform rcFanStopTask after 10 { set RC.rcfan->active = false } }

state rcCooling: ... }

Finally, the following code is a test script to test cooling pro- grams. It essentially stimulates a cooling algorithm by chang- ing hardware properties and then asserting that the algorithm reacts in a certain way. cooling test for Standardcooling { prolog { set cc.c1->standstillPeriod = 0 } // initially we are not cooling assert-currentstate-is noCooling // then we say that RC needs cooling, but // the standstillPeriod is still too low. mock: set RC->needsCooling = true step assert-currentstate-is noCooling // now we increase standstillPeriod and check // if it now goes to rcCooling mock: set cc.c1->stehzeit = 400 step assert-currentstate-is rcCooling }

1.11.3 mbeddr C This case study covers a set of extensions to the C program- ming language tailored to embedded programming8, devel- 8 This system is also used as the ex- oped as part of mbeddr.com9. Extensions include state ma- ample for implementation-level DSLs in Part IV of the book. It is covered in chines, physical quantities, tasks, as well as interfaces and com- Chapter 20. ponents. Higher-level DSLs are added for specific purposes. 9 mbeddr.com An example used in a showcase application is the control of a Lego Mindstorms robot. Plain C code is generated and sub- sequently compiled with GCC or other target device specific compilers. The DSL is intended to be used by embedded soft- ware developers and is implemented with MPS. 20 dslbook.org

Figure 1.1: A simple C module with an embedded decision table. This is a nice example of MPS’ ability to use non-textual notations thanks to its projectional editor (which we describe in detail in Part III).

Figure 1.2: This extension to C supports working with physical units (such as kg and lb). The has been extended to include type checks for units. The example also shows the unit testing extension.

Figure 1.3: This extension shows a state machine. Notice how regular C expressions are used in the guard conditions of the transitions. The inset code shows how the state machine can be triggered from regular C code.

1.11.4 Pension Plans

This DSL is used to describe families of pension plans for a large insurance company efficiently. The DSL supports math- ematical abstractions and notations to allow insurance mathe- maticians to express their domain knowledge directly (Fig. 1.5), as well as higher-level pension rules and unit tests using a ta- dsl engineering 21

ble notation (Fig. 1.4). A complete Java implementation of the calculation engine is generated. It is intended to be used by in- surance mathematicians and pension experts. It has been built by Capgemini with the Intentional Domain Workbench.

Figure 1.4: This example shows high- level business rules, together with a tabular notation for unit tests. The prose text is in Dutch, but it is not important to be able to understand it in the context of this book. 22 dslbook.org

Figure 1.5: Example Code written using the Pension Plans language. Notice the mathematical symbols used to express insurance mathematics.

1.11.5 WebDSL

WebDSL is a language for web programming10 that integrates 10 E. Visser. WebDSL: A case study in languages to address the different concerns of web program- domain-specific language engineering. In GTTSE, pages 291–373, 2007 ming, including persistent data modeling (entity), user in- terface templates (define), access control11, data validation12, 11 D. M. Groenewegen and E. Visser. search and more. The language enforces inter-concern con- Declarative access control for WebDSL: Combining language integration and 13 sistency checking, providing early detection of failures . The separation of concerns. In ICWE, pages fragments in Fig. 1.6 and Fig. 1.7 show a data model, user inter- 175–188, 2008 face templates and access control rules for posts in a blogging 12 D. Groenewegen and E. Visser. application. WebDSL is implemented with Spoofax and is used Integration of data validation and user 14 interface concerns in a dsl for web in the researchr digital library . applications. SoSyM, 2011

13 Z. Hemel, D. M. Groenewegen, L. C. L. Kats, and E. Visser. Static con- sistency checking of web applications with WebDSL. JSC, 46(2):150–182, 2011

Figure 1.6: Example Code written in WebDSL. The code shows data structures and utility functions. dsl engineering 23

Figure 1.7: More WebDSL example code. This example shows access con- trol rules as well as a page definition.

2 Introduction to DSLs

Domain-Specific Languages (DSLs) are becoming more and more important in software engineering. Tools are becom- ing better as well, so DSLs can be developed with relatively little effort. This chapter starts with a definition of impor- tant terminology. It then explains the difference between DSLs and general-purpose languages, as well as the rela- tionship between them. I then look at the relationship to model-driven development and develop a vision for modu- lar programming languages which I consider the pinnacle of DSLs. I discuss the benefits of DSLs, some of the chal- lenges for adopting DSLs and describe a few application ar- eas. Finally, I provide some differentiation of the approach discussed in this book to alternative approaches.

2.1 Very Brief Introduction to the Terminology

While we explain many of the important terms in the book as we go along, here are a few essential ones. You should at least roughly understand those right from the beginning. I use the term programming language to refer to general-pur- pose languages (GPLs) such as Java, C++, Lisp or Haskell. While DSLs could be called programming languages as well (although they are not general purpose programming languages) I don’t do this in this book: I just call them DSLs. I use the terms model, program and code interchangeably be- cause I think that any distinction is artificial: code can be writ- ten in a GPL or in a DSL. Sometimes DSL code and program code are mixed, so separating the two makes no sense. If the 26 dslbook.org

distinction is important, I say "DSL program" or "GPL code". Note that considering programs and If I use model and program or code in the same sentence, the models the same thing is only valid when looking at executable models, i.e. model usually refers to the more abstract representation. An models whose final purpose is the cre- example would be: "The program generated from the model is ation of executable software. Of course, there are models used in systems en- . . . ". gineering, for communication among If you know about DSLs, you will know that there are two stakeholders in business, or as approxi- main schools: internal and external DSLs. In this book I only mations of physical, real-world systems that cannot be considered programs. address external DSLs. See Section 2.8 for details. However, these are outside the scope of I distinguish between the execution engine and the target this book. platform. The target platform is what your DSL program has to run on in the end and is assumed to be something we can- not change (significantly) during the DSL development pro- cess. The execution engine can be changed, and bridges the gap between the DSL and the platform. It may be an interpreter or a generator. An interpreter is a program running on the target platform that loads a DSL program and then acts on it. A gen- erator (aka compiler) takes the DSL program and transforms it into an artifact (often GPL source code) that can run directly on the target platform.1. 1 In an example from enterprise sys- A language, domain-specific or not, consist of the following tems, the platform could be JEE and the execution engine could be an enterprise main ingredients. The concrete syntax defines the notation with bean that runs an interpreter for a DSL. which users can express programs. It may be textual, graph- In embedded software, the platform could be a real-time operating system, ical, tabular or a mix of these. The abstract syntax is a data and the execution engine could be a structure that can hold the semantically relevant information code generator that maps a DSL to the expressed by a program. It is typically a tree or a graph. It APIs provided by the RTOS. does not contain any details about the notation – for example, in textual languages, it does not contain keywords, symbols or whitespace. The static semantics of a language are the set of constraints and/or type system rules to which programs have to conform, in addition to being structurally correct (with re- gards to the concrete and abstract syntax). Execution semantics refers to the meaning of a program once it is executed. It is realized using the execution engine. If I use the term semantics without any qualification, I refer to the execution semantics, not the static semantics. Sometimes it is useful to distinguish between what I call technical DSLs and application domain DSLs2. The distinction 2 Sometimes also called business DSLs, is not always clear and not always necessary, but generally I vertical DSLs or "fachliche DSLs" in German. consider technical DSLs to be used by programmers and appli- cation domain DSLs to be used by non-programmers. This can have significant consequences for the design of the DSL. There is often a confusion around meta-ness (as in meta dsl engineering 27

model) and abstraction. I think these terms are clearly different and I try to explain my understanding here. The meta model of a model (or program) is a model that defines (the abstract syntax of) a language used to describe a model. For example, the meta model of UML is a model that defines all those language concepts that make up the UML, such as classifier, association or property. So the prefix meta can be understood as the definition of. The reverse direction of the relationship is typically called instance of or conforms to. It also becomes clear that every meta model is a model3. A model 3 The reverse statement is of course not m can play the role of a meta model with regards to a set of other true. models O, where m defines the language used to express the models in O. The notion of abstraction is different, even though it also characterizes the relationship between two artifacts (programs or models). An artifact a1 is more abstract than an artifact a2 if it leaves out some of the details of a2, while preserving those characteristics of a2 that are important for whatever a1 is used for – the purpose of a1 informs the the abstractions we use to approximate a2 with a1. Note that according to this definition, abstraction and model are synonyms: a simplification of real- ity for a given purpose. In this sense, the term model can also be understood as characterizing the relationship between two artifacts. a1 is a model of a2. Based on this discussion it should be clear that it does not make sense to say that the meta model is the model of a model, a sentence often heard around the 2.2 From General Purpose Languages to DSLs modeling community. model of and meta model of are two quite distinct concepts. General Purpose Programming Languages (GPLs) are a means for programmers to instruct computers. All of them are Tur- ing complete, which means that they can be used to imple- ment anything that is computable with a Turing machine. It also means that anything expressible with one Turing complete programming language can also be expressed with any other Turing complete programming language. In that sense, all pro- gramming languages are interchangeable. So why is there more than one? Why don’t we program everything in Java or Pascal or Ruby or Python? Why doesn’t an embedded systems developer use Ruby, and why doesn’t a Web developer use C? Of course there is the execution strategy. C code is compiled to efficient native code, whereas Ruby is run by a virtual ma- chine (a mix between an interpreter and a compiler). But in 28 dslbook.org

principle, you could compile (a subset of) Ruby to native code, and you could interpret C. The real reason why these languages are used for what they are used for is that the features they offer are optimized for the tasks that are relevant in the respective domains. In C you can directly influence memory layout (which is important when communicating with low-level, memory-mapped devices), you can use pointers (resulting in potentially very efficient data structures) and the preprocessor can be used as a (very limited) way of expressing abstractions with zero runtime overhead. In Ruby, closures can be used to implement "postponed" behav- ior (very useful for asynchronous web applications); Ruby also provides powerful string manipulation features (to handle in- put received from a website), and the meta programming facil- ity supports the definition of internal DSLs that are quite suit- able for Web applications (the Rails framework is the example for that). So, even within the field of general-purpose programming, there are different languages, each providing different features tailored to the specific tasks at hand. The more specific the tasks get, the more reason there is for specialized languages4. 4 We do this is real life as well. I am Consider relational algebra: relational databases use tables, sure you have heard about Eskimos having many different words for rows, columns and joins as their core abstractions. A special- snow, because this is relevant in their ized language, SQL, which takes these features into account "domain". Not sure this is actually true, but it is surely a nice metaphor for has been created. Or consider reactive, distributed, concurrent tailoring a language to its domain. systems: Erlang is specifically made for this environment. So, if we want to "program" for even more specialized en- vironments, it is obvious that even more specialized languages are useful. A Domain-Specific Language is simply a language that is optimized for a given class of problems, called a do- main. It is based on abstractions that are closely aligned with the domain for which the language is built5. Specialized lan- 5 SQL has tables, rows and columns, guages also come with a syntax suitable for expressing these Erlang has lightweight tasks, message passing and pattern matching. abstractions concisely. In many cases these are textual nota- tions, but tables, symbols (as in mathematics) or graphics can also be useful. Assuming the semantics of these abstractions is well defined, this makes a good starting point for expressing programs for a specialized domain effectively.

 Executing the Language Engineering a DSL (or any lan- guage) is not just about syntax, it also has to be "brought to life" – DSL programs have to be executed somehow. It is im- dsl engineering 29

portant to understand the separation of domain contents into DSL, execution engine and platform (see Fig. 2.1):

Figure 2.1: Fixed domain concerns (black) end up in the platform, variable concerns end up in the DSL (white). Those concerns that can be derived by rules from the DSL program end up in the execution engine (gray).

• Some concerns are different for each program in the domain (white circles). The DSL provides tailored abstractions to express this variability concisely.

• Some concerns are the same for each program in the domain (black circles). These typically end up in the platform.

• Some concerns can be derived by fixed rules from the program written in the DSL (gray circles). While these concerns are not identical in each program in the domain, they are always the same for a given DSL program structure. These concerns are handled by the execution engine (or, in some cases, in frameworks or libraries that are part of the platform).

There are two main approaches to building execution engines: translation (aka generation or compilation) and interpretation. The former translates a DSL program into a language for which an execution engine on a given target platform already exists. Often, this is GPL source code. In the latter case, you build a new execution engine (on top of your desired target platforms) which loads the program and executes it directly. If there is a big semantic gap between the language abstrac- tions and the relevant concepts of the target platform (i.e. the platform the interpreter or generated code runs on), execution may become inefficient. For example, if you try to store and query graph data in a relational database, this will be very in- 6 This may sound counterintuitive. Isn’t efficient, because many joins will be necessary to reassemble a DSL supposed to abstract away from just these details of execution? Yes, but: the graph from the normalized tabular structure. As another it has to be possible to implement a example, consider running Erlang on a system which only pro- reasonably efficient execution engine. vides heavyweight processes: having thousands of processes DSL design is a compromise between appropriate domain abstractions and (as typical Erlang programs require) is not going to be effi- the ability to get to an efficient execu- cient. So, when defining a language for a given domain, you tion. A good DSL allows the DSL user to ignore execution concerns, but allows should be aware of the intricacies of the target platform and the DSL implementor to implement a the interplay between execution and language design6. reasonable execution engine 30 dslbook.org

 Languages versus Libraries and Frameworks At this point you should to some extent believe that specific problems can be more efficiently solved by using the right abstractions. But why do we need full-blown languages? Aren’t objects, func- tions, APIs and frameworks good enough? What does creating a language add to the picture?

• Languages (and the programs you write with them), are the cleanest form of abstraction – essentially, you add a nota- tion to a conceptual model of the domain. You get rid of all the unnecessary clutter that an API – or anything else em- bedded in or expressed with a general-purpose language – requires. You can define a notation that expresses the ab- stractions concisely and makes interacting with programs easy and efficient.

• DSLs sacrifice some of the flexibility to express any pro- gram (as in GPLs) for productivity and conciseness of rel- evant programs in a particular domain. In that sense, DSLs are limited, or restricted. DSLs may be so restricted that they only allow the creation of correct programs (correct-by- construction).

• You can provide non-trivial static analyses and checks, and In the end, this is what allows DSLs an IDE that offers services such as code completion, syn- to be used by non-programmers, one of the value propositions of DSLs: tax highlighting, error markers, refactoring and debugging. they get a clean, custom, productive This goes far beyond what can be done with the facilities environment that allows them to work with languages that are closely aligned provided by general-purpose languages. with the domain in which they work.

 Differences between GPLs and DSLs I said above that DSLs sacrifice some of the flexibility to express any program in favor of productivity and conciseness of relevant programs in a par- ticular domain. But beyond that, how are DSLs different from GPLs, and what do they have in common? The boundary isn’t as clear as it could be. Domain-specificity is not black-and-white, but instead gradual: a language is more or less domain specific. The following table lists a set of lan- guage characteristics. While DSLs and GPLs can have charac- teristics from both the second and the third columns, DSLs are more likely to have characteristics from the third column. Considering that DSLs pick more characteristics from the third rather than the second column, this makes designing DSLs a more manageable problem than designing general-pur- dsl engineering 31

GPLs DSLs Domain large and complex smaller and well-defined Language size large small Turing completeness always often not User-defined abstractions sophisticated limited Execution via intermediate GPL native Lifespan years to decades months to years (driven by context) Designed by guru or committee a few engineers and domain experts User community large, anonymous and widespread small, accessible and local Evolution slow, often standardized fast-paced Deprecation/incompatible changes almost impossible feasible pose languages. DSLs are typically just much smaller and sim- Figure 2.2: Domain-specific languages versus programming languages. DSLs 7 pler than GPLs (although there are some pretty sophisticated tend to pick more characteristics from DSLs). the third column, GPLs tend to pick more from the second. There are some who maintain that DSLs are always declar- ative (it is not completely clear what "declarative" means any- 7 Small and simple can mean that the way), or that they may never be Turing complete. I disagree. language has fewer concepts, that the type system is less sophisticated or that They may well be. However, if your DSL becomes as big and the expressive power is limited. general as, say, Java, you might want to consider just using Java8. DSLs often start simple, based on an initially limited 8 Alternatively, if your tooling allows it, extending Java with domain-specific understanding of the domain, but then grow more and more concepts. sophisticated over time, a phenomenon Hudak notes in his ’96 paper9. 9 P. Hudak. Building domain-specific So, then, are Mathematica, SQL, State Charts or HTML ac- embedded languages. ACM Comput. Surv., 28(4es):196, 1996 tually DSLs? In a technical sense they are. They are clearly optimized for (and limited to) a special domain or problem. However, these are examples of DSLs that pick more charac- teristics from the GPL column, and therefore aren’t necessar- Ira Baxter suggests only half-jokingly ily good examples for the kinds of languages we cover in this that as soon as a DSL is really success- book. ful, we don’t call them DSLs anymore.

2.3 Modeling and Model-Driven Development

There are two ways in which the term modeling can be under- stood: descriptive and prescriptive. A descriptive model repre- sents an existing system. It abstracts away some aspects and emphasizes others. It is usually used for discussion, commu- nication and analysis. A prescriptive model is one that can be used to (automatically) construct the target system. It must be 10 Some people say that models are much more rigorous, formal, complete and consistent. In the always descriptive, and once you become prescriptive, you enter the context of this chapter, and of the book in general, we always realm of programming. That’s fine mean prescriptive models when we use the term model10. Us- with me. As I have said above, I don’t distinguish between programming and ing models in a prescriptive way is the essence of model-driven modeling, just between more or less (software) development (MDSD). abstract languages and models. 32 dslbook.org

Defining and using DSLs is a flavor of MDSD: we create formal, tool-processable representations of specific aspects of software systems11. We then use interpretation or code gener- 11 One can also do MDSD without DSLs ation to transform those representations into executable code by, for example, generating code from general-purpose modeling languages expressed in general-purpose programming languages and the such as UML. associated XML/HTML/whatever files. With today’s tools it is technically relatively simple to define arbitrary abstractions that represent some aspect of a software system in a meaning- ful way12. It is also relatively simple to build code generators 12 Designing a good language is another that generate the executable artifacts (as long as you don’t need matter – Part II, DSL Design, provides some help with this. sophisticated optimizations, which can be challenging). De- pending on the particular DSL tool used, it is also possible to define suitable notations that make the abstractions easily un- derstandable by non-programmers (for example opticians or thermodynamics engineers). However, there are also limitations to the classical MDSD approach. The biggest one is that modeling and programming often do not go together very well: modeling languages, envi- ronments and tools are distinct from programming languages, environments and tools. The level of distinctness varies, but in many cases it is big enough to cause integration issues that can make adoption of MDSD challenging. Let me provide some specific examples. Industry has set- tled on a limited number of meta meta models, EMF/EMOF being the most widespread. Consequently, it is possible to navigate, query and constrain arbitrary models with a com- mon API. However, programming language IDEs are typically not built on top of EMF, but come with their own API for rep- resenting and accessing the syntax tree. Thus, interoperability between models and source code is challenging – you cannot treat source code in the same way as models in terms of how you access the AST programmatically. A similar problem exists regarding IDE support for model- code integrated systems: you cannot mix (DSL) models and (GPL) programs while retaining reasonable IDE support. Again, this is because the technology stacks used by the two are differ- ent13. These problems often result in an artificial separation of 13 Of course, an integration can be models and code, where code generators either create skeletons created, as Xtext/Xtend/Java shows. However, this is a special integration into which source code is inserted (directly or via the genera- with Java. Interoperability with, say, C tion gap pattern), or the arcane practice of pasting C snippets code, would require a new and different integration infrastructure. into 300 by 300 pixel sized text boxes in graphical state machine tools (and getting errors reported only when the resulting in- dsl engineering 33

tegrated C code is compiled). So what really is the difference between programming and (prescriptive) modeling today? The table in Fig. 2.3 contains some (general and broad) statements:

Modeling Programming Define your own notation/language Easy Sometimes possible to some extent Syntactically integrate several langs Possible, depends on tool Hard Graphical notations Possible, depends on tool Usually only visualizations Customize generator/compiler Easy Sometimes possible based on open compilers Navigate/query Easy Sometimes possible, depends on IDE and APIs View Support Typical Almost Never Constraints Easy Sometimes possible with Findbugs etc. Sophisticated mature IDE Sometimes, effort-dependent Standard Debugger Rarely Almost always Versioning, diff/merge Depends on syntax and tools Standard

Figure 2.3: Comparing modeling and programming  Why the Difference? So one can and should ask: why is there a difference in the first place? I suspect that the primary reason is history: the two worlds have different origins and have evolved in different directions. Programming languages have traditionally used textual con- crete syntax, i.e. the program is represented as a stream of characters. Modeling languages traditionally have used graph- ical notations. Of course there are textual domain-specific lan- guages (and mostly failed graphical general-purpose languages), but the use of textual syntax for domain-specific modeling has only recently become more prominent. Programming languages have traditionally stored programs in their textual, concrete syntax form, and used scanners and parsers to transform this character stream into an abstract syntax tree for further pro- cessing. Modeling languages have traditionally used editors that directly manipulate the abstract syntax, and used projec- tion to render the concrete syntax in the form of diagrams14. 14 This is not something we think This approach makes it easy for modeling tools to define views, about much. To most of us this is obvious. If it were different, we’d have the ability to show the same model elements in different con- to define grammars that could parse texts, often using different notations. This has never really been two-dimensional graphical structures. While this is possible, it has never a priority for programming languages beyond outline views, caught on in practice. inheritance trees or call graphs. 15 This is my personal opinion. While Here is one of the underlying premises of this book: there I know enough people who share it, I should be no difference15! Programming and (prescriptive) also know people who disagree. modeling should be based on the same conceptual approach 16 As we will see in this book, Xtext/ and tool suite, enabling meaningful integration16. In my expe- Xbase/Xtend and MPS’ BaseLanguage and mbeddr C are convincing examples rience, most software developers don’t want to model. They of this idea. want to program, but: 34 dslbook.org

at different levels of abstraction: some things may have to be de- scribed in detail, low level, algorithmically (a sorting algo- rithm); other aspects may be described in more high-level terms (declarative UIs) from different viewpoints: separate aspects of the system should be described with languages suitable to these aspects (data structures, persitence mapping, process, UI) with different degrees of domain-specificity: some aspects of systems are generic enough to be described with reusable, generic languages (components, database mapping). Other aspects require their own dedicated, maybe even project-specific DSLs (pension calculation rules). with suitable notations, so all stakeholders can contribute directly to "their" aspects of the overall system (a tabular notation for testing pension rules) with suitable degrees of expressiveness: aspects may be described imperatively, with functions, or other Turing complete for- malisms (a routing algorithm), and other aspects may be described in a declarative way (UI structures) always integrated and tool processable, so all aspects directly lead to executable code through a number of transformations or other means of execution.

This vision, or goal, leads to the idea of modular languages, as explained in the next section.

2.4 Modular Languages

I distinguish between the size of a language and its scope. Lan- guage size simply refers to the number of language concepts in that language. Language scope describes the area of ap- plicability for the language, i.e. the size of the domain. The same domain can be covered with big and small languages. A big language makes use of linguistic abstraction, whereas a small language allows the user to define their own in-language abstractions. We discuss the tradeoffs between big and small languages in detail as part of the chapter on Expressivity (Sec- tion 4.1), but here is a short overview, based on examples from GPLs. dsl engineering 35

Examples of big languages include Cobol (a relatively old language intended for use by business users) or ABAP (SAP’s language for programming the R/3 system). Big languages (Fig. 2.4) have a relatively large set of very specific language concepts. Proponents of these languages say that they are easy to learn, since "There’s a keyword for everything". Constraint checks, meaningful error messages and IDE support are rela- tively simple to implement because of the large set of language concepts. However, expressing more sophisticated algorithms Figure 2.4: A Big Language has many can be clumsy, because it is hard to write compact, dense code. very specific language concepts. Let us now take a look at small languages (Fig. 2.5). Lisp or are examples of small GPLs. They have few, but very powerful language concepts that are highly orthogonal, and hence, composable. Users can define their own abstrac- tions. Proponents of this kind of language also say that those are easy to learn, because "You only have to learn three con- cepts". But it requires experience to build more complex sys- tems from these basic building blocks, and code can be chal- lenging to read because of its high density. Tool support is harder to build because much more sophisticated analysis of Figure 2.5: A Small Language has few, the code is necessary to reverse engineer its domain semantics. but powerful, language concepts. There is a third option: modular languages (Fig. 2.6). They are, in some sense, the synthesis of the previous two. A mod- ular language is made up of a minimal language core, plus a library of language modules that can be imported for use in a given program. The core is typically a small language (in the way defined above) and can be used to solve any prob- lem at a low level, just as in Smalltalk or Lisp. The extensions then add first class support for concepts that are interesting the target domain. Because the extensions are linguistic in na- ture, interesting analyses can be performed and writing gener- ators (transforming to the minimal core) is relatively straight- forward. New, customized language modules can be built and Figure 2.6: A Modular Language: a used at any time. A language module is like a framework or small core, and a library of reusable library, but it comes with its own syntax, editor, type system language modules. and IDE tooling. Once a language module is imported, it be- haves as an integral part of the composed language, i.e. it is integrated with other modules by referencing symbols or by being syntactically embedded in code expressed with another 17 This may sound like internal DSLs. module. Integration on the level of the type system, the seman- However, static error checking, static optimizations and IDE support is tics and the IDE is also provided. An extension module may what differentiates this approach from even be embeddable in different core languages17. internal DSLs. 36 dslbook.org

This idea isn’t new. Charles Simonyi18 and Sergey Dmitriev19 18 C. Simonyi, M. Christerson, and have written about it, so has Guy Steele in the context of Lisp20. S. Clifford. Intentional software. In OOPSLA, pages 451–464, 2006 The idea of modular and extensible languages also relates very 19 much to the notion of language workbenches as defined by S. Dmitriev. Language oriented programming: The next programming 21 Martin Fowler . He defines language workbenches as tools paradigm, 2004 where: 20 G. L. S. Jr. Growing a language. lisp, 12(3):221–236, 1999 • Users can freely define languages that are fully integrated with 21 each other. This is the central idea for language workbenches, M. Fowler. Language workbenches: The killer-app for domain specific but also for modular languages, since you can easily argue languages?, 2005 that each language module is what Martin Fowler calls a language. "Full integration" can refer to referencing as well as embedding, and includes type systems and semantics.

• The primary source of information is a persistent abstract repre- 22 Projectional editing means that users sentation and language users manipulate a DSL through a pro- don’t write text that is subsequently parsed. Instead, user interactions with jectional editor. This implies that projectional editing must the concrete syntax directly change be used22. I don’t agree. Storing programs in their abstract the underlying abstract syntax. We’ll discuss this technology much more representation and then using projection to arrive at an ed- extensively in Part III of the book. itable representation is very useful, and maybe even the best approach to achieve modular languages23. However, in the 23 In fact, this is what I think personally. end I don’t think this is important, as long as languages are But I don’t think that this characteristic is essential for language workbenches. modular. If this is possible with a different approach, such 24 24 Scannerless parsers do not distinguish as scannerless parsers , that is fine with me. between recognizing tokens and pars- • Language designers define a DSL in three main parts: schema, ed- ing the structure of the tokens, thereby avoiding some problems with grammar itor(s) and generator(s). I agree that ideally a language should composability. We’ll discuss this further be defined "meta model first", i.e. you first define a schema in the book as well (aka the meta model or AST), and then the concrete syntax (editor or grammar), based on the schema: MPS does it this way. However, I think it is also ok to start with the gram- mar, and have the meta model derived. This is the typical workflow with Xtext, although it can do both. From the lan- guage user’s point of view, it does not make a big difference in most cases. • A language workbench can persist incomplete or contradictory in- formation. I agree. This is trivial if the models are stored in a concrete textual syntax, but it is not so trivial if a persistent representation based on the abstract syntax is used.

Let me add two additional requirements. For all the languages built with the workbench, tool support must be available: syn- tax highlighting, code completion, any number of static anal- yses (including type checking in the case of a statically typed dsl engineering 37

language) and ideally also a debugger25. A final requirement 25 A central idea of language work- is that I want to be able to program complete systems within benches is that language definition always includes IDE definition. The two the language workbench. Since in most interesting systems you should be integrated. will still write parts in a GPL, GPLs must also be available in the environment based on the same language definition/edit- ing/processing infrastructure. Depending on the target do- mains, this language could be Java, Scala or C#, but it could also be C/C++ for the embedded community. Starting with an existing general-purpose language also makes the adoption of the approach simpler: incremental language extensions can be developed as the need arises.

 Concrete Syntax By default, I expect the concrete syntax of 26 This is not true for all formalisms. DSLs to be textual. Decades of experience show that textual Expressing hierarchical state charts textually can be a challenge. However, syntax, together with good tool support, is adequate for large textual syntax is a good default that can and complex software systems26. This becomes even more true be used unless otherwise indicated. if you consider that programmers will have to write less code in a DSL – compared to expressing the same functionality in a GPL – because the abstractions available in the languages will be much more closely aligned with the domain. And program- mers can always define an additional language module that fits a domain. Using text as the default does not mean that it should stop there. There are worthwhile additions. For example, symbolic (as in mathematical) notations and tables should be supported. Finally, graphical editing is useful for certain cases. Examples include data structure relationships, state machine diagrams or data flow systems. The textual and graphical notations must be integrated, though: for example, you will want to embed the expression language module into the state machine diagram to be able to express guard conditions. The need to see graphical notations to gain an overview over complex structures does not necessarily mean that the program has to be edited in a graphical form: custom visualizations are important as well. Visualizations are graphical representations of some interesting aspect of the program that is read-only, au- tomatically laid out and supports drill-down back to the pro- gram (you can double-click on, say, a state in the diagram, and the text editor selects that particular state in the program text).

 Language Libraries The importance of being able to build your own languages varies depending on the concern at hand. Assume that you work for an insurance company and you want 38 dslbook.org

to build a DSL that supports your company’s specific way of defining insurance contracts. In this case it is essential that the language is aligned exactly with your business, so you have to define the language yourself27. There are other similar ex- 27 Examples concepts in the insurance amples: building DSLs to describe radio astronomy observa- domain include native types for dates, times and time periods, currencies, tions for a given telescope, our case study language to describe support for temporal data and the sup- cooling algorithms for refrigerators, or a language for describ- porting operators, as well as business rules that are "polymorphic" regarding ing telecom billing rules (all of these are actual projects I have their period of applicability worked on). However, for a large range of concerns relating to software engineering or software architecture, the relevant abstractions are well known. They could be made available for reuse (and adaptation) in a library of language modules. Examples in- clude:

• Hierarchical components, ports, component instances and connectors.

• Tabular data structure definition (as in the relational model) or hierarchical data structure definition (as in XML and XML schema), including specifications for persisting that data.

• Definition of rich contracts, including interfaces, pre- and post conditions, protocol state machines and the like.

• Various communication paradigms, such as message pass- ing, synchronous and asynchronous remote procedure calls and service invocations.

• Abstractions for concurrency based on transactional mem- ory or actors.

This sounds like a lot to put into a programming language. But remember: it will not all be in one language. Each of those concerns will be a separate language module that will be used in a program only if needed. It is certainly not possible to define all these language mod- ules in isolation. Modules have to be designed to work with each other, and a clear dependency structure has to be estab- lished. Interfaces on language level support "plugging in" new language constructs. A minimal core language, supporting primitive types, expression, functions and maybe OO, will act as the focal point around which additional language modules are organized. dsl engineering 39

Many of these architectural concerns interact with frame- works, platforms and middleware. It is crucial that the ab- stractions in the language remain independent of specific tech- nology solutions. In addition, when interfacing with a spe- cific technology, additional (hopefully declarative) specifica- tions might be necessary: such a technology mapping should be a separate model that references the core program that ex- presses the application logic. The language modules define a language for specifying persistence, distribution or contract definition. Technology suppliers can support customized gen- erators that map programs to the APIs defined by their tech- nology, taking into account possible additional specifications 28 This is a little like service provider that configure the mapping28. interfaces (SPIs) in Java enterprise technology.

Figure 2.7: A program written in a modularly extendible C (from the mbeddr.com project).

 A Vision of Programming For me, this is the vision of pro- gramming I am working towards. The distinction between modeling and programming vanishes. People can develop code using a language directly suitable to the task at hand and aligned with their role in the overall development project. They can 40 dslbook.org

also build their own languages or language extensions, if that makes their work easier. Most of these languages will be rel- atively small, since they only address one aspect of a system, and typically extend existing languages (Fig. 3.13 shows an ex- ample of extensions to C). They are not general-purpose: they are DSLs. Tools for this implementing this approach exist. Of course MPS is one of them, which is why they can become even better, for example in the development I focus a lot on MPS in this book. Intentional’s Domain Workbench is of debuggers or integration of graphical and textual languages, another one. Various Eclipse-based but we are clearly getting there. solutions (with Xtext/ Xbase/Xtend at the core) are getting there as well.

2.5 Benefits of using DSLs

Using DSLs can reap a multitude of benefits. There are also some challenges you have to master: I outline these in the next section. Let’s look at the upside first.

2.5.1 Productivity Once you’ve got a language and its execution engine for a par- ticular aspect of your development task, work becomes much more efficient, simply because you don’t have to do the grunt work manually29. This is the most obviously useful if you can 29 Presumably the amount of DSL code replace a lot of GPL code with a few lines of DSL code. There you have to write is much less than what you’d have to write if you used are many studies that show that the mere amount of code one the target platform directly. has to write (and read!) introduces complexity, independent of what the code expresses, and how. The ability to reduce that amount while retaining the same semantic content is a huge advantage. You could argue that a good library or framework will do the job as well. True, libraries, frameworks and DSLs all encap- sulate knowledge and functionality, making it easily reusable. However, DSLs provide a number of additional benefits, such as a suitable syntax, static error checking or static optimizations and meaningful IDE support.

2.5.2 Quality

Using DSLs can increase the quality of the created product: The approach can also yield better fewer bugs, better architectural conformance, increased main- performance if the execution engine contains the necessary optimizations. tainability. This is the result of the removal of (unnecessary) However, implementing these is a lot degrees of freedom for programmers, the avoidance of dupli- of work, so most DSLs do not lead to significant gains in performance. cation of code (if the DSL is engineered in the right way) and the consistent automation of repetitive work by the execution dsl engineering 41

engine30. As the next item shows, more meaningful valida- 30 This is also known as correct-by- tion and verification can be performed on the level of DSL pro- construction: the language only allows the construction of correct programs. grams, increasing the quality further.

2.5.3 Validation and Verification Since DSLs capture their respective concern in a way that is not cluttered with implementation details, DSL programs are more semantically rich than GPL programs. Analyses are much eas- ier to implement, and error messages can use more meaningful wording, since they can use domain concepts. As mentioned above, some DSLs are built specifically to enable non-trivial, formal (mathematical) analyses. Manual review and valida- tion also becomes more efficient, because the domain-specific aspects are uncluttered, and domain experts can be involved more directly.

2.5.4 Data Longevity If done right, models are independent of specific implementa- tion techniques. They are expressed at a level of abstraction that is meaningful to the domain – this is why we can ana- lyze and generate based on these models. This also means that models can be transformed into other representations if the 31 This leads to an interesting definition need arises, for example, because you are migrating to a new of legacy code: it is legacy, if you DSL technology. While the investments in a DSL implementa- cannot access the domain semantics of tion are specific to a particular tool (and lost if you change it), a data structure, and hence you cannot automatically migrate it to a different 31 the models should largely be migratable . formalism. 2.5.5 A Thinking and Communication Tool If you have a way of expressing domain concerns in a language that is closely aligned with the domain, your thinking becomes clearer, because the code you write is not cluttered with im- plementation details. In other words, using DSLs allows you to separate essential from accidental complexity, moving the latter to the execution engine. This also makes team communi- cation simpler. But not only is using the DSL useful; also, the act of building the language can help you improve your understanding of the domain for which you build the DSL. It also helps straighten Building a language requires formal- out differences in the understanding of the domain that arise ization and decision making: you can’t create a DSL if you don’t really know from different people solving the same problem in different what you’re talking about. ways. In some senses, a language definition is an "executable analysis model"32. I have had several occasions on which cus- 32 Remember the days when "analysts" tomers said, after a three-day DSL prototyping workshop, that created "analysis models"? 42 dslbook.org

they had learned a lot about their own domain, and that even if they never used the DSL, this alone would be worth the ef- fort spent on building it. In effect, a DSL is a formalization of 33 E. Evans. Domain-driven design: the Ubiquitous Language in the sense of Eric Evans’ Domain tackling complexity in the heart of software. Addison-Wesley, 2004 Driven Design33.

2.5.6 Domain Expert Involvement Of course, the domain (and the people working in it) must be suitable for for- DSLs whose domain, abstractions and notations are closely malization, but once you start looking, aligned with how domain experts (i.e. non-programmers) ex- it is amazing how many domains fall into this category. Insurance contracts, press themselves, allow for very good integration between de- hearing aids and refrigerators are just velopers and domain experts: domain experts can easily read, some examples you maybe didn’t ex- pect. On the other hand, I once helped and often write program code, since it is not cluttered with build a DSL to express enterprise gover- implementation details irrelevant to them. And even when do- nance and business policies. This effort main experts aren’t willing to write DSL code, developers can failed, because the domain was much too vague and too "stomach driven" for at least pair with them when writing code, or use the DSL it to be formalizable. code to get domain experts involved in meaningful validation and reviews (Fowler uses the term "business-readable DSLs" in this case). At the very least you can generate visualizations, reports or even interactive simulators that are suitable for use by domain experts.

2.5.7 Productive Tooling In contrast to libraries, frameworks, and internal DSLs (those 34 JetBrains once reported the following embedded into a host language and implemented with host about their webr and dnq Java exten- language abstractions), external DSLs can come with tools, i.e. sions for web applications and database persistence: "Experience shows that the IDEs that are aware of the language. This can result in a much language extensions are easier to learn improved user experience. Static analyses, code completion, than J2EE APIs. As an experiment, visualizations, debuggers, simulators and all kinds of other a student who had no experience in web development was tasked to create niceties can be provided. These features improve the produc- a simple accounting application. He tivity of the users and also make it easier for new team mem- was able to produce a web application 34 with sophisticated Javascript UI in bers to become productive . about 2 weeks using the webr and dnq languages." 2.5.8 No Overhead If you are generating source code from your DSL program (as opposed to interpreting it) you can use domain-specific ab- stractions without paying any runtime overhead, because the generator, just like a compiler, can remove the abstractions and generate efficient code. And it generates the same low- overhead code, every time, automatically. This is very useful in cases where performance, throughput or resource efficiency is a concern (i.e. in embedded systems, but also in the cloud, where you run many, many processes in server farms; energy consumption is an issue these days). dsl engineering 43

2.5.9 Platform Independent/Isolation

In some cases, using DSLs can abstract from the underlying 35 Remember OMG’s MDA? They technology platform35. Using DSLs and an execution engine introduced the whole model-driven approach primarily as a means to makes the application logic expressed in the DSL code inde- abstract from platforms (probably a pendent of the target platform36. It is absolutely feasible to consequence of their historical focus on interoperability). There are cases change the execution engine and the target platform "under- where interoperability is the primary neath" a DSL to execute the code on a new platform. Porta- focus: cross-platform mobile develop- bility is enhanced, as is maintainability, because DSLs support ment is an example. However, in my experience, platform independence is separation of concerns – the concerns expressed in the DSL often just one driver in many, and it is (e.g. the application logic) is separated from implementation typically not the most important one. details and target platform specifics. 36 This is not necessarily true for archi- tecture DSLs and utility DSLs, whose abstractions may be tied relatively Often no single one of the advantages would drive you to us- closely to the concepts provided by the target platform. ing a DSL. But in many cases you can benefit in multiple ways, so the sum of the benefits is often worth the (undoubtedly nec- essary) investment in the approach.

2.6 Challenges

There is no such thing as a free lunch. This is also true for DSLs. Let’s look at the price you have to pay to get all the benefits described above.

2.6.1 Effort of Building the DSLs

Before a DSL can be used, it has to be built37. If the DSL 37 If it has been built already, before a has to be developed as part of a project, the effort of build- project, then using the DSL is obviously useful. ing it has to be factored into the overall cost-benefit analysis. 38 38 Technical DSLs are those that ad- For technical DSLs , there is a huge potential for reuse (e.g. a dress aspects of software engineering large class of web or mobile applications can be described with such as components, state machines the same DSL), so here the investment is easily justified. On or persistence mappings, not appli- cation domain DSLs for a technical the other hand, application domain-specific DSLs (e.g. pension domain (such as automotive software or plan specifications) are often very narrow in focus, so the in- machine control). vestment in building them is harder to justify at first glance. But these DSLs are often tied to the core know-how of a busi- ness and provide a way to describe this knowledge in a formal, uncluttered, portable and maintainable way. That should be There are three factors that make DSL a priority for any business that wants to remain relevant! In creation cheaper: deep knowledge about a domain, experience of the DSL both cases, modern tools reduce the effort of building DSLs developer and productivity of the tools. considerably, making it a feasible approach in more and more This is why focussing on tools in the context of DSLs is important. projects. 44 dslbook.org

2.6.2 Language Engineering Skills Building DSLs is not rocket science. But to do it well requires experience and skill: it is likely that your first DSL will not be great. Also, the whole language/compiler thing has a bad rep- utation that mainly stems from "ancient times" when tools like lex/yacc, ANTLR, C and Java were the only ingredients you could use for language engineering. Modern language work- benches have changed this situation radically, but of course there is still a learning curve. In addition, the definition of good languages – independent of tooling and technicalities – is not made simpler by better tools: how do you find out which abstractions need to go into the languages? How do you cre- ate "elegant" languages? The book provides some guidance in Part II, DSL Design, but it nevertheless requires a significant element of experience and practice that can only be build up over time.

2.6.3 Process Issues Using DSLs usually leads to work split: some people build the languages, others use them. Sometimes the languages have been built already when you start a development project; some- 39 For some DSLs, the users of the DSL times they are built as part of the project. In the latter case es- are the same people who build the DSL (often true for utility DSLs). This pecially, it is important that you establish some kind of process is great because there is no commu- for how language users interact with language developers and nication overhead or knowledge gap 39 between the domain expert (you) and with domain experts . Just like any other situation in which the DSL developer (you). It is a good one group of people creates something that another group of idea to choose such a DSL as your first people relies on, this can be a challenge40. DSL. 40 This is not much different for lan- 2.6.4 Evolution and Maintenance guages than for any other shared artifact (frameworks, libraries, tools in A related issue is language evolution and maintenance. Again, general), but it also isn’t any simpler just like any other asset you develop for use in multiple con- and needs to be addressed. texts, you have to plan ahead (people, cost, time, skills) for the maintenance phase. A language that is not actively maintained and evolved will become outdated over time and will become a liability. During the phase where you introduce DSLs into an 41 organization especially, rapid evolution based on the require- While this is an important aspect, once again it is no worse for DSLs than 41 ments of users is critical to build trust in the approach . it is for any other shared, reused asset.

2.6.5 DSL Hell Once development of DSLs becomes technically easy, there is a danger that developers create new DSLs instead of searching for and learning existing DSLs. This may end up as a large dsl engineering 45

set of half-baked DSLs, each covering related domains, pos- sibly with overlap, but still incompatible. The same problem 42 It also helps if DSLs are incrementally can arise with libraries, frameworks or tools. They can all be extensible, so an existing language addressed by governance and effective communication in the can be extended instead of creating a team42. completely new language.

2.6.6 Investment Prison

The more you invest in reusable artifacts, the more productive With the advent of the digital age, we you become. However, you may also get locked into a particu- all know of many businesses that went bankrupt because they had stuck to a lar way of doing things. Radically changing your business may dying business model. Maybe they just seem unattractive once you’ve become very efficient at the cur- couldn’t see that things would change, but maybe it way because they were rent one. It becomes expensive to "move outside the box". To so efficient at what they were doing, avoid this, keep an open mind and be willing to throw things they couldn’t invest into new ideas or away and come up with more appropriate solutions. approaches for fear of canibalizing their mainstream business. 2.6.7 Tool Lock-in Many of the DSL tools are open source, so you don’t get locked into a vendor. But you will still get locked into a tool. While it is feasible to exchange model data between tools, there is es- sentially no interoperability between DSL tools themselves, so the investments in DSL implementation are specific to a single tool.

2.6.8 Cultural Challenges Statements like "Language Engineering is complicated", "De- velopers want to program, not model", "Domain experts aren’t programmers" and "If we model, we use the UML standard" are often-overheard prejudices that hinder the adoption of DSLs. I hope to provide the factual and technical arguments for fight- ing these in this book. But an element of cultural bias may still remain. You may have to do some selling and convincing that is relatively independent of the actual technical arguments. Problems like this always arise if you want to introduce some- thing new into an organization, especially if it changes signif- icantly what people do, how they do it or how they interact. A lot has been written about introducing new ideas into orga- nizations, and I recommend reading Fearless Change by Rising 43 L. Rising and M. L. Manns. Fearless 43 Change: Patterns for Introducing New and Manns if you’re the person who is driving the introduc- Ideas: Introducing Patterns into Organiza- tion of DSLs into your organization. tions. Addison-Wesley, 2004

Of course there are other things that can go wrong: your DSL or generator might be buggy, resulting in buggy systems. You 46 dslbook.org

might have the DSL developed by external parties, giving away core domain knowhow. The person who built the DSL may leave the company. However, these things are not specific to DSLs: they can happen with anything, so we don’t address them as challenges in the context of DSLs specifically.

 Is it worth it? Should you use DSLs? The only realistic answer is: it depends. With this book I aim to give you as much help as possible. The better you understand the topic, the easier it is to make an informed decision. In the end, you have to decide for yourself, or maybe ask for the help of people who have done it before. Let us look at when you should not use DSLs. If you don’t understand the domain you want to write a DSL for, or if you don’t have the means to learn about it (e.g. access to somebody who knows the domain), you’re in trouble. You will identify the wrong abstractions, miss the expectations of your future users and generally have to iterate a lot to get it right, mak- ing the development expensive44. Another sign of problems is 44 If everyone is aware of this, then this: if you build your DSL iteratively and over time and the you might still want to try to build a language as a means of building the changes requested by the domain experts don’t become fewer understanding about the domain. But and smaller, and concern more and more detailed points, then this is risky, and should be handled with care. you know you are in trouble, because it seems there is no com- mon understanding about the domain. It is hard to write a DSL for a set of stakeholders who can’t agree on what the domain is all about. Another problem is an unknown target platform. If you don’t know how to implement a program in the domain on the target platform manually, you’ll have a hard time imple- menting an execution engine (generator or interpreter). You might want to consider writing (or inspecting) a couple of rep- resentative example applications to understand the patterns that should go into the execution engine. DSLs and their tooling are sophisticated software programs themselves. They need to be designed, tested, deployed and documented. So a certain level of general software develop- ment proficiency is a prerequisite. If you are struggling with unit testing, software design or continuous builds, then you should probably master these challenges before you address DSLs. A related topic is the maturity of the development pro- cess. The fact that you introduce additional dependencies (in the form of a supplier-consumer relationship between DSL de- dsl engineering 47

velopers and DSL users) into your development team requires that you know how to track issues, handle version manage- ment, do testing and quality assurance and document things in a way accessible to the target audience. If your development team lacks this maturity, you might want to consider first intro- ducing those aspects into the team before you start using DSLs in a strategic way – although the occasional utility DSL is the obvious exception.

2.7 Applications of DSLs

So far we have covered some of the basics of DSLs, as well as the benefits and challenges. This section addresses those aspects of software engineering in which DSLs have been used successfully. Part IV of the book provides extensive treatment of most of these.

2.7.1 Utility DSLs One use of DSLs is simply as utilities for developers. A devel- oper, or a small team of developers, creates a small DSL that automates a specific, usually well-bounded aspect of software development. The overall development process is not based on DSLs, it’s a few developers being creative and simplifying their own lives45. 45 Often, these DSL serve as a "nice Examples include the generation of array-based implemen- front end" to an existing library or framework, or automates a particularly tations for state machines, any number of interface/contract annoying or intricate aspect of software definitions from which various derived artifacts (classes, WSDL, development in a given domain. factories) are generated, or tools that set up project and code skeletons for given frameworks (as exemplified in Rails’ and Roo’s scaffolding). The Jnario language for behavior-driven development is discussed as an example of a utility DSL in Chapter 19.

2.7.2 Architecture DSLs A larger-scale use of DSLs is to use them to describe the archi- tecture (components, interfaces, messages, dependencies, pro- cesses, shared resources) of a (larger) software system or plat- form. In contrast to using existing architecture modeling lan- guages (such as UML or the various existing architecture de- scription languages (ADLs)), the abstractions in an architec- ture DSL can be tailored specifically to the abstractions rele- vant to the particular platform or system architecture. Much 48 dslbook.org

more meaningful analyses and generators are possible in this way. From the architecture models expressed in the DSL, code skeletons are generated into which manually written applica- tion code is inserted. The generated code usually handles the integration with the runtime infrastructure. Often, these DSLs also capture non-functional constraints such as timing or re- source consumption. Architecture DSLs are usually developed during the architecture exploration phase of a project. They can help to ensure that the system architecture is consistently implemented by a potentially large development team. For example in AUTOSAR46, the architecture is specified in 46 AUTOSAR is an architectural stan- models, then the complete communication middleware for a dard for automotive software develop- ment. distributed component infrastructure is generated. Examples in embedded systems in general abound: I have used this ap- proach for a component architecture in software-defined ra- dio, as well as for factory automation systems, in which the distributed components had to "speak" an intricate protocol whose handlers could be generated from a concise specifica- tion. Finally, the approach can also be used well in enter- prise systems that are based on a multi-tier, database-based distributed server architecture. Middleware integration, server configuration and build scripts can often be generated from relatively concise models. We discuss an example architecture DSL for distributed, com- ponent-based systems as one of the case studies in Part II of the book, and also in Chapter 18.

2.7.3 Full Technical DSLs

For some domains, DSLs can be created that don’t just embody the architectural structure of the systems, but their complete application logic as well, so that 100% of the code can be gen- erated. DSLs like these often consist of several language mod- ules that play together to describe all aspects of the underlying system. I emphasize the word "technical", since these DSLs are used by developers, in contrast to application domain DSLs. Examples include DSLs for some types of Web application, DSLs for mobile phone apps, as well as DSLs for developing state-based or dataflow-based embedded systems. As an ex- ample of this class of DSLs we discuss mbeddr, a set of exten- sions to C for embedded software development as an example in Part II and in Chapter 20. dsl engineering 49

2.7.4 Application Domain DSLs In this case the DSLs describe the core business logic of an ap- plication system independent of its technical implementation. These DSLs are intended to be used by domain experts, usually non-programmers. This leads to more stringent requirements regarding notation, ease of use and tool support. These also typically require more effort in building the language, since a "messy" application domain first has to be understood, struc- tured and possibly "re-taught" to the domain experts47. 47 In contrast, technical DSLs are often Examples include DSLs for describing pension plans, a DSL much easier to define, since they are guided very much by existing formal for describing the cooling algorithms in refrigerators, a DSL for artifacts (architectures, frameworks, configuring hearing aids or DSLs for insurance mathematics. middleware infrastructures). We discuss the pension plan example in Part II, and discuss a DSL for defining health monitoring applications in Chapter 22.

2.7.5 DSLs in Requirements Engineering A related topic to application domain DSLs is the use of DSLs in the context of requirements engineering. Here, the focus of the languages is not so much on automatic code generation, but rather on a precise and checkable complete description of requirements. Traceability to other artifacts is important. Of- ten, the DSLs need to be embedded or otherwise connected to prose text, to integrate them with "classical" requirements approaches. Examples include a DSL for helping with the trade analy- 48 The latter kind of DSLs, also called sis for satellite construction, or pseudo-structured natural lan- Controlled Natural Language, is quite different from the kinds of DSLs we guage DSLs that assume some formal meaning for domain en- cover in this book. I will not cover it tities and terms such as should or must48. We discuss the con- any furter. nection of DSLs and requirements engineering in Chapter 17.

2.7.6 DSLs used for Analysis Another category of DSL use is as the basis for analysis, check- ing and proofs. Of course, checking plays a role in all use cases for DSLs – you want to make sure that the models you release for downstream use are "correct" in a sense that goes beyond what the language syntax already enforces. But in some cases, DSLs are used to express concerns in a formalism that lends itself to formal verification (safety, scheduling, concurrency, re- source allocation). While code generation is often a part of it, code generation is not the driver for the use of this type of DSL. This is especially relevant in complex technical sys- tems, or in systems engineering, where we look beyond only 50 dslbook.org

software and consider a system as a whole (including mechani- cal, electric/electronic or fluid-dynamic aspects). Sophisticated mathematical formalisms are used here – I will cover this as- pect only briefly in this book, as part of the Semantics chapter (Section 4.1).

2.7.7 DSLs used in Product Line Engineering At its core, PLE is mainly about expressing, managing and then later binding variability between a set of related products. De- pending on the kind of variability, DSLs are a very good way of capturing the variability, and later, in the DSL code, of describ- ing a particular variant. Often, but not always, these DSLs are used more for configuration than for "creatively constructing" a solution to a problem. Examples include the specification of refrigerator models as the composition of the functional layout of a refrigerator and a cooling algorithm, injected with specific parameter values. We look at the relationship of DSLs and PLE in Chapter 21.

2.8 Differentiation from other Works and Approaches

2.8.1 Internal versus External DSLs Internal DSLs are DSLs that are embedded into general-purpose languages. Usually, the host languages are dynamically typed and the implementation of the DSL is based on meta program- ming (Scala is an exception here, since it is a statically typed language with type inference). The difference between an API and an internal DSL is not always clear, and there is a middle ground called a Fluent API. Let’s look at the three:

• We all know what a regular object-oriented API looks like. We instantiate an object and then call a sequence of methods on the object. Each method call is packaged as a separate statement.

• A fluent API essentially chains method calls. Each method call returns an object on which subsequent calls are possible. This results in more concise code, and, more importantly, by returning suitable intermediate objects from method calls, a sequence of valid subsequent method calls can be enforced (almost like a grammar – this is why it could be consid- ered a DSL). Here is a Java/Easymock example, taken from Wikipedia: dsl engineering 51

Collection coll = EasyMock.createMock(Collection.class); EasyMock.expect(coll.remove(null)).andThrow(new NullPointerException()). atLeastOnce();

• Fluent APIs are chained sequences of method calls. The syntax makes this obvious, and there is typically no way to change this syntax, as a consequence of the inflexible syntax rules of the host language. Host languages with more flex- ible syntax can support internal DSL that look much more like actual, custom languages. Here is a Ruby on Rails ex- ample49, which defines a data structure (and, implicitly, a 49 taken from rubyonrails.org database table) for a blog post:

class Post < ActiveRecord::Base validates :name, :presence => true validates :title, :presence => true, :length => { :minimum => 5 } end

While I recognize the benefits of fluent APIs and internal DSLs, 50 Note that with modern language I think they are fundamentally limited by the fact that an im- workbenches, you can also achieve language extension or embedding, 50 portant ingredient is missing: IDE support . In classical inter- resulting in the same (or even a some- nal DSLs, the IDE is not aware of the grammar, constraints or what cleaner) syntax. However, these extensions and embeddings are real lan- other properties of the embedded DSL beyond what the type guage extensions (as opposed to meta system can offer, which isn’t much in the case of dynamically programs) and do come with support typed languages. Since I consider IDE integration an impor- for static constraint checking and IDE support. We cover this extensively in tant ingredient to DSL adoption, I decided not to cover internal the book. DSLs in this book51. 51 In addition, I don’t have enough real- world experience with internal DSLs to 2.8.2 Compiler Construction be able to talk about them in a book. Language definition, program validation and transformation or interpretation are obviously closely related to compiler con- struction – even though I don’t make this connection explicit in the book all the time. And many of the techniques that are traditionally associated with compiler construction are appli- cable to DSLs. However, there are also significant differences. The tools for building DSLs are more powerful and convenient 52 There are universities who teach com- 52 and also include IDE definition , a concern not typically as- piler construction based on language sociated with compiler construction. Compilers also typically workbenches and DSLs. generate machine code, whereas DSLs typically transform to 53 A. V. Aho, M. S. Lam, R. Sethi, and source code in a general-purpose language. Finally, a big part J. D. Ullman. Compilers: Principles, Techniques, and Tools (2nd Edition). of building compilers is the implementation of optimizations Addison Wesley, August 2006 (in the code generator or interpreter), a topic that is not as 54 A. W. Appel. Modern Compiler prominent in the context of DSLs. I recommend reading the Implementation in Java. Cambridge "Dragon Book"53 or Appel’s Modern Compiler Construction54. University Press, 1998 52 dslbook.org

2.8.3 UML So what about the Unified Modeling Language – UML? I de- When I wrote my "old" book on MDSD, cided not to cover UML in this book. I focus on mostly tex- UML played an important role. At the time, I really did use UML a lot tual DSLs and related topics. UML does show up peripherally for projects involving models and in a couple of places, but if you are interested in UML-based code generation. Over the years, the importance of UML has diminished MDSD, then this book is not for you. For completeness, let us significantly (in spite of the OMG’s briefly put UML into the context of DSLs. efforts to popularize both UML and UML is a general-purpose modeling language. Like Java MDA), mainly because of the advent of modern language workbenches. or Ruby, it is not specific to any domain (unless you consider software development in general to be a domain, which ren- ders the whole DSL discussion pointless), so UML itself does not count as a DSL55. To change this, UML provides profiles, 55 UML can be seen as an integrated which are a (limited and cumbersome) way to define variants collection of DSLs that describe various aspects of software systems: class of UML language concepts and to effectively add new ones. It structure, state based behavior, or depends on the tool you choose how well this actually works deployment. However, these DSLs still address the overall domain of software. and how far you can adapt the UML syntax and the modeling tool as part of profile definition. In practice, most people use 56 It is interesting to see that even only a very small part of UML, with the majority of concepts these sectors increasingly embrace DSLs. I know of several projects in defined via profiles. It is my experience that because of that, it the aerospace/defense sector where is much more productive, and often less work, to build DSLs UML-based modeling approaches were replaced with very specific and with "real" language engineering environments, as opposed to much more productive DSLs. It is using UML profiles. also interesting to see how sectors So is UML used in MDSD? Sure. People build profiles and define their own standard languages. While I hesitate to call it a DSL, the use UML-based DSLs, especially in large organizations where automotive industry is in the process the (perceived) need for standardization is para- mount56. of standardizing on AUTOSAR and its modeling languages. 2.8.4 Graphical versus Textual This is something of a religious war, akin to the statically-typed versus dynamically-typed languages debate. Of course, there is a use for both flavors of notation, and in many cases, a mix is the best approach. In a number of cases, the distinction is even hard to make: tables or mathematical and chemical notations are both textual and graphical in nature57. 57 The ideal tool will allow you to use However, this book does have a bias towards textual nota- and mix all of them, and we will see in the book how close existing tools come tions, for several reasons. I feel that the textual format is more to this ideal. generally useful, that it scales better and that the necessary tools take (far) less effort to build. In the vast majority of cases, starting with textual languages is a good idea – graphical visu- alizations or editors can be built on top of the meta model later, if a real need has been established. If you want to learn more 58 S. Kelly and J.-P. Tolvanen. Domain- about graphical DSLs, I suggest you read Kelly and Tolvanen’s Specific Modeling: Enabling Full Code Generation. Wiley-IEEE Computer 58 book Domain Specific Modeling . Society Press, March 2008 Part II

DSL Design

dsl engineering 55

This part of the book has been written together with Eelco Visser of TU Delft. You can reach him at [email protected].

Throughout this part of the book we refer back to the five case studies introduced in Part I of the book (Section 1.11). We use a the following labels: Component Architecture: This refers to the component architecture case study described in Section 1.11.1. J Refrigerators: This refers to the refrigerator configuration case study described in Section 1.11.2. J mbeddr C: This refers to the mbeddr.com extensible C case study described in Section 1.11.3. J Pension Plans: This refers to the pension plans case study described in Section 1.11.4. J WebDSL: This refers to the WebDSL case study described in Section 1.11.5. J Note that in this part of the book the examples will only be used to illustrate DSL design and the driving design decisions. Part III of the book will then discuss the implementation as- pects. Some aspects of DSL design have been formalized with math- ematical formulae. These are intended as an additional means of explaining some of the concepts. Formulae are able to state properties of programs and languages in an unambiguous way. However, I want to emphasize that reading or understanding the formulae is not essential for understanding the language design discussion. So if you’re not into mathematical formu- lae, just ignore them. This part consists of three chapters. In Chapter 3 we intro- duce important terms and concepts including domain, model purpose and the structure of programs and languages. In Chap- ter 4 we discuss a set of seven dimensions that guide the design of DSLs: expressivity, coverage, semantics, separation of con- cerns, completeness, language modularization and syntax. Fi- nally, in Chapter 5 we look at well-known structural and behav- ioral paradigms (such as inheritance or state based behaviour) and discuss their applicability to DSLs.

3 Conceptual Foundations

This chapter provides the conceptual foundations for the discussion of the design dimensions. It consists of three sec- tions. The first one, Program, Languages and Domain defines some of the terminology around DSL design we will use in the rest of this chapter. The second section briefly ad- dress the Purpose of programs as a way of guiding their de- sign. And the third section briefly introduces parser-based and projectional editing, since some design considerations depend on this rather fundamental difference in DSL im- plementation.

3.1 Programs, Languages and Domains

Domain-specific languages live in the realm of programs, lan- guages and domains. So we should start by explaining what these things are. We will then use these concepts throughout this part of the book. As part of this book’s treatment of DSLs, we are primarily in- terested in computation, i.e. we are aimed at creating executable software1. So let’s first consider the relation between programs 1 This is opposed to just communicating and languages. Let’s define P to be the set of all conceivable among humans or describing complete systems. programs. A program p in P is the conceptual representation of some computation that runs on a universal computer (Turing machine). A language l defines a structure and notation for ex- pressing or encoding programs from P. Thus, a program p in P may have an expression in L, which we will denote as pl. There can be several languages l1 and l2 that express the fac- same conceptual program p in different way pl1 and pl2 ( 58 dslbook.org

torial can be expressed in Java and Lisp, for example). There may even be multiple ways to express the same program in Notice that this transformation only a single language l (in Java, factorial can be expressed via changes the language used to express the program. The conceptual program recursion or with a loop). A transformation T between lan- does not change. In other words, the transformation preserves the semantics guages l1 and l2 maps programs from their l1 encoding to their of p . We will come back to this notion ( ) = l1 l2 encoding, i.e. T pl1 pl2 . as we discuss semantics in more detail It may not be possible to encode all programs from P in a in Section 4.3. given language l. We denote as Pl the subset of P that can be expressed in l. More importantly, some languages may be Turing-complete languages can by better at expressing certain programs from P: the program may definition express all of P be shorter, more readable or more analyzable. Pension Plans: The pension plan language is very good at representing pension calculations, but cannot practically be used to express other software. For example, user de- fined data structures and loops are not supported. J

 Domains What are domains? We have seen one way of defining domains in the previous paragraph. When we said that a language l covers a subset of P, we can simply call this subset the domain covered with l. However, this is not a very useful approach, since it equates the scope of a domain trivially with the scope of a language (the subset of P in that domain PD is equal to the subset of P we can express with a language l Pl). We cannot ask questions like: "Does the language adequately cover the domain?", since it always does, by definition. There are two more useful approaches. In the inductive or bottom-up approach we define a domain in terms of existing software used to address a particular class of problems or prod- ucts. That is, a domain D is identified as a set of programs with common characteristics or similar purpose. Notice how at this point we do not imply a special language to express them. They could be expressed in any Turing-complete language. Often such domains do not exist outside the realm of software. An especially interesting case of the inductive approach is where we define a domain as a subset of programs written in a specific language Pl instead of the more general set P. In this case we can often clearly identify the commonalities among the 2 Some people have argued for a programs in the domain, in the form of their consistent use of long time that the need to use id- a set of domain-specific patterns or idioms2. This makes build- ioms or patterns in a language is a ing a DSL for D relatively simple, because we know exactly smell, and should be understood as hints at missing language fea- what the DSL has to cover, and we know what code to gener- tures: c2.com/cgi/wiki?AreDesign ate from DSL programs. PatternsMissingLanguageFeatures dsl engineering 59

mbeddr C: The domain of this DSL has been defined bottom- up. Based on idioms commonly employed when using C for embedded software development, linguistic abstrac- tions have been defined that provide a "shorthand" for those idioms. These linguistic abstractions form the basis of the language extensions. J The above examples can be considered relatively general – the domain of embedded software development is relatively broad. In contrast, a domain may also be very specific, as is illustrated by the refridgerator case study. Refrigerators: The cooling DSL is tailored specifically to- wards expressing refrigerator cooling programs for a very specific organization. No claim is made for broad appli- cability of the DSL. However, it perfectly fits into the way cooling algorithms are described and implemented in that particular organization. J The second approach for defining a domain is deductive or top- down. In this approach, a domain is considered a body of knowledge about the real world, i.e. outside the realm of soft- ware. From this perspective, a domain D is a body of knowl- edge for which we want to provide some form of software sup- port. PD is the subset of programs in P that implement inter- esting computations in D. This case is much harder to address using DSLs, because we first have to understand precisely the nature of the domain and identify the interesting programs in that domain. Pension Plans: The pensions domain has been defined in this way. The customer had been working in the field of old-age pensions for decades and had a detailed un- derstanding of that domain. That knowledge was mainly contained in the heads of pension experts, in pension plan requirements documents, and, to a limited extent, encoded in the source of existing software. J In the context of DSLs, we can ultimately consider a domain Figure 3.1: The programs relevant D by a set of programs PD, whether we take the deductive or to a domain PD and the programs inductive route. There can be multiple languages in which we expressible with a language PL are both can express PD programs. Possibly, PD can only be partially subsets of the set of all programs P.A good DSL has a large overlap with its expressed in a language l (Figure 3.1). target domain (PL ≈ PD).

 Domain-Specific Languages We can now understand the no- tion of a domain-specific language. A domain-specific language 60 dslbook.org

lD for a domain D is a language that is specialized for encoding 3 3 programs from PD. That is, lD is more efficient in represent- There are several ways of measuring efficiency. The most obvious one is ing PD programs than other languages, and thus, is particularly the amount of code a developer has well suited for PD. It achieves this by using abstractions suitable to write to express a problem in the to the domain, and avoiding details that are irrelevant to pro- domain: the more concise, the more efficient. We will discuss this in more grams in D (typically because they are similar in all programs detail in Section 4.1. and can be added automatically by the execution engine). It is of course possible to express programs in PD with a general-purpose language. But this is less efficient – we may have to write much more code, because a GPL is not special- ized to that particular domain. Depending on the expressivity of a DSL, we may also be able to use it to describe programs outside of the D domain4. However, this is often not efficient 4 For example, you can write any pro- at all, because, by specializing a DSL for D, we also restrict gram with some dialects of SQL. its efficiency for expressing programs outside of D. This is not a problem as long as we have scoped D correctly. If the DSL actually just covers a subset of PD, and we have to express programs in D for which the DSL is not efficient, we have a problem. This leads us to the crucial challenge in DSL design: finding regularity in a non-regular domain and capturing it in a lan- guage. Especially in the deductive approach, membership of programs in the domain is determined by a human and is, in some sense, arbitrary. A DSL for the domain hence typically represents an explanation or interpretation of the domain, and often requires trade-offs by under- or over-approximation (Fig- ure 3.2). This is especially true while we develop the DSL: an iterative approach is necessary that evolves the language as our understanding of the domain becomes more and more refined Figure 3.2: Languages L1 and L2 under- over time. In a DSL l that is adequate for the domain, the sets approximate and over-approximate domain D. Pl and PD are the same.

 Domain Hierarchy In the discussion of DSLs and progres- 5 In reality, domains are not always sively higher abstraction levels, it is useful to consider domains as neatly hierarchical as we make it seem here. Domains may overlap, for 5 organized in a hierarchy , in which higher domains are a sub- example. Nonetheless, the notion of a set (in terms of scope) of the lower domains (Fig. 3.3). hierarchy is very useful for discussing many of the advanced topics in this At the bottom we find the most general domain D0. It is the book. In terms of DSLs, overlap may domain of all possible programs P. Domains Dn, with n > 0, be addressed by factoring the com- represent progressively more specialized domains, where the mon aspects into a separate language module that can be used in both the set of interesting programs is a subset of those in Dn−1 (ab- overlapping domains. breviated as D−1). We call D+1 a subdomain of D. For ex- ample, D1.1 could be the domain of embedded software, and dsl engineering 61

D1.2 could be the domain of enterprise software. The progres- sive specialization can be continued ad infinitum, in principle. For example, D2.1.1 and D2.1.2 are further subdomains of D1.1: 6 D2.1.1 could be automotive embedded software and D2.1.2 could At the top of the hierarchy we find be avionics software6. singleton domains that consist of a single program (a non-interesting boundary case).

Figure 3.3: The domain hierarchy. Domains with higher index are called subdomains of domains with a lower index (D1 is a subdomain of D0). We use just D to refer to the current domain, and D+1 and D−1 to refer to the relatively more specific and more general ones.

Languages are typically designed for a particular domain D. 7 7 Languages for D0 are called general-purpose languages . Lan- We could define D0 to be those pro- grams expressible with Turing ma- guages for Dn with n > 0 become more domain-specific for chines, but using GPLs for D0 is a more growing n. Languages for a particular Dn can also be used useful approach for this book. to express programs in Dn+1. However, DSLs for Dn+1 may add additional abstractions or remove some of the abstractions found in languages for Dn. To get back to the embedded sys- tems domain, a DSL for D1.1 could include components, state machines and data types with physical units. A language for D2.1.1, automotive software, will retain these extensions, but in addition provide direct support for the AUTOSAR standard and prohibit the use of void* to conform to the MISRA-C stan- dard.

mbeddr C: The C base language is defined for D0. Exten- sions for tasks, state machines or components can argued to be specific to embedded systems, making those sit in D1.1. Progressive specialization is possible; for example, a language for controlling small Lego robots sits on top of state machines and tasks. It could be allocated to D2.1.1. J

3.2 Model Purpose

We have said earlier that there can be several languages for the same domain. These languages differ regarding the abstrac- tions they make use of. Deciding which abstractions should go into a particular language for D is not always obvious. The basis for the decision is to consider the model purpose. Mod- 62 dslbook.org

els8, and hence the languages to express them, are intended 8 As we discuss below, we use the terms for a specific purpose. Examples of model purpose include program and model as synonyms. automatic derivation of a D−1 program, formal analysis and model checking, platform-independent specification of func- tionality or generation of documentation9. The same domain 9 Generation of documentation is concepts can often be abstracted in different ways, for differ- typically not the main or sole model purpose, but may be an important ent purposes. When defining a DSL, we have to identify the secondary one. In general, we consider different purposes required, and then decide whether we can models that only serve communication among humans to be outside the scope create one DSL that fits all purposes, or create a DSL for each of this book, because they don’t have purpose10. to be formally defined to achieve their purpose. mbeddr C: The model purpose is the generation of an ef-

ficient low-level C implementation of the system, while at 10 Defining several DSLs for a single the same time providing software developers with mean- domain is especially useful if different ingful abstractions. Since efficient C code has to be gen- stakeholders want to express different aspects of the domain with languages erated, certain abstractions, such as dynamically growing suitable to their particular aspect. We lists or runtime polymorphic dispatch, are not supported discuss this in the section on View- points (Section 4.4) even though they would be convenient for the user. The state machines in the statemachines language have an additional model purpose: model checking, i.e. proving certain properties about the state machines (e.g., proving that a certain state is definitely going to be reached after some event occurs). To make this possible, the action code used in the state machines is limited: it is not possible, for example, to read and write the same variable in the same action. J Refrigerators: The model purpose is the generation of effi- cient implementation code for various different target plat- forms (different types of refrigerators use different elec- tronics). A secondary purpose is enabling domain experts to express the algorithms and experiment with them us- ing simulations and tests. The DSL is not expected to be used to visualize the actual refrigerator device for sales or marketing purposes. J Pension Plans: The model purpose of the pension DSL is to enable insurance mathematicians and pension plan de- velopers (who are not programmers) to define complete pension plans, and to allow them to check their own work for correctness using various forms of tests. A secondary purpose is the generation of the complete calculation en- gine for the computing center and the website. J dsl engineering 63

The purpose of a DSL may also change over time. Conse- quently, this may require changes to the abstractions or no- tations used in the language. From a technical perspective, this is just like any other case of language evolution (discussed in Chapter 6).

3.3 The Structure of Programs and Languages

The discussion above is relatively theoretical, trying to cap- ture somewhat precisely the inherently imprecise notion of do- mains. Let us now move into the field of language engineering. Here we can describe the relevant concepts in a much more practical way.

 Concrete and Abstract Syntax Programs can be represented Figure 3.4: Concrete and abstract syntax for a textual variable declaration. in their abstract syntax and the concrete syntax forms. The con- Notice how the abstract syntax does not crete syntax is the notation with which the user interacts as he contain the keyword var or the symbols : ; edits a program. It may be textual, symbolic, tabular, graphical and . or any combination thereof. The abstract syntax is a data struc- ture that represents the semantically relevant data expressed by The abstract syntax is very similar to a a program (Fig. 3.4 shows an example of both). It does not con- meta model in that it represents only tain notational details such as keywords, symbols, white space a data structure and ignores notation. The two are also different: the abstract or positions, sizes and coloring in graphical notations. The ab- syntax is usually automatically derived stract syntax is used for analysis and downstream processing of from a grammar, whereas a meta model is typically defined first, independent programs. A language definition includes the concrete and the of a notation. This means that, while abstract syntax, as well as rules for mapping one to the other. the abstract syntax may be structurally Parser-based systems map the concrete syntax to the abstract affected by the grammar, the meta model is "clean" and represents purely syntax. Users interact with a stream of characters, and a parser the structure of the domain. In practice, derives the abstract syntax by using a grammar and mapping the latter isn’t strictly true either, since editing and tool considerations typically rules. Projectional editors go the other way round: user interac- influence a meta model as well. In tions, although performed through the concrete syntax, directly this book, we consider the two to be change the abstract syntax. The concrete syntax is a mere pro- synonyms. jection (that looks and feels like text when a textual projection is used). No parsing takes place. Spoofax and Xtext are parser- based tools, MPS is projectional. While concrete syntax modularization and composition can be a challenge and requires a discussion of textual concrete syntax details, we will illustrate most language design concerns based on the abstract syntax. The abstract syntax of programs elements are primarily trees of program . Each element is an in- Figure 3.5: A program is a tree of stance of a language concept, or concept for short. A language is program elements, with a single root essentially a set of concepts (we’ll come back to this below). Ev- element. ery element (except the root) is contained by exactly one parent 64 dslbook.org

element. Syntactic nesting of the concrete syntax corresponds to a parent-child relationship in the abstract syntax. There may also be any number of non-containing cross-references between elements, established either directly during editing (in projec- tional systems) or by a name resolution (or linking) phase that follows parsing and tree construction.

 Fragments A program may be composed from several pro- gram fragments. A fragment is a standalone tree, a partial pro- gram. Conversely, a program is a set of fragments connected by references (discussed below). Ef is the set of program elements in a fragment f .

 Languages A language l consists a set of language concepts Figure 3.6: A fragment is a program 11 tree that stands for itself and potentially Cl and their relationships . We use the term concept to refer references other fragments. to all aspects of an element of a language, including concrete 11 In the world of grammars, a concept syntax, abstract syntax, the associated type system rules and is essentially a Nonterminal. We will constraints as well as some definition of its semantics. In a discuss the details about grammars in the implementation section of this book fragment, each element e is an instance of a concept c defined in some language l. mbeddr C: In C, the statement int x = 3; is an instance of the LocalVariableDeclaration concept. int is an in- stance of IntType, and the 3 is an instance of Number- Literal. J Figure 3.7: A language is a set of concepts.

Figure 3.8: The statement int x = 3; is an instance of the Local- VariableDeclaration. co returns the concept for a given element.

 Functions We define the concept-of function co to return the concept of which a program element is an instance: co ⇒ element → concept (see Fig. 3.8). Similarly we define the language- of function lo to return the language in which a given con- dsl engineering 65

cept is defined: lo ⇒ concept → language. Finally, we define a fragment-of function f o that returns the fragment that contains a given program element: f o ⇒ element → fragment (Fig. 3.9).

 Relationships We also define the following sets of relation- ships between program elements. Cdnf is the set of parent- child relationships in a fragment f . Each c ∈ Cdn has the Figure 3.9: f o returns the fragment for a properties parent and child (see figure Fig. 3.10; Cdn are all the given element. parent-child "lines" in the picture). mbeddr C: In int x = 3; the local variable declaration is the parent of the type and the init expression 3. The con- cept Local- VariableDeclaration defines the contain- ment relationships type and init, respectively. J

Refsf is the set of non-containing cross-references between pro- gram elements in a fragment f . Each reference r in Refsf has the properties f rom and to, which refer to the two ends of the reference relationship (see figure Fig. 3.10). Figure 3.10: f o returns the fragment for mbeddr C: For example, in the x = 10; assignment, x is a given element. a reference to a variable of that name, for example, the one declared in the previous example paragraph. The con- cept LocalVariableRef has a non-containing reference re- lationship var that points to the respective variable. J Finally, we define an inheritance relationship that applies the Liskov Substitution Principle (LSP) to language concepts. The LSP states that,

In a computer program, if S is a subtype of T, then objects of type T may be replaced with objects of type S (i.e., objects of type S may be substitutes for objects of type T) without altering any of the desirable properties of that program (correctness, task performed, etc.)

The LSP is well known in the context of object-oriented pro- gramming. In the context of language design it implies that a concept csub that extends another concept csuper can be used in places where an instance of csuper is expected. Inhl is the set of Figure 3.11: Concepts can extend other concepts. The base concept may be inheritance relationships for a language l. Each i ∈ Inhl has the defined in a different language. properties super and sub. mbeddr C: The LocalVariableDeclaration introduced above extends the concept Statement. This way, a local variable declaration can be used wherever a Statement is expected, for example, in the body of a function, which is a StatementList. J 66 dslbook.org

 Independence An important concept is the notion of inde- pendence. An independent language does not depend on other languages. This means that for all parent/child, reference and inheritance relationships, both ends refer to concepts defined in the same language. Based on our definitions above we can define an independent language l as a language for which the 12 Formulae like these are not essential 12 for understanding. You may ignore following hold : them if you like.

∀r ∈ Refsl | lo(r.to) = lo(r.from) = l (3.1)

∀s ∈ Inhl | lo(s.super) = lo(s.sub) = l (3.2)

∀c ∈ Cdnl | lo(c.parent) = lo(c.child) = l (3.3)

Independence can also be applied to fragments. An independent fragment is one in which all non-containing cross-references

Re f s f point to elements within the same fragment:

∀r ∈ Refsf | fo(r.to) = fo(r.from) = f (3.4)

Notice that an independent language l can be used to construct dependent fragments, as long as the two fragments just contain elements from this single language l. Vice versa, a dependent language can be used to construct independent fragments. In this case we just have to make sure that the non-containing cross references are "empty" in the elements in fragment f . Refrigerators: The hardware definition language is inde- pendent, as are fragments that use this language. In con- trast, the cooling algorithm language is dependent. Buil- dingBlockRef declares a reference to the BuildingBlock concept defined in the hardware language (Fig. 3.12). Con- sequently, if a cooling program refers to a hardware setup using an instance of BuildingBlockRef, the fragment be- comes dependent on the hardware definition fragment that Figure 3.12:A BuildingBlockRef references a hardware element from contains the referenced building block. J within a cooling algorithm fragment.

 Homogeneity We distinguish homogeneous and heterogeneous fragments. A homogeneous fragment is one in which all ele- An independent language can only ments are expressed with the same language (see formula 1.5). express homogeneous fragments. However, a homogeneous fragment This means that for all parent/child relationships (Cdn f ), the can be expressed with a dependent elements at both ends of the relationship have to be instances language if the dependencies all come via the Re f s relationship. of concepts defined in one language l (1.6):

∀e ∈ Ef | lo(co(e)) = l (3.5)

∀c ∈ Cdnf | lo(co(c.parent)) = lo(co(c.child)) = l (3.6) dsl engineering 67

mbeddr C: A program written in plain C is homogeneous. All program elements are instances of the C language. Us- ing the state machine language extension allows us to em- bed state machines in C programs. This makes the respec- tive fragment heterogeneous (see Fig. 3.13). J

Figure 3.13: An example of a het- erogeneous fragment. This module 3.4 Parsing versus Projection contains global variables (from the core language), a state machine (from the statemachines language) and a test case This part of the book is not about implementation techniques. (from the unittest language). Note how However, the decision of whether to build a DSL using a pro- concepts defined in the statemachine language (trigger, isInState and jectional editor instead of the more traditional parser-based ap- test statemachine) are used inside a proach can have some consequences for the design of the DSL. TestCase. 68 dslbook.org

So we have to provide some level of detail on the two at this point. In the parser-based approach, a grammar specifies the se- quence of tokens and words that make up a structurally valid program. A parser is generated from this grammar. A parser is a program that recognizes valid programs in their textual form Figure 3.14: In parser-based systems, and creates an abstract syntax tree or graph. Analysis tools the user only interacts with the concrete syntax, and the AST is constructed from or generators work with this abstract syntax tree. Users enter the information in the text. programs using the concrete syntax (i.e. character sequences) and programs are also stored in this way. Example tools in this category include Spoofax and Xtext. Projectional editors (also known as structured editors) work without grammars and parsers. A language is specified by defining the abstract syntax tree, then defining projection rules that render the concrete syntax of the language concepts de- Figure 3.15: In projectional systems, fined by the abstract syntax. Editing actions directly modify the user sees the concrete syntax, but all editing gestures directly influence the abstract syntax tree. Projection rules then render a textual the AST. The AST is not extracted from (or other) representation of the program. Users read and write the concrete syntax, which means the concrete syntax does not have to be programs through this projected notation. Programs are stored parsable. as abstract syntax trees, usually as XML. As in parser-based 13 You could argue that they are not systems, backend tools operate on the abstract syntax tree. purely projectional because the user Projectional editing is well known from graphical editors; can move the shapes around and virtually all of them use this approach13. However, they can the position information has to be persistent. Nonetheless, graphical 14 also be used for textual syntax . Example tools in this cat- editors are fundamentally projectional. egory include the Intentional Domain Workbench15 and Jet- 14 Brains MPS. While in the past projectional text editors have acquired a bad reputation In this section, we do not discuss the relative advantages mostly because of bad usability, as and drawbacks of parser-based versus projectional editors in of 2011, the tools have become good enough, and computers have become any detail (although we do discuss the trade-offs in the chap- fast enough to make this approach ter on language implementation, Section 7). However, we will feasible, productive and convenient to point out if and when there are different DSL design options use. depending on which of the two approaches is used. 15 www.intentsoft.com 4 Design Dimensions

This chapter has been written in association with Eelco Visser of TU Delft. You can contact him via [email protected].

DSLs are powerful tools for software engineering, because they can be tailor-made for a specific class of problems. However, because of the large degree of freedom in design- ing DSLs, and because they are supposed to cover the in- tended domain, consistently, and at the right abstraction level, DSL design is also hard. In this chapter we present a framework for describing and characterizing domain spe- cific languages. We identify seven design dimensions that span the space within which DSLs are designed: expressiv- ity, coverage, semantics, separation of concerns, complete- ness, language modularization and syntax.

We illustrate the design alternatives along each of these dimen- sions with examples from our case studies. The dimensions provide a vocabulary for describing and comparing the design of existing DSLs, and help guide the design of new ones. We also describe drivers, or forces, that lead to using one design al- ternative over another. This chapter is not a complete method- ology. It does not present a recipe that guarantees a great DSL if followed. I don’t believe in methodologies, because they pre- tend precision where there isn’t any. Building a DSL is a craft. This means that, while there are certain established approaches and conventions, building a good DSL also requires experience and practice. 70 dslbook.org

4.1 Expressivity

One of the fundamental advantages of DSLs is increased ex- pressivity over more general programming languages. Increased expressivity typically means that programs are shorter, and that the semantics are more readily accessible to processing tools (we will return to this). By making assumptions about the target domain and encapsulating knowledge about the do- main in the language and in its execution strategy (and not just in programs), programs expressed using a DSL can be signifi- cantly more concise. Refrigerators: Cooling algorithms expressed with the cool- ing DSL are approximately five times shorter than the C version that users would have to write instead. J While it is always possible to produce short but incomprehen- sible programs, in general shorter programs require less ef- 1 The size of a program may not be fort to read and write than longer programs, and are therefore the only relevant metric to asses the usefulness of a DSL. For example, if the be more efficient in software engineering. We will thus as- DSL requires only a third of the code to sume that, all other things being equal, shorter programs are write, but it takes four times as long to 1 write the code per line, then there is no preferable over longer programs. . We use the notation |pL| benefit for writing programs. However, to indicate the size of program p as encoded in language L2. often when reading programs, less code The essence is the assumption that, within one language, more is clearly a benefit. So it depends on the ratio between writing and reading complex programs will require larger encodings. We also as- code as to whether a DSL’s conciseness is important. sume that pL is the smallest encoding of p in L, i.e. does not contain dead or convoluted code. We can then qualify the ex- 2 We will not concern ourselves with pressivity of a language relative to another language. the exact way to measure the size of a program, which can be textual lines A language L is more expressive in domain D of code or nodes in a syntax tree, for 1 example. than a language L2 (L1 ≺D L2), ∈ ∩ ∩ | | < | | if for each p PD PL1 PL2 , pL1 pL2 .

A weaker but more realistic version of this statement requires that a language is mostly more expressive, but may not be in some obscure special cases: DSLs may optimize for the com- mon case and may require code written in a more general lan- guage to cover the corner cases3. 3 We discuss this aspect in the section Compared to GPLs, DSLs (and the programs expressed with on completeness (Section 4.5). them) are more abstract: they avoid describing details that are irrelevant to the model purpose. The execution engine then fills in the missing details to make the program executable on a given target platform, based on knowledge about the domain encoded in the execution engine. Good DSLs are also declar- ative: they provide linguistic abstractions for relevant domain dsl engineering 71

concepts that allow processors to "understand" the domain se- mantics without sophisticated analysis of the code. Linguistic abstraction means that a language contains concepts for the abstractions relevant in the domain. We discuss this in more detail below. Note that there is a trade-off between expressivity and the scope of the language. We can always invent a language with exactly one symbol Σ that represents exactly one single pro- gram. It is extremely expressive! It is trivial to write a code generator for it. However, the language is also useless, because it can only express one single program, and we’d have to create a new language if we wanted to express a different program. So in building DSLs we are striving for a language that has max- imum expressivity while retaining enough coverage (see next chapter) of the target domain to be useful.

DSLs have the advantage of being more expressive than GPLs in the domain they are built for. But there is also a disadvan- tage: before being able to write these concise programs, users have to learn the language4. This task can be separated into 4 While a GPL also has to be learned, learning the domain itself, and learning the syntax of the lan- we assume that there is a relatively small number of GPLs and developers guage. For people who understand the domain, learning the already know them. There may be a syntax can be simplified by using good IDEs with code com- larger number of DSLs in any given project or organization, and new team pletion and quick fixes, as well as with good, example-based members cannot be expected to know documentation. In many cases, DSL users already understand them. the domain, or would have to learn the domain even if no DSL were used to express programs in it: learning the domain is in- dependent of the language itself. It is easy to see, however, that, if a domain is supported by well-defined language, this can be a good reference for the domain itself. Learning a domain can be simplified by working with a good DSL5. In conclusion, the 5 This can also be read the other way learning overhead of DSLs is usually not a huge problem in round: a measure for the quality of a DSL is how long it takes domain practice. experts to learn it. Pension Plans: The users of the pension DSL are pension experts. Most of them have spent years describing pension plans using prose, tables and (informal) formulas. The DSL provides formal languages to express the same thing in a way that can be processed by tools. J The close alignment between a domain and the DSL can also be exploited during the construction of the DSL. While it is not a good idea to start building a DSL for a domain about which 72 dslbook.org

we don’t know much, the process of building the DSL can help deepen the understanding about a domain. The domain has to be scoped, fully explored and systematically structured to be able to build a language. Refrigerators: Building the cooling DSL has helped the thermodynamicists and software developers to understand the details of the domain, its degrees of freedom and the variability in refrigerator hardware and cooling algorithms in a much more structured and thorough way than before. Also, the architecture of the generated C application that will run on the device became much more well-structured as a consequence of the separation between reusable frame- works, device drivers and generated code. J

4.1.1 Expressivity and the Domain Hierarchy In the section on expressivity above we compare arbitrary lan- guages. An important idea behind domain-specific languages is that progressive specialization of the domain enables pro- gressively more specialized and expressive languages. Pro- ⊂ grams for domain Dn Dn−1 expressed in a language LDn−1 typically use a set of characteristic idioms and patterns. A lan- guage for Dn can provide linguistic abstractions for those id- ioms or patterns, which makes their expression much more concise and their analysis and translation less complex. mbeddr C: Embedded C extends the C programming lan- guage with concepts for embedded software including state machines, tasks and physical quantities. The state machine construct, for example, has concepts representing states, events, transitions and guards. Much less code is required compared to switch/case statements or cross-indexed in- teger arrays, two typical idioms for state machine imple- mentation in C. J WebDSL: WebDSL entity declarations abstract over the boilerplate code required by the Hibernate6 framework for 6 www.hibernate.org/ annotating Java classes with object-relational mapping an- notations. This reduces code size by an order of magni- tude 7. J 7 E. Visser. WebDSL: A case study in domain-specific language engineering. 4.1.2 Linguistic versus In-Language Abstraction In GTTSE, pages 291–373, 2007 There are two major ways of defining abstractions. Abstrac- tions can be built into the language (in which case they are dsl engineering 73

called linguistic abstractions), or they can be expressed by con- cepts available in the language (in-language abstractions). DSLs typically rely heavily on linguistic abstraction, whereas GPLs rely more on in-language abstraction.

 Linguistic Abstraction A specific domain concept can be modeled with the help of existing abstractions, or one can in- troduce a new abstraction for that concept. If we do the lat- ter, we use linguistic abstraction. By making the concepts of D first-class members of a language LD, i.e. by defining linguistic abstractions for these concepts, they can be uniquely identi- fied in a D program and their structure and semantics is well defined. No semantically relevant8 idioms or patterns are re- 8 By "semantically relevant" we mean quired to express interesting programs in D. Consider these that the tools needed to achieve the model purpose (analysis, translation) two examples of loops in a Java-like language: have to treat these cases specially. int[] arr = ... int[] arr = ... for (int i=0; i l = ... sum += arr[i]; for (int i=0; i

The loop in the left-hand example can be parallelized, since the order of summing up the array elements is irrelevant. The right-hand one cannot, since the order of the elements in the OrderedList class is relevant. A transformation engine that translates and optimizes these programs must perform (so- phisticated, and sometimes impossible) program analysis to determine that the left-hand loop example can indeed be par- allelized. The following alternative expression of the same behavior uses better linguistic abstractions, because it is clear without analysis that the first loop can be parallelized and the second cannot: 9 Without linguistic abstraction, the processor has to analyze the program for (int i in arr) { seqfor (int i in arr) { to "reverse engineer" the semantics sum += i; l.add( arr[i] ); }} to be able to act on it. With linguistic abstraction, we rely on the language user to use the correct abstraction. We The property of a language LD of having first-class concepts for assume that the user is able to do this! abstractions relevant in D is often called declarativeness: no so- The trade-off makes sense in DSLs phisticated pattern matching or program flow analysis is nec- because we assume that DSL users are familiar with the domain, and we often essary to capture the semantics of a program (relative to the don’t have the budget or experience purpose) and treat it correspondingly. The decision can simply to build the sophisticated program analyses that could do the semantic for seqfor 9 be based on the language concept used ( versus ) . reverse engineering. mbeddr C: State machines are represented with first class concepts. This enables code generation, as well as mean- ingful validation. For example, it is easy to detect states that are not reached by any transition and report this as an 74 dslbook.org

error. Detecting this same problem in a low-level C imple- 10 This approach assumes that the mentation requires sophisticated analysis on the switch- generator works correctly – we’ll discuss this problem in Section 4.3 on case statements or indexed arrays that constitute the im- semantics. plementation of the state machine10. J mbeddr C: Another good example is optional ports in components. Components (see Fig. 20.6) define required ports that specify the interfaces they use. For each com- ponent instance, each required port is connected to a pro- vided port of another instance (that has a compatible in- terface). Required ports may be optional11, so for a given instance, an optional port may be connected or not. Invok- ing an operation on an unconnected required port would result in an error, so this has to be prevented. This can be done by enclosing the invocation on a required port in Figure 4.1: Example component dia- an if statement, checking whether the port is connected. gram. The top half defines components, However, an if statement can contain any arbitrary Boolean their ports and the relationship of these ports to interfaces. The bottom expression as its condition (e.g., if (isConnected(rp) || half shows instances whose ports are somethingRandom()) { port.doSomething(); }). So connected by a connector. checking statically that the invocation only happens if the port is connected is impossible. A better solution based on linguistic abstraction is to introduce a new language con- cept that checks for a connected port directly: with port (rp) { rp.doSomething(); }. The with port statement doesn’t use an expression as its argument, but only a refer- ence to an optional required port (Fig. 4.2). In this way the 11 The terminology may be a little required IDE can check that an invocation on a required optional confusing here: means that the component invokes operations on the port rp is only done inside of a with port statement ref- port (as opposed to providing them for erencing that same port. J other to invoke); optional refers to the fact that, for any given instance of that component, the port may be connected or not.

Figure 4.2: The with port statement is required to surround an invocation on an optional required port; otherwise, an error is reported in the IDE. If the port is not connected for any given instance, the code inside the with port is not executed. It acts like an if statement, but since it cannot contain an expression as its condition, the correct use of the with port statement can be statically checked. dsl engineering 75

Linguistic abstraction also means that no details irrelevant to the model purpose are expressed. Once again, this increases conciseness, and avoids the undesired specification of unin- tended semantics (over-specification). Overspecification is usu- ally bad because it limits the degrees of freedom available to a transformation engine. In the example with the parallelizable loops, the first loop is over-specified: it expresses ordering of the operations, although this is (most probably) not intended by the person who wrote the code. mbeddr C: State machines can be implemented as switch/- case blocks or as arrays indexing into each other. The DSL program does not specify which implementation should be used and the transformation engine is free to chose the more appropriate representation, for example based on de- sired program size or performance characteristics. Also, log statements and task declarations can be translated in different ways depending on the target platform. J

 In-Language Abstraction Conciseness can also be achieved by a language that provides facilities to allow users to define new (non-linguistic) abstractions in programs. Well-known GPL concepts for building new abstractions include procedures, clas- ses, or functions and higher-order functions, generics, traits and monads. It is not a sign of a bad DSL if it has in-language It is worth understanding these to abstraction mechanisms as long as the abstractions created don’t some extent, so that you can make an informed decision which of these – if require special treatment by analysis or processing tools – at any – are useful in a DSL. which point they should be refactored into linguistic abstrac- tions. An example of such special treatment would be if the compiler of the above example language knew that the Ordered- List library class is actually ordered, and that, consequently, the respective loop cannot be parallelized. Another example of special treatment can be constructed in the context of the op- tional port example. If we had solved the problem by having a library function isConnected(port), we could enforce a call on an optional port to be surrounded by an if (isConnected (port)) without any other expression in the condition. In this case, the static analyzer would have to treat isConnected spe- cially12. In-language abstraction can, as the name suggests, 12 Treating program elements specially provide abstraction, but it cannot provide declarativeness: a model is dangerous because the semantics of isConnected or OrderedList could be processor has to "understand" what the user wanted to express changed by a library developer without by building the in-language abstraction, in order to be able to changing the static analyzer of code generator in a consistent way. act on it. 76 dslbook.org

Refrigerators: The language does not support the con- struction of new abstractions since its user community con- sists of non-programmers who are not familiar with defin- ing abstractions. As a consequence, the language had to be modified several times during development, as new re- quirements came from the end users which had to be inte- grated directly into the language. J mbeddr C: Since C is extended, C’s abstraction mecha- nisms (functions, structs, enums) are available. Moreover, we added new mechanisms for building abstractions, in- cluding interfaces and components. J WebDSL: WebDSL provides template definitions to capture partial web pages, including rendering of data from the database and form request handling. User defined tem- plates can be used to build complex user interfaces. J

 Standard Library If a language provides support for in- lang- uage abstraction, these facilities can be used by the lan- guage developer to provide collections of domain specific ab- stractions to language users. Instead of adding language fea- tures, a standard library is deployed along with the language to all its users. It contains abstractions relevant to the domain, This approach is of course well known expressed as in-language abstractions. This approach keeps the from programming languages. All of them come with a standard library, language itself small, and allows subsequent extensions of the and the language can hardly be used library without changing the language definition and process- without relying on it. It is effectively a part of the language ing tools. Refrigerators: Hardware building blocks have properties. For example, a fan can be turned on or off, and a compres- sor has a speed (rpm). The set of properties available for the various building blocks is defined via a standard li- brary and is not part of the language (see code below). The reason why this is not a contradiction to what we dis- cussed earlier is this: as a consequence of the structure of the framework used on the target platform, new proper- ties can be added to hardware elements without the need to change the generator. They are not treated specially! J lib stdlib { command compartment::coolOn command compartment::coolOff property compartment::totalRuntime: int readonly property compartment::needsCooling: bool readonly property compartment::couldUseCooling: bool readonly property compartment::targetTemp: int readonly property compartment::currentTemp: int readonly dsl engineering 77

property compartment::isCooling: bool readonly }

Some languages treat certain abstractions defined in the stan- dard library specially. For example, Java’s WeakReference has special semantics for garbage collection. While an argument can be made that special treatment is acceptable for a standard library (after all, it can be considered an essential companion to the language itself), it is still risky and dangerous. Considering that, in the case of DSLs, we can change the language rela- tively easily, I would suggest avoiding special treatment even in a standard library and recommend providing linguistic ab- stractions for these cases.

 Comparing Linguistic and In-Language Abstraction A lang- uage that contains linguistic abstractions for all relevant do- main concepts is simple to process; the transformation rules can be tied to the identities of the language concepts. It also makes the language suitable for domain experts, because rele- vant domain concepts have a direct representation in the lan- guage. Code completion can provide specific and meaning- ful support for "exploring" how a program should be written. However, using linguistic abstractions extensively requires that the relevant abstractions be known in advance, or frequent evo- lution of the language is necessary. It can also lead to languages that feel large, bloated or inelegant. In-language abstraction is more flexible, because users can build just those abstractions they actually need. However, this requires that users are ac- tually trained to build their own abstractions. This is often true for programmers, but it is typically not true for domain experts. Using a standard library may be a good compromise, in which one set of users develops the abstractions to be used Modular language extension, as dis- by another set of developers. This is especially useful if the cussed later in Section 4.6.2, provides a middle ground between the two ap- same language is to be used for several, related, projects or proaches. A language can be flexibly user groups. Each can build their own set of abstractions in the extended, while retaining the advan- tages of linguistic abstraction. library. Note that languages that provide good support for in-lang- uage abstraction feel different from those that use a lot of lin- guistic abstraction (compare Scala or Lisp to Cobol or ABAP). Make sure that you don’t mix the two styles unnecessarily: the resulting language may be judged as being ugly, especially by programmers. 78 dslbook.org

4.1.3 Language Evolution Support If a language uses a lot of linguistic abstraction, it is likely, especially during the development of the language, that these abstractions will change. Changing language constructs may break existing models, so special care has to be taken regard- ing language evolution. This requires any or all of the follow- ing: a strict configuration management discipline, versioning information in the models to trigger compatible editors and model processors, keeping track of the language changes as a sequence of change operations that can be "replayed" on ex- isting models, or model migration tools to transform models based on the old language into the new language. Whether model migration is a challenge or not depends on the tooling. There are tools that make model evolution very smooth, but many environments don’t. Consider this when In parser-based languages, you can deciding on the tooling you want to use! always at the very least open the file in a text editor and run some kind It is always a good idea to minimize those changes to a of global search/replace to migrate DSL that break existing models13. Backward compatibility and the program. In a projectional editor, special care has to be taken to enable deprecation are techniques well worth keeping in mind when the same functionality. working with DSLs. For example, instead of just changing an 13 existing concept in an incompatible way, you may add a new This is especially true if you don’t have access to all programs to migrate concept in addition to the old one, along with deprecation of them in one fell swoop: you have to the old one and a migration script or wizard. Note that you deploy migration scripts with the lan- might be able to instrument your model processor to collect guage, or rely on the users to perform the migration manually. statistics on whether deprecated language features continue to be used. Once no more instances show up in models, you can safely remove the deprecated language feature. If the DSL is used by a closed, known user community that 14 The instrumentation mentioned above is accessible to the DSL designers, it will be much easier to may even report uses of deprecated language features after the official evolve the language over time, because users can be reached, expiration date. making them migrate to newer versions14. Alternatively, the set of all models can be migrated to a newer version using a script provided by the language developers. In cases where the set of users, and the DSL programs, are not easily accessi- ble, much more effort must be put into maintaining backward compatibility, and the need for coordinated evolution should 15 This is the reason why many GPLs be kept minimal15. can never get rid of deprecated lan- guage features. 4.1.4 Precision versus Algorithm We discussed earlier the fact that some DSLs may be Turing complete (and feel more like a programming language), whereas others are purely declarative and maybe just describe facts, dsl engineering 79

structures and relationships in a domain. The former may not be usable by domain users (i.e. non-programmers). They are often able to formally and precisely specify facts, structures and relationships about their domain, but they are often not able to define algorithmic behavior. In this case, a DSL has to be defined that abstracts far enough to hide these algorithmic details. Alternatively, you can create an incomplete language (Section 4.5) and have developers fill in the algorithmic details in GPL code. One way to do this is to provide a set of predefined behaviors (in some kind of library) which are then just parametrized or configured by the users. Pension Plans: Pension rules are at the boundary between being declarative and algorithmic. The majority of the models define data structures (customers, pension plans, payment schedules). However, there are also mathemati- cal equations and calculation rules. These are algorithmic, but in the pension domain, the domain users are well able to deal with these. J

4.1.5 Configuration Languages Configuration languages are purely declarative. They consist of a well-defined set of configuration parameters and constraints among them. "Writing programs" boils down to setting val- ues for these parameters. In many cases, the parameters are Booleans, in which case a program is basically a selection of a subset of the configuration switches. Feature models consti- tute a well-known configuration language. We discuss config- uration languages in more detail in the chapter on DSLs and Product Line Engineering (Section 21).

4.1.6 Platform Influence In theory, the design of the abstractions used in a language should be independent of the execution engine and the plat- form. However, this is not always the case16. There are two 16 It is obviously not the case for ar- reasons why the platform may influence the language. chitecture DSLs where you build a language that resembles the architec- tural abstractions in a platform. But  Runtime Efficiency In most systems, the resulting system that’s not what we’re talking about has to execute in a reasonably efficient way. Efficiency can here. mean performance, scalability, as well as resource consumption (memory, disk space, network bandwidth). Depending on the semantic gap between the platform and the language, building efficient code generators can be a lot of work (we discuss this in 80 dslbook.org

some detail in the section on semantics (Section 4.3)). Instead of building the necessary optimizers, you can also change the language to use abstractions that make global optimizations simpler to build. 17. 17 While this may be considered "cheat- ing", it may be the only practical way mbeddr C: The language does not support dynamically given project constraints. growing lists, because it is hard to implement them in an efficient way considering we are targeting embedded soft- ware. Dynamic allocation of memory is often not allowed, and even if it were, the necessary copying of existing list data into a new, bigger buffer is too expensive for practi- cal use. The incurred overhead is also not obvious to the language user (he just increases list size or adds another element that triggers list growth), making it all the more dangerous. J mbeddr C: Another example includes floating point arith- metic. If the target platform has no floating point unit (FPU), floating point arithmetic is expensive to emulate. We had to build the language in a way that could prevent the use of float and double types if the target platform had no FPU. J

 Platform Limitations The platform may have limitations re- garding the size of data structures, the memory or disk space, or the bandwidth of the network, that limit or otherwise influ- ence language design. Refrigerators: In the cooling language we had to intro- duce time units (seconds, minutes, hours) into the DSL after we’d noticed that the time periods relevant for cool- ing algorithms were so diverse that no single unit could fit all necessary values into the available integer types. If we had used only seconds, the days or months periods would not fit into the available ints. Using only hours or days obviously would not let us express the short periods without using fractions of floating point data types. So the language now has the ability to express periods, as in 3s or 30d. J Note that we can achieve full coverage by making L too general. Such a lan- guage, may, however, be less expressive, 4.2 Coverage resulting in bigger (unnecessarily big) programs. Indeed this is the reason for designing DSLs in the first place: A language L always defines a domain D such that PD = PL. general purpose languages are too Let’s call this domain DL, i.e. the domain determined by L. general. dsl engineering 81

This does not work the other way around: given a (deductively defined) domain D, there is not necessarily a language that fully covers that domain unless we revert to a universal lan- guage at a D0 (cf. the hierarchical structure of domains and languages).

A language L fully covers domain D if for each program p rel- evant to the domain PD a program pL can be written in L. In other words, PD ⊆ PL.

Full coverage is a Boolean predicate: a language either fully covers a domain or it does not. In practice, many languages do not fully cover their respective domain. We would like to indi- cate the coverage ratio. The domain coverage ratio of a language L is the portion of programs in a domain D that it can express. We define CD(L), the coverage of domain D by language L, as: number o f P programs expressable by L C (L) = D D number of programs in domain D

At first glance, an ideal DSL will cover all of its domain (CD(L) is 100%). It requires, however, that the domain is well-defined and we can actually know what full coverage is. Also, over time, it is likely that the domain will evolve and grow, and the language has to be continuously evolved to retain full coverage. As the domain evolves, language In addition to the evolution-related reason given above, there evolution has to keep pace, requiring responsive DSL developers. This is an are two reasons for a DSL not to cover all of its own domain D. important process aspect to keep in First, the language may be deficient and need to be redesigned. mind! This is especially likely for new and immature DSLs. Scoping the domain for which to build a DSL is an important part of DSL design. Second, the language may have been defined expressly to cover only a subset of D, typically the subset that is most com- monly used. Covering all of D may lead to a language that is too big or complicated for the intended user community be- cause of its support for rarely used corner cases of the do- main18. In this case, the remaining parts of D may have to 18 Incremental language extension provides a third option: you can put the be expressed with code written in D−1 (see also Section 4.5). common parts into the base language This requires coordination between DSL users and D−1 users, and the support for the corner cases if this not the same group of people. into optionally included language modules. WebDSL: WebDSL defines web pages through "page def- initions" which have formal parameters. navigate state- ments generate links to such pages. Because of this styl- ized idiom, the WebDSL compiler can check that internal 82 dslbook.org

links are to existing page definitions, with arguments of the right type. The price that the developer pays is that the language does not support free-form URL construc- tion. Thus, the language cannot express all types of URL conventions and does not have full coverage of the domain of web applications. J Refrigerators: After trying to write a couple of algorithms, we had to add a perform ...after t statement to run a set of statements after a specified time t has elapsed. In the initial language, this had to be done manually with events and timers. Over time we noticed that this is a very typical case, so we added first-class support. J mbeddr C: Coverage of this set of languages is full, al- though any particular extension to C may only cover a part of the respective domain. However, even if no suitable lin- guistic abstraction is available for some domain concept, it can be implemented in the D0 language C, while retaining complete syntactic and semantic integration. Also, addi- tional linguistic abstractions can be easily added because of the extensible nature of the overall approach. J

4.3 Semantics and Execution

Semantics can be partitioned into static semantics and execu- tion semantics. Static semantics are implemented by the con- straints and type system rules. Execution semantics denote the observable behavior of a program p as it is executed. We look at both aspects in this section; but we refer to execution seman- tics if we don’t explicitly say otherwise. Using a function OB that defines this observable behavior, There are also a number of approaches for formally defining semantics inde- we can define the semantics of a program pL by mapping it to D pendent of operational mappings to a program q in a language for D−1 that has the same observable target languages. However, they don’t behavior: play an important role in real-world DSL design, so we don’t address them ( ) = ( ) == ( ) in this book. semantics pLD : qLD−1 where OB pLD OB qLD−1

Equality of the two observable behaviors can be established with a sufficient number of tests, or with model checking and proof (which takes a lot of effort and is hence rarely done). This definition of semantics reflects the hierarchy of domains and works both for languages that describe only structure, as well as for those that include behavioral aspects. dsl engineering 83

The technical implementation of the mapping to D−1 can be 19 provided in two different ways: a DSL program can literally be In projectional systems you cannot build structurally/syntactically incor- transformed into a program in an LD−1, or an interpreter can rect programs in the first place. For parser-based systems, the AST, on be written in LD−1 or LD0 to execute the program. Before we spend the rest of this section looking at these two options in which the constraint checks are per- formed, often is not constructed unless detail, we first briefly look at static semantics. the syntactic structure is correct. This automatically leads to constraint checks 4.3.1 Static Semantics/Validation being performed only on structural- ly/syntactically correct models. Before establishing the execution semantics by transforming or interpreting the program, its static semantics has to be vali- dated. Constraints and type systems are used to this end and we describe their implementation in Part III of the book. Here is a short overview. Sometimes constraints are used instead of grammar rules. For example, instead of using a 1..n multiplicity in the  Constraints Constraints are simply Boolean expressions that grammar, I often use 0..n together with check some property of a model. For example, one might verify a constraint that checks that there is at least one element. The reason for using that the names of a set of attributes of some entity are unique. this approach is that if the grammar For a model to be statically correct, all constraints have to eval- mechanism is used, a possible error message comes from the parser. That true uate to . Constraint checking should only be performed error message may read something 19 for a model that is structurally/syntactically correct . like expecting SUCH_AND_SUCH, found SOMETHING_ELSE. This is not mbeddr C: One driver in selecting the linguistic abstrac- very useful to the DSL user. If a more tions that go into a DSL is the ability to easily implement tolerant (0..n) grammar is used, the constraint error message can be made meaningful constraints. For example, in the state machine to express a real domain constraint extension it is trivial to find states that have no outgoing (e.g., at least one SUCH_AND_SUCH is transitions (dead end, Fig. 4.3). In a functional language, required, because . . . ). such a constraint could be written in the way shown in the code below. J states .select(s|!s.isInstanceOf(StopState)) .select(s|s.transitions.size == 0)

When defining languages and transformations, developers of- ten have a set of constraints in their mind that they consider obvious. They assume that no one would ever use the lan- guage in a particular way. However, DSL users may be creative and actually use the language in that way, leading the transfor- mation to crash or create non-compilable code. Make sure that all constraints are actually implemented. This can sometimes be hard. Only extensive (automated) testing can prevent these problems from occurring. Figure 4.3: An example state machine In many cases, a multi-stage transformation is used in which with a dead end state, a state that cannot be left once entered (no outgoing a model expressed in L3 is transformed into a model expressed transitions). in L2, which is then in turn transformed into a program ex- 84 dslbook.org

20 20 pressed in L1 . In this case it is important that every valid Note how this also applies to the classical case in which L2 is your program in L3 leads to a valid program in L2. If the process- DSL and L1 is a GPL which is then ing of L2 fails with an error message using abstractions from compiled. L2 (e.g., compiler errors), users of L3 will not be able to under- stand them; they may have never seen the programs generated in L2. Again, automated testing is the way to address this issue. If many or complex constraints (or type system rules) are executed on a large model, performance may become an issue. Even if the DSL tool is clever about this and only revalidates the constraints for those program elements that changed, it can still be a problem if some kind of whole-model validation is tied to a particular element. To solve this problem, many DSL tools allow users to classify the constraints according to their cost (i.e. performance overhead). Cheap constraints are executed for each changing program element, in real-time, as it changes. Progressively more expensive constraints are checked, for ex- ample, as a fragment is saved or only upon explicit request by the user.

 Type Systems Type systems are a special kind of constraint. Consider the example of var int x = 2 * someFunction( sqrt(2));. The type system constraint may check that the type of the variable is the same or a supertype of the type of the ini- tialization expression. However, establishing the type of the init expression is non-trivial, since it can be an arbitrarily com- plex expression. A type system is a formalism or framework for defining the rules to establish the types of arbitrary expres- sions, as well as type checking constraints. It is a form of con- straint checking. We cover the implementation of type systems in Part III of the book (Section 10).

When designing constraints and type systems in a language, a decision has to be made between one of two approaches: (a) declaration of intent and checking for conformance, and (b) deriving characteristics and checking for consistency. Consider the following examples. mbeddr C: For variables, the type has to be specified ex- plicitly. A type specification expresses the intent that this variable be, for example, of type int. Alternatively, a type system could be built to automatically derive the type of the variable declaration based on the type of the init ex- pression, an approach known as type inference. This would dsl engineering 85

allow the following code: var x = 2 * someFunction( sqrt(2));. Since no type is explicitly specified, the type system will infer the type of x to be the type calculated for the init expression. J mbeddr C: State machines that are supposed to be veri- fied by the model checker have to be marked as verified. In that case, additional constraints kick in that report specific ways of writing actions as invalid, because they cannot be handled by the model checker. An alternative approach could check a state machine for whether these "unverifi- able" ways of writing actions are used, and if so, mark the state machine as not verifiable. J Pension Plans: Pension plans can inherit from another plan (called the base plan). If a pension calculation rule overrides a rule in the base plan, then the overriding rule has to be marked as overrides. In this way, if the rule in the base plan is removed or renamed, validation of the sub-plan will report an error. An alternative design would simply infer the fact that a rule overrides another one if they have the same name and signature. J Note how in all three cases the constraint checking is done in two steps. First we declare an intent (variable is intended to be int, this state machine is intended to be verifiable, a rule is intended to override another one). We then check whether the program conforms to this intention. The alternative approach would infer the fact from the program (the variable’s type is whatever the expression’s type evaluates to, state machines are verifiable if the "forbidden" features aren’t used, rules override another rule if they have the same name and signature) without any explicitly specified intent. When designing constraints and type systems, a decision has to be made regarding when to use which approach. Here are some trade-offs. The specification/conformance approach requires more code to be written, but results in more meaning- ful and specific error messages. A message can express that fact that one part of a program does not conform to a specification made by another part of the program21. The derivation/con- 21 It also reduces the chance that users do something they do not intend by sistency approach is less effort to write and can hence be seen mistake. For example, a user might to be more convenient, but it requires more effort in constraint not want to override a method from checking, and error messages may be harder to understand be- the base class, but it might happen just because the user uses the same name cause of the missing, explicit "hard fact" about the program. and parameters. 86 dslbook.org

The specification/conformance approach can also be used to "anchor" the constraint checker, because a fixed fact about the program is explicitly given instead of having to be derived from a (possibly large) part of the program. This decouples models and can increase scalability. Consider the following example. A program contains a function call, and the type checker needs to check the typing for this call. To do so, it has to determine the type of the called function. Assume this function does not specify the return type explicitly, instead it is inferred from the returned expressions. These expressions may be calls to other functions, so the process repeats. In the worst case, a whole chain of function calls must be followed in this way to calculate the type of the function initially called by your program. Notice that this requires accessing all the downstream programs, so these all have to be loaded and type checked! In large systems, this can lead to serious performance and scalability issues22. If, 22 We have seen such problems in instead, the type of the called function were given explicitly, no practice with large-scale models. downstream models need to be accessed or loaded.

 Multi-Level Constraints Several sets of constraints can be used to enforce multiple levels of correctness/strictness/com- pliance for models. The first level typically consists of basic constraints (such as name uniqueness) and typing rules. These are checked for every program. Additional levels are often op- tional. They are triggered either by a configuration switch or by using the programs for a given purpose. Additional levels always constrain programs further relative to more basic levels. mbeddr C: A nice example of multi-level constraints can be seen in the state machines extension to C. Structural and type system correctness (for C and for state machines) is always checked for every program. However, if a state machine is marked as verifiable, then the action code is further restricted via additional constraints. For exam- ple, it is not allowed to read and write the same variable during a single state change (i.e. in all of the code in the exit actions of the current state, the transition actions and entry actions of the target state). This is necessary to keep the complexity of the generated model checker input code within limits. J mbeddr C: Another example concerns the use of float- ing point types. Some target devices may not have float- ing point units (FPUs), which means that floating point dsl engineering 87

types (float, double) cannot be used in programs that should be deployed on such a target device. So, as the user changes the target device in the build configuration, additionawritel constraints are checked that report floating point types as errors if the target has no FPU. J

4.3.2 Establishing the Correctness of the Execution Engine

Earlier we defined the meaning of the program p at Dn as the equivalent observable behavior of a program q at Dn−1. This essentially defines the transformation or interpreter to be cor- rect. However, this is useless in practice. As the language de- veloper, we have a specific behavior in mind, and we want to make sure that the executing DSL program exhibits this behav- ior. We have to make sure that the execution engine executes the DSL program accordingly. In classical programming, we write the GPL code based on Figure 4.4: Test code tests the appli- our understanding of the requirements. We then write unit cation code based on a single under- tests, based on the same understanding, which test that code standing of the requirements for the (Fig. 4.4). system. In DSL testing, we write the DSL, the DSL program and the execution engine based on our understanding of the require- ments for the system. We can still write unit tests (in the GPL) based on this understanding to check for the correctness of the executing DSL program (Fig. 4.5). Writing one DSL program and one unit test ensures that this one program executes correctly regarding the test case. Our goal here is, however, to ensure that the transformation is correct for all programs we can write with the DSL. This can Figure 4.5: Using DSLs, a test written be achieved by writing many DSL programs and many tests – on D − 1 tests the D program as well as the transformation from D to D − 1. enough to make sure that every branch of the transformation is covered at least once23. As always in testing, we encounter the 23 For the Xpand code generator there coverage problem: we have to write enough example programs is a coverage analysis tool that can be used to make sure that for a given test and tests to cover all aspects of the language and the execution suite, all branches of the transformation engine. In particular, we have to first think of the corner cases to template had been executed at least once. write tests for them24. A variant of this approach is to express the test cases in the 24 The coverage problem can be solved DSL (after extending the DSL with a way to express tests) and in some cases by automatic test case generation and formal verification. We then executing the application code and the test code on the discuss this later in this chapter. target platform together (Fig. 4.6). This is often more conve- nient, since the tests can be formulated more concisely on the level of the DSL. As we will see later, this approach is espe- cially useful if we have several execution engines: we write the 88 dslbook.org

Figure 4.6: Test cases can also be ex- pressed with the DSL and then exe- cuted on the target platform together with the application code.

test once and then execute it on all execution engines. Note that the GPL program may have additional, unintended behaviors not prescribed by the DSL. These can often be ex- ploited maliciously and are known as safety or security prob- lems. These will not be found by testing the GPL code based on the requirements, but only by "trying to exploit" the program (known as penetration testing). We will elaborate more on ensuring the correctness of the execution semantics in this chapter, as well as in the Part III chapter on DSL testing (Chapter 14), where we discuss the im- plementation aspects of DSL and IDE testing. Figure 4.7: The robot control DSL is embedded in C program modules and 4.3.3 Transformation provides linguistic abstractions for controlling a small Lego car. It can Transformations define the execution semantics of a DSL by accelerate, decelerate and turn left and right. mapping it to another language. In the vast majority of cases a transformation for LD recreates those patterns and idioms in LD−1 for which it provides linguistic abstraction. The result may be transformed further, until a level is reached for which a language with an execution infrastructure exists – often D0. Code generation, the case in which we generate GPL code from a DSL, is thus a special case in which LD0 code is generated. mbeddr C: The semantics of state machines are defined by their mapping back to C switch statements. This is re- peated for higher D languages. The semantics of the robot control DSL (Fig. 4.7) is defined by its mapping to state machines and tasks (Fig. 4.8). To explain the semantics to Figure 4.8: Robot control programs are the users, prose documentation is available as well. J mapped to state machines and tasks. State machines are mapped to C, and Component Architecture: The component architecture DSL tasks are mapped to C as well as to op- describes only structures: interfaces, components and sys- erating system configuration files (the OIL file). In the end, everything ends up tems. Many constraints about structural integrity are en- in text for downstream processing by forced, and a mapping to a distribution middleware is im- existing tools. plemented. The formal definition of the semantics are im- plied by the mapping to the executable code. J dsl engineering 89

DSL programs may be mapped to multiple languages at the same time. Typically, there is one primary language that is used for execution of the DSL program (C in Fig. 4.9). The other languages may be used to configure the target platform (generated XML files) or provide input for verification tools (NuSMV in Fig. 4.9). In this case, one has to make sure that the semantics of all generated representations is actually the same. We discuss this problem in Section 4.3.7.

mbeddr C: The state machines can be transformed to a Figure 4.9: From the C state machine representation in NuSMV, which is a model checker that extensions, we generate low-level C code (for execution) as well an input can be used to establish properties of state machines by file for the NuSMV model checker (for exhaustive search. Examples properties include freedom verification). from deadlocks, assuring liveness and specific safety prop- erties such as "It will never happen that the out events pedestrian light green and car light green are set at the same time". J

 Multi-staged Transformation There are several reasons why the gap between a language at D and its target platform may not be bridged by a single transformation. Instead, the over- all transformation becomes a chain of subsequent transforma- tions, an approach also known as cascading. Multi-staged transformation is a form of modularization, 25 One could measure this semantic and so the reason for doing it is the same reason we always gap between two languages: how many constructs do two languages use for modularization: breaking down a big problem into a share, how many "synonyms" exist, set of smaller problems that can be solved independently. In how many constructs are the same but have different meanings? In practice, the case of transformations, this "big problem" is a big seman- the size of the gap is intuited by the tic gap between the DSL and the target language25. Modular- transformation designer. ization breaks down this big semantic gap into several smaller ones, making each of them easier to understand, test and main- tain26. 26 This approach can only be used if the Another reason for multi-stage transformations is the poten- tools support multi-staged transforma- tion well. This is not true for all DSL tial for reuse of each of the stages (Fig. 4.10). Reusing lower tools. D languages and their subsequent transformations also im- 27 This is one of the reasons why we plies reuse of potentially non-trivial analyses or optimizations usually generate GPL source code that can be done at that particular abstraction level27. Con- from DSLs, and not machine code or sider GPL compilers as an example. They can be retargetted byte code: we want to reuse existing transformations and optimizations relatively easily by exchanging the backends (machine code provided by the GPL compiler. generation phases) or the frontend (programming language parsers and analyzers). For example, GCC can generate code for many different processor architectures (exchangeable back- ends), and it can generate backend code for several program- 90 dslbook.org

ming languages, among them C, C++ and Ada (exchangeable frontends). The same is possible for DSLs. The same high D models can be executed differently by exchanging the lower D intermediate languages and transformations. Or the same lower D languages and transformations can be used for differ- ent higher D languages, by mapping these different languages to the same intermediate language.

mbeddr C: The embedded C language (and some of its Figure 4.10: Left: Backend reuse. Dif- higher D extensions) have various translation options, for ferent languages (L3/L2 and L5) are several different target platforms (Win32 and Osek), an ex- transformed to the same intermediate language L1, reusing its backend gen- ample of backend reuse. All of them are C code, but we erator to L0. Right: Frontend reuse. L3 generate different idioms in the code and different make is mapped to L2, which has two sets of backend generators, reusing the L to files. 3 J L2 transformation. Multi-stage transformation can also be a natural consequence of incremental language extension along the domain hierar- chy, where we repeatedly build additional higher-level lan- guages on top of lower-level languages. When transforming the higher-level languages, it is natural and obvious to trans- form them onto the next lower level, and not onto a language at D0. mbeddr C: The extensions to C are all transformed back to C idioms during transformation. Higher-level DSLs, for example, a simple DSL for robot control, are reduced to C plus some extensions such as state machines and tasks (Fig. 4.11), reusing the transformations for those abstrac- tions back to C. J

Figure 4.11: Multi-stage transforma- tion in mbeddr. MPS supports multi- stage transformations really well, so managing the interplay of the set of transformations is feasible.

A special case of a multi-staged transformation is a preproces- sor to a code generator. Here, a transformation reduces the set of used language concepts in a fragment to a minimal core, and only the minimal core is supported in the code generator. Note how, in this case, the source and target languages of the trans- dsl engineering 91

formation are the same. However, the target model only uses a subset of the concepts defined by the source/target language. A preprocessor simplifies portability of the actual code gener- ator: it becomes simpler, since only the subset of the language has to be mapped to code. mbeddr C: Consider the case of a state machine to which you want to be able to add an "emergency stop" feature, i.e. a new transition from each existing state to a new STOP state. Instead of handling this case in the code generator, a model transformation script preprocesses the state ma- chine model and adds all the new transitions and the new emergency stop state (Fig. 4.12). Once done, the existing generator is run unchanged. You have effectively mod- ularized the emergency stop concern into a preprocessor transformation. J Figure 4.12: A transformation adds Component Architecture: The DSL describes hierarchical an emergency stop feature to a state component architectures (where components are assem- machine. A new state is added (Ses), and a transition from every other state bled from interconnected instances of other components). to that new state is added as well. The Most component runtime platforms don’t support such hi- transition is triggered by the emergency erarchical components, so you need to "flatten" the struc- stop event (not shown). ture for execution. Instead of trying to do this in the code generator, you should consider a model transformation step to do it, and then write a simpler generator that works with a flattened, non-hierarchical model. J Multi-stage transformations can be challenging. It becomes harder to understand what is going on in total. Debugging the overall transformation can become hard, and good tool support is needed28. 28 MPS addresses this problem by (optionally) keeping all intermediate Efficiency and Optimization Transforming from D to D models around for debugging pur-  −1 poses. The language developer can allows the use of sophisticated optimizations, potentially re- select a program element in any of the sulting in very efficient code. DSL uses domain-specific ab- intermediate models and have MPS show a trace from and to where the stractions and hence includes a lot of domain semantics, so element was transformed in other (in- optimizations can take advantage of this and produce very ef- termediate) models. The trace can also ficient D code. However, building such optimizations can be show the transformation code involved −1 in each step. very expensive. It is especially hard to build global optimiza- tions that require knowledge about the structure or semantics of large or diverse parts of the overall program. Also, an opti- mization will always rely on some set of rules that determine when and how to optimize. There will always be corner cases where an experienced developer will be able to write more ef- 92 dslbook.org

ficient D−1 code manually. However, this requires a competent developer and, usually, a lot of effort for each specific program.A tool (i.e. the transformation in this case) will typically address the 90% case well: it will produce reasonably efficient code in the vast majority of cases with very little effort (once the optimizations have been defined). In most cases, this is good 29 This argument in favor of tools is used in GPLs for garbage collection and enough – in the remaining corner cases, D−1 has to be written optimizing compilers for higher-level 29 manually . programming languages.

 Care about Generated Code Ideally, generated code is a throw- away artifact, like object files in a C compiler. However, that’s not quite true. At least during development and test of the gen- erator you may have to read, understand and debug the gener- ated code. For incomplete DSLs30, i.e. those in which parts of 30 We cover completeness in Section 4.5. the resulting program have to be written manually in LD−1, readability and good structure is even more important, be- cause the manually written code has to be integrated with the generated parts of the LD−1 program. Hence, generated code should use meaningful abstractions, should be designed well, use good names for identifiers, be documented well, and be indented correctly. In short, generated code should generally Note that in complete languages (where adhere to the same standards as manually written code. This 100% of the LD−1 code is generated), the generated code is never seen by also helps to diffuse some of the skepticism against code gener- a DSL user. But even in this case, ation that is still widespread in some organizations. However, concerns for code quality apply and the code has to be understood and there are several exceptions to this rule: tested during DSL and generator development. • Sometimes generating really well-structured code makes the generator much more complicated. You then have to decide whether you want to live with some less nicely structured generated code, or whether you want to increase genera- tor complexity – a valid trade-off, since the generator also needs to be maintained! A good example is import state- ments when generating Java code. It can be a lot of work to find out exactly which imports are needed in a generated class. In this case it may be better to keep the generator sim- ple and use fully qualified class names throughout the code, and/or to import a few too many classes31. 31 Xtend provides special sup- port for this problem based on an • Using a generator opens up additional options you wouldn’t ImportManager. It makes generating consider when writing code manually (and which are hence the correct imports relatively simple. considered ugly). An example is generated collection classes. Imagine that your models define entities, and from each en- tity you generate a Java Bean. In Java version 1.4 and earlier, Java did not have generics, so in order to work with collec- dsl engineering 93

tions of entities you would use the generic List class. In the context of generated code you might want to consider generating a specific collection class for each entity, with an API typed to the respective Java Bean. This makes life much more convenient for those people who write Java code that uses the generated Beans. • The third exception to the rule is if the code has to be highly optimized for reasons of performance and code size. While you can still indent your code well and use meaningful names, the structure of the code may be convoluted. Note, however, that the code would look the same if it were written by hand in this case. mbeddr C: The components extension to C supports components with provided and required ports. A re- quired port declares which interface it is expected to be connected to. The same interface can be provided by different components, implementing the interface dif- ferently. Upon translation of the component extension, regular C functions are generated. An outgoing call on a required port has to be routed to the function that has been generated to implement the called interface operation in the target component. Since each compo- nent can be instantiated multiple times, and each in- stance can have its required ports connected to different component instances (implementing the same interface) there is no way for the generated code to know which particular function has to be called for an outgoing call on a required port for a given instance. An indirec- tion through function pointers is used instead. Conse- quently, functions implementing operations in compo- nents take an additional struct as an argument, which provides those function pointers for each operation of each required port. A call on a required port is therefore a relatively ugly affair based on function pointers. How- ever, to achieve the desired goal, no different, cleaner code approach is possible in C. It is optionally possi- ble to restrict a required port to a particular component (Fig. 4.13). In this case, the target function is known statically and no function pointer-based indirection is required. The resulting code is cleaner and more effi- cient. Programmers trade flexibility for performance. J 94 dslbook.org

Figure 4.13: The required port lowlevel is not just bound to the ILowLevel interface, but restricted to the ll port of the LowLevelCode component. This way, it is statically known which C function implements the behavior and the generated code can be optimized.

 Platform The complexity can be reduced by splitting the overall transformation into several steps – see above. Another approach is to work with a manually implemented, rich do- main specific platform. This typically consists of middleware, frameworks, drivers, libraries and utilities that are taken ad- vantage of by the generated code. Where the generated code and the platform "meet" depends on the complexity of the generator, requirements regarding Figure 4.14: Typical layering structure code size and performance, the expressiveness of the target of an application created using DSLs. language and the potential availability of libraries and frame- works that can be used for the task. In the extreme case, the generator just generates code to populate (or configure) the frameworks (which might already exist, or which you have to grow together with the generator) or provides statically typed facades around otherwise dynamic data structures. Don’t go too far towards this end, however: in cases in which you need to consider resource or timing constraints, or when the target platform is predetermined and perhaps limited, code genera- tion is the better approach: trying to make the platform too generic or flexible will increase its complexity. Figure 4.15: Stalagmites and stalactites mbeddr C: For most aspects, we use only a very shallow in limestone caves as a metaphor for a platform. This is mostly for performance reasons and for generator and a platform: the stalag- mite represents the platform, it grows the fact that the subset of C that is often used for embed- from up the lower abstraction levels. ded systems does not provide good means of abstraction. Stalactites represent the transforma- For example, state machines are translated to switch state- tions, which grow down from the high abstraction level represented by the ments. If we were to generate Java code in an enterprise DSL. system, we might populate a state machine framework in- stead. In contrast, when we translate the component def- initions to the AUTOSAR target environment, a relatively powerful platform is used – namely the AUTOSAR APIs, conventions and generators. J dsl engineering 95

4.3.4 Interpretation An interpreter is basically a program that acts on the DSL pro- gram it receives as an input. How it does that depends on the particular paradigm used (see Section 5.2). For impera- tive programs it steps through the statements and executes their side effects. In functional programs, the interpreter (re- cursively) evaluates functions. For declarative programs, some other evaluation strategy, for example based on a solver, may be used. We describe some of the details about how to design and implement interpreters in Section 12. Refrigerators: The DSL also supports the definition of unit tests for the asynchronous, reactive cooling algorithm. These tests are executed with an in-IDE interpreter. A sim- ulation environment allows the interpreter to be used in- teractively. Users can "play" with a cooling program, step- ping through it in single steps, watching values change. J Pension Plans: The pension DSL supports the in-IDE exe- cution of rule unit tests by an interpreter. In addition, the rules can be debugged. The rule language is functional, so the debugger "expands" the calculation tree, and users can inspect all intermediate results. J For interpretation, the domain hierarchy could be exploited as well: the interpreter for LD could be implemented in LD−1.

However, in practice we see interpreters written in LD0 . They may be extensible, so new interpreter code can be added to deal with the case where higher-level lanuguages add new language concepts. The abstraction level of an interpreter must be decided. One alternative might ignore for example the use of registers when performing an assignment, avoiding problems resulting from parallelism. Alternatively, the interpreter might model every- thing, taking into account issues related to parallelism. In other words, an interpreter defines a virtual machine and it is fun- damental that this virtual machine has an adequate abstraction level. The users must be aware of exactly what it means for the execution of the program on the target hardware if the program runs on the virtual machine.

4.3.5 Transformation versus Interpretation When defining the execution semantics for a language, a deci- sion has to be made between transformation (code generation) 96 dslbook.org

and interpretation. Here are some criteria to help with this decision. Code Inspection When using code generation, the resulting code can be inspected to check whether it resembles code that had previously been written manually in the DSL’s domain. Writing the transformation rules can be guided by the estab- lished patterns and idioms in LD−1. Interpreters are meta programs and as such harder to relate to existing code pat- terns.

Debugging Debugging generated code is straightforward if the code is well structured (which is up to the transformation) and an execution paradigm is used for which a decent de- bugging approach exists (not the case for many declara- tive approaches). Debugging interpreters is harder, because, they are meta programs. For example, setting breakpoints in the DSL program requires conditional breakpoints in the interpreter, which are typically cumbersome to use32 32 This is especially useful during the development of the execution engine. Performance and Optimization The code generator can perform Once the DSL and the engine are optimizations that result in small and tight generated code. finished, users should be able to debug The compiler for the generated code may come with its own DSL programs on the level of the DSL. However, since building DSL debuggers optimizations which are used automatically if source code is is not directly supported by most generated and subsequently compiled, simplifying the code language workbenches, this is a lot of work – and users are required to debug generator33. Generally, performance is better in generated on LD−1. environments, since interpreters always imply an additional 33 For example, it is not necessary to layer of indirection during the execution of the program. optimize away calls to empty functions, Platform Conformance Generated code can be tailored to any if statements that always evaluate to true, or arithmetic expressions target platform. The code can look exactly as manually writ- containing only constants. ten code would look; no support libraries are required. This is important for systems in which the source code (and not the DSL code) is the basis for a contractual obligations or for review and/or certification. Also, if artifacts need to be supplied to the platform that are not directly executable (de- scriptors, meta data), code generation is more suitable. 34 While it is theoretically possible to also extend interpreters incrementally Modularization When incrementally building DSLs on top of along a hierarchy of languages, I have existing languages, it is natural to use transformations to not seen this in practice. Interpreters 34 are typically written in a GPL. LD−1 . Turnaround Time Turnaround time for interpretation is better than for generation: no generation, compilation and pack- aging step is required. For target languages with slow com- pilers especially, large amounts of generated code can be a problem. dsl engineering 97

Runtime Change In interpreted environments, the DSL program can be changed as the target system runs; the DSL editor can even be integrated into the target system35. 35 The term data-driven system is often used in this case.

Refrigerators: There were two reasons for implementing the interpreter for the cooling programs. The first was that initially we didn’t have a code generator, because the tar- get architecture was not yet defined. To be able to execute cooling programs, we needed an interpreter and simula- tor. Second, the turn-around time for the domain experts as they experimented with the DSL programs is much re- duced compared to generating, compiling and running C code. The (interpreted) simulator also allowed the domain experts to run the programs at a speed they could follow. This proved an important means of understanding and de- bugging the asynchronous reactive cooling programs. J mbeddr C: This DSL exploits incremental extension to the C programming language (inductive DSL definition). In this case it is natural to use transformation to LD−1 as a means of defining the semantics of extensions. Also, since the target domain is embedded software, performance, code size and reuse of the optimizations provided by the C com- piler is essential. Interpretation was never an option. J Component Architecture: The driving factor for using generation over interpretation was platform conformance. The reason for the DSL is to automate the generation of target platform artifacts and thereby make working with the platform more efficient. J Pension Plans: Turnaround time was important for the pension contract specification. Also, the domain experts, as they created the pension plans, did not have access to the final execution platform. An in-IDE interpreter was clearly the best choice. J WebDSL: Platform conformance was key here. Web appli- cations have to use the established web standards, and the necessary artifacts have to be generated. An interpreted approach would not work in this scenario. J Combinations between the two approaches are also possible. For example, transformation can create an intermediate rep- resentation which is then interpreted. Or an interpreter can 98 dslbook.org

generate code on the fly as a means of optimization. While this approach is common in GPLs (e.g., the JVM), we have not seen this approach used for DSLs.

4.3.6 Sufficiency A program fragment is sufficient for transformation T if the frag- ment itself contains all the data necessary to executed the trans- formation. While dependent fragments are by definition not sufficient without the transitive closure of fragments they de- pend on, an independent fragment may be sufficient for one transformation, and insufficient for another. Refrigerators: The hardware structure is sufficient for a transformation that generates an HTML document that describes the hardware. It is insufficient regarding the C code generator, since the behavior fragment is required as well. J Sufficiency is important where large systems are concerned. An sufficient fragment can be used for code generation without checking out and/or loading other fragments. This supports modular, incremental transformations of only the changed frag- ments, and hence, potentially significant improvements in per- formance and scalability.

4.3.7 Synchronizing Multiple Mappings Ensuring the semantics of the execution engine becomes more challenging if we transform the program to several different tar- gets using several different transformations. We have to ensure that the semantics of all resulting programs are identical36. In 36 At least to the extent we care – we practice, this case often occurs if an interpreter is used in the may not care if one of the resulting programs is faster or more scalable. In IDE for "experimenting" with the models, and a code generator fact, these differences may be the very creates efficient code for execution in the target environment. reason for having several mappings). To synchronize the semantics in this case, we recommend pro- viding a set of test cases that are expressed on DSL level, and that are executed in all executable representations, expecting them to succeed in all of them. If the coverage of these test cases is high enough to cover all of the observable behavior, then it can be assumed with reasonable certainty that the se- mantics are indeed the same37. 37 Strictly speaking they are just bug- compatible, i.e. they may all make the Pension Plans: The unit tests in the pension plans DSL same mistakes. are executed by an interpreter in the IDE. However, as Java code is generated from the pension plan specifications, the same unit tests are also executed by the generated Java dsl engineering 99

code, expecting the same results as in the interpreted ver- sion. J Refrigerators: A similar situation occurs with the cooling DSL where an IDE-interpreter is used for testing and ex- perimenting with the models, and a code generator creates the executable version of the cooling algorithm that actu- ally runs on the microcontroller in the refrigerator. A suite of test cases is used to ensure the same semantics. J

4.3.8 Choosing between Several Mappings Sometimes there are several alternative ways in which a pro- gram in LD can be translated to a single LD−1, for example to realize different non-functional requirements (optimizations, target platform, tracing or logging). There are several ways in which one alternative may be selected:

• In analogy to compiler switches, the decision can be con- trolled by additional external data. Simple parameters passed to the transformation are the simplest case. A more elabo- rate approach is to have an additional model, called an anno- tation model, which contains data used by the transforma- tion to decide how to translate the core program. The trans- formation uses the LD program and the annotation model as its input. There can be several different annotation models for the same core model that define several different trans- formations, to be used alternatively. An annotation model is a separate viewpoint (Section 4.4) an can hence be provided by a different stakeholder than the one who maintains the core LD program.

• Alternatively, LD can be extended to directly contain addi- tional data to guide the decision. Since the data controlling the transformation is embedded in the core program, this is only useful if the DSL user can actually decide which al- ternative to choose, and if only one alternative should be chosen for each program. Annotation models provide more flexibility.

• Heuristics, based on patterns, idioms and statistics extracted from the LD program, can be used to determine the applica- ble transformation as well. Codifying these rules and heuris- tics can be hard though, so this approach is rarely used. 100 dslbook.org

As we have suggested above in the case of multiple transforma- tions of the same LD program, here too extensive testing must be used to make sure that all translations exhibit the same se- mantics (except for the non-functional characteristics that may be expected to be different, since they often are the reason for the different transformations in the first place).

4.3.9 Reduced Expressiveness and Verification It may be beneficial to limit the expressiveness of a language. Limited expressiveness often results in more sophisticated an- alyzability. For example, while state machines are not very expressive (compared to fully fledged C), sophisticated formal 38 spinroot.com verification algorithms are available (e.g., model checking us- 39 nusmv.fbk.eu/ ing SPIN38 or NuSMV39). The same is true for first-order logic, 40 D. G. Mitchell. A sat solver primer. where satisfiability (SAT) solvers40 can be used to check pro- eatcs, 85:112–132, 2005 grams for consistency. If these kinds of analyses are useful 41 A simple example is to use integers for the model purpose, then limiting the expressiveness to the with ranges int[0..10] x; instead of general integers. This makes pro- respective formalism may be a good idea, even if it makes ex- grams harder to write (ranges must pressing some programs in D more cumbersome41. Possibly a be specified every time) but easier to DSL should be partitioned into several sub-DSLs, where some analyze. of them are verifiable and some are not. mbeddr C: This is the approach used here: model check- ing is provided for the state machines. No model check- ing is available for general-purpose C, so behavior that should be verifiable must be isolated into a state machine explicitly. State machines interact with their surround- ing C program in a limited and well-defined way to iso- late them and make them checkable. Also, state machines marked as verifiable cannot use arbitrary C code in its actions. Instead, an action can only change the values of variables local to the state machine and set output events (which are then mapped to external functions or compo- nent runnables). The key here is that the state machine is completely self-contained regarding verification: adapt- ing the state machine to its surrounding C program is a separate concern and irrelevant to the model checker. J However, the language may have to be reduced to the point where domain experts are not able to use the language be- cause the connection to the domain is too loose. To remedy this problem, a language with limited expressiveness can be used at D−1. For analysis and verification, the LD programs are transformed down to the verifiable LD−1 language. Verifi- dsl engineering 101

cation is performed on LD−1, mapping the results back to LD. Transforming to a verifiable formalism also works if the formal- ism is not at D−1, as long as a mapping exists. The problem with this approach is the interpretation of analysis results in the context of the DSL. Domain users may not be able to inter- pret the results of model checkers or solvers, so they have to be translated back to the DSL. Depending on the semantic gap between the generated model checker input program and the DSL, this can be very hard.

4.3.10 Documentation Formally, defining semantics happens by mapping the DSL concepts to D−1 concepts for which the semantics is known. For DSLs used by developers, and for domains that are defined inductively (bottom-up), this works well. For application do- main DSLs, and for domains defined deductively (top-down), this approach is not necessarily good enough, since the D−1 concepts has no inherent meaning to the users and/or the do- main. An additional way of defining the meaning of the DSL is required. Useful approaches include prose documentation42 42 We suggest always writing such as well as test cases or simulators. This way, domain users documentation in tutorial style, or as FAQs. Hardly anyone reads "reference can "play" with the DSL and write down their expectations for- documentation": while it may be mally in test cases. complete and correct, it is boring to read and does not guide users through mbeddr C: The extensible C language comes with a 100- using the DSL. page PDF that shows how to use the MPS-based IDE, illus- trates the changes to regular C, provides examples for all C extensions and also discusses how to use the integrated analysis tools. J Refrigerators: This DSL has a separate viewpoint for defin- ing test cases where domain experts can codify their expec- tations regarding the behavior of cooling programs. An interpreter is available to simulate the programs, observe their progress and stimulate them to see how they react. J Pension Plans: This DSL supports an Excel-like tabular notation for expressing test cases for pension calculation rules (Fig. 4.16). The calculations are functional, and the calculation tree can be extended as a way of debugging the rules. J 102 dslbook.org

Figure 4.16: Test cases in the pension language allow users to specify test data for each input value of a rule. 4.4 Separation of Concerns The rules are then evaluated by an interpreter, providing immediate feedback about incorrect rules (table A domain D may be composed from different concerns. Each rows are colored red and green – not concern covers a different aspect of the overall domain. When visible in the printed version). For embedded software, these could developing a system in a domain, all the concerns in that do- be component and interface definitions main have to be addressed. Separation into concerns is often (A), component instantiation and connections (B), as well as scheduling driven by different aspects of the system being specified by and bus allocation (C). different stakeholders or at different times in the development process. Fig. 4.17 shows D1.1 composed from the concerns A, B and C.

Figure 4.17: A domain may consist of several concerns. A domain is covered either by a DSL that addresses all of these concerns, or by a set of related, concern-specific DSLs.

Two fundamentally different approaches are possible to deal with the set of concerns in a domain. Either a single, inte- grated language can be designed that addresses all concerns dsl engineering 103

of D in one integrated model. Alternatively, separate concern- specific DSLs can be defined, each addressing one or more of the domain’s concerns43. A complete program then consists 43 Strictly speaking, this is not quite of a set of dependent, concern-specific fragments that relate true: some concerns are typically also addressed by the execution engine. We to each other in a well-defined way. Viewpoints support this discuss this below in the section on separation of domain concerns into separate DSLs. Fig. 4.18 Cross-Cutting Concerns. illustrates the two different approaches.

Figure 4.18: Left: An integrated DSL, where the various concerns (repre- sented by different line styles) are covered by a single integrated language (and consequently, one model). Right: Several viewpoint languages (and model fragments), each covering a sin- gle concern. Arrows in Part B highlight dependencies between the viewpoints.

mbeddr C: The tasks language module includes the task implementation as well as task scheduling in one language construct. Scheduling and implementation are two con- cerns that could have been separated. We opted against this, because both concerns are specified by the same per- son. The language used for implementation code is med.core, whereas the task constructs are defined in the med.tasks language. So the languages are modularized, but they are used together in a single heterogeneous fragment. J WebDSL: Web programs consists of multiple concerns in- cluding persistent data, user interface and access control. 44 Z. Hemel, D. M. Groenewegen, WebDSL provides specific languages for these concerns, L. C. L. Kats, and E. Visser. Static con- sistency checking of web applications 44 but linguistically integrates them into a single language . with WebDSL. JSC, 46(2):150–182, 2011 Declarations in the languages can be combined in WebDSL modules. A WebDSL developer can choose how to factor declarations into modules; e.g., all access control rules in one module, or all aspects of some feature together in one module. J Component Architecture: The specification of interfaces and components is done with one DSL in one viewpoint. A separate viewpoint is used to describe component in- stantiation and connection. This choice has been made because the same set of interfaces and components will be instantiated and connected differently in different usage scenarios, so separate fragments are useful. J 104 dslbook.org

4.4.1 Viewpoints for Concern Separation

If viewpoints are used, the concern-specific DSLs, and conse- The IDE should provide navigational quently the viewpoint models, should have well-defined de- support: if an element in viewpoint B points to an element in viewpoint A pendencies; cycles should be avoided. If dependencies between then it should be possible to follow this viewpoint fragments are kept cycle-free, the independent frag- reference ("Ctrl-Click"). It should also be possible to query the dependencies ments may be sufficient for certain transformations; this can be in the opposite direction ("Find the a driver for using viewpoints in the first place. persistence mapping for this entity" The dependent viewpoint fragment (and the language to ex- or "Find all UI forms that access this entity"). press it) have to provide a way of pointing to the referenced element. This usually means that the referenced element has to provide a qualified name that can be used in the reference45. 45 In projectional editors one can techni- Separating out a domain concern into a separate viewpoint cally use the UUID of the target element for the reference, but for the user, some fragment can be useful for several reasons. If different concerns kind of qualified name is still necessary. of a domain are specified by different stakeholders, then sep- arate viewpoints make sure that each stakeholder has to deal 46 Projectional editors can use a dif- only with the information they care about. The various frag- ferent approach. They can store the information of all concerns in a single ments can be modified, stored and checked in/out separately, model, but then use different projec- maintaining only referential integrity with the referenced frag- tions to address the needs of different 46 stakeholders. This solves the problem ment . The viewpoint separation has to be aligned with the of referential integrity. However, this development process: the order of creation of the fragments approach does not support separate must be aligned with the dependency structure. store and check in/out. Another reason for separate viewpoints is a 1:n relationship A final (very pragmatic) reason for between the independent and the dependent fragments. If a using viewpoints is when the tooling used does not support embedding of single core concern may be enhanced by several different addi- a reusable language because syntactic tional concerns, then it is crucial to keep the core concern inde- composition is not supported. pendent of the information in the additional concerns. View- points make this possible.

4.4.2 Viewpoints as Annotation Models A special case of viewpoint separation is annotation models (already mentioned in Section 4.3.8). An annotation provides additional, often technical or transformation-controlling data for elements in a core program47. This is especially useful in 47 For those who know Eclipse EMF: a multi-stage transformation (Section 4.3.3), where additional genmodels are annotation models for ecore models. data may have to be specified for the result of the first phase to control the execution of the next phase. Since that intermediate model is generated, it is not possible to add these additional specifications to the intermediate model directly. Externalizing it into an annotation model solves that problem. Refrigerators: One concern in this DSL specifies the log- ical hardware structure of refrigerators installations. An- other one describes the refrigerator cooling algorithm. Both dsl engineering 105

Figure 4.19: The hardware fragment is independent, and sufficient for generation of hardware diagrams and documentation. The algorithms fragment depends on the hardware fragment. The two of them together are sufficient for generating the controller code. Tests depend on the algorithm. There are many test fragments for a single algorithm fragment.

are implemented as separate viewpoints, where the algo- rithm DSL references the hardware structure DSL. Using this dependency structure, different algorithms can be de- fined for the same hardware structure. Each of these al- gorithms resides in its own fragment. While the C code generation requires both behavior and hardware structure fragments, the hardware fragment is sufficient for a trans- formation that creates a visual representation of the hard- ware structures (see Fig. 4.19). J Example: For example, if you create a relational data model from an object oriented data model, you might automat- ically derive database table names from the name of the 48 This is a typical example of where class in the OO model. If you need to "change" some of the easiest thing in a projectional those names, use an annotation model that specifies an al- editor would be to just place a field for ternate name. The downstream processor knows that the holding an overriding table name under the program element that represents the name in the annotation model overrides the name in the table. Users can edit that table name in original model48. J a special projection.

4.4.3 Viewpoint Consistency

If viewpoints are used, constraints have to be defined to check consistency of the viewpoints. A dependent viewpoint frag- ment contains program elements that reference program ele- ments in another fragment. It is straightforward to check that the target elements of these actually exist, since the reference will break if it does not; in most tools these kinds of checks are available by default. The other direction is more interesting. Assume two view- points: business data structure and persistence mapping. There may be a constraint that says that every Entity in the busi- ness data viewpoint has to have exactly one EntityPersis- 106 dslbook.org

tenceMapping element that points to the respective Entity. It is an error if such an EntityPersistenceMapping does not exit. Checking this constraint has two problems:

• The first problem may be performance. The whole world has to be searched to check if a referencing program element 49 This data should also be exploited by exists somewhere. If the tool supports it, this problem can the IDE. UI actions should be available to navigate from the referenced element 49 be solved by automatically maintained reverse indices . (the Entity) to the referencing elements (the EntityPersistenceMapping). • The second problem is more fundamental: it is not clear This is more than a generic Find Us- what constitutes the whole world. The fragment with the ages functionality, since it specifically persistence mapping for a given Entity may reside on a dif- searches for certain kinds of usages (the EntityPersistenceMapping in this ferent machine or be under the control of a different user. example). Further tool support may It may not be accessible to the constraint checker when the include creation of such referencing elements based on a policy that deter- user edits the business data fragment. To solve this problem, mines into which fragment the created it is necessary to define explicitly what the world is, using element should go. some kind of configuration. For example, a C compiler’s in- clude path or Java’s classpath are ways of defining the scope within which the overall system description must be com- plete. This does not necessarily have to be done by each developer who, for example, works on the business data. But at the point when the final system is generated or built, such a "world definition" is essential.

4.4.4 Cross-Cutting Concerns In the discussion so far we have considered concerns that can be modularized clearly. Fig. 4.17 emphasizes this: the concern boxes are neatly arranged next to each other. However, there may also be concerns that do not fit into the chosen modulariza- tion approach. These are typically called cross-cutting concerns; see Fig. 4.20.

Figure 4.20: Cross-cutting concerns cannot be modularized: they permeate other concerns.

In the context of DSLs we have to separate several classes of cross-cutting concerns:

 Handled by Execution Engine If we are lucky, a concern that is cross-cutting in the domain can be handled completely by the execution engine. For example the collection of performance dsl engineering 107

data, billing information or audit logs typically does not have to be described in the DSL at all. Since every program in the domain has to address this concern in the same way, the imple- mentation can be handled by the execution engine by inserting the respective code at the relevant locations (in the case of a generator). Component Architecture: The component architecture DSL supports the collection of performance data. Using mock objects, we started running load tests early on. For a load test, we have to collect the times it takes to execute oper- ations on components. Based on a configuration switch, the generator adds the necessary code to collect the per- formance data automatically. J

 Modularized in DSL Another class of cross-cutting con- cerns are those that cut across the resulting executable system, but can be modularized on the DSL level. A good example is permissions. Specifying users, roles and permissions to access certain resources in the system can be modularized into a con- cern, and is typically described in a separate viewpoint. It is then the job of the execution engine to consider the specified permissions in all relevant places in the resulting system. WebDSL: WebDSL has a means of specifying access con- trol for web pages. The generator injects the necessary code to check these permissions into the client side and server side parts of the resulting web application. J

 Cross-Cutting in the DSL The third class is when the con- cern cross-cuts the programs written in the DSL and can not 50 Building a (typically relatively lim- be modularized, as in the previous class. In this case we have ited) aspect weaver on the DSL’s level is not a big problem, since we already to deal with cross-cutting concerns in the same way as we do have access to the AST, and trans- today in programming languages: we either have to manually forming it in a way where we inject additional code based on a DSL-specific insert the code in all the relevant places in the DSL program, or pointcut specification is relatively we have to resort to aspect weaving on the DSL level50. straightforward. Component Architecture: We implemented a simple weaver that is able to introduce additional ports into existing com- ponents. It was used, among other things, to modular- ize the monitoring concern: if monitoring was enabled, this aspect component would add the mon port to all other 51 The specifies that this aspect ap- MonitoringConsole * components, enabling the to connect plies to all existing components. Other to the other components and query monitoring data (see selectors could be used instead of the * to select only a subset). the code below51). J 108 dslbook.org

namespace monitoring feature monitoring {

component MonitoringConsole ... instance monitor: ... dynamic connect monitor.devices .. .

aspect (*) component { provides mon: IMonitoring } }

4.4.5 Views on Programs

In projectional editors it is also possible to store the data for all viewpoints in the same model tree, while using different pro- jections to show different views onto the model to materialize the various viewpoints. The particular benefit of this approach MPS also provides annotations, al- is that additional concern-specific views can be defined later, lowing additional model data to be "attached" to any model element, and after programs have been created. It also avoids the need for shown optionally. defining sophisticated ways of referencing program elements from other viewpoints. Pension Plans: Pension plans can be shown in a graphical notation highlighting the dependency structure (Fig. 4.21). The dependencies can still be edited in this view, but the actual content of the pension plans is not shown. J

Figure 4.21: Graphical notation for dependencies among rules in a pension mbeddr C: Annotations are used for storing requirements plan. traces and documentation in the models (Fig. 20.22). The program can be shown and edited with and without re- quirements traces and documentation text. J dsl engineering 109

Figure 4.22: The shaded annotations are traces into a requirements database. The program can be edited with and without these annotations. The annota- tions language has no dependency on the languages it annotates.

4.4.6 Viewpoints for Progressive Refinement There is an additional use case for viewpoint models not re- lated to the concerns of a domain, but to progressive refine- ment. Consider the development of complex systems, which typically proceeds in phases: it starts with requirements, pro- ceeds to high-level component design and specification of non- functional properties, and finishes with the implementation of the components. In each of these phases, models can be used to represent the system with abstractions that are appropriate for the phase. An appropriate DSL is needed to represent the mod- els in each phase (Fig. 4.23). The references between model el- ements are called traces52. Since the same conceptual elements 52 W. Jirapanthong and A. Zisman. may be represented on different refinement levels (e.g., compo- Supporting product line development through traceability. In apsec, pages nent design and component implementation), synchronization 506–514, 2005 between the viewpoint models is often required (see the next subsection).

Figure 4.23: Progressive refinement: the boxes represent models expressed with corresponding languages. The dotted arrows express dependencies, whereas the solid arrows represent references between model elements.

4.4.7 Model Synchronization In the discussion of viewpoints so far we have assumed that there is no overlap between the viewpoints: every piece of 110 dslbook.org

information lives in exactly one viewpoint. Relationships be- tween viewpoints are established by references (which means that the Referencing language composition technique can be used; this is discussed in Section 4.6.1). However, sometimes this is not the case, and the same (conceptual) information is represented in two viewpoint models. Obviously there is a constraint that enforces consistency between the viewpoints; the models have to be synchronized53. 53 Thanks are due to the participants In some cases the rules for establishing consistency between of the MoDELS 2012 workshop on Multi-Paradigm Modeling who, by viewpoints can be described formally, and hence the synchro- discussing this issue, reminded me that nization can be automated, if the DSL tool supports such syn- it is missing from the book. chronization54. An example occurs in mbeddr C: 54 MPS has a nice way of automatically mbeddr C: Components implement interfaces. Each com- executing a quick fix for a constraint violation. If the constraint detects an ponent provides an implementation for each method de- inconsistency between viewpoints, the fined in each of the interfaces it implements. If a new quick fix can automatically correct the whole method is added to an interface, all components that im- problem. This also solves the world problem neatly, since every plement that particular interface must get a new, empty dependent fragment is "corrected" as method implementation. This is an example of model syn- soon as it is opened in the editor. chronization. J In this example the synchronization is trivial, for two reasons: first, there is a clear (unidirectional) dependency between the method implementation and the operation specification in the interface, so the synchronization is also unidirectional. Second, the information represented in both models/places is identical, so it is easy to detect an inconsistency and fix it. However, there are several more complicated cases:

• The dependency might be bidirectional, and changes may be allowed in either model. This means that two transfor- mations have to be written, one for each direction, or a for- malism for expressing the transformation has to be used that can be executed in both directions55. In multi-user scenarios 55 The QVT-R transformation language it is also possible that the two models are changed at the has this capability, for example. same time, in an inconsistent way. In this case the changes have to be merged, or a clear priority (who will win) has to be established. • The languages expressing the viewpoints may have been de- fined independent of each other, with no dependency. This probably means that it was discovered only after the fact that some parts of the model have to be synchronized. In this case the synchronization must be put into some kind of adapter language. It also means that the synchronization is dsl engineering 111

not as clean as if it had been "designed into" the languages (see the next item). • In the mbeddr example, the information (the signature of the operation) was simply replicated, so the transformation was trivial. However, there may not be a 1:1 correspondence between the information in the two viewpoints. This makes the transformation more complex to write. In the worst case it may mean that the synchronization cannot be formally described and automated.

Sometimes the correspondence between models can only be expressed on an instance level (as in "This functional block cor- responds to this software component")56. Consequently, devel- 56 This often happens in the context of opers have to express the correspondence (the trace links men- progressive refinement, as discussed in the previous subsection. tioned earlier) manually. However, consistency checks (and 57 possibly automatic synchronization) may still be possible, based Z. Diskin, Y. Xiong, and K. Czarnecki. From state- to delta-based bidirectional on the manually expressed trace links. model transformations. In ICMT, pages In my work with DSLs I have only encountered the simplest 61–76, 2010 cases of synchronization, which is why we don’t put much em- 58 P. Stevens. Bidirectional model phasis on this topic in the rest of the book. For more details, transformations in qvt: semantic issues and open questions. SoSyM, 9(1):7–20, 57 58 see the papers by Diskin and Stevens . 2010

4.5 Completeness

Completeness59 refers to the degree to which a language L can 59 This has nothing to do with Turing express programs that contain all necessary information to ex- completeness. ecute them. An program expressed in an incomplete DSL re- quires additional specifications (such as configuration files or code written in a lower-level language) to make it executable. Let us introduce a function G ("code generator") that trans- forms a program p in LD to a program q in LD−1. For a com- plete language, p and q have the same semantics, i.e. OB(p) == OB(G(p)) == OB(q) (see Section 4.3). For incomplete lan- guages where OB(G(p)) ⊂ OB(p) we have to write additional code in L − to obtain a program in D− that has the same se- D 1 1 Another way of stating this is that G mantics as intended by the original program in LD. In cases in produces a program in LD−1 that is which we use several viewpoints to represent various concerns not sufficient for a subsequent trans- formation (e.g., a compiler), only the of D, the set of fragments written for these concerns must be manually written LD−1 code leads to enough for complete D−1 generation. sufficiency. mbeddr C: The Embedded C language is complete regard- ing D−1, or even D−m for higher levels of D, since higher levels are always built as extensions of its D−1. Developers 112 dslbook.org

can always fall back to D−1 to express what is not express- ible directly with LD. Since the users of this system are developers, falling back to D−1 or even D0 is not a prob- lem. J

4.5.1 Compensating for Incompleteness

Integrating the LD−1 in the case of an incomplete LD language can be done in several ways:

• By calling "black box" code written in LD−1. This requires 60 concepts in LD for calling D−1 foreign functions. No syntac- In the simplest case, these functions don’t even have arguments, so the syn- tic embedding of D−1 code is required, beyond the ability to tax to call such a function is essentially 60 call functions . just the function name.

• By directly embedding LD−1 code in the LD program. This is useful if LD is an extension of LD−1, or if the tool provides Just "pasting text into a text field", an approach used by several graphical adequate support for embedding the D− language into L 1 D modeling tools, is not productive, since programs. Note that LD−1 may not be analyzable, so mixing no syntactic and semantic integration between the languages is provided. LD−1 into LD code may compromise analyzability of the LD In most cases there is no tool support code. (syntax highlighting, code completion, error checking) • By using composition mechanisms of LD−1 to "plug in" the manually written code into the generated code without actu- 61 J. Vlissidis. Generation gap. C++ ally modifying the generated files (also known as the Gener- Report, 1996 ation Gap61 pattern). Example techniques for realizing this 62 For example, in user interfaces, such approach include generating a base class with abstract meth- a method could return a position object for a widget. The default implemen- ods (requiring the user to implement them in a manually tation returns null, indicating to the written subclass) or with empty callback methods which the framework to use the the generic layout 62 algorithm. If a position is returned, it user can use to customize in a subclass . You can delegate, is used instead of the one computed by implement interfaces, use #include, use reflection tricks, the layout algorithm.

AOP or take a look at the well-known design patterns for We discourage the use of protected inspiration. Some languages provide partial classes, where regions. You’ll run into all kinds of a class definition can be split over a generated file and a problems: generated code is not a throw-away product any more, you manually written file. have to check it in leading to funny • By inserting manually-written L code into the L code situations with your version control D−1 D−1 system. Also, often you will accumulate generated from the LD program using protected regions. a "sediment" of code that has been Protected regions are areas of the code, usually delimited generated from elements that are no longer in the model, leading to by special comments, whose (manually written) contents are compilation errors in the worst case not overwritten during regeneration of the file. – even though the code is in fact not longer required. If you don’t use For DSLs used by developers, incompleteness is usually not protected regions, you can delete the whole generated source directory from a problem because they are comfortable with writing the D−1 time to time, cleaning up the sediment. code in a programming language. Specifically, the DSL users are the same people as those who provide the remaining D−1 code, so coordination between the two roles is not a problem. dsl engineering 113

Component Architecture: This DSL is not complete. Only class skeleton and infrastructure integration code is gener- ated from the models. The component implementation has to be implemented manually in Java using the Generation Gap pattern. The DSL is used by developers, so writing code in a subclass of a generated class is not a problem. J For DSLs used by domain experts, the situation is different. Usually, they are not able to write D−1 code, so other people (developers) have to fill in the remaining concerns. Alterna- This requires elaborate collaboration tively, developers can develop a predefined set of foreign func- schemes, because the domain experts have to communicate the remaining tions that can be called from within the DSL. In effect, devel- concerns via text or verbal communica- opers provide a standard library (cf. Section 4.1.2) which can tion. be invoked as black boxes from DSL programs. WebDSL: The core of a web application is concerned with persistent data and its presentation. However, web ap- plications need to perform additional duties outside that core, for which useful libraries often exist. WebDSL pro- vides a native interface that allows a developer to call into a Java library by declaring types and functions from the library in a WebDSL program. J Note that a DSL that does not cover all of D can still be complete: not all of the programs imaginable in a domain may be ex- pressed with a DSL, but those programs that can be expressed can be expressed completely, without any manually written code. Also, the code generated from a DSL program may re- quire a framework written in LD−1 to run in. That framework represents aspects of D outside the scope of LD. Refrigerators: The cooling DSL only supports reactive, state-based systems that make up the core of the cool- ing algorithm. The drivers used in the lower layers of the system, or the control algorithms controlling the ac- tual compressors in the fridge, cannot be expressed with the DSL. However, these aspects are developed once and can be reused without adaptations, so using DSLs is not sensible. These parts are implemented manually in C. J

 Controlling D−1 Code Allowing users to manually write D−1 code, and especially if it is actually a GPL in D0, comes with two additional challenges. Consider the following exam- ple: the generator generates an abstract class from some model element. The developer is expected to subclass the generated 114 dslbook.org

class and implement a couple of abstract methods. The man- ually written subclass needs to conform to a specific naming convention so that some other generated code can instantiate 63 Of course, if the constructor of the the manually written subclass. The generator, however, just concrete subclass is called from another location of the generated code, and/or generates the base class and stops: how can you make sure if the abstract methods are invoked, developers actually do write that subclass, using the correct you’ll get compiler errors. By their 63 nature, they are on the abstraction name ? level of the implementation code, To address this issue, make sure there is there a way to make however. It is not always obvious what those conventions and idioms interactive. One way to do this the developer has to do in terms of the model or domain to get rid of these is to generate checks/constraints against the code base and have errors. them evaluated by the IDE, for example using Findbugs64 or 64 findbugs.sourceforge.net/ similar code checking tools. If one fails, an error message is reported to the developer. That error message can be worded by the developer of the DSL, helping the developer understand what exactly has to be done to solve the problem with the code.

 Semantic Consistency As part of the definition of a DSL you will implement constraints that validate the DSL program in order to ensure some property of the resulting system (see Section 20.5). For example, you might check dependencies be- tween components in an architecture model to ensure compo- nents can be exchanged in the actual system. Of course such a validation is only useful if the manually written code does not introduce dependencies that are not present in the model. In that case the "green light" from the constraint check does not help much. To ensure that promises made by the models are kept by the (manually written) code, use one of the following two ap- proaches. First, generate code that does not allow violation of model promises. For example, don’t expose a factory that allows components to look up and use any other component (creating dependencies), but rather use dependency injection to supply objects for the valid dependencies expressed in the 65 65 A better approach is to build a com- model . plete DSL. The language used to express Component Architecture: The Java code generator gen- the behavior (which might otherwise plugged in manually in the gener- erates component implementation classes that use depen- ated code) is suitably limited and/or dency injection to supply the targets for required ports. checked to enforce that it does not lead to inconsistencies. This is a nice This ensures that the implementation class will have access use case for language extension and to exactly those interfaces specified in the model. An al- embedding. ternative approach would be to simply hand to the imple- mentation class some kind of factory or registry where a component implementation can look up instances of com- dsl engineering 115

ponents that provide the interfaces specified by the re- quired ports of the current component. However, this way it would be much harder to make sure that only those de- pendencies are accessed that are expressed in the model. Using dependency injection enforces this constraint in the implementation code. J A second approach uses code checkers (like the Findbugs men- tioned above) or architecture analysis tools to validate manu- ally written code. You can easily generate the relevant checking rules for those tools from the models.

4.5.2 Roundtrip Transformation

Roundtrip transformation means that an LD program can be recovered from a program in LD−1 (written from scratch, or changed manually after generation from a previous iteration of the L program). This is challenging, because it requires D 66 D. Beyer, T. A. Henzinger, and reconstituting the semantics of the LD program from idioms G. Theoduloz. Program analysis with dynamic precision adjustment. or patterns used in the LD−1 code. This is the general reverse In ASE, pages 29–38, 2008; M. Pistoia, engineering problem and is not generally possible, although S. Chandra, S. J. Fink, and E. Yahav. A progress has been made over recent years (see for example66). survey of static analysis methods for Note that for complete languages roundtripping is generally identifying security vulnerabilities in software systems. IBMSJ, 46(2):265– not useful, because the complete program can be written in LD 288, 2007; and M. Antkiewicz, T. T. in the first place. Even if recovery of the semantics is possible Bartolomei, and K. Czarnecki. Fast extraction of high-quality framework- it may not be practical: if the DSL provides significant abstrac- specific models from application code. tion over the LD−1 program, then the generated LD−1 program ASE, 16(1):101–144, 2009 is so complicated that manually changing the D code in a −1 Notice that the problem of "under- consistent and correct way is tedious and error-prone. standing" the semantics of a program Roundtripping has traditionally been used with respect to written at a too-low abstraction level is the reason for DSLs in the first place: UML models and generated class skeletons. In that case, the by providing linguistic abstractions for abstractions between the model and the code are similar (both the relevant semantics, no "recovery" is are classes); the tool basically just provides a different concrete necessary for meaningful analysis and transformation. syntax (diagrams). This similarity of abstractions in the code and the model made roundtripping possible to some extent. However, it also made the models relatively useless, because they did not provide a significant benefit in terms of abstrac- tion over code details. We generally recommend avoiding any attempt to build support for roundtripping. mbeddr C: This language does not support roundtripping, but since all DSLs are extensions of C, one can always add C code to the programs, alleviating the need for roundtrip- ping in the first place. J 116 dslbook.org

Refrigerators: Roundtripping is not required here, since the DSL is complete. The code generators are quite so- phisticated, and nobody would want to manually change the generated C code. Since the DSL has proved to provide good coverage, the need to "tweak" the generated code has not come up. J Component Architecture: Roundtripping is not supported. Changes to the interfaces, operation signatures or com- ponents have to be performed in the models. This has not been reported as a problem by the users, since both the implementation code and the DSL "look and feel" the same way – they are both Eclipse-based textual editors – and generation of the derived low-level code happens au- tomatically on saving a changed model. The workflow is seamless. J Pension Plans: This is a typical application domain DSL where the users never see the generated Java code. Con- sequently, the language has to be complete and roundtrip- ping is not useful and would not fit with the development process. J

4.6 Language Modularity

Reuse of modularized parts makes software development more Language modularization and reuse efficient, since similar functionality does not have to be devel- is often not driven by end user or domain requirements, but rather, oped over and over again. A similar argument can be made by the experience of the language for languages. Being able to reuse languages, or parts of lan- designers and implementers striving for consistency and avoidance of duplicate guages, in new contexts makes designing DSLs more efficient. implementation work. Language composition requires the composition of abstract 67 It requires the composition of the IDE syntax, concrete syntax, constraints/type systems and the ex- as well. However, with the language ecution semantics67. We discuss all of these aspect in this workbenches used in this book, this is section. However, in the discussion of semantic integration, mostly automatic. we consider only the case in which the composed language uses the same (or closely related) behavioral paradigms68, since 68 The behavioral paradigm is also otherwise the composition can become very challenging. We known as the Model of Computation. mostly focus on imperative programs. We discuss behavioral paradigms in more detail in Section 5.

 Composition Techniques We have identified the following four composition strategies: referencing, extension, reuse and embedding. We distinguish them regarding fragment structure dsl engineering 117

and language dependencies, as illustrated in Fig. 4.24. Fig. 4.25 69 Note how in both cases the language shows the relationships between fragments and languages in definitions are modular: invasive mod- ification of a language definition is 69 these cases . not something we consider language modularity!

Figure 4.24: We distinguish the four modularization and composition ap- proaches regarding their consequences for fragment structure and language dependencies.

We consider these two criteria to be relevant for the following reasons. Language dependencies capture whether a language has to be designed with knowledge about a particular composition partner in mind in order to be composable with that partner. It is desirable in many scenarios that languages be compos- able without previous knowledge about all possible composi- tion partners. Fragment Structure captures whether the two composed languages can be syntactically mixed, or whether separate viewpoints are used. Since modular concrete syntax can be a challenge, this is not always possible, though often desirable.

Figure 4.25: The relationships between fragments and languages in the four composition approaches. Boxes rep- resent fragments, rounded boxes are languages. Dotted lines are dependen- cies, solid lines references/associations. The shading of the boxes represent the two different languages.

 DSL Hell? Reusing DSL also helps avoid the "DSL Hell" problem we discussed in the introduction. DSL hell refers to the danger that developers create new DSLs all the time, result- ing in a large set of half-baked DSLs, each covering related do- mains, possibly with overlap, but still incompatible. Language modularization and reuse can help to avoid this problem. Lan- guage extension allows users to add new language constructs to existing languages. They can reuse all the features of the 118 dslbook.org

existing language while still adding their own higher-level ab- stractions. Language embedding lets language designers em- bed existing languages into new ones. This is particularly in- teresting in the case of expression or query languages, which are relevant in many different contexts.

 More Detailed Examples Part III of the book discusses the implementation of these modularization techniques with var- ious tools (Section 16). As part of this discussion we present much more concrete and detailed examples of the various com- position techniques. You may want to take a look at those ex- amples while you read this section.

4.6.1 Language Referencing Language referencing enables homogeneous fragments with cross- references among them, using dependent languages (Fig. 4.26). A fragment f2 depends on f1. f2 and f1 are expressed with different languages l2 and l1. The referencing language l2 de- pends on the referenced language l1 because at least one con- cept in the l2 references a concept from l1. We call l2 the referenc- ing language, and l1 the referenced language. While equations (1.2) and (1.3) (see Section 3.3) continue to hold, (1.1) does not. Instead:

∀ ∈ | ( ) = ∧ ( ( ) = ∨ ( ) = ) r Refsl2 lo r.from l2 lo r.to l1 lo r.to l2 (4.1) Figure 4.26: Referencing. Language l2 depends on l1, because concepts in l2 reference concepts in l1. (We use rectangles for languages, circles for language concepts, and UML syntax for the lines: dotted arrows = dependency, normal arrows = associations, hollow- triangle-arrow for inheritance.)

 Viewpoints As we have discussed before in Section 4.4, a domain D can be composed from different concerns. One way of dealing with this is to define separate concern-specific DSLs, each addressing one or more of the domain’s concerns. A pro- gram then consists of a set of concern-specific fragments, which relate to each other in a well-defined way using language refer- encing. This approach has the advantage that different stake- holders can modify "their" concern independent of others. It also allows reuse of the independent fragments and languages dsl engineering 119

with different referencing languages. The obvious drawback is that for tightly integrated concerns the separation into separate fragments can be a usability problem. Referencing implies knowledge about the relationships of the languages as they are designed. Viewpoints are the clas- sical case for this. The dependent languages cannot be reused, because of the dependency on the other language. Refrigerators: As an example, consider the domain of re- frigerator configuration. The domain consists of three con- cerns. The first concern H describes the hardware struc- ture of refrigerator appliances including compartments, compressors, fans, valves and thermometers. The second concern A describes the cooling algorithm using a state- based, asynchronous language. Cooling programs refer to hardware building blocks and access their properties in ex- pressions and commands. The third concern is testing, T. A cooling test can test and simulate cooling programs. The dependencies are as follows: A → H and T → A. Each of these concerns is implemented as a separate language, with references between them. H and A are separated be- cause H is defined by product management, whereas A is defined by thermodynamicists. Also, several algorithms for the same hardware must be supported, which makes separate fragments for H and A useful. T is separate from A because tests are not strictly part of the product defi- nition and may be enhanced after a product has been re- leased. These languages have been built as part of a single project, so the dependencies between them are not a prob- lem. J

 Progressive Refinement Progressive refinement, also intro- duced earlier (Section 4.4.6), also makes use of language refer- encing.

4.6.2 Language Extension

Language extension enables heterogeneous fragments with de- pendent languages (Fig. 4.27). A language l2 extending l1 adds additional language concepts to those of l1. We call l2 the ex- tending language (or language extension), and l1 the base lan- guage. To allow the new concepts to be used in the context provided by l1, some of them extend concepts in l1. So, while 120 dslbook.org

l1 remains independent, l2 becomes dependent on l1, since:

∃i ∈ Inh(l2) | i.sub = l2 ∧ i.super = l1 (4.2)

Consequently, a fragment f contains language concepts from both l1 and l2:

∀e ∈ Ef | lo(e) = l1 ∨ lo(e) = l2 (4.3) Note that copying a language definition ⊂ ( ∪ ) and changing it does not constitute a In other words, Cf Cl1 Cl2 , so f is heterogeneous. For 1 3 case of language extension, because the heterogeneous fragments ( . ) does not hold anymore, since: extension is not modular, it is invasive. Also, native interfaces that support ∀c ∈ Cdnf | (lo(co(c.parent)) = l1 ∨ lo(co(c.parent)) = l2)∧ calling one language from another (like calling C from Perl or Java) is not (lo(co(c.child)) = l1 ∨ lo(co(c.child)) = l2) (4.4) language extension; rather it is a form of language referencing. The fragments remain homogeneous.

Figure 4.27: Extension: l2 extends l1. It provides additional concepts B3 and B4. B3 extends A3, so it can be used as a child of A2, just like A3. This plugs l2 into the context provided by l1. Consequently, l2 depends on l2.

Language extension fits well with the hierarchical domains in- Language extension is especially inter- troduced in Section 3.1: a language L for a domain D may esting if D0 languages are extended, B making a DSL an extension of a general extend a language LA for D−1. LB contains concepts specific purpose language. to D, making analysis and transformation of those concepts possible without pattern matching and semantics recovery. As explained in the introduction, the new concepts are often rei- fied from the idioms and patterns used when using an LA for D. Language semantics are typically defined by mapping the new abstractions to just these idioms (see Section 4.3) inline. This process, also known as assimilation, transforms a hetero- geneous fragment (expressed in LD and LD+1) into a homoge- neous fragment expressed only with LD. Extension is especially useful for bottom-up domains. The common patterns and idioms identified for a domain can be reified directly into linguistic abstractions, and used directly in the language from which they have been embedded. Incom- plete languages are not a problem, since users can easily fall back to D−1 to implement the rest. Since DSL users see the D−1 code all the time anyway, they will be comfortable falling dsl engineering 121

back to D−1 in exceptional cases. This makes extensions suit- able only for DSLs used by developers. Domain expert DSLs are typically not implemented as extensions. mbeddr C: As an example, consider embedded program- ming. The C programming language is typically used as the GPL for D0 in this case. Extensions for embedded pro- gramming include state machines, tasks or data types with physical units. Language extensions for the subdomain of real-time systems may include ways of specifying deter- ministic scheduling and worst-case execution time. For the avionics subdomain support for remote communica- tion using some of the bus systems used in avionics could be added. J Extension comes in two flavors. One really feels like extension, the other feels more like embedding.

• Extension Flavor In the first case we provide (a little, local) additional syntax to an otherwise unchanged language. For example, C may be extended with new data types and liter- als for complex numbers, as in complex c = (3+2i);. The programs still essentially look like C programs, with specific extensions in a few places.

• Embedding Flavor The other case is where we create a com- pletely new language, but reuse some of the syntax provided by the base language. For example, we could create a state machine language that reuses C’s expression and types in guard conditions. This use case feels like embedding (we embed syntax from the base language in our new language), The embedding flavour is often suitable but in the classification according to syntactic integration for use with DSLs that are used by non- programmers, since the "embedded" and dependencies, it is still extension. Embedding would subset of the language is often small prevent dependencies between the state machine language and simple to understand. Once again, expression languages are the prime and C. example for this.

Language extension is also a very useful way to address the problem of DSLs often starting as simple, but then becoming more complicated over time, because new corners or intricacies in the domain are discovered as users gain more experience in the domain. These corner cases and intricacies can be factored into a separate language module that extends the core DSL. The use of these extensions can then initially be restricted to a few users in order to find out if they are really needed. Dif- ferent experiments can even be performed at the same time, 122 dslbook.org

with different groups of users using different extensions. Even once these extensions have proved useful, "advanced" language features can be restricted in this way to a small group of "ad- vanced" users who handle the hard cases by using the exten- sion. Incremental extension can help to avoid the feared "cus- tomization cliff". The customization cliff is a term introduced by Steve Cook70: once you step outside of what is covered by your 70 blogs.msdn.com/b/ DSL, you plunge down a cliff onto the rocks of the low-level plat- stevecook/archive/ 2005/12/16/504609.aspx form. If DSLs are built as incremental extensions of the next lower language, then stepping outside any DSL on level D will only plunge you down to the language for D−1. And presum- ably you can always create an additional extension that extends your DSL to cover an additional, initially unexpected aspect. Defining a D language as an extension of a D−1 language can also have drawbacks. The language is tightly bound to the D−1 language it is extended from. While it is possible for a stand-alone DSL in D to generate implementations for differ- ent D−1 languages, this is not easily possible for DSLs that are extensions of a D−1 language. Also, interaction with the D−1 language may make meaningful semantic analysis of complete programs (using LD and LD−1 concepts) hard. This problem can be limited if isolated LD sections are used in which inter- action with LD−1 concepts is limited and well-defined. These isolated sections remain analyzable.

 Restriction Sometimes language extension is also used to Restriction is often useful for the restrict the set of language constructs available in the subdo- embedding-flavor of extension. For ex- ample, when embedding C expressions main. For example, the real-time extensions for C may re- into the state machine language, we strict the use of dynamic memory allocation, or the extension may want to restrict users from using the pointer-related expressions. for safety-critical systems may prevent the use of void point- ers and certain casts. Although the extending language is in some sense smaller than the extended one, we still consider this a case of language extension, for two reasons. First, the restrictions are often implemented by adding additional con- straints that report errors if the restricted language constructs are used. Second, a marker concept may be added to the base language. The restriction rules are then enforced for children of these marker concepts (e.g., in a module marked as "safe", one cannot use void pointers and the prohibited casts). mbeddr C: Modules can be marked as MISRA-compliant, which prevents the use of those C constructs that are not dsl engineering 123

allowed in MISRA-C71. Prohibited concepts are reported 71 www.misra.org.uk/ as errors directly in the program. Publications/tabid/57/ J Default.aspx#label-c2 4.6.3 Language Reuse Language reuse enables homogenous fragments with independent languages (Fig. 4.28). Given are two independent languages l2 and l1 and two fragment f2 and f1. f2 depends on f1, so that: ∃ ∈ | ( ) = ∧ r Refsf2 fo r.from f2

(fo(r.to) = f1 ∨ fo(r.to) = f2) (4.5) One could argue that in this case reuse is just a clever combination Since l2 is independent, it cannot directly reference concepts of referencing and extension. While this is true from an implementation in l1. This makes l2 reusable with different languages (in con- perspective, it is worth describing as a trast to language referencing, where concepts in l2 reference separate approach, because it enables independent concepts in l1). We call l2 the context language and l1 the reused the combination of two languages by adding an adapter after language. the fact, so no pre-planning during the One way of realizing dependent fragments while retaining design of l1 and l2 is necessary. independent languages is using an adapter language lA where lA extends l2, and: ∃ ∈ | ( ) = ∧ ( ) = r RefslA lo r.from lA lo r.to l1 (4.6)

Figure 4.28: Reuse: l1 and l2 are in- dependent languages. Within an l2 fragment, we still want to be able to reference concepts in another frag- ment expressed with l1. To do this, an adapter language lA is added that depends on both l1 and l2, using inher- itance and referencing to adapt l1 to l2.

While language referencing supports reuse of the referenced language, language reuse supports the reuse of the referencing language as well. This makes sense for DSLs that have the po- tential to be reused in many domains, with minor adjustments. Examples include role-based access control, relational database mappings and UI specification. Example: Consider a language for describing user inter- faces. It provides language concepts for various widgets, layout definition and disable/enable strategies. It also sup- ports data binding, where data structures are associated 124 dslbook.org

with widgets, to enable two-way synchronization between the UI and the data. Using language reuse, the same UI language can be used with different data description lan- guages. Referencing would not achieve this goal, because the UI language would have a direct dependency on a par- ticular data description language. Changing the depen- dency direction to data → ui doesn’t solve the problem ei- ther, because this would go against the generally accepted idiom that UI has dependencies to the data, but not vice versa (cf. the MVC pattern). J Generally, the referencing language is built with the knowledge that it will be reused with other languages, so hooks may be provided for adapter languages to plug in. Example: The UI language thus may define an abstract concept DataMapping, which is then extended by various adapter languages. J

4.6.4 Language Embedding Language embedding (Fig. 4.29) enables heterogeneous fragments with independent languages. It is similar to reuse, in that there are two independent languages l1 and l2, but instead of estab- lishing references between two homogeneous fragments, we now embed instances of concepts from l2 in a fragment f ex- pressed with l1, so:

∀c ∈ Cdnf | lo(co(c.parent)) = l1 ∧

(lo(co(c.child)) = l1 ∨ lo(co(c.child)) = l2)) (4.7)

Unlike language extension, where l2 depends on l1 because concepts in l2 extends concepts in l1, there is no such depen- dency in this case. Both languages are independent. We call l2 the embedded language and l1 the host language. Again, an adapter language lA that extends l1 can be used to achieve this, where:

∃ ∈ | ( ) = ∧ ( ) = c CdnlA lo c.parent lA lo c.child l1 (4.8)

Embedding supports syntactic composition of independently developed languages. As an example, consider a state machine language that can be combined with any number of program- ming languages such as Java or C. If the state machine language is used together with Java, then the guard conditions used in the transitions should be Java expressions. If it is used with dsl engineering 125

Figure 4.29: Embedding: l1 and l2 are independent languages. However, we still want to use them in the same fragment. To enable this, an adapter language lA is added. It depends on both l1 and l2, and uses inheritance and composition to adapt l1 to l2 (this is almost the same structure as in the case of reuse; the difference is that B5 now contains A3, instead of just referencing it).

C, then the expressions should be C expressions. The two ex- pression languages, or in fact, any others, must be embeddable in the guard conditions. So the state machine language cannot depend on any particular expression language, and the expres- sion languages of C or Java obviously cannot be designed with knowledge about the state machine language. Both have to re- Note that if the state machine lan- main independent, and have to be embedded using an adapter guage is specifically built to "embed" C expressions, then this is a case of language. language Extension, since the state Another example is embedding a database query language machine language depends on the C expression language. such as Linq or SQL in different programming languages (Java, C#, C). Again, the query language may not have a dependency on any programming language (otherwise it would not be em- beddable in all of them). The problem could be solved by ex- tension (with embedding flavor), but then the programming language would have to be invasively changed – it now has to have a dependency on the query language. Using embedding, this dependency can be avoided. When embedding a language, the embedded language must often be extended as well. In the state machine example, new kinds of expressions must be added to support referencing event parameters defined in the host language. In the case of the query language, method arguments and local variables should probably me usable as part of the queries (... WHERE somecolumn = someMethodArg). These additional expressions will typically reside in the adapter language as well. Just as in the embedding-flavored extension case (cf. Sec- tion 4.6.2), sometimes the embedded language must also be restricted. If you embed the C expression language in state ma- chine guard conditions, you may want to restrict the user from using pointer types or all the expressions related to pointers in C. 126 dslbook.org

WebDSL: In order to support queries over persistent data, WebDSL embeds the Hibernate Query Language (HQL) such that HQL queries can be used as expressions. Queries can refer to entity declarations in the program and to vari- ables in the scope of the query. J Pension Plans: The pension workbench DSL embeds a spreadsheet language for expressing unit tests for pension plan calculation rules. The spreadsheet language comes with its own simple expression language to be used inside the cells. A new expression has been added to reference pension rule input parameters so that they can be used inside the cells. J

 Cross-Cutting Embedding, Meta Data A special case of em- bedding is handling meta data. We define meta data as pro- gram elements that are not essential to the semantics of the program, and are typically not handled by the primary model processor. Nonetheless this data must relate to program ele- ments, and, at least from a user’s perspective, they often need to be embedded in programs. Since most of them are rather generic, embedding is the right composition mechanism: no dependency to any specific language should be necessary, and the meta data should be embeddable in any language. Exam- ple meta data includes:

Documentation , which should be attachable to any program element, and in the documentation text, other program ele- ments should be referenceable.

Traces , to capture typed relationships between program ele- ments, or between program elements and requirements or other documentation ("this program element implements that requirement").

Presence Conditions in product line engineering, to describe if a program element should be available in the program for a given product configuration ("this procedure is only in the program in the international variant of the product").

In projectional editors, this meta data can be stored in the pro- gram tree and shown only optionally, if some global configura- tion switch is true. In textual editors, meta data is often stored in separate files, using pointers to refer to the respective model dsl engineering 127

elements. The data may be shown in hovers or views adjacent to the editor itself. mbeddr C: The system supports various kinds of meta data, including traces to requirements and documenta- tion. They are implemented with MPS’ attribute mecha- nism, which is discussed in the part on MPS in Section 16.2.7. As a consequence of how MPS attributes work, these meta data can be applied to program elements defined in any arbitrary language. J

4.6.5 Implementation Challenges and Solutions The previous subsections discussed four strategies for language composition. In this section we describe some of the challenges regarding syntax, type systems and transformations for these four strategies.

 Syntax Referencing and Reuse keeps fragments homoge- neous. Mixing of concrete syntax is not required. A reference between fragments is usually simply an identifier and does not have its own internal structure for which a grammar would be required72. The name resolution phase can then create the 72 Sometimes the references use qual- actual cross-reference between abstract syntax objects. ified names, in which case the strings use dots and colons. However, this Refrigerators: The algorithm language contains cross-ref- is still a trivial token structure, so it is acceptable to define the structure erences into the hardware language. Those references are separately in both languages. simple, dotted names such as compartment1.valve. J Example: In the UI example, the adapter language simply introduces dotted names to refer to fields of data struc- tures. J Extension and embedding requires modular concrete syntax definitions because additional language elements must be mixed with programs written with the base/host language. As we discuss in Part III (mostly in Section 7), combining indepen- dently developed languages after the fact can be a problem: depending on the parser technology, the combined grammar may not be parsable with the parser technology at hand. There are parser technologies that do not exhibit this problem, and projectional editors avoid it by definition. However, several widely used language workbenches have problems in this re- spect. mbeddr C: State machines are hosted in regular C pro- grams. This works because the C language’s Module con- 128 dslbook.org

struct contains a collection of IModuleContents, and the StateMachine concept implements the IModuleContent concept interface. This state machine language is designed specifically to be embedded into C, so it can access and ex- tend IModuleContent (Fig. 4.30). If the state machine lan- guage were embeddable in any host language in addition to C, this dependency on ModuleContent (from the C base language) would not be allowed. An adapter language would have to be created which adapts a StateMachine to IModuleContent. J

 Type Systems For referencing, the type system rules and Figure 4.30: The core language (above the dotted line) defines an interface constraints of the referencing language typically have to take IModuleContent. Anything that should into account the referenced language. Since the referenced lan- be hosted inside a Module has to im- guage is known when developing the referencing language, the plement this interface, typically from another language. StateMachines are type system can be implemented with the referenced language an example. in mind as well. Refrigerators: In the refrigerator example, the algorithm language defines typing rules for hardware elements (from the hardware language), because these types are used to determine which properties can be accessed on the hard- ware elements (e.g., a compressor has a property active that controls whether it is on or off). J In the case of extension, the type systems of the base language must be designed in a way that allows adding new typing rules in language extensions. For example, if the base language de- fines typing rules for binary operators, and the extension lan- guage defines new types, then those typing rules may have to be overridden to allow the use of existing operators with the new types. mbeddr C: A language extension provides types with phys- ical units (as in 100 kg). Additional typing rules are needed to override the typing rules for C’s basic operators (+, -, *, /, etc.). MPS supports declarative type system specifica- tion, so you can just add additional typing rules for the case in which one or both of the arguments have a type with a physical unit. J For reuse and embedding, the typing rules that affect the inter- play between the two languages reside in the adapter language. The type systems of both languages must be extensible in the way described in the previous paragraph on extension. dsl engineering 129

Example: In the UI example the adapter language will have to adapt the data types of the fields in the data de- scription to the types the UI widgets expect. For example, a combo box widget can only be bound to fields that have some kind of text or enum data type. Since the specific types are specific to the data description language (which is unknown at the time of creation of the UI language), a mapping must be provided in the adapter language. J

 Transformation In this section we use the terms transforma- tion and generation interchangeably. In general, transformation is used if one tree of program elements is mapped to another tree, while generation describes the case of creating text from program trees. However, for the discussions in this section, this distinction is generally not relevant. Three cases have to be considered for referencing. The first one (Fig. 4.31) propagates the referencing structure to the tar- get fragments. We call these two transformations single-sourced, since each of them only uses a single, homogeneous fragment as input and creates a single, homogeneous fragment as out- put, typically with references between them. Since the referenc- ing language is created with knowledge about the referenced Figure 4.31: Referencing: Two separate, language, the generator for the referencing language can be dependent, single-source transforma- tions written with knowledge about the names of the elements that have to be referenced in the fragment generated from the ref- erenced fragment. If a generator for the referenced language already exists, it can be reused unchanged. The two genera- tors basically share knowledge about the names of generated elements. Component Architecture: In the types viewpoint, inter- faces and components are defined. The types viewpoint is independent, and it is sufficient for the generation of the code necessary for implementing component behav- ior: Java base classes are generated that act as the compo- nent implementations (expected to be extended by manu- ally written subclasses). A second, dependent viewpoint describes component instances and their connections; it depends on the types viewpoint. A third describes the deployment of the instances to execution nodes (servers, essentially). The generator for the deployment viewpoint generates code that actually instantiates the classes that implement components, so it has to know the names of 130 dslbook.org

those generated (and hand-written) classes. J The second case (Fig. 4.32) is a multi-sourced transformation that creates one single homogeneous fragment. This typically occurs if the referencing fragment is used to guide the transfor- mation of the referenced fragment, for example by specifying transformation strategies (annotation models). In this case, a new transformation has to be written that takes the referenc- ing fragment into account. The possibly existing generator for Figure 4.32: A single multi-sourced transformation. the referenced language cannot be reused as is. Refrigerators: The refrigerator example uses this case. The code generator that generates the C code that implements the cooling algorithm takes into account the information from the hardware description model. A single fragment is generated from the two input models. The generated code is C-only, so the fragment remains homogeneous. J The third case, an alternative to rewriting the generator, is the use of a preprocessing transformation (Fig. 4.33), that changes the referenced fragment in a way consistent with what the ref- erencing fragment prescribes. The existing transformations for the referenced fragment can then be reused.

Figure 4.33: A preprocessing trans- As we have discussed above, language extensions are usually formation that changes the referenced created by defining linguistic abstractions for common idioms fragment in a way specified by the referencing fragment of a domain D. A generator for the new language concepts can simply recreate those idioms when mapping LD to LD−1, a pro- cess also called assimilation. In other words, transformations for language extensions map a heterogeneous fragment (con- taining LD−1 and LD code) to a homogeneous fragment that contains only LD−1 code (Fig. 4.34). In some cases additional files may be generated, often configuration files. In any case, Figure 4.34: Extension: transforma- the subsequent transformations for LD−1, if any, can be reused tion usually happens by assimilation, unchanged. i.e. generating code in the host lan- guage from code expressed in the ex- mbeddr C: State machines are generated down to a func- tension language. Optionally, additional files are generated, often configuration switch enum tion that contains a statement, as well as s files. for states and events. Then the existing C-to-text trans- formations are reused unchanged. In addition, the state machines are also transformed into a dot file that is used to render the state machine graphically via graphviz. J Sometimes a language extension requires rewriting transfor- mations defined by the base language. In this case, the transfor- mation engine must support the overriding of transformations dsl engineering 131

by transformations defined in another language. mbeddr C: In the data-types-with-physical-units example, the language also provides range checking and overflow detection. So if two such quantities are added, the addi- tion is transformed into a call to a special add function instead of using the regular plus operator. This function performs overflow checking and addition. MPS supports transformation priorities that can be used to override the existing transformation with a new one. J Language extension introduces the risk of semantic interac- tions. The transformations associated with several indepen- 73 It would be nice if DSL tools would dently developed extensions of the same base language may detect such conflicts statically, or at least supported a way of marking two interact with each other. To avoid the problem, transforma- languages or extensions as incompatible. tions should be built in a way so that they do not "consume However, none of the tools I know support such features. scarce resources" such as inheritance links73. Example: Consider the (somewhat artificial) example of two extensions to Java that each define a new statement. When assimilated to pure Java, both new statements re- quire the surrounding Java class to extend a specific but different base class. This won’t work, because a Java class can only extend one base class. J Interactions may also be more subtle and affect memory usage or execution performance. Note that this problem is not spe- cific to languages; it can occur whenever several independent extensions of a something can be used together, ad hoc. A more Figure 4.35: Reuse: Reuse of existing thorough discussion of the problem of semantic interactions is transformations for both fragments plus beyond the scope of this book. generation of adapter code.

In the reuse scenario, it is likely that both the reused and the context language already come with their own generators. If these generators transform to different, incompatible target lan- guages, no reuse is possible. If they transform to a common tar- get languages (such as Java or C) then the potential for reusing previously existing transformations exists. There are three cases to consider. The first one, illustrated in Fig. 4.35, describes the case in which there is an existing transformation for the reused fragment and an existing trans- formation for the context fragment – the latter being written with the knowledge that later extension will be necessary. In this case, the generator for the adapter language may "fill in the holes" left by the reusable generator for the context language. 132 dslbook.org

For example, the generator of the context language may gener- ate a class with abstract methods; the adapter may generate a subclass and implement these abstract methods. In the second case, Fig. 4.36, the existing generator for the reused fragment has to be enhanced with transformation code specific to the context language. A mechanism for composing transformations is needed. Figure 4.36: Reuse: composing transfor- The third case, Fig. 4.37, leaves composition to the target mations languages. We generate three different independent, homoge- neous fragments, and a some kind of weaver composes them into one final, heterogeneous artifact. Often, the weaving spec- ification is the intermediate result generated from the adapter language. An example implementation could use AspectJ.

An embeddable language may not come with its own gener- ator, since, at the time of implementing the embeddable lan- guage, one cannot know what to generate – its purpose is to be embedded! In that case, when embedding the language, a suitable generator has to be developed. It will typically either generate host language code (similar to generators in the case of language extension) or directly generate to the same target language that is generated to by the host language. If the embeddable language comes with a generator that Figure 4.37: Reuse: generating separate artifacts plus a weaving specification. transforms to the same target language as the embedding lan- guage, then the generator for the adapter language can coor- dinate the two, and make sure a single, consistent fragment is generated. Fig. 4.38 illustrates this case. Just as for language extension, language embedding may also lead to semantic interactions if multiple languages are em- bedded into the same host language.

Figure 4.38: In transforming embedded 4.7 Concrete Syntax languages, a new transformation has to be written if the embedded language does not come with a transformation A good choice of concrete syntax is important for DSLs to be for the target language of the host accepted by the intended user community. Especially (but not language transformation. Otherwise the adapter language can coordinate the exclusively) in business domains, a DSL will only be success- transformations for the host and for the ful if and when it uses notations that directly fit the domain – embedded languages. there might even be existing, established notations that should be reused. A good notation makes expression of common con- cerns simple and concise and provides sensible defaults. It is acceptable for less common concerns to require a little more verbosity in the notation. dsl engineering 133

4.7.1 Design Concerns for Concrete Syntax In particular the following concerns may be addressed when designing a concrete syntax74: 74 These concerns do not just depend on the concrete syntax, but also on the ab- Writability stract syntax and the expressiveness of A writable syntax is one that can be written effi- the language itself (which is discussed ciently. This usually means that the syntax is concise, be- in Section 4.1). However, the concrete cause users have to type less. However, a related aspect is syntax has a major influence, which is why we discuss it here.) tool support: the degree to which the IDE can provide bet- ter editing support75 (code completion and quick fixes in 75 See the example of the select state- particular) makes a difference to readability. ment in Section 4.7.2. Readability A readable syntax means that it can be read effec- tively. A more concise syntax is not necessarily more read- able, because context may be missing76, in particular for 76 A good example of this dilemma are people other than those who have written the code. the APL or M languages: the syntax is so concise that it is really hard to read. Learnability A learnable syntax is useful to novices in particu- lar, because it can be "explored", often exploiting IDE sup- port77. For example, the more the language uses concepts 77 "Just press Ctrl-Space and the tool that have a direct meaning in the domain, the easier it is for will tell you what you can type next." domain users to lean the language. Effectiveness Effectiveness relates to the degree that a language enables routine users to effectively express typical domain problems after they have learned the language.

 Tradeoffs It is obvious that some of these concerns are in conflict. A very writable language may not be very readable. If a group of stakeholders R uses artifacts developed by another group W (e.g. by referencing some of the program elements), it is important that a readable language is used. A learnable language may feel "annoyingly verbose and cumbersome" to routine users after a while78. However, creating an effective 78 Note that if a specific DSL is only syntax and trying to convince users to adopt the language even used irregularly, then users probably never become routine users, and have to though it is hard(er) to learn may be a challenge. relearn the language each time they use For DSLs whose programs have a short lifetime (as in script- it. ing languages) readability is often not very important, because the programs are thrown away once they have performed their particular task.

 Multiple Notations One way to solve these dilemmas is to provide different concrete syntaxes for the same abstract syn- tax, and let users choose. For example, beginners can chose a more learnable one, and switch to a more effective one over 134 dslbook.org

time. However, depending on the tooling used, this can be a lot of work.

 Multiple Notations For projectional editors it is relatively We have the equivalent of multiple notations for the same language in easy to define several notations for the same language con- the real world. English can be spoken, cept. By changing the projection rules, existing programs can written, transported via morse code be shown in a different way. In addition, different notations or even expressed via sign language. Each of these is optimized for certain (possibly showing different concerns of the overall program) contexts and audiences. can be used for different stakeholders. mbeddr C: For state machines, the primary syntax is tex- tual. However, a tabular notation is supported as well. The projection can be changed as the program is edited, rendering the same state machine textually or as a table. A graphical notation will be added in the future, as MPS’ support for graphical notations improves. J Refrigerators: The refrigerator DSL uses graphical visual- izations to render diagrams of the hardware structure, as well as a graphical state charts representing the underly- ing state machine. J Another option to resolve the learnability vs. effectiveness dilemma is to create an effective syntax and help new users by good documentation, training and/or IDE support (templates, wizards). 79 Automatic layout requires a good Reports and Visualization A visualization is a graphical rep- layout algorithm. The best one is  available commercially in yFiles/yEd. resentation of a model that cannot be edited. It is created from Sometimes it is necessary manually the core model using some kind of transformation, and high- adjust the layout of the generated visualization. Doing so of course lights a particular aspect of the source program. It is often is problematic because the manual automatically laid out79. The resulting diagram may be static adjustments are lost if the visualization (i.e. an image file is generated) or interactive (where users can is regenerated. A better approach is to create another model that specifies show, hide and focus on different parts of the diagram). It may properties of the visualization, such as the provide drill-down back to the core program (double-clicking subset of model elements that should be in the diagram, semantic coloring or on the figure in the image opens the code editor at the respec- selecting of shapes or layout hints for tive location)80. the algorithm. A report has the same goals (highlighting a particular aspect 80 Graphviz is one of the most of the source program, while not being editable) but uses a well-known tools for this kind textual notation. of visualization. Another is Jan Koehnlein’s Generic Graph View Visualizations and reports are a good way of resolving a po- at github.com/JanKoehnlein tential conflict if the primary DSL users want to use a writable /Generic-Graph-View. notation and other stakeholders want a more readable repre- sentation. Since reports and visualizations are not the primary dsl engineering 135

notation, it is possible to create several different visualizations or reports for the source program, highlighting different as- pects of the core program. mbeddr C: In the mbeddr components extension, we sup- port several notations. The first shows interfaces and the components that provide and require these interfaces. The second shows component instances and the connections between their provided and required ports. Finally, there is a third visualization that applies to all mbeddr models, Figure 4.39: mbeddr C also supports graphical visualizations of state ma- not just those that use components: it shows the mod- chines. For every state machine, a dot ules, their imports (i.e. module dependencies) as well as representation is automatically gener- the public contents of these modules (functions, structs, ated. The image is then rendered by graphviz directly in the IDE. Double- components, test cases). J clicking on a state selects the respective program element in the editor.

4.7.2 Classes of Concrete Syntax There are a couple of major classes for DSL concrete syntax81: 81 Sometimes form-based GUIs or tree views are considered DSLs. I disagree, textual DSLs use linear textual notations, typically based on because this would make any GUI ASCII or Unicode characters. They basically look and feel like application a DSL. 136 dslbook.org

traditional programming languages. Graphical DSLs use graph- ical shapes. An important subgroup is represented by those that use box-and-line diagrams that look and feel like UML class diagrams or state machines. However, there are more op- tions for graphical notations, such as those illustrated by UML timing diagrams or sequence diagrams. Symbolic DSLs are tex- tual DSLs with an extended set of symbols, such as fraction bars, mathematical symbols or subscript and superscript. Ta- bles and matrices are a powerful way to represent certain kinds of data and can play an important part for DSLs. The perfect DSL tool should support free combination and integration of the various classes of concrete syntax, and be able to show (aspects of) the same model in different notations. As a consequence of tool limitations, this is not always possible, Figure 4.40: Graphical notation for relationships. however. The requirements for concrete syntax are a major driver in tool selection.

 When to Use Which Form We do not want to make this sec- tion a complete discussion between graphical and textual DSLs – a discussion that is often heavily biased by previous expe- Figure 4.41: Graphical notation for flow rience, prejudice and tool capabilities. Here are some rules of thumb. Purely textual DSLs integrate well with existing development infrastructures, making their adoption relatively easy. They are well suited for detailed descriptions, anything that is algorithmic or generally resembles (traditional) program source code. A good textual syntax can be very effective (in Figure 4.42: Graphical notation for terms of the design concerns discussed above). Symbolic nota- causality and timing tions can be considered "better textual", and lend themselves to domains that make heavy use of symbols and special notations; many scientific and mathematical domains come to mind. Ta- bles are very useful for collections of similarly structured data items, or for expressing how two independent dimensions of data relate. Tables emphasize readability over writability. Fi- nally, graphical notations are very good for describing relation- ships (Fig. 4.40), flow (Fig. 4.41) or timing and causal relation- ships (Fig. 4.42). They are often considered easier to learn, but may be perceived as less effective by experienced users. Pension Plans: The pension DSL uses a mathematical no- tation to express insurance mathematics (Fig. 4.43). A table Figure 4.43: Mathematical notations notation is embedded to express unit tests for the pension used to express insurance math in the plan calculation rules. A graphical projection shows de- pension workbench. pendencies and specialization relationships between plans. J dsl engineering 137

mbeddr C: The core DSLs use mostly textual notations with some tabular enhancements, for example for deci- sion tables (Fig. 20.4). However, as MPS’ capability for handling graphical notations improves, we will represent state machines as diagrams. J

Figure 4.44: Decision tables use a tabu- lar notation. It is embedded seamlessly into a C program.

Selection of a concrete syntax is simple for domain user DSLs if there is an established notation in the domain. The challenge then is to replicate this notation as closely as possible with the DSL, while cleaning up possible inconsistencies in the notation (since presumably it had not been used formally before). I like to use the term "strongly typed (Microsoft) Word" in this case82. 82 In some cases it is useful to come up For DSLs targeted at developers, a textual notation is usually with a better notation than the one used historically. This is especially true if the a good starting point, since developers are used to working historic notation is Excel. with text, and they are very productive with it. Tree views and some visualizations are often useful for outlines, hierarchies or overviews, but not necessarily for editing. Textual notations also integrate well with existing development infrastructures. mbeddr C: C is the baseline for embedded systems, and everybody is familiar with it. A textual notation is useful for many concerns in embedded systems. Note that sev- eral languages create visualizations on the fly, for example for module dependencies, component dependencies and component instance wirings. The graphviz tool is used here since it provides decent auto-layout. J There are very few DSLs where a purely graphical notation makes sense, because in most cases some textual languages are embedded in the diagrams or tables: state machines embed- ded expressions in guards and statements in actions (Fig. 20.7); component diagrams use text for specifications of operations in interfaces, maybe using expressions for preconditions; block diagrams use a textual syntax for the implementation/para- metrization of the blocks (Fig. 4.45); tables may embed textual 138 dslbook.org

notations in the cells (Fig. 4.46). Integrating textual languages A text box where textual code can be into graphical ones is becoming more and more important, and entered without language support should only be used as a last resort. tool support is improving. Instead, a textual notation, with addi- tional graphical visualizations should be used.

Figure 4.45: A block diagrams built with the Yakindu modeling tools. A textual DSL is used to implement the behavior in the blocks. While the textual DSL is not technically integrated with the graphical notation (separate viewpoints), semantic integration is provided.

Note that initially domain users prefer a graphical notation, In my consulting practice, I almost because of the perception that things that are described graph- always start with a textual notation and try to stabilize language abstractions. ically are simple(r) to comprehend. However, what is most Only then will I engage in a discussion important regarding comprehensibility is the alignment of the about whether a graphical notation on top of the textual one is necessary. domain concepts with the abstractions in the language. A well- Often it is not, and if it is, we have designed textual notation can go a long way. Also, textual lan- avoided iterating the implementation guages are more productive once the learning curve has been of the graphical editor implementation, which, depending on the tooling, can be overcome. I have had several cases where domain users started a lot of work. preferring textual notations later in the process.

Figure 4.46: The Yakindu Requirements tools integrates a textual DSL for for- mal requirements specification into a table view. The textual specifications are stored as text in the requirements database; consequently, the entities defined textually cannot be referenced (which is not a problem in this do- main). dsl engineering 139

Figure 4.47: The Yakindu State Chart Tools support the use of Xtext DSLs in actions and guard conditions of state machines, mixing textual and graphical notations. The DSL can even be exchanged, to support domain specific action languages, for example for integrating with user interface specifications. In this case, the textual specification are stored as the AST in terms of EMF, not as text.

 IDE Supportability For textual languages, it is important to keep in mind if and how a syntax can be support by the IDE, especially regarding code completion. Consider query languages. An example SQL query looks like this:

SELECT field1, field2 FROM aTable WHERE ...

When entering this query the IDE cannot provide code com- pletion for the fields after the SELECT because at this point the table has not yet been specified. A more suitable syntax, with respect to IDE support, would be: 83 FROM aTable SELECT field1, field2 WHERE ... SQL is a relatively old language and IDE concerns were probably not very It is better because now the IDE can provide support code com- important at the time. More modern query languages such as HQL or pletion for the fields based on the table name that has already Linq in fact use the more IDE-friendly been entered when you specify the fields83. syntax. Another nice example is dot-notation for function calls. Con- sider a functional language. Typical function call syntax is f(a, b, c) or possiby (f a b c). In either case, the function comes first. Now consider a notation where you can (optionally) write the first argument before the dot, i.e. a.f(b, c). This has a significant advantage in terms of IDE support: after the user enters a., code completion can propose all the functions that are available for the type of a. This leads to much better ex- plorability of the language compared to the normal function- first syntax: since at the time of writing the function, the user 84 This example was motivated by Daan has not yet written the value on which to apply the function, Leijen’s Koka language which supports the IDE can provide no support84. the dot-notation for just this reason. 140 dslbook.org

Note that tool supportability in general is not fundamentally different in graphical and textual languages. While IDEs for textual languages can provide code completion, the palette or the context buttons in a graphical DSL play the same role. I often hear that a graphical DSL is more suitable for simulation (because the execution of the program can be animated on the graphical notation). However, this is only true if the graphical notation works well in the first place. A textual program can also be animated; a debugger essentially does just that.

 Relationship to Hierarchical Domains Domains at low D are most likely best expressed with a textual or symbolic concrete syntax. Obvious examples include programming languages at D0. Mathematical expressions, which are also very dense and algorithmic, use a symbolic notation. As we progress to higher Ds, the concepts become more and more abstract, and as state machines and block diagrams illustrate, graphical notations be- come useful. However, these two notations are also a good example of language embedding, since both of them require expressions: state machines in guards and actions (Fig. 20.7), and block diagrams as the implementation of blocks (Fig. 4.45 and Fig. 5.5). Reusable expression languages should be embed- ded into the graphical notations. If this is not supported by the tool, viewpoints may be an option. One viewpoint could use a graphical notation to define coarse-grained structures, and a second viewpoint use a textual notation to provide "imple- 85 Not every tool can support every mentation details" for the structures defined by the graphical (combination of) form of concrete syntax, so this aspect is limited by the 85 viewpoint . tool, or drives tool selection. mbeddr C: As the graphical notation for state machines becomes available, the C expression language that is used in guard conditions for transitions will be usable as labels on the transition arrows. In the table notation for state machines, C expressions can be embedded in the cells as well. J 5 Fundamental Paradigms

Every DSL is different. It is driven by the domain for which it is built. However, as it turns out, there are also a number of commonalities between DSLs. These can be handled by modularizing and reusing (parts of) DSLs, as discussed in the last section of the previous chapter. In this section we look at common paradigms for describing DSL structure and behavior.

5.1 Structure

Languages have to provide a means of structuring large pro- grams in order to keep them manageable. Such means include The language design alternatives modularization and encapsulation, specification vs. implemen- described in this section are usually not driven directly by the domain, tation, specialization, types and instances, as well as partition- or the domain experts guiding the ing. design of the language. Rather, they are often brought in by the language 5.1.1 Modularization and Visibility designer as a means of managing overall complexity. For this reason they DSLs often provide some kind of logical unit structure, such may be hard to "sell" to domain experts. as namespaces or modules. Visibility of symbols may be re- stricted to the same unit, or to referencing ("importing") units. Symbols may be declared as public or private, the latter mak- ing them invisible to other modules, which guarantees that changes to these symbols cannot affect other modules. Some Most contemporary programming form of namespaces and visibility is necessary in almost any languages use some form of names- paces and visibility restriction as their DSL. Often there are domain concepts that can play the role top-level structure. of the module, possibly oriented towards the structure of the organization in which the DSL is used. mbeddr C: As a fundamental extension to C, this DSL con- tains modules with visibility specifications and imports. Functions, state machines, tasks and all other top-level 142 dslbook.org

concepts reside in modules. Header files (which are effec- tively a poor way of managing symbol visibility) are only used in the generated low-level code and are not relevant to the user of mbeddr C. J Component Architecture: Components and interfaces live in namespaces. Components are implementation units, and are always private. Interfaces and data types may be public or private. Namespaces can import each other, making the public elements of the imported namespace visible to the importing namespace. The OSGi generator creates two different bundles: an interface bundle that con- tains the public artifacts, and an implementation bundle with the components. In the case of a distributed system, only the interface bundle is deployed on the client. J Pension Plans: Pension plans constitute namespaces. They are grouped into more coarse-grained packages that are aligned with the structure of the pension insurance busi- ness. J

5.1.2 Partitioning Partitioning refers to the breaking down of programs into sev- eral physical units such as files (typically each model fragment If a repository-based tool is used, the is stored in its own partition). These physical units do not importance of partitioning is greatly have to correspond to the logical modularization of the models reduced. Although even in that case there may be a set of federated and within the partitions. For example, in Java a public class has distributed repositories that can be to live in a file of the same name (logical module == physi- considered partitions. cal partition), whereas in C# there is no relationship between namespace, class names and the physical file and directory structure. A similar relationship exists between partitions and viewpoints, although in most cases, different viewpoints are stored in different partitions. Note that a reference to an element should not take into account the partition in which the target element lives. Instead, it should only use the logical structure. Consider an element E that lives in a namespace x.y, stored in a partition mainmodel. A reference to that element should be expressed as x.y.E, not as mainmodel.E or mainmodel/x.y.E. This is important, as it allows elements to move freely between partitions without this leading to updates of all references to the element. Partitioning may have consequences for language design. Consider a textual DSL in which a concept A contains a list of dsl engineering 143

instances of concept B. The B instances then have to be physi- cally nested within an instance of A in the concrete syntax. If there are many instances of B in a given model, they cannot be split into several files, so these files may become big and result in performance problems. If such a split must be possible, this has to be designed into the language. Component Architecture: A variant of this DSL that was used in another project had to be changed to allow a names- pace to be spread over several files for reasons of scalabil- ity and version-control granularity. In the initial version, namespaces actually contained the components and inter- faces. In the revised version, components and interfaces were owned by no other element, but model files (parti- tions) had a namespace declaration at the top, logically putting all the contained interfaces and components into this namespace. Since there was no technical containment relationship between namespaces and their elements, sev- eral files could now declare the same namespace. Chang- ing this design decision lead to a significant reimplementa- tion effort, because all kinds of naming and scoping strate- gies changed. J Other concerns influence the design of a partitioning strategy as well:

Change Impact Which partition changes as a consequence of a particular change of the model (changing an element name might require changes to all references to that element from other partitions).

Link Storage Where are links stored (are they always stored in the model that logically "points to" another one?), and if not, how/where/when to control reference/link storage.

Model Organization Partitions may be used as a way of orga- nizing the overall model. This is particularly important if the tool does not provide a good means of presenting the overall logical structure of models and finding elements by name and type. Organizing files with meaningful names in directory structures is a workable alternative.

Tool Chain Integration Integration with existing, file-based tool chains. Files may be the unit of check in/check out, version- ing, branching or permission checking. 144 dslbook.org

It is often useful to ensure that each partition is processable Another driver for using partitions is separately to reduce processing times. An alternative approach the scalability of the DSL tool. Beyond a certain file size, the editor may become supports the explicit definition of those partitions that should sluggish. be processed in a given processor run (or at least a search path, a set of directories, to find the partitions, like an include path in C compilers or the Java classpath). You might even consider a separate build step to combine the results created from the separate processing steps of the various partitions (again like a C compiler, which compiles every file separately into an object file, after which the linker handles overall symbol/reference resolution and binding). The partitioning scheme may also influence users’ team col- laboration when editing models. There are two major collab- oration models: real-time and commit-based. In real-time col- laboration, a user sees his model change in real time as another user changes the same model. Change propagation is imme- diate. A database-backed repository is often a good choice re- garding storage, since the granularity tracked by the repository is the model element. In this case, the partitioning may not be visible to the end user, since they just work "on the reposi- tory". This approach is often (at least initially) preferred by non-programmer DSL users. The other collaboration mode is commit-based, in which a user’s changes are only propagated to the repository if he per- forms a commit, and incoming changes are only visible after a user has performed an update. While this approach can be used with database-backed repositories, it is most often used with file-based storage. In this case, the partitioning scheme is visible to DSL users, because it is those files they commit or update. This approach tends to be preferred by develop- ers, maybe because well-known versioning tools have used the 1 Interfaces, pure abstract classes, traits approach for a long time. or function signatures are a realiza- tion of this concept in programming 5.1.3 Specification vs. Implementation languages. Separating specification and implementation supports plug- The separation of specification and implementation can also have positive ging in different implementations for the same specification effects on scalability and performance. and hence provides a way to "decouple the outside from the If the specification and implementation inside"1. This supports the exchange of several implementa- are separated into different fragments, then, in order to type check a client’s tions behind a single interface. This is often required as a con- access to some provided service, only sequence of the development process: one stakeholder defines the fragment that contains the specifica- tion has to be loaded/parsed/checked. the specification and a client, whereas another stakeholder pro- This is obviously faster than processing vides one or more implementations. complete implementation. dsl engineering 145

A challenge for this approach is how to ensure that all im- plementations are consistent with the specification. Tradition- ally, only the structural/syntactic/signature compatibility is checked. To ensure semantic compatibility, additional means that specify the expected behavior are required. This can be achieved with pre- or post-conditions, invariants or protocol state machines. mbeddr C: This DSL adds interfaces and components to C. Components provide or use one or more interfaces. Dif- ferent components can be plugged in behind the same in- terface. To support semantic specifications, the interfaces support pre- and post-conditions as well as protocol state machines. Fig. 5.1 shows an example. Although these specifications are attached to interfaces, they are actually checked (at runtime) for all components that provide the respective interface. J

Figure 5.1: An interface using semantic specifications. Preconditions check the values of arguments for validity. Postconditions express constraints on the values of query operations after the execution of the operation. Notice how the value of the query before executing the operation can be referred to (the old keyword used in the postcondition for accelerateBy). In addition, pro- tocols constrain the valid sequence of operation invocations. For example, the accelerateBy operation can only be used if the protocol state machine is al- ready in the forward state. The system gets into the forward state by invok- ing the driveContinuoslyForward operation.

Refrigerators: Cooling programs can access hardware el- ements (compressors, fans, valves); those are defined as part of the refrigerator hardware definition. To enable cooling programs to run with different, but similar hard- ware configurations, the hardware structure can use "trait inheritance", by which a hardware trait defines a set of hardware elements, acting as a kind of interface. Other hardware configurations can inherit these traits. As long as cooling programs are only written against traits, they work with any refrigerator that implements the particular set of traits against which the program is written. J 146 dslbook.org

5.1.4 Specialization Specialization enables one entity to be a more specific variant of another. Typically, the more specific one can be used in all contexts in which the more general one is expected (the Liskov substitution principle2). The more general one may be incom- 2 B. Liskov and J. M. Wing. A behav- plete, requiring the specialized ones to "fill in the holes". Spe- ioral notion of subtyping. TOPLAS, 16(6):1811–1841, 1994 cialization in the context of DSLs can be used for implementing variants or for evolving a program over time. In GPLs, we know this approach from class inheritance. "Leaving holes" is Defining the semantics of inheritance for domain-specific realized by abstract methods. language concepts is not always easy. The various approaches found in programming languages, as well as the fact that some of them lead to problems (multiple inheritance, diamond in- heritance, linearization, or code duplication in Java’s interface inheritance) shows that this is not a trivial topic. It is a good idea to just copy a suitable approach completely from a pro- gramming language in which inheritance seems to work well. Even small changes can make the whole approach inconsistent. Pension Plans: The customer using this DSL had the chal- lenge of creating a huge set of pension plans, implement- ing changes in relevant law over time, or implementing re- lated plans for different customer groups. Copying com- plete plans and then making adaptations was not feasi- ble, because this resulted in a maintenance nightmare: a large number of similar but not identical pension plans. Hence the DSL provides a way for pension plans to inherit from one another. Calculation rules can be marked abstract (needing to be overwritten in sub-plans), final rules are not overwritable. Visibility modifiers control which rules are considered "implementation details". J Refrigerators: A similar approach is used in the cooling DSL. Cooling programs can specialize other cooling pro- grams. Since the programs are fundamentally state-based, we had to define what it means to specialize a cooling pro- gram: a subprogram can add additional event handlers and transitions to states. New states can be added, but states defined in the super-program cannot be removed. J

5.1.5 Types and Instances In programming languages we know this from classes and objects (where Types and instances supports the definition of structures that constructor parameters are used for parametrization) or from components can be parametrized upon instantiation. This allows reuse of (where different instances can be con- common parts, and expressing variability via parameters. nected differently to other instances). dsl engineering 147

mbeddr C: Apart from C’s structs (which are instanti- atable data structures) and components (which can be in- stantiated and connected), state machines can be instanti- ated as well. Each instance can be in a different state at any given time. J

5.1.6 Superposition and Aspects

Superposition refers to the ability to merge several model frag- This is especially important in the ments according to some DSL-specific merge operator. Aspects context of product line engineering and is discussed in Section 21. provide a way of "pointing to" several locations in a program based on a pointcut operator (essentially a query over a pro- gram or its execution), adapting the model in ways specified by the aspect. Both approaches support the compositional cre- ation of many different model variants from the same set of model fragments. Component Architecture: This DSL provides a way of ad- vising component definitions from an aspect (Fig. 5.2). An aspect may introduce an additional provided port mon: IMonitoring that allows a central monitoring component to query the advised components via the IMonitoring in- terface. J

Figure 5.2: The aspect component contributes an additional required port to each of the other components defined in the system.

WebDSL: Entity declarations can be extended in separate modules. This makes it possible to declare in one module all data declarations of a particular feature. For example, 148 dslbook.org

in the researchr application, a Publication can be Tagged, which requires an extension of the Publication entity. This extension is defined in the tag module, together with the definition of the Tag entity. This is essentially a use of superposition. J

5.1.7 Versioning Often, variability over time of elements in DSL programs has to be tracked. One alternative is to simply version the model files using existing version control systems, or the version con- trol mechanism built into the language workbench. However, this requires users to interact with often complex version con- trol systems and prevents domain-specific adaptations of the version control strategy. The other alternative is to make versioning and tracking over time a part of the language. For example, model elements may be tagged with version numbers, or specify a revision chain by pointing to a previous revision, enforcing compatibility con- straints between those revisions. Instead of declaring explicit versions, business data is often time-dependent, where differ- ent revisions of a business rule apply to different periods of time. Support for these approaches can be built directly into the DSL, with various levels of tool support. mbeddr C: No versioning is defined into the DSL. Users work with MPS’ integration with popular version control systems. Since this DSL is intended for use by program- mers, working with existing version control systems is not a problem. J

Component Architecture: Components can specify a new version of reference to another component. In this case, the new version may specify additional provided ports with the same interfaces, or with new versions of these interfaces. The new version may also deprecate required ports. Effectively, this means that the new version of some- thing must be replacement-compatible with the old ver- sion (the Liskov substitution principle again). J Pension Plans: In the pension workbench, calculation rules declare applicability periods. This supports the evolution of calculation rules over time, while retaining reproduca- bility for calculations performed at an earlier point in time. Since the Intentional Domain Workbench is a projectional dsl engineering 149

tool, pension plans can be shown with only the version of a rule that is valid for a given point in time. J

5.2 Behavior

The behavior expressed with a DSL must of course be aligned with the needs of the domain. However, in many cases, the be- havior required for a domain can be derived from well-known behavioral paradigms3, with slight adaptations or enhance- 3 The term Model of Computation is also ments, or simply by interacting with domain-specific structures used to refer to behavioral paradigms. I prefer "behavioral paradigm" because or data. the term model is obviously heavily Note that there are two kinds of DSLs that don’t make use overloaded in the DSL/MDSD space already. of these kinds of behavior descriptions. Some DSLs really just specify structures. Examples include data definition languages or component description languages (although both of them often use expressions for derived data, data validation or pre- and post-conditions). Other DSLs specify a set of expectations regarding some behavior (declaratively), and the generator cre- ates the algorithmic implementation. For example, a DSL may specify, simply with a tag such as async, that the communi- cation between two components shall be asynchronous. The generator then maps this to an implementation that behaves according to this specification. Component Architecture: The component architecture DSL is an example of a structure-only DSL, since it only de- scribes black box components and their interfaces and rela- tionships. It uses the specification-only approach to spec- ify whether a component port is intended for synchronous or asynchronous communication. J mbeddr C: The component extension provides a similar notion of interfaces, ports and components as in the pre- vious example. However, since here they are directly in- tegrated with C, C expression can be used for pre- and post-conditions of interface operations (see Fig. 5.1). J

Using an established behavioral paradigm for a DSL has sev- eral advantages4. First, it is not necessarily simple to define 4 Which is why we discuss these consistent and correct semantics in the first place. By reusing paradigms in this book. an existing paradigm, one can learn about advantages and drawbacks from existing experience. Second, a paradigm may already come with existing means for performing interesting 150 dslbook.org

analyses (as in model checking or SMT solving) that can eas- ily be used to analyse DSL programs. Third, there may be existing generators from a behavioral paradigm to an efficient executable for a given platform (state machines are a prime candidate). By generating a model in a formalism for which such a generator exists, we reduce the effort for building an end-to-end generator. If our DSL uses the same behavioral paradigm as the language for which the generator exists, writ- ing the necessary transformation is straightforward (from a se- mantic point of view). The last point emphasizes that using an existing paradigm for a DSL (e.g. state-based) does not mean that the concepts have to directly use the abstractions used by that paradigm (just because a program is state-based does not mean that the concept that acts as a state has to be called state, etc.). This is only an overview over a few . This section describes some of the most well-known be- paradigms; many more exist. I refer to the excellent Wikipedia entry on havioral paradigms that can serve as useful starting points for Programming Paradigms and to the book behavior descriptions in DSLs. In addition to describing the P. V. Roy and S. Haridi. Concepts, paradigm, we also briefly investigate how easily programs us- Techniques, and Models of Computer Programming. MIT Press, 2004 ing the paradigm can be analyzed, and how complicated it is to build debuggers.

5.2.1 Imperative

Imperative programs consist of a sequence of statements, or in- For many people, often including structions, that change the state of the program. This state may domain experts, this approach is easy to understand. Hence it is often a good be local to some kind of module (e.g., a procedure or an object), starting point for DSLs. global (as in global variables) or external (when communicat- ing with peripheral devices). Procedural and object-oriented programming are both imperative, using different means for structuring and (in the case of OO) specialization. Because of aliasing and side effects, imperative programs are expensive to analyse. Debugging imperative programs is straightforward and involves stepping through the instructions and watching the state change. mbeddr C: Since C is used as a base language, this lan- guage is fundamentally imperative. Some of the DSLs on top of it use other paradigms (the state machine extension is state-based, for example). J Refrigerators: The cooling language integrates various para- digms, but contains sequences of statements to implement aspects of the overall cooling behavior. J dsl engineering 151

5.2.2 Functional Functional programming uses functions as the core abstrac- tion. In purely functional programming, a function’s return value only depends on the values of its arguments. Calling the same function several times with the same argument val- ues returns the same result (that value may even be cached!). Functions cannot access global mutable state, no side effects are allowed. These characteristics make functional programs very easy to analyze and optimize. These same characteristics, however, also make purely functional programming relatively useless, because it cannot affect its environment (after all, this would be a side effect). So, functional programming is often only used for parts ("calculation core") of an overall program and integrates with, for example, an imperative part that deals with IO. Since there is no changing state to observe as the program steps through instructions, debugging can be done by simply showing all intermediate results of all function calls as some kind of tree, basically "inspecting" the state of the calculation. This makes building debuggers relatively simple. Pension Plans: The calculation core of pension rules is functional. Consequently, a debugger has been implemented that, for a given set of input data, shows the rules as a tree that shows all intermediate results of each function call (Fig. 5.3). No "step through" debugger is necessary. J Pure expressions are an important subset of functional pro- gramming (as in i > 3*2 + 7). Instead of calling functions, operators are used. However, operators are just infix notations for function calls. Usually the operators are hard wired into the language and it is not possible for users to define their own functional abstractions. The latter is the main differentiator to functional programming in general. It also limits expressivity, since it is not possible to modularize an expression or to reuse expressions by packaging into a user-defined function. Conse- quently, only relatively simply tasks can be addressed with a pure expression language5. 5 However, many DSLs do not require anything more sophisticated, especially mbeddr C: We use expressions in the guard conditions of if powerful domain-specific operators the state machine extension as well as in pre- and post- are available. So, while expression languages are limited in some sense, conditions for interface operations. In both cases it is not they are still extremely useful and possible to define or call external functions. Of course, (a widespread. subset of) C’s expression language is reused here. J 152 dslbook.org

Figure 5.3: Debugging functional programs can be done by showing the 5.2.3 Declarative state of the calculation, for example as a tree. Declarative programming can be considered the opposite of (and, to some extent, of functional programming). A declarative program does not specify any control flow; it does not specify a sequence of steps of a calcu- lation. A declarative program only specifies what the program should accomplish, not how. This is often achieved by specify- ing a set of properties, equations, relationships or constraints. Some kind of evaluation engine then tries to find solutions. The 6 For example, the strategies for im- particular advantage of this approach is that it does not prede- plementing SAT solvers have evolved quite a bit over time. SAT solvers are fine how a solution is found; the evaluation engine has a lot much more scalable today. However, of freedom in doing so, possibly using different approaches in the formalism for describing the logic 6 formulas that are processed by SAT different environments, or evolving the approach over time . solvers have not changed. This large degree of freedom often makes finding the solution expensive – trial and error, backtracking or exhaustive search 7 Users often have to provide hints to may be used7. Debugging declarative programs can be hard, the engine to make it run fast enough since the solution algorithm may be very complex and possibly or scale to programs of relevant size. In practice, declarative programming is not even be known to the user of the language. often not as "pure" as it is in theory. dsl engineering 153

Declarative programming has many important subgroups. For concurrent programs, a declarative approach allows the effi- cient execution of a single program on different parallel hard- ware structures. The compiler or runtime system allocates the program to available computational resources. In constraint pro- gramming, the programmer specifies constraints between a set of variables. The engine tries to find values for these variables that satisfy all constraints. Solving mathematical equation sys- tems is an example, as is solving sets of Boolean logic formu- las. Logic programming is another sub-paradigm, in which users specify logic clauses (facts and relations) as well as queries. A theorem prover then tries to solve the queries. The Prolog language works in this way. Component Architecture: This DSL specifies timing and resource characteristics for component and interface op- erations. Based on this data, one could run an algorithm which allocates the component instances to computing hard- ware so that the hardware is used as efficiently as possible, while at the same time reducing the amount of network traffic. This is an example of constraint solving used to synthesize a schedule. J mbeddr C: This DSL supports presence conditions for prod- uct line engineering. A presence condition is a Boolean expression over a set of configuration features that deter- mines whether the associated piece of code is present for a given combination of feature selections (Fig. 5.4). To verify the structural integrity of programs in the face of varying feature combinations, constraint programming is used (to ensure that there is no configuration of the pro- gram in which a reference to a symbol is included, but the referenced symbol is not). A set of Boolean equations is generated from the program and the attached presence conditions, . A solver then makes sure they are consis- tent by trying to find an example solution that violates the Boolean equations. J Example: The Yakindu DAMOS block diagram editor sup- ports custom block implementation based on the Mscript language (Section 5.5). It supports declarative specifica- tion of equations between input and output parameters of a block. A solver computes a closed, sequential solution that efficiently calculates the output of an overall block di- agram. J 154 dslbook.org

Figure 5.4: This module contains variability expressed with presence conditions. The affected program ele- ments are highlighted in a color (shade in the screenshot) that represents the condition. If the feature highRes is selected, the code uses a double instead of an int8_t. The log messages are only included if the logging feature is selected. Note that one cannot just depend on single features (such as logging) but also on arbitrary expres- sions such as logging && highRes.

Example: Another example for declarative programming is the type system DSL used by MPS itself. Language de- velopers specify a set of type equations containing free type variables, among other things. A unification engine tries to solve the set of equations by assigning actual types to the free type variables so that the set of equations is consistent. We describe this approach in detail in Sec- tion 10.4. J

5.2.4 Reactive/Event-based/Agent In this paradigm, behavior is triggered based on received events. Events may be created by another entity or by the environment (through a device driver). Reactions are expressed by the cre- ation of other events. Events may be globally visible or explic- itly routed between entities, possibly using filters and/or using priority queues. This approach is often used in embedded sys- tems that have to interact with the real world, where the real dsl engineering 155

Figure 5.5: An Mscript block specifies input and output arguments of a block (u and v) as well as configuration parameters (initialCondition and gain). The assertions specify constraints on the data the block works with. The eq statements specify how the output values are calculated from the input values. Stateful behaviors are supported, where the value for the n-th step depends on values from previous steps (e.g., n − 1).

world produces events as it changes. A variant of this approach queries input signals at intervals controlled by a scheduler and considers changes in input signals as the events. Refrigerators: The cooling algorithms are reactive pro- grams that control the cooling hardware based on envi- ronment events. Such events include the opening of a re- frigerator door, the crossing of a temperature threshold, or a timeout that triggers defrosting of a cooling compart- ment. Events are queued, and the queues are processed in intervals determined by a scheduler. J Debugging is simple if the timing/frequency of input events can be controlled. Visualizing incoming events and the code that is triggered as a reaction is relatively simple. If the timing of input events cannot be controlled, then debugging can be al- most impossible, because humans are much too slow to fit "in between" events that may be generated by the environment in rapid succession. For this reason, various kinds of simulators are used to debug the behavior of reactive systems, and sophis- ticated diagnostics regarding event frequencies or queue filling levels may have to be integrated into the programs as they run in the real environment. Refrigerators: The cooling language comes with a simula- 156 dslbook.org

tor (Fig. 5.6) based on an interpreter in which the behavior of a cooling algorithm can be debugged. Events are explic- itly created by the user, on a timescale that is compatible with the debugging process. J

Figure 5.6: The simulator for the cool- ing language shows the state of the system (commands, event queue, value of hardware properties, variables and tasks). The program can be single- stepped. The user can change the value of variables or hardware properties as a means of interacting with the program.

5.2.5 Dataflow The dataflow paradigm is centered around variables with de- pendencies (in terms of calculation rules) among them. As a variable changes, the variables that depend on the changing variable are recalculated. We know this approach mainly from two use cases. One is spreadsheets: cell formulas express de- pendencies to other cells. As the values in these other cells change, the dependent cells are updated. The other use case is data flow (or block) diagrams (Fig. 5.7), used in embedded software, extraction-transfer-load data processing systems and enterprise messaging/complex event processing. There, the calculations or transformations are encapsulated in the blocks, and the lines represent dependencies – the output of one blocks "flows" into the input slot of another block. There are three dif- ferent execution modes: Figure 5.7: Graphical notation for flow. • The first one considers the data values as continuous sig- dsl engineering 157

nals. At the time one of the inputs changes, all dependent values are recalculated. The change triggers the recalcula- tion, and the recalculation ripples through the dependency graph. This is the model used in spreadsheets.

• The second one considers the data values as quantized, unique messages. A new output message is calculated only if a message is available for all inputs. The recalculation syn- chronizes on the availability of a message at each input, and upon recalculation, these messages are consumed. This ap- proach is often used in ETL and CEP systems.

• The third approach is time-triggered. Once again, the inputs are understood to be continuous signals, and a scheduler de- termines when a new calculation is performed. The sched- uler also makes sure that the calculation "ripples through from left to right" in the correct order. This model is typi- cally used in embedded systems.

Debugging these kinds of systems is relatively straightforward because the calculation is always in a distinct state. Depen- dencies and data flow, or the currently active block and the available messages, can easily be visualized in a block diagram notation. Note that the calculation rules themselves are con- sidered black boxes here, whose internals may be built from any other paradigm, often functional. Integrating debuggers for the internals of boxes is a more challenging task. 5.2.6 State-based The state-based paradigm describes a system’s behavior in terms of the states the system can be in, the transitions between these states, events that trigger these transitions and actions that are executed as states change. State machines are useful for sys- tematically organizing the behavior of an entity. They can also be used to describe valid sequences of events, messages or pro- cedure calls. State machines can be used in an event-driven mode in which incoming events actually trigger transitions and the associated actions. Alternatively a state machine can be run in a timed mode, in which a scheduler determines when event queues are checked and processed. Except for possible real- time issues, state machines are easy to debug by highlighting the contents of event queues and the current state8. 8 Apart from the imperative paradigm and simple expression languages, state mbeddr C: This language provides an extension that sup- machines are probably the paradigm ports directly working with state machines. Events can that is most often used in DSLs. 158 dslbook.org

be passed into a state machine from regular C code, or by mapping incoming messages in components to events in state machines that reside in components. Actions can contain arbitrary C code, unless the state machine is marked as verifiable, in which case actions may only create outgo- ing events or change state machine-local variables. J Refrigerators: The behavior of cooling programs is fun- damentally state-based. A scheduler is used to execute the state machine at regular intervals. Transitions are trig- gered either by incoming, queued events or by changing property values of hardware building blocks. This lan- guage is an example where a behavioral paradigm is used without significant alterations, but working with domain- specific data structures – refrigerator hardware and its prop- erties. J State-based behavior description is also interesting in the con- text of model checking. The model checker either determines that the state chart conforms to a set of specifications or pro- vides a counter-example that violates the specifications. Spec- 9 A good introduction to model check- ifications express something about sequences of states such as ing can be found in the book mentioned below: "It is not possible that two traffic lights show green at the same time" or "Whenever a pedestrian presses the request button, the pedestrian lights eventually will show green"9. Berard, B., Bidoit, M., Finkel, A., In principle, any program can be represented as a state ma- Laroussinie, F., Petit, A., Petrucci, L., Schnoebelen, and P. Systems and chine and can then be model checked. However, creating state Software Verification. Springer, 2001 machines from, say, a procedural C program is non-trivial, and the state machines also become very big very quickly. State- based programs are already a state machine, and, they are typi- cally not that big either (after all, they have to be understood by the developer who creates and maintains them). Consequently, many realistically-sized state machines can be model checked efficiently.

5.3 Combinations

The behavioral paradigm also plays a role in the context of language composition. If two to-be-composed languages use different behavioral paradigms, the composition can become really challenging. For example, combining a continuous sys- tem (which works with continuous streams of data) with a dis- crete event-based system requires temporal integration. We dsl engineering 159

won’t discuss this topic in detail in this book10. However, it 10 It is still very much a research topic. is obvious that combining systems that use the same paradigm is much simpler. Alternatively, some paradigms can be inte- grated relatively easily; for example, it is relatively simple to map a state-based system onto an imperative system. 11 Note how this observation leads to Many DSLs use combinations of various behavioral and struc- the desire to better modularize and tural paradigms described in this section11. Some combina- reuse some of the above paradigms. tions are very typical: Room for research :-)

• A data flow language often uses a functional, imperative or declarative language to describe the calculation rules that ex- press the dependencies between the variables (the contents of the boxes in data flow diagrams or of cells in spread- sheets). Fig. 4.45 shows an example block diagram, and Fig. 5.5 shows an example implementation.

• State machines use expressions as transition guard condi- tions, as well as typically an imperative language for ex- pressing the actions that are executed as a state is entered or left, or when a transition is executed. An example can be seen in Fig. 20.7.

• Reactive programming, in which "black boxes" react to events, often using data flow or state-based programming to imple- ment the behavior that determines the reactions.

• In purely structural languages, for example those for ex- pressing components and their dependencies, a function- al/expression language is often used to express pre- and post-conditions for operations. A state-based language is often used for protocol state machines, which determines the valid order of incoming events or operation calls.

Note that these combinations can be used to make well-estab- lished paradigms domain-specific. For example, in the Yakindu State Chart Tools (Fig. 20.7), a custom DSL can be plugged into an existing, reusable state machine language and editor. One concrete example is an action language that references another DSL that describes UI structures. This allows the state machine to be used to orchestrate the behavior of the UI. Some of the case studies used as examples in this part of the book also use combinations of several paradigms. Pension Plans: The pension language uses functional ab- stractions with mathematical symbols for the core actu- 160 dslbook.org

ary mathematics. A functional language with a plain tex- tual syntax is used for the higher-level pension calculation rules. A spreadsheet/data flow language is used for ex- pressing unit tests for pension rules. Various nesting levels of namespaces are used to organize the rules, the most im- portant of which is the pension plan. A plan contains cal- culation rules as well as test cases for those rules. Pension plans can specialize other plans as a means of expressing variants. Rules in a sub-plan can override rules in the plan from which the sub-plan inherits. Plans can be declared to be abstract, with abstract rules that have to be imple- mented in sub-plans. Rules are versioned over time, and the actual calculation formula is part of the version. Thus, a pension plan’s behavior can be made to be different for different points in time. J Refrigerators: The cooling behavior description is described as a reactive system. Events are produced by hardware el- ements12. A state machine constitutes the top-level struc- 12 Technically it is of course the driver ture. Within it, an imperative language is used. Programs of the hardware element, but in terms of the model the events are associated can inherit from another program, overwriting states de- with the hardware element directly fined in the base program: new transitions can be added, and the existing transitions can be overridden as a way for an extended program to "plug into" the base program. J 6 Process Issues

Software development with DSLs requires a compatible de- velopment process. A lot of what’s required is similar to what’s required for working with any other reusable artifact such as a framework: a workable process must be established between those who build the reusable artifact and those who use it. Requirements have to flow in one direction, and a finished, stable, tested and documented product has to be delivered in the other direction. Also, using DSLs can be a fundamental change for all involved, especially the domain experts. In this chapter we provide some guidelines for the process.

6.1 DSL Development

6.1.1 Requirements for the Language How do you find out what your DSL should express? What are the relevant abstractions and notations? This is a non-trivial issue; in fact it is one of the key issues in developing DSLs. It requires a lot of domain expertise, thought and iteration. The core problem is that you’re trying not just to understand one problem, but rather a class of problems. Understanding and defining the extent and nature of this class of problems can be a lot of work. There are three typical fundamentally different cases. The first one conerns technical DSLs where the source for a language is often an existing framework, library, architecture or architectural pattern (the inductive approach). The knowl- edge often already exists, and building the DSL is mainly about 162 dslbook.org

factoring the knowledge into a language: defining a notation, putting it into a formal language, and building generators to generate (parts of) the implementation code. In the process, you often also want to put in place reasonable defaults for some of the framework features, thereby increasing the level of abstraction and making framework use easier. mbeddr C: This was the approach taken by the extensi- ble C case study. There is a lot of experience in embed- ded software development, and some of the most pressing challenges are the same throughout the industry. When the DSL was built, we talked to expert embedded software developers to find out what these central challenges were. We also used an inductive approach and looked at existing C code to indentify idioms and patterns. We then defined extensions to C that provided linguistic abstractions for the most important patterns and idioms. J The second case addresses business domain DSLs. There you can often mine the existing (tacit) knowledge of domain ex- perts (deductive approach). In domains like insurance, science or logistics, domain experts are absolutely capable of precisely expressing domain knowledge. They do it all the time, often using Excel or Word. Other domain artifacts can also be ex- ploited in the same way: for example, hardware structures or device features are good candidates for abstractions in the re- spective domains. So are existing user interfaces: they face users directly, and so are likely to contain core domain abstrac- tions. Other sources are standards for an industry, or training material. Some domains even have an agreed ontology contain- ing concepts relevant to that domain and recognized as such by a community of stakeholders. DSLs can be (partly) derived from such domain ontologies. Pension Plans: The company for which the pension DSL was built had a lot of experience with pension plans. This experience was mostly in the heads of (soon to be retiring) senior domain experts. They also already had the core of the DSL: a "rules language". The people who defined the pension plans would write rules as Word documents to "formally" describe the pension plan behavior. This was not terribly productive because of the missing tool sup- port, but it meant that the core of the DSL was known. We still had to run a long series of workshops to work out nec- dsl engineering 163

essary changes to the language, clean up loose ends and discuss modularization and reuse in pension plans. J In the two cases discusses so far, it is pretty clear how the DSL is going to look in terms of core abstractions; discussions will be about details, notation, how to formalize things, viewpoints, partitioning and the like (although all these can be pretty non- trivial too!). One of my most successful approaches In the remaining third case, however, we are not so lucky. If in this case is to build "straw men": trying to understand something, factor no domain knowledge is easily available we have to do an ac- it into some kind of regular structure, tual domain analysis, digging our way through requirements, and then re-explain that structure back to the stakeholders. stakeholder "war stories" and existing applications. People may be knowledgeable, but they might be unable to conceptualize their domain in a structured way – it is then the job of the lan- guage designer to provide the structure and consistency that is necessary for defining a language. Co-evolving language and concepts (see below) is a successful technique, especially in this case. Refrigerators: At the beginning of the project, all cool- ing algorithms were implemented in C. Specifications were written in Word documents as prose (with tables and some physical formulas). It was not really clear at the beginning what the right abstraction level would be for a DSL suit- able for the thermodynamics experts. It took several iter- ations to settle on the asynchronous, state-based structure described earlier. J For your first DSL, try to catch case one or two. Ideally, start with case one, since the people who build the DLSs – software architects and developers – are often the same as the domain experts.

6.1.2 Iterative Development Some people use DSLs as an excuse to reintroduce waterfall processes. They spend months and months developing lan- guages, tools and frameworks. Needless to say, this is not a very successful approach. You need to iterate when develop- ing the language. Start by developing some deep understanding of a small part of the domain for which you build the DSL. Then build a little bit of language, build a little bit of generator and develop a small example model to verify what you just did. Ideally, im- 1 IDE polishing is probably something plement all aspects of the language and processor for each new you want to postpone a little, and not domain requirement before focusing on new requirements1. do as part of every iteration. 164 dslbook.org

Novices to DSLs especially tend to get languages and meta models wrong because they are not used to "thinking meta". You can avoid this pitfall by immediately trying out your new language feature by building an example model and develop- ing a compatible generator to verify that you can actually gen- erate the relevant artifacts. Refrigerators: To solidify our choices regarding language abstractions, we prototypically implemented several ex- ample refrigerators. During this process we found the need for more and more language abstractions. We no- ticed early on that we needed a way to test the example programs, so we implemented the interpreter and simu- lator relatively early. In each iteration, we extended the language as well as the interpreter, so the domain experts could experiment with the language even though we did not yet have a C code generator. J It is important that the language approaches some kind of sta- ble state over time (Fig. 6.1). As you iterate, you will encounter the following situation: domain experts express requirements that may sound inconsistent. You add all kinds of exceptions and corner cases to the language. You language grows in size and complexity. After a number of these exceptions and corner cases, ideally the language designer will spot the systematic nature behind them and refactor the language to reflect this deeper understanding of the domain. Language size and com- plexity is reduced. Over time, the amplitude of these changes in language size and complexity (the error bars in Fig. 6.1) Figure 6.1: Iterating towards a stable language over time. It is a sign of should become smaller, and the language size and complex- trouble if the language complexity does ity should approach a stable level (ss in Fig. 6.1). not approach some kind of stable state over time. Component Architecture: A nice example of spotting a systematic nature behind a set of special cases was the in- troduction of data replication as a core abstraction in the architecture DSL (we also discuss this in Section 18). Af- ter modeling a number of message-based communication channels, we noticed that the interfaces all had the same set of methods, just for different data structures. When we finally saw the pattern behind it, we created new linguistic abstractions: data replication. J dsl engineering 165

6.1.3 Co-evolve Concepts and Language In cases in which you perform a real domain analysis, i.e. when you have to find out which concepts the language should con- tain, make sure you evolve the language in real-time as you discuss the concepts. Defining a language requires formalization. It requires be- coming very clear and unambiguous about the concepts that go into the language. In fact, building the language, because of the need for formalization, helps you become clear about the domain abstractions in the first place. Language construction acts as a catalyst for understanding the domain! I recommend actually building a language in real-time as you analyze your domain. Refrigerators: This is what we did in the cooling language. Everybody learned a lot about the possible structure of refrigerators and the limited feature combinations (based on limitations imposed by the way in which some of the hardware devices work). J To make this feasible, your DSL tool must be lightweight enough to support language evolution during domain analysis work- shops. Turnaround time should be minimal. Refrigerators: The cooling DSL is built with Xtext. Xtext allows very fast turnaround regarding grammar evolution, and, to a lesser extent, scopes, validation and type systems. We typically evolved the grammar in real-time, during the language design workshops, together with the domain ex- perts. We then spent a day offline finishing scopes, con- straints and the type system, as well as the interpreter. J

6.1.4 Let People Do What They are Good At DSLs offer a chance to let everybody do what they are good at. There are several clearly defined roles, or tasks, that need to be done. Let me point out two, specifically. Experts in a specific target technology can dig deep into the details of how to efficiently implement, configure and operate that technology. They can spend a lot of time testing, dig- ging and tuning. Once they have found out what works best, they can put their knowledge into platforms and execution en- gines, efficiently spreading the knowledge across the team. For the latter task, they will collaborate with generator experts and language designers – our second example role. 166 dslbook.org

Component Architecture: In building the language, an OSGi expert was involved in building the generation tem- plates. J The language designer works with domain experts to define abstractions, notations and constraints to capture domain knowl- edge accurately. The language designer also works with the architect and the platform experts in defining code generators or interpreters. Be aware that language designers need to have some kind of predisposition: not everybody is good at "think- ing meta", some people are comfortable with concrete work. Make sure you use "meta people" to do the "meta work". And of course, the language designer must be fluent with the DSL tool used in the project. The flip side is that you have to make sure that you actually have people on your team who are good at language design, know the domain and understand the target platforms, other- wise the benefits promised by using DSLs may not materialize.

6.1.5 Domain Users vs. Domain Experts

When building business DSLs, people from the domain can play two different roles. They can either participate in the do- main analysis and the definition of the DSL itself, or they can use the DSL to create domain-specific models or programs. It is useful to distinguish these two roles explicitly. The first role (language definition) must be filled by a domain expert. These are people who have typically been working in the do- main for a long time, often in different roles, and who have a deep understanding of the relevant concepts, which they are able to express precisely and maybe even formally. The second group of people are the domain users. They are of course famil- iar with the domain, but they are typically not as experienced as the domain experts. This distinction is relevant because you want to work with the domain experts when defining the language, but you want to build a language that is suitable for use by the domain users. If the experts are too far ahead of the users, the users might not be able to "follow", and you will not be able to roll out the language to the actual target audience. Hence, make sure that when defining the language that you actually cross-check with real domain users whether they are able to work with the language. dsl engineering 167

Pension Plans: The core domain abstractions were con- tributed by Herman. Herman was the most senior pension expert in the company. In workshops we worked with a number of other domain users who didn’t have as much experience. We used them to validate that our DSL would work for the average future user. Of course they also found actual problems with the language, so they contributed to the evolution of the DSL beyond just acting as guinea pigs. J

6.1.6 DSL as a Product The language, constraints, interpreters and generators are usu- ally developed by one (smaller) group of people and used by another (larger) group of people. To make this work, consider the "language stuff" as a product developed by one group for use by another. Make sure there’s a well-defined release sched- ule, that development happens in short, predefined increments, that requirements and issues are reported and tracked, errors are fixed reasonably quickly, there is ample documentation and that support staff is available to help with problems and the unavoidable learning curve. These things are critical for accep- tance! A specific best practice is to exchange people: from time to time, make application developers part of the language team so that they can appreciate the challenges of "meta", and make people from the language development team participate in ac- tual application development to make sure they understand if and how their work products suit the people who do the actual application development. mbeddr C: One of our initial proof-of-concept projects didn’t really work out very well. So in order to try out our first C extensions and come up with a showcase for an upcom- ing exhibition, the language developers built the proof-of- concept themselves. As it turned out, this was really help- ful. We didn’t just find a lot of bugs, we also experienced first-hand some of the usability challenges of the system at the time. It was easy for us to fix, because it was we who experienced the problems in the first place. J

6.1.7 Documentation is still necessary Building the DSLs and execution engines is not enough to make the approach successful. You have to communicate to the 168 dslbook.org

users how to use these things in real-world contexts. Specifi- cally, here’s what you have to document: the language struc- ture and syntax, how to use the editors and the generators, how and where to write manual code and how to integrate it into generated code, as well as platform/framework decisions (if applicable). Keep in mind that there are other media than paper. Screen- casts, videos that show flip chart discussions, or even a regular podcast that talks about how the tools change are good choices, too. Also keep in mind that hardly anybody reads reference documentation. If you want to be successful, make sure the majority of your documentation consists of example-driven or task-based tutorials. Component Architecture: The documentation for the com- ponent architecture DSL contains a set of example applica- tions. Each of them guides a new user through building an increasingly complex application. It explains installation of the DSL into Eclipse, concepts of the target architecture and how they map to language syntax, use of the editor and generator, as well as how to integrated manually writ- ten code into the generated base classes. J

6.2 Using DSLs

6.2.1 Reviews A DSL limits the user’s freedom in some respect: they can only express things that are within the limits of DSLs. Specifically, low-level implementation decisions are not under a DSL user’s control because they are handled by the execution engine. However, even with the nicest DSL, users can still make mis- takes, the DSL users can still misuse the DSL – the more expres- sive the DSL, the bigger this risk. So, as part of your develop- ment process, make sure you perform regular model reviews. This is critical, especially for the adoption phase, when people are still learning the language and the overall approach. Reviews are easier on the DSL level than on the code level. Since DSL programs are more concise and support better sep- aration of concerns than their equivalent specification in GPL code, reviews become more efficient. If you notice recurring mistakes, things that people do in the "wrong" way regularly, you can either add a constraint check dsl engineering 169

that detects the problem automatically, or (maybe even better) consider this as input to your language designers: maybe what the users expect is actually correct, and the language needs to be adapted. 6.2.2 Compatible Organization Done right, using DSLs requires a lot of cross-project work. In many settings the same language (module) will be used in several projects or contexts. While this is of course a big plus, it also requires that the organization is able to organize, staff, schedule and pay for cross-cutting work. A strictly project- focused organization will have a very hard time finding re- sources for these kinds of activities. DSLs, beyond the small ad-hoc utility DSL, are very hard to introduce into such envi- ronments. In particular, make sure that the organizational structure, and the way project cost is handled, is compatible with cross- cutting activities. Any given project will not invest in assets that are reusable in other projects if the cost for developing the asset is billed only to the particular project. Assets that are use- ful for several projects (or the company as a whole) must also paid for by those several projects (or the company in general). 6.2.3 Domain Users Programming? Technical DSLs are intended for use by programmers. Appli- cation domain DSLs are targeted towards domain users, non- programmers who are knowledgeable in the domain covered by the DSL. Can they actually work with DSLs? In many domains, usually those that have a scientific or mathematical flavor, users can precisely describe domain knowl- edge. In other domains you might want to aim for a somewhat lesser goal. Instead of expecting domain users and experts to independently specify domain knowledge using a DSL, you might want to pair a developer and a domain expert. The de- veloper can help the domain expert to be precise enough to "feed" the DSL. Because the notation is free of implementation clutter, the domain expert feels much more at home than when staring at GPL source code. Initially, you might even want to reduce your aspirations to the point where the developer does the DSL coding based Executing the program, by generating on discussions with domain users, then showing them the re- code or running some kind simulator, can also help domain users understand sulting model and asking confirming or disproving questions better what has been expressed with the about it. Putting knowledge into formal models helps you DSL. 170 dslbook.org

point out decisions that need to be made, or language exten- sions that might be necessary. If you are not able to teach a business domain DSL to the domain users, it might not necessarily be the domain users’ fault. Maybe your language isn’t really suitable to the domain. If you encounter this problem, take it as a warning sign and consider changing the language. 6.2.4 DSL Evolution A DSL that is successfully used will have to be evolved. Just as for any other software artifact, requirements evolve over time and the software has to reflect these changes. In the context of DSLs, the changes can be driven by several different concerns:

Target Platform Changes The target platform may change because of the availability of new technologies that provide better performance, scalability or usability. Ideally, no changes to either the language or the models are necessary: a new exe- cution engine for the changed target platform can be created. In practice it is not always so clean: the DSL may make as- sumptions about the target platform that are no longer true for the changed or new platform. These may have to be re- moved from the languages and existing models. Also, the new platform may support different execution options, and the existing models do not contain enough information to make the decision of which option to take. In this case, ad- ditional annotation models may become necessary2. 2 Despite the caveats discussed in this paragraph, a target platform change is Domain Changes As the domain evolves, it is likely that the lan- typically relatively simple to handle. guage has to evolve as well3. The problem then is: what do 3 you do with existing models? You have two fundamental If you use a lot of in-language abstrac- tion or a standard library, you may be options: keep the old language and don’t change the mod- lucky and the changes can be realized els, or evolve the existing models to work with the new (ver- without changes to the language. sion of the) language. The former is often not really practi- cal, especially in the face of several such changes. The amount of pain in evolving existing models depends a lot on the nature of the change4. The most pragmatic ap- 4 It also depends a lot on the DSL tool. proach keeps the new version of the language backward Different tools support model evolution in different ways. compatible, so that existing models can still be edited and processed. Under this premise, adding new language con- cepts is never a problem. However, you must never just delete existing concepts or change them in an incompatible way. Instead, these old concepts should be marked as depre- cated, and the editor will show a corresponding warning in dsl engineering 171

the IDE. The IDE may also provide a quick fix to change the old, deprecated concept to a new (version of the) concept, if such a mapping is straightforward. Otherwise the migra- tion must be done manually. If you have access to all mod- els, you may also run a batch transformation during a quiet period to migrate them all at once. Note that, although dep- recation has a bad reputation from programming languages from which deprecated concepts are never removed, this is not necessarily comparable to DSLs: if, after a while, people still use the deprecated concepts, you can have the IDE send an email to the language developers, who can then work with the "offending user" to migrate the programs. Note that for the above approach to work, you have to have a structure process for versioning the languages and tools, otherwise you will quickly end up in version chaos.

DSL Tool Changes The third change is driven by evolution of the DSL tool. Of course, the language definition (and po- tentially, the existing models) may have to evolve if the DSL tool changes in an incompatible way (which, one could ar- gue, it shouldn’t!). This is similar to every other tool, li- brary or framework you may use. People seem particularly afraid of the situation in which they have to switch to a com- pletely new DSL tool because the current one is no longer supported, or a new one is just better. Of course it is very likely that you’ll have to completely redo the language def- initions: there is no portability in terms of language defini- tions among DSL tools (not even among those that reside on Eclipse). However, if you had designed your languages well you will probably be able to automatically transform existing 5 It is easy to see that over time the real models into the new tool’s data structures5. value is in the models, and not so much in the language definitions and IDEs. One central pillar of using DSLs is the high degree to Rebuilding those is hence not such a which they support separation of concerns and the expres- big deal, especially if we consider that a new, better tool may require less effort sion of domain knowledge at a level of abstraction that makes to build the languages. the domain semantics obvious, thus avoiding complex re- verse engineering problems. Consequently you can generate all kinds of artifacts from the models. This characteristic also means that it is relatively straightforward to write a genera- 6 For example, we have built a generic tor that creates a representation of the model in a new tool’s MPS to EMF exporter. It works for meta models as well as for the models. data structures6. 172 dslbook.org

6.2.5 Avoiding Uncontrolled Growth and Fragmentation If you use DSLs successfully, there may be the danger of un- controlled growth and diversification in languages, with the obvious problems for maintenance, training and interoperabil- ity7. To avoid this, there is an organizational approach. 7 While this may become a problem, The organizational approach requires putting in place gov- this may also become a problem with libraries or framework... the solution is ernance structures for language development. Maybe devel- also the same, as we will see. opers have to coordinate with a central entity before they are "allowed" to define a new language. Or an open-source like model is used, in which languages are developed in public and the most successful ones will survive and attract contribu- tions. Maybe you want to limit language development to some central "language team"8. Larger organizations in which un- 8 Please don’t overdo this – don’t make controlled language growth and fragmentation might become it a bureaucratic nightmare to develop a language! a problem are likely to already have established processes for coordinating reusable or cross-cutting work. You should just plug into these processes. The technical approach (which should be used together with the organizational one) exploits language modularization, ex- tension and composition. If (parts of) languages can be reused, the drive to develop something completely new (that does more or less the same as somebody else’s language) is reduced. Of course this requires that language reuse actually works with your tool of choice. It also requires that the potentially reusable languages are robust, stable and documented – otherwise no- body will use them. In a large organization I would assume that a few languages will be strategic: aligned with the needs of the whole organization, well-designed, well tested and doc- umented, implemented by a central group, used by many de- velopers and reusable by design9. In addition, small teams may 9 The development of these languages decide to develop their own smaller languages or extensions, should be governed by the organiza- tional approach discussed above. reusing the strategic ones. Their focus is much more local, and the development requires much less coordination. Part III

DSL Implementation

dsl engineering 175

This part of the book has been written together Lennart Kats and Guido Wachsmuth, who contributed the material on Spoo- fax, and Christian Dietrich, who helped with the language modularization in Xtext.

In this part we describe language implementation with three language workbenches, which together represent the current state of the art: Spoofax, Xtext and MPS. All of them are Open Source, so you can experiment with them. For more example language implementations using more language workbenches, take a look at the Language Workbench Competition website10. 10 languageworkbenches.net This part of the book does not cover a lot of design decisions or motivation for having things like constraints, type systems, transformations or generators. Conceptually these topics are introduced in Part II of the book on DSL design. This part really just looks at the "how", not the "what" or "why". Each chapter contains examples implemented with all three tools. The ordering of the tools is different from chapter to chapter, based on the characteristics of each tool: if the example for tool A illustrates a point that is also relevant for tool B, then A is discussed before B. The examples are not intended to serve as a full tutorial for any of these tools, but as an illustration of the concepts and ideas involved with language implementation in general. However, they should give you a solid understanding of the capabilities of each of the tools, and the class of tools they stand for. Also, if a chapter does not explain topic X for tool Y, this does not imply that you cannot do X with Y – it just means that Y’s approach to X is not significantly different from things that have already been discussed in the chapter.

7 Concrete and Abstract Syntax

In this chapter we look at the definition of abstract and con- crete syntax, and the mapping between the two in parser- based and projectional systems. We also discuss the advan- tages and drawbacks of these two approaches. We discuss the characteristics of typical AST definition formalisms. The meat of the chapter is made up of extensive examples for defining language structure and syntax with our three ex- ample tools.

The concrete syntax (CS) of a language is what the user interacts with to create programs. It may be textual, graphical, tabular or any combination thereof. In this book we focus mostly on textual concrete syntaxes; examples of other forms are briefly discussed Section 4.7. In this chapter we refer to other forms where appropriate. The abstract syntax (AS) of a language is a data structure that holds the core information in a program, but without any of the notational details contained in the concrete syntax: keywords and symbols, layout (e.g., whitespace), and comments are typ- ically not included in the AS. In parser-based systems the syn- tactic information that doesn’t end up in the AS is often pre- served in some "hidden" form so the CS can be reconstructed from the combination of the AS and this hidden information – this bidirectionality simplifies the creation of IDE features such as quick fixes or formatters. As we have seen in the introduction, the abstract syntax is essentially a tree data structure. Instances that represent actual programs (i.e. sentences in the language) are hence often called an abstract syntax tree or AST. Most formalisms also support 178 dslbook.org

cross-references across the tree, in which case the data structure becomes a graph (with a primary containment hierarchy). It is still usually called an AST. While the CS is the interface of the language to the user, the AS acts as the API to access programs by processing tools: it is used by developers of validators, transformations and code generators. The concrete syntax is not relevant in these cases. To illustrate the relationship between the concrete and abstract syntax, consider the following example program: var x: int; calc y: int = 1 + 2 * sqrt(x)

This program has a hierarchical structure: definitions of x and y at the top; inside y there’s a nested expression. This struc- ture is reflected in the corresponding abstract syntax tree. A 1 We write possible because there are possible AST is illustrated in Fig. 7.11. typically several ways of structuring the abstract syntax.

Figure 7.1: Abstract syntax tree for the above program. Boxes represent instances of language concepts, solid lines represent containment, dotted lines represent cross-references.

2 This is more convenient, but the resulting AS may not be as clean as if it were defined manually; it may There are two ways of defining the relationship between the CS contain idiosyncrasies that result from the automatic derivation from the CS. and the AS as part of language development: For example, an Ecore meta model derived from an Xtext grammar will CS first From a concrete syntax definition, an abstract syntax is never contain interfaces, because these cannot be expressed with the Xtext derived, either automatically or using hints in the concrete grammar language. However, the use of 2 syntax specification . This is the default use for Xtext, where interfaces may result in a meta model Xtext derives the Ecore meta model from an Xtext grammar. that is easier to process (richer typing). In this case it makes sense to use the AS AS first We first define the AS. We then define the concrete first approach. syntax, referring to the AS in the definition of the concrete syntax3. For example, in Xtext it is possible to define gram- 3 This is often done if the AS struc- ture already exists, has to conform to mar for an existing meta model. externally imposed constraints or is developed by another party than the Once the language is defined, there are again two ways in language developer. dsl engineering 179

which the abstract syntax and the concrete syntax can relate as the language is used to create programs4: 4 We will discuss these two approaches in more detail in the next subsection. Parsing In the parser-based approach, the abstract syntax tree is constructed from the concrete syntax of a program; a 5 Sometimes the parser creates a con- parser instantiates and populates the AS, based on the in- crete syntax tree, which is then trans- formed to an AST – however, we ignore formation in the program text. In this case, the (formal) this aspect in the rest of the book, as it definition of the CS is usually called a grammar5. Xtext and is not essential. Spoofax use this approach. Projection In the projectional approach, the abstract syntax tree is built directly by editor actions, and the concrete syntax is rendered from the AST via projection rules. MPS is an example of a tool that uses projectional editing.

Fig. 7.2 shows the typical combinations of these two dimen- sions. In practice, parser-based systems typically derive the AS from the CS – i.e. CS first. In projectional systems, the CS is usually annotated onto the AS data structures – i.e. AS first. Figure 7.2: Dimensions of defining the concrete and abstract syntax of a 7.1 Fundamentals of Free Text Editing and Parsing language. Xtext is mentioned twice because it supports CS first and AS first, although CS first is the default. Most programming environments rely on free text editing, where Note also that as of now there is no programmers edit programs at the text/character level to form projectional system that uses CS first. However, JetBrains are currently experi- (key)words and phrases. menting with such a system. A parser is used to check the program text (concrete syntax) for syntactic correctness, and create the AST by populating the AS data structures from information extracted from the textual source. Most modern IDEs perform this task in real-time as the user edits the program, and the AST is always kept in sync with the program text. Many IDE features – such as content assist, validation, navigation or refactoring support – are based on this synchronized AST. A key characteristic of the free text editing approach is its strong separation between the concrete syntax (i.e. text) and the abstract syntax. The concrete syntax is the principal repre- sentation, used for both editing and persistence6. The abstract syntax is used under the hood by the implementation of the DSL, e.g., for providing an outline view, validation, and for Figure 7.3: In parser-based systems, the user only interacts with the concrete transformations and code generation. The AS can be changed syntax, and the AST is constructed from (by changing the mapping from the CS to an AS) without any the information in the text via a parser. effect on the CS and existing programs. 6 In projectional editing it is the other way round: the CS can be changed Many different approaches exist for implementing parsers. easily (by changing the projection rules) Each may restrict the syntactic freedom of a language, or con- while keeping the AS constant. 180 dslbook.org

strain the way in which a particular syntax must be specified. It is important to be aware of these restrictions, since not all languages can be comfortably implemented by every parser implementation approach, or even at all. You may have heard terms like context free, ambiguity, look-ahead, LL, (LA)LR or PEG. These all pertain to a certain class of parser implemen- tation approaches. We provide more details on the various grammar and parser classes further on in this section.

7.1.1 Parser Generation Technology In traditional compilers and IDEs (such as gcc or the Eclipse JDT), parsers are often written by hand as a big, monolothic program that reads a stream of characters and uses recursion to create a tree structure. However, manually writing a parser requires significant expertise in parsing and a significant devel- opment effort. For standardized programming languages that don’t change very often, and that have a large user commu- nity, this approach makes sense. It can lead to very fast parsers that also provide good error reporting and error recovery (the ability to continue parsing after a syntax error has been found). In contrast, language workbenches, and most of today’s com- pilers, generate a parser from a grammar. A grammar is a syn- tax specification written in a DSL for formally defining tex- tual concrete syntax. These generated parsers may not pro- vide the same performance or error reporting/recovery as a hand-tailored parser constructed by an expert, but they pro- vide bounded performance guarantees that make them (usu- ally) more than fast enough for modern machines. Also, they generate a complete parser for the complete grammar – develop- ers may forget corner cases if they write the parser manually. 7 The grammar definition is also much However, the most important argument for using parser gen- more readable and maintainable than the actual parser implementation, either eration is that the effort of building a parser is much lower than custom-written or generated. manually writing a custom parser7. Finally, it means that the developer who defines a language does not have to be an ex- pert in parsing technology. 8 Note that many parser generators allow you to add arbitrary code (called actions) to the grammar, for example to  Parsing versus Scanning Because of the complexity inher- check constraints or interpret the pro- ent in parsing, parser implementations tend to split the pars- gram. We strongly recommend against ing process into a number of phases. In the majority of cases this: instead, a parser should only check for syntactic correctness and build the the text input is first separated into a sequence of tokens (i.e. AST. All other processing should be keywords, identifiers, literals, comments or whitespace) by a built on top of the AST. This separation between AST construction and AST scanner (sometimes also called lexer or tokenizer). The parser processing results in much more main- then constructs the actual AST from the token sequence8. This tainable language implementations. dsl engineering 181

simplifies the implementation compared to directly parsing at the character level. A scanner is usually implemented using direct recognition of keywords and a set of regular expressions to recognize all other valid input as tokens. Both the scanner and parser can be generated from gram- mars (see below). A well-known example of a scanner (lexer) generation tool is lex9. Modern parsing frameworks, such as 9 dinosaur.compilertools.net/ ANTLR10, do their own scanner generation. 10 www.antlr.org/ A separate scanning phase has direct consequences for the Note that the word "parser" now has more than one meaning: it can overall parser implementation, when the scanner is not aware either refer to the combination of the of the context of its input. An example of a typical problem scanner and the parser, or to the post- that arises from this is that keywords can’t be used as iden- scanner parser only. Usually the former meaning is intended (both in this book tifiers even though the use of a keyword frequently wouldn’t as well as in general) unless scanning cause ambiguity in the actual parsing. The Java language is an and parsing are discussed specifically. example of this: it uses a fixed set of keywords, such as class and public, that cannot be used as identifiers. A context-unaware scanner is also problematic when gram- mars are extended or composed. In the case of Java, this was seen with the assert and enum keywords that were introduced in Java 1.4 and Java 5, respectively. Any programs that used identifiers with those names (such as unit testing APIs) were no longer valid. For composed languages, similar problems arise, as constituent languages have different sets of keywords and can define incompatible regular expressions for lexicals such as identifiers and numbers. A recent technique to overcome these problems is context- aware scanning, in which the lexer relies on the state of the parser to determine how to interpret the next token11. With 11 E. Van Wyk and A. Schwerdfeger. scannerless parsing, there is no separate scanner at all. Instead, Context-Aware Scanning for Parsing Extensible Languages. In Intl. Conf. on the parser operates at the character level and statefully pro- Generative Programming and Component cesses lexicals and keywords, avoiding the problems of context- Engineering, GPCE 2007. ACM Press, 2007 unaware scanning illustrated above. Spoofax (or rather, the un- derlying parser technology SDF) uses scannerless parsing. 12 They can also be used to "produce" valid input by executing them "the  Grammars Grammars are the formal definition for concrete other way round", hence the name. textual syntax. They consist of production rules that define how 13 In these systems, the production 12 valid textual input ("sentences") look like . Grammars are the rules are enriched with information basis for syntax definitions in text-based workbenches such as beyond the pure grammatical structure 13 of the language, such as the seman- Spoofax and Xtext . tical relation between references and Fundamentally, production rules can be expressed in Backus- declarations. 14 Naur Form (BNF) , written as S ::= P1 ... Pn. This grammar 14 en.wikipedia.org/ _ defines a symbol S by a series of pattern expressions P1 ... Pn. wiki/Backus-Naur Form 182 dslbook.org

Each pattern expression can refer to another symbol or can be a literal such as a keyword or a punctuation symbol. If there are multiple possible patterns for a symbol, these can be written as separate productions (for the same symbol), or the patterns can be separated by the | operator to indicate a choice. An ex- tension of BNF, called Extended BNF (EBNF)15, adds a number 15 en.wikipedia.org/wiki/ Extended_Backus-Naur_Form of convenience operators such as ? for an optional pattern, * to indicate zero or more occurrences, and + to indicate one or more occurrences of a pattern expression. The following code is an example of a grammar for a sim- ple arithmetic expression language using BNF notation. Basic expressions are built up of NUM number literals and the + and * 16 16 operators . These are the + and * operators of the defined language, not those Exp ::= NUM mentioned for EBNF above. | Exp "+" Exp | Exp "*" Exp

Note how expression nesting is described using recursion in this grammar: the Exp rule calls itself, so sentences like 2 + 3 * 4 are possible. This poses two practical challenges for parser generation systems: first, the precedence and associativity of the operators is not described by this grammar. Second, not all parser generators provide full support for recursion. For example, ANTLR cannot cope with left-recursive rules. We elaborate on these issues in the remainder of the section and in the Spoofax and Xtext examples.

 Grammar classes BNF can describe any grammar that maps textual sentences to trees based only on the input symbols. These are called context-free grammars and can be used to parse the majority of modern programming languages17. In con- 17 An exception is SAP’s ABAP lan- trast, context-sensitive grammars are those that also depend on guage, which requires a custom, hand- written parser. the context in which a partial sentence occurs, making them suitable for natural language processing but at the same time, making parsing itself a lot harder, since the parser has to be aware of a lot more than just the syntax. Parser generation was first applied in command-line tools such as yacc in the early seventies18. As a consequence of 18 dinosaur.compilertools.net relatively slow computers, much attention was paid to the ef- ficiency of the generated parsers. Various algorithms were de- signed that could parse text in a bounded amount of time and memory. However, these time and space guarantees could only be provided for certain subclasses of the context-free gram- mars, described by acronyms such as LL(1), LL(k), LR(1), and dsl engineering 183

so on. A particular parser tool supports a specific class of gram- mars – e.g., ANTLR supports LL(k) and LL(*). In this naming scheme, the first L stands for left-to-right scanning, and the second L in LL and the R in LR stand for leftmost and right- most derivation. The constant k in LL(k) and LR(k) indicates the maximum number (of tokens or characters) the parser will look ahead to decide which production rule it can recognize. The bigger k, the more syntactic forms can be parsed19. Typi- 19 Bigger values of k may also reduce cally, grammars for "real" DSLs tend to need only finite look- parser performance, though. ahead and many parser tools effectively compute the optimal value for k automatically. A special case is LL(*), where k is unbounded and the parser can look ahead arbitrarily many to- kens to make decisions. Supporting only a subclass of all possible context-free gram- mars poses restrictions on the languages that are supported by a parser generator. For some languages, it is not possible to write a grammar in a certain subclass, making that partic- ular language unparseable with a tool that only supports that particular class of grammars. For other languages, a natural context-free grammar exists, but it must be written in a differ- ent, sometimes awkward or unintuitive way to conform to the 20 You may have heard of shift/reduce or subclass. This will be illustrated in the Xtext example, which reduce/reduce conflicts for LR parsers, or first/first or first/follow conflicts and uses ANTLR as the underlying LL(k) parser technology. direct or indirect left recursion for LL Parser generators can detect whether a grammar conforms parsers. We will discuss some of these in detail below. to a certain subclass, reporting conflicts that relate to the im- 21 20 Understanding these errors and plementation of the parsing algorithm . Language developers then refactoring the grammar to ad- can then attempt to manually refactor the grammar to address dress them can be non-trivial, since it those errors21. As an example, consider a grammar for prop- requires an understanding of the par- ticular grammar class and the parsing erty or field access, expressions of the form customer.name or algorithm. 22 "Tim".length : 22 Note that we use ID to indicate Exp ::= ID identifier patterns and STRING to | STRING indicate string literal patterns in these | Exp "." ID examples.

This grammar uses left-recursion: the left-most symbol of one of the definitions of Exp is a call to Exp, i.e. it is recursive. Left- recursion is not supported by LL parsers such as ANTLR. The left-recursion can be removed by left-factoring the gram- mar, i.e. by changing it to a form where all left recursion is eliminated. The essence of left-factoring is that the grammar is rewritten in such a way that all recursive production rules consume at least one token or character before going into the recursion. Left-factoring introduces additional rules that act as intermediaries and often makes repetition explicit using the + 184 dslbook.org

and * operators. Our example grammar from above uses re- cursion for repetition, which can be made explicit as follows:

Exp ::= ID | STRING | Exp ("." ID)+

The resulting grammar is still left-recursive, but we can intro- duce an intermediate rule to eliminate the recursive call to Exp:

Exp ::= ID | STRING | FieldPart ("." ID)+

FieldPart ::= ID | STRING

Unfortunately, this resulting grammar still has overlapping rules (first/first conflicts), as the ID and STRING symbols both match more than one rule. This conflict can be eliminated by remov- ing the Exp ::= ID and Exp := STRING rule and making the + (one or more) repetition into a * (zero or more) repetition:

Exp ::= FieldPart ("." ID)*

FieldPart ::= ID | STRING

This last grammar describes the same language as the original grammar shown above, but conforms to the LL(1) grammar class23. In the general case, not all context-free grammars can 23 Unfortunately, it is also much more be mapped to one of the restricted classes. Valid, unambiguous verbose. Refactoring "clean" context free grammars to make them conform to a grammars exist that cannot be factored to any of the restricted particular grammar class usually makes grammar classes. In practice, this means that some languages the grammars larger and/or uglier. cannot be parsed with LL or LR parsers.

 General parsers Research into parsing algorithms has pro- duced parser generators specific to various grammar classes, but there has also been research in parsers for the full class of context-free grammars. A naive approach to avoid the restric- tions of LL or LR parsers may be to add backtracking, so that if any input doesn’t match a particular production, the parser can go back and try a different production. Unfortunately, this approach risks exponential execution times or non-termination and usually exhibits poor performance. There are also general parsing algorithms that can efficiently parse the full class. In particular, generalized LR (GLR) parsers24 24 en.wikipedia.org/wiki/ GLR_parser and Earley parsers25 can parse in linear time O(n) in the com- 25 mon case. In the case of ambiguities, the time required can in- en.wikipedia.org/wiki/ Earley_parser crease, but in the worst case they are bounded by cubic O(n3) time. In practice, most programming languages have few or no dsl engineering 185

ambiguities, ensuring good performance with a GLR parser. Spoofax is an example of a language workbench that uses GLR parsing.

 Ambiguity Grammars can be ambiguous, meaning that at least one valid sentence in the language can be constructed in more than one (non-equivalent) way from the production rules26, corresponding to multiple possible ASTs. This obvi- 26 This also means that this sentence can ously is a problem for parser implementation, as some decision be parsed in more than one way. has to be made on which AST is preferred. Consider again the expression language introduced above.

Exp ::= NUM | Exp "+" Exp | Exp "*" Exp

This grammar is ambiguous, since for a string 1 * 2 + 3 there are two possible trees (corresponding to different operator prece- dences). Exp Exp

Exp Exp Exp Exp

Exp Exp Exp Exp

1 * 2 + 3 1 * 2 + 3 The grammar does not describe which interpretation should be preferred. Parser generators for restricted grammar classes and generalized parsers handle ambiguity differently. We discuss both approaches below.

 Ambiguity with Grammar Classes LL and LR parsers are de- terministic parsers: they can only return one possible tree for a given input. This means they can’t handle a grammar that has ambiguities, including our simple expression grammar. Deter- mining whether a grammar is ambiguous is a classic undecid- able problem. However, it is possible to detect violations of the LL or LR grammar class restrictions, in the form of con- flicts. These conflicts do not always indicate ambiguities (as seen with the field access grammar discussed above), but by resolving all conflicts (if possible) an unambiguous grammar can be obtained. Resolving grammar conflicts in the presence of associativ- ity, precedence, and other risks of ambiguity requires carefully layering the grammar in such a way that it encodes the desired properties. To encode left-associativity and a lower priority for 186 dslbook.org

the + operator, we can rewrite the grammar as follows:

Expr ::= Expr "+" Mult | Mult Mult ::= Mult "*" NUM | NUM

The resulting grammar is a valid LR grammar. Note how it 27 A + will end up further up in the puts the + operator in the highest layer to give it the low- expression tree than a *. This means that the has higher precedence, 27 * est priority , and how it uses left-recursion to encode left- since any interpreter or generator will associativity of the operators. The grammar can be left-factored encounter the * first. to a corresponding LL grammar as follows28: 28 We will see more extensive examples of this approach in the section on Xtext Expr ::= Mult ("+" Mult) * (Section 7.5). Mult ::= NUM ("*" NUM)*

 Ambiguity with Generalized Parsers Generalized parsers ac- cept grammars regardless of recursion or ambiguity. So our expression grammar is readily accepted as a valid grammar. In the case of an ambiguity, the generated parser simply re- turns all possible abstract syntax trees, e.g. a left-associative tree and a right-associative tree for the expression 1 * 2 + 3. The different trees can be manually inspected to determine what ambiguities exist in the grammar, or the desired tree can be programmatically selected. A way of programmatically select- ing one alternative is disambiguation filters. For example, left- associativity can be indicated on a per-production basis:

Exp ::= NUM | Exp "+" Exp {left} | Exp "*" Exp {left}

This indicates that both operators are left-associative (using the {left} annotation from Spoofax). Operator precedence can be indicated with relative priorities or with precedence annota- tions:

Exp ::= Exp "*" Exp {left} > Exp ::= Exp "+" Exp {left}

The > indicates that the * operator binds stronger than the + operator. This kind of declarative disambiguation is commonly found in GLR parsers, but typically is not available in parsers that support only more limited grammar classes29. 29 As even these simple examples show, this style of specifying grammars leads Grammar Evolution and Composition Grammars evolve as to simpler, more readable grammars.  It also makes language specification languages change and new features are added. These features much simpler, since developers don’t can be added by adding single, new productions, or by com- have to understand the conflicts/errors mentioned above. posing the grammar with an existing grammar. Composition of grammars is an efficient way of reusing grammars and quickly dsl engineering 187

constructing or extending new grammars. As a basic exam- ple of grammar composition, consider once again our simple grammar for arithmetic expressions:

Expr ::= NUM | Expr "*" Expr | Expr "+" Expr

Once more operators are added and the proper associativities and precedences are specified, such a grammar forms an ex- cellent unit for reuse30. As an example, suppose we want to 30 For example, expressions can be compose this grammar with the grammar for field access ex- used as guard conditions in state machines, for pre- and postconditions 31 pressions : in interface definitions, or to specify derived attributes in a data definition Expr ::= ID | STRING language. | Expr "." ID 31 Here we consider the case where In the ideal case, composing two such grammars should be two grammars use a symbol with the identical name Expr. Some grammar trivial – just copy them into the same grammar definition file. definition formalisms support mech- However, reality is often less than ideal. There are a number anisms such as grammar mixins and renaming operators to work with gram- of challenges that arise in practice, related to ambiguity and to mar modules where the symbol names grammar class restrictions32. do not match.

32 • Composing arbitrary grammars risks introducing ambigui- See Laurence Tratt’s article Parsing – the solved problem that ties that did not exist in either of the two constituent gram- isn’t. at tratt.net/laurie/ mars. In the case of the arithmetic expressions and field tech_articles/articles/parsing _the_solved_problem_that_isnt. access grammars, care must specifically be taken to indicate the precedence order of all operators with respect to all oth- ers. With a general parser, new priority rules can be added without changing the two imported grammars. When an LL or LR parser is used, it is often necessary to change one or both of the composed grammars to eliminate any con- flicts. This is because in a general parser, the precedences are declarative (additional preference specification can sim- ply be added at the end), whereas in LL or LR parsers the precedence information is encoded in the grammar struc- ture (and hence invasive changes to this structure may be required).

• We have shown how grammars can be massaged with tech- niques such as left-factoring in order to conform to a cer- tain grammar class. Likewise, any precedence order or as- sociativity can be encoded by massaging the grammar to take a certain form. Unfortunately, all this massaging makes grammars very resistant to change and composition: after two grammars are composed together, the result is often no 188 dslbook.org

longer LL or LR, and another manual factorization step is required.

• Another challenge is in composing scanners. When two grammars that depend on a different lexical syntax are com- posed, conflicts can arise. For example, consider what hap- pens when we compose the grammar of Java with the gram- mar of SQL:

for (Customer c : SELECT customer FROM accounts WHERE balance < 0) { ... }

The SQL grammar reserves keywords such as SELECT, even though they are not reserved in Java. Such a language change could break compatibility with existing Java programs which happen to use a variable named SELECT. A common pro- grammatic approach to solve this problem is the introduc- tion of easy-to-recognize boundaries, which trigger switches between different parsers. In general, this problem can only be avoided completely by a scannerless parser, which con- siders the lexical syntax in the context in which it appears; traditional parsers perform a separate scanning stage in which no context is considered.

7.2 Fundamentals of Projectional Editing

In parser-based approaches, users use text editors to enter char- acter sequences that represent programs. A parser then checks the program for syntactic correctness and constructs an AST from the character sequence. The AST contains all the seman- tic information expressed by the program. In projectional editors, the process happens the other way round: as a user edits the program, the AST is modified di- rectly. A projection engine then creates some representation of the AST with which the user interacts, and which reflects the changes. This approach is well-known from graphical ed- itors in general, and the model-view-controller (MVC) pattern specifically. When users edit a UML diagram, they don’t draw pixels onto a canvas, and a "pixel parser" then creates the AST. Rather, the editor creates an instance of uml.Class as you drag Figure 7.4: In projectional systems, the a class from the palette to the canvas. A projection engine ren- user sees the concrete syntax, but all ders the diagram, in this case drawing a rectangle for the class. editing gestures directly influence the AST. The AST is not extracted from the Projectional editors generalize this approach to work with any concrete syntax, which means the CS notation, including textual. does not have to be parseable. dsl engineering 189

This explicit instantiation of AST objects happens by picking the respective concept from the code completion menu using a character sequence defined by the respective concept (typically the "leading keyword" of the respective program construct, or the name of a referenced variable). If at any given program lo- cation two concepts can be instantiated using the same character sequence, then the projectional editor prompts the user to de- cide33. Once a concept is instantiated, it is stored in the AST as 33 As discussed above, this is the sit- a node with a unique ID (UID). References between program uation where many grammar-based systems run into problems from ambi- elements are pointers to this UID, and the projected syntax that guity. represents the reference can be arbitrary. The AST is actually an abstract syntax graph from the start because cross-references are first-class rather than being resolved after parsing34. The 34 There is still one single containment program is stored using a generic tree persistence mechanism, hierarchy, so it is really a tree with cross-references. often XML35. 35 Defining a projectional editor, instead of defining a gram- And yes, the tools provide diff/merge based on the projected syntax, not mar, involves the definition of projection rules that map lan- based on XML. guage concepts to a notation. It also involves the definition of event handlers that modify the AST based on a user’s editing gestures. The way to define the projection rules and the event handlers is specific to the particular tool used. The projectional approach can deal with arbitrary syntac- tic forms including traditional text, symbols (as in mathemat- ics), tables or graphics. Since no grammar is used, grammar classes are not relevant here. In principle, projectional edit- ing is simpler in principle than parsing, since there is no need to "extract" the program structure from a flat textual source. However, as we will see below, the challenge in projectional editing lies making the editing experience convenient36. Mod- 36 In particular, editing notations that ern projectional editors, and in particular MPS, do a good job look like text should be editable with the editing gestures known from text in meeting this challenge. editors.

7.3 Comparing Parsing and Projection

7.3.1 Editing Experience In free text editing, any regular text editor will do. However, users expect a powerful IDE that includes support for syntax coloring, code completion, go-to-definition, find references, er- ror annotations, refactoring and the like. Xtext and Spoofax 37 provide IDE support that is essentially similar to what a mod- We assume that you are familiar with modern IDEs, so we do not discuss ern IDE provides for mainstream languages (e.g. Eclipse for their features in great detail in this Java)37. However, you can always go back to any text editor to section. 190 dslbook.org

edit the programs.

Figure 7.5: An mbeddr example pro- gram using five separate but integrated languages. It contains a module with an enum, a state machine (Counter) and a function (nextMode) that contains a decision table. Inside both of them developers can write regular C code. The IDE provides code completion for all language extensions (see the start/stop suggestions) as well as static error validation (Error... hover). The green trace annotations are traces to requirements that can be attached to arbitrary program elements. The red parts with the {resettable} next to them are presence conditions (in the context of product line engineering): the respective elements are only in a program variant if the configuration feature resettable is selected.

In projectional editing, this is different: a normal text ed- itor is obviously not sufficient; a specialized editor has to be supplied to perform the projection (an example is shown in Fig. 7.5). As in free text editing, it has to provide the IDE support features mentioned above. MPS provides those. How- ever, there is another challenge: for textual-looking notations, it is important that the editor tries to make the editing experi- ence as text-like as possible, i.e. the keyboard actions we have become used to from free-text editing should work as far as possible. MPS does a decent job here, using, among others, the following strategies38: 38 The following list may be hard to relate to if you have never used a projectional editor. However, under- • Every language concept that is legal at a given program lo- standing this section in detail is not cation is available in the code completion menu. In naive essential for the rest of the book. implementations, users have to select the language concept based on its name and instantiate it. This is inconvenient. In MPS, languages can instead define aliases for language concepts, allowing users to "just type" the alias, after which the concept is immediately instantiated39. 39 By making the alias the same as the leading keyword (e.g. if for an • Side transforms make sure that expressions can be entered IfStatement), users can "just type" the conveniently. Consider a local variable declaration int a = code. 2;. If this should be changed to int a = 2 + 3; the 2 in the dsl engineering 191

init expression needs to be replaced by an instance of the bi- nary + operator, with the 2 in the left slot and the 3 in the right. Instead of removing the 2 and manually inserting a +, users can simply type + on the right side of the 2. The system performs the tree restructuring that moves the + to the root of the subtree, puts the 2 in the left slot, and then puts the cursor into the right slot, so the user can enter the second ar- gument. This means that expressions (or anything else) can be entered linearly, as expected. For this to work, operator precedence has to be specified, and the tree has to be con- structed taking these precedences into account. Precedence is typically specified by a number associated with each op- erator, and whenever a side transformation is used to build an expression, the tree is automatically reshuffled to make sure that those operators with a higher precedence number are further down in the tree. • Delete actions are used to similar effect when elements are deleted. Deleting the 3 in 2 + 3 first keeps the plus, with an empty right slot. Deleting the + then removes the + and puts the 2 at the root of the subtree. • Wrappers support instantiation of concepts that are actually children of the concepts allowed at a given location. Con- sider again a local variable declaration int a;. The respec- tive concept could be LocalVariableDeclaration, a sub- concept of Statement, to make it legal in method bodies (for example). However, users simply want to start typ- ing int, i.e. selecting the content of the type field of the LocalVariableDeclaration. A wrapper can be used to support entering Types where LocalVariableDeclarations are expected. Once a Type is selected, the wrapper imple- mentation creates a LocalVariableDeclaration, puts the Type into its type field, and moves the cursor into the name slot. Summing up, this means that a local variable declara- tion int a; can be entered by starting to type the int type, as expected. • Smart references achieve a similar effect for references (as opposed to children). Consider pressing Ctrl-Space after the + in 2 + 3. Assume further, that a couple of local vari- ables are in scope and that these can be used instead of the 3. These should be available in the code completion menu. However, technically, a VariableReference has to be in- stantiated, whose variable slot is then made to point to 192 dslbook.org

any of the variables in scope. This is tedious. Smart refer- ences trigger special editor behavior: if in a given context a VariableReference is allowed, the editor first evaluates its scope to find the possible targets, then puts those targets into the code completion menu. If a user selects one, then the VariableReference is created, and the selected element is put into its variable slot. This makes the reference object effectively invisible in terms of the editing experience. • Smart delimiters are used to simplify inputting list-like data, where elements are separated with a specific separator sym- bol. An example is argument lists in functions: once a pa- rameter is entered, users can press comma, i.e. the list de- limiter, to instantiate the next element.

Except for having to get used to the somewhat different way of editing programs, the strategies mentioned above (plus some others) result in a reasonably good editing experience. Tra- ditionally, projectional editors have not used these or similar strategies, and projectional editors have acquired a bit of a bad reputation because of that. In the case of MPS this tool support is available, and hence MPS provides a productive and pleasant working environment. 7.3.2 Language Modularity As we have seen in Section 4.6, language modularization and composition is an important building block in working with DSLs. Parser-based and projectional editors come with differ- ent trade-offs in this respect. In parser-based systems the extent to which language com- position can be supported depends on the supported grammar class. As we have said above, the problem is that the result of combining two or more independently developed grammars into one may become ambiguous, for example, because the same character sequence is defined as two different tokens. The resulting grammar cannot be parsed and has to be disam- biguated manually, typically by invasively changing the com- posite grammar. This of course breaks modularity and hence is not an option. Parsers that do not support the full set of context-free grammars, such as ANTLR, and hence Xtext, have this problem. Parsers that do support the full set of context-free grammars, such as the GLR parser used as part of Spoofax, are better off. While a grammar may become ambiguous in the sense that a program may be parseable in more than one way, dsl engineering 193

this can be resolved by declaratively specifying which alterna- tive should be used. This specification can be made externally, without invasively changing either the composed or the compo- nent grammars, retaining modularity. In projectional editors, language modularity and composi- tion is not a problem at all40. There is no grammar, no pars- 40 An example for a composed language ing, no grammar classes, and hence no problem with com- is shown in Fig. 7.5. It contains code expressed in C, in a statemachines posed grammars becoming ambiguous. Any combination of extension, a decision tables extension languages will be syntactically valid. In cases where a com- and in languages for expressing re- quirements traces and product line posed language would be ambiguous in a GLR-based system, variability. the user has to make a disambiguating decision as the program is entered. For example, in MPS, if at a given location two lan- guage concepts are available with the same alias, just typing the alias won’t bind, and the user has to manually decide by picking one alternative from the code completion menu. 7.3.3 Notational Freedom Parser-based systems process linear sequences of character sym- bols. Traditionally, the character symbols were taken from the ASCII character set, resulting in textual programs being made up from "plain text". With the advent of Unicode, a much wider 41 en.wikipedia.org/wiki/ _ _ variety of characters is available while still sticking to the lin- Fortress (programming language) ear sequence of characters approach. For example, the Fortress 42 There have been experimental parsers programming language41 makes use of this: Greek letters and a for two-dimensional structures such as tables and even for graphical shapes, wide variety of different bracket styles can be used in Fortress but these have never made it beyond programs. However, character layout is always ignored. For the experimental stage. Also, it is possible to approximate tables by using example it is not possible to use parsers to handle tabular no- vertical bars and hyphens to some tations, fraction bars or even graphics42. extent. JNario, described in Section 19, In projectional editing, this limitation does not exist. A pro- uses this approach. jectional editor never has to extract the AST from the concrete syntax; editing gestures directly influence the AST, and the concrete syntax is rendered from the AST. This mechanism is basically like a graphical editor and notations other than text can be used easily. For example, MPS supports tables, fraction bars and "big math" symbols43. Since these non-textual nota- tions are handled in the same way as the textual ones (possibly Figure 7.6: A table embedded in an otherwise textual program with other input gestures), they can be mixed easily44: tables can be embedded into textual source, and textual languages 43 The upcoming version 3.0 of MPS will can be used within table cells (see Fig. 7.6). also support graphical notations. 44 Of course, the price you pay is the 7.3.4 Language Evolution somewhat different style of interacting with the editor, which, as we have said, If the language changes, existing instance models temporarily approximates free text editing quite become outdated, in the sense that they were developed for the well, but not perfectly. 194 dslbook.org

old version of the language. If the new language is not back- ward compatible, these existing models have to be migrated to conform to the updated language. Since projectional editors store the models as structured data in which each program node points to the language concept it is an instance of, the tools have to take special care that such "incompatible" models can still be opened and then migrated, manually or by a script, to the new version of the language. MPS supports this feature, and it is also possible to distribute migration scripts with (updated) languages to run the migra- tion automatically45. 45 It is also possible to define quick fixes Most textual IDEs do not come with explicit support for that run automatically; so whenever a concept is marked as deprecated, evolving programs as languages change. However, since a this quick fix can trigger an automatic model is essentially a sequence of characters, it can always be migration to a new concept. opened in the editor. The program may not be parseable, but users can always update the program manually, or with global search and replace using regular expressions. More complex migrations may require explicit support via transformations on the AST.

7.3.5 Infrastructure Integration

Today’s software development infrastructure is typically text- oriented. Many tools used for diff and merge, or tools like grep and regular expressions, are geared towards textual storage. Programs written with parser-based textual DSLs (stored as plain text) integrate automatically and nicely with these tools. In projectional IDEs, special support needs to be provided for infrastructure integration. Since the CS is not pure text, a generic persistence format is used, typically based on XML. While XML is technically text as well, it is not practical to per- form diff, merge and the like on the level of the XML. There- fore, special tools need to be provided for diff and merge. MPS provides integration with the usual version control sys- tems and handles diff and merge in the IDE, using the con- crete, projected syntax46. Fig. 7.7 shows an example of an MPS 46 Note that since every program ele- diff. However, it clearly is a drawback of projectional editing ment has a unique ID, move can poten- tially be distinguished from delete/create, (and the associated abstract syntax-based storage) that many providing richer semantics for diff and well-known text utilities don’t work47. merge. Also, copy and paste with textual environments may be a 47 For example, web-based diffs in challenge. MPS, for example, supports pasting a projected pro- github or gerrit are not very helpful gram that has a textual-looking syntax into a text editor. How- when working with MPS. ever, for the way back (from a textual environment to the pro- dsl engineering 195

jectional editor), there is no automatic support. However, spe- 48 While this works reasonably well for cial support for specific languages can be provided via paste Java, it has to be developed specifically for each language used in MPS. If handlers. Such a paste handler is available for Java, for exam- a grammar for the target language ple: when a user pastes Java text into a Java program in MPS, is available for a Java-based parser 48 generator, it is relatively simple to a parser is executed that builds the respective MPS tree . provide such an integration.

Figure 7.7: The diff/merge tool presents MPS programs in their concrete syntax, 7.3.6 Tool Lock-in i.e. text for textual notations. However, other notations, such as tables, would In the worst case, textual programs can be edited with any text also be rendered in their native form. editor. Unless you are prepared to edit XML, programs ex- pressed with a projectional editor always require that editor to edit programs. As soon as you take IDE support into account though, both approaches lock users into a particular tool. Also, there is essentially no standard for exchanging language def- initions between the various language workbenches49. So the 49 There is some support for exchanging effort of implementing a language is always lost if the tool must the abstract syntax based on formalisms such as MOF or Ecore, but most of the be changed. effort for implementing a language is in areas other than the abstract syntax. 7.3.7 Other In parser-based systems, the complete AST has to be recon- structable from the CS. This implies that there can be no infor- mation in the tree that is not obtained from parsing the text. 196 dslbook.org

This is different in projectional editors. For example, the tex- tual notation could only project a subset of the information in the tree. The same information can be projected with different projections, each possibly tailored to a different stakeholder, and showing a different subset of the overall data. Since the tree uses a generic persistence mechanism, it can hold data that has not been planned for in the original language definition. All kinds of meta data (documentation, presence conditions, requirements traces) can be stored, and projected if required50. 50 MPS supports annotations, where additional data can be added to model elements of existing languages. The data can be projected inside the orig- 7.4 Characteristics of AST Formalisms inal program’s projection, all without changing the original language specifi- Most AST formalisms, aka meta meta models51, are ways to cation. represent trees or graphs. Usually, such an AST formalism is 51 Abstract syntax and meta model are "meta circular" in the sense that it can describe itself. typically considered synonyms, even This section is a brief overview over the three AST formalisms though they have different histories relevant to Xtext, Spoofax and MPS. We will illustrate them in (the former comes from the parser/- grammar community, whereas the latter more detail in the respective tool example sections. comes from the modeling community). Consequently, the formalisms for defin- ing ASTs are conceptually similar to 7.4.1 EMF Ecore meta meta models. The Eclipse Modeling Framework52 (EMF) is at the core of 52 www.eclipse.org/modeling/emf/ all Eclipse Modeling tools. It provides a wide variety of ser- vices and tools for persisting, editing and processing models and abstract syntax definitions. EMF has grown to be a fairly large ecosystem within the Eclipse community and numerous projects use EMF as their basis. Its core component is Ecore, a variant of the EMOF stan- dard53. Ecore acts as EMF’s meta meta model. Xtext uses Ecore 53 en.wikipedia.org/wiki/ _ as the foundation for the AS: from a grammar definition, Xtext Meta-Object Facility derives the AS as an instance of Ecore. Ecore’s central concepts are: EClass (representing AS elements or language concepts), EAttribute (representing primitive properties of EClasses), EReference (representing associations between EClasses) and EObject (representing instances of EClasses, i.e. AST nodes). EReferences can have containment semantics or not and each EObject can be contained by at most one EReference instance. Fig. 7.8 shows a class diagram of Ecore. When working with EMF, the Ecore file plays a central role. From it, all kinds of other aspects are derived; specifically, a generic tree editor and a generated Java API for accessing an AST. It also forms the basis for Xtext’s model processing: The Ecore file is derived from the grammar, and the parser, when dsl engineering 197

Figure 7.8: The Ecore meta model rendered as a UML diagram. executed, builds an in-memory tree of EObjects representing the AST of the parsed program.

7.4.2 Spoofax’ ATerm

Spoofax uses the ATerm format to represent abstract syntax. ATerm provides a generic tree structure representation format that can be serialized textually similar to XML or JSON. Each tree node is called an ATerm, or simply a term. Terms consist of the following elements: Strings ("Mr. White"), Numbers (15), Lists ([1,2,3]) and constructor applications (Order(5, 15, "Mr. White") for labeled tree nodes with a fixed number of children. The structure of valid ATerms is specified by an algebraic signature. Signatures are typically generated from the concrete- 198 dslbook.org

syntax definition, but can also be specified manually. A sig- nature introduces one or more algebraic sorts, i.e. collections of terms. The sorts String, Int, and List54 are predefined. 54 List is a parameterized sort, i.e. it User-defined sorts are inhabited by declaring term constructors takes the sort of the list elements as a parameter. and injections. A constructor has a name and zero or more sub- terms. It is declared by stating its name, the list of sorts of its direct subterms, and the sort of the constructed term. Con- structor names may be overloaded. Injections are declared as nameless constructors. The following example shows a signa- ture for expressions: signature

sorts Exp

constructors Plus : Exp * Exp -> Exp Times: Exp * Exp -> Exp : Int -> Exp

The signature declares sort Exp with its constructors Plus and Times, which both require two expressions as direct subterms. Basic expressions are integers, as declared by the injection rule : Int -> Exp. Compared to XML or JSON, perhaps the most significant distinction is that ATerms rely on the order of subterms rather than on labels. For example, a product may be modeled in JSON as follows:

{ "product": { "itemnumber": 5, "quantity": 15, "customer": "Mr. White" } }

Note how this specification includes the actual data describing the particular product (the model), but also a description of each of the elements (the meta model). With XML, a product would be modeled in a similar fashion. An equivalent of the JSON above written in ATerm format would be the following:

Order([ItemNumber(5), Quantity(15), Customer("Mr.\ White")])

However, this representation contains a lot of redundant infor- mation that also exists in the grammar. Instead, such a prod- uct can be written as Order(5, 15, "Mr. White"). This more concise notation tends to make it slightly more convenient to use in handwritten transformations. The textual notation of ATerms can be used for exchanging data between tools and as a notation for model transformations dsl engineering 199

or code generation rules. In memory, ATerms can be stored in a tool-specific way (i.e. simple Java objects in the case of Spoofax)55. 55 The generic structure and serializ- In addition to the basic elements above, ATerms support an- ability of ATerms also allows them to be converted to other data formats. For notations to add additional information to terms. These are example, the aterm2xml and xml2aterm similar to attributes in XML. For example, it is possible to an- tools can convert between ATerms and XML. notate a product number with its product name:

Order(5{ProductName("Apples")}, 15, "Mr. White")

Spoofax also uses annotations to add information about refer- ences to other parts of a model to an abstract syntax tree. While ATerms only form trees, the annotations are used to represent the graph-like references.

7.4.3 MPS’ Structure Definition

In MPS, programs are trees/graphs of nodes. A node is an instance of a concept which defines the structure, syntax, type system and semantics of its instance nodes56. Like EClasses57, 56 The term concept used in this book, to concepts are meta circular, i.e. there is a concept that defines refer to language constructs including abstract syntax, concrete syntax and the properties of concepts: semantics, is inspired by MPS’ use of the term. concept ConceptDeclaration extends AbstractConceptDeclaration implements INamedConcept 57 Nodes correspond to EObjects in EMF, concepts resemble EClasses. instance can be root: false

properties: helpURL : string rootable : boolean

children: InterfaceConceptReference implementsInterfaces 0..n LinkDeclaration linkDeclaration 0..n PropertyDeclaration propertyDeclaration 0..n ConceptProperty conceptProperty 0..n ConceptLink conceptLink 0..n ConceptPropertyDeclaration conceptPropertyDeclaration 0..n ConceptLinkDeclaration conceptLinkDeclaration 0..n

references: ConceptDeclaration extendsConcept 0..1

A concept may extend a single other concept and implement any number of interfaces58. It can declare references (non- 58 Note that interfaces can provide containing) and children (containing). It may also have a num- implementations for the methods they specify – they are hence more like Scala ber of primitive-type properties as well as a couple of "static" traits or mixins known from AOP and features. In addition, concepts can have behavior methods. various programming languages. While the MPS structure definition is proprietary to MPS 59 This is illustrated by the fact the and does not implement any accepted industry standard, it is exporters and importers to and from conceptually very close to Ecore59. Ecore have been written. 200 dslbook.org

7.5 Xtext Example

Cooling programs60 represent the behavioral aspect of the re- 60 This and the other examples in this frigerator descriptions. Here is a trivial program that can be section refer back to the case studies introduced at the beginning of the book used to illustrate some of the features of the language. The in Section 1.11. program is basically a state machine. cooling program HelloWorld uses stdlib {

var v: int event e

init { set v = 1 }

start: on e { state s }

state s: entry { set v = 0 }

}

The program declares a variable v and an event e. When the program starts, the init section is executed, setting v to 1. The system then (automatically) transitions into the start state. There it waits until it receives the e event. It then transitions to the state s, where it uses an entry action to set v back to 0. More complex programs include checks of changes of proper- ties of hardware elements (aCompartment->currentTemp) and commands to the hardware (set aCompartment->isCooling = true), as shown in the next snippet: start: check ( aCompartment->currentTemp > maxTemp ) { set aCompartment->isCooling = true state initialCooling } check ( aCompartment->currentTemp <= maxTemp ) { state normalCooling } state initialCooling: check ( aCompartment->currentTemp < maxTemp ) { state normalCooling } It is also possible to first create the Ecore meta model and then define a grammar for it. While this is a bit more  Grammar Basics In Xtext, the syntax is specified using an work, it is also more flexible, because EBNF-like notation, a collection of productions that are typi- not all possible Ecore meta models can be described implicitly by a grammar. cally called parser rules. These rules specify the concrete syntax For example, Ecore interfaces cannot be of a program element, as well as its mapping to the AS. From expressed from the grammar. A middle the grammar, Xtext generate the abstract syntax represented in ground is to have Xtext generate the meta model while the grammar is still 61 Ecore . Here is the definition of the CoolingProgram rule: in flux and then switch to maintaining the meta model manually when the CoolingProgram: "cooling" "program" name=ID "{" grammar stabilizes. (events+=CustomEvent | variables+=Variable)* 61 The entity that contains the meta (initBlock=InitBlock)? classes is actually called an EPackage. dsl engineering 201

(states+=State)* "}";

Rules begin with the name (CoolingProgram in the example above), a colon, and then the rule body. The body defines the syntactic structure of the language concept defined by the rule. In our case, we expect the keywords cooling and program, fol- lowed by an ID. ID is a terminal rule that is defined in the parent grammar from which we inherit (not shown). ID is defined as an unbounded sequence of lowercase and uppercase charac- ters, digits, and the underscore, although it may not start with a digit. This terminal rule is defined as follows: terminalID : (’a’..’z’|’A’..’Z’|’_’) (’a’..’z’|’A’..’Z’|’_’|’0’..’9’)*;

In pure grammar languages, one would typically write the fol- lowing:

"cooling" "program" ID "\{ ..."}

This expresses the fact that after the two keywords we expect an ID. However, Xtext grammars don’t just express the con- crete syntax – they also determine the mapping to the AS. We have encountered two such mappings so far. The first one is implicit: the name of the rule will be the name of the derived meta class62. So we will get a meta class CoolingProgram. The 62 If we start with the meta model and second mapping we have encountered is name=ID. It specifies then define the grammar, it is possible to have grammar rule names that are that the meta class gets a property name that holds the contents different from meta class names. of the ID from the parsed program text. Since nothing else is specified in the ID terminal rule, the type of this property defaults to EString, Ecore’s version of a string data type. The rest of the definition of a cooling program is enclosed in curly braces. It contains three elements: first the program contains a collection of events and variables (the * specifies unbounded multiplicity), an optional init block (optionality is specified by the ?) and a list of states. Let us inspect each of these in more detail. The expression (states+=State)* specifies that there can be any number of State instances in the program. The Cooling- Program meta class gets a property states, it is of type State (the meta class derived from the State rule). Since we use the += operator, the states property will be typed to be a list of States. In the case of the optional init block, the meta class will have an initBlock property, typed as InitBlock (whose parser rule we don’t show here), with a multiplicity of 0..1. Events and variables are more interesting, since the vertical bar 202 dslbook.org

operator is used within the parentheses. The asterisk expresses the fact that whatever is inside the parentheses can occur any number of times63. Inside the parentheses we expect either 63 While there are exceptions, the use of a CustomEvent or a Variable, which is expressed with the |. a * usually goes hand in hand with the use of a +=. Variables are assigned to the variables collection, events are assigned to the events collection. This notation means that we can mix events and variables in any order. The following alternative notation would first expect all events, and then all variables.

(events+=CustomEvent)* (variables+=Variable)*

The definition of State is interesting, since State is intended to be an abstract meta class with several subtypes.

State: BackgroundState | StartState | CustomState;

The vertical bar operator is used here to express syntactic al- ternatives. This is translated to inheritance in the meta model. The definition of CustomState is shown in the following code snippet:

CustomState: "state" name=ID ":" (invariants+=Invariant)* ("entry" "{" (entryStatements+=Statement)* "}")? ("eachTime" "{" (eachTimeStatements+=Statement)* "}")? (events+=EventHandler | signals+=SignalHandler)*;

StartState and BackgroundState, the other two subtypes of State, share some properties. Consequently, Xtext’s AS deriva- tion algorithm pulls them up into the abstract State meta class. This way they can be accessed polymorphically. Fig. 7.9 shows the resulting meta model using EMF’s tree view.

 References Let us now look at statements and expressions. States have entry and exit actions which are procedural state- ments that are executed when a state is entered and left, re- spectively. The set v = 1 in the example program above is an example. Statement itself is abstract and has the various kinds of statements as subtypes/alternatives:

Statement: Statement | AssignmentStatement | PerformAsyncStatement | ChangeStateStatement | AssertStatement;

ChangeStateStatement: "state" targetState=[State]; dsl engineering 203

AssignmentStatement: "set" left=Expr "=" right=Expr;

The ChangeStateStatement is used to transition into another state. It uses the keyword state followed by a reference to the target state. Notice how Xtext uses square brackets to ex- press the fact that the targetState property points to an ex- isting state, as opposed to containing a new one (which would be written as targetState=State); i.e. the square brackets ex- press non-containing cross-references. This is another example of where the Xtext grammar lan- guage goes beyond classical grammar languages, where one would write "state" targetStateName=ID;. Writing it in this way only specifies that we expect an ID after the state key- word. The fact that we call it target- StateName communi- cates to the programmer that we expect this text string to cor- respond to the name of a state – a later phase in model pro- cessing resolves the name to an actual state reference. Typically, the code to resolve the reference has to be written manually, because there is no way for the tool to derive from the gram- mar automatically the fact that this ID is actually a reference to a State. In Xtext, the targetState=[State] notation makes this explicit, so the resolution of the reference can be automatic. This approach also has the advantage that the resulting meta class types the targetState property to State (and not just to a string), which makes processing the models much easier.

Figure 7.9: Part of the Ecore meta model derived from the Xtext gram- mar for the cooling program. Gram- mar rules become EClasses, assign- ments in the grammar rules become EAttributes and EReferences. The no- tation A -> B symbolizes that A extends B. 204 dslbook.org

Note that the cross-reference definition only specifies the tar- get type (State) of the cross-reference, but not the concrete syntax of the reference itself. By default, the ID terminal is used for the reference syntax, i.e. a simple (identifier-like) text 64 Notice that in this case the vertical string is expected. However, this can be overridden by speci- bar does not represent an alternative, it is merely used as a separator between fying the concrete syntax terminal behind a vertical bar in the the target type and the terminal used to reference64. In the following piece of code, the targetState represent the reference. reference uses the QID terminal as the reference syntax.

ChangeStateStatement: "state" targetState=[State|QID];

QID: ID ("." ID)*;

The other remaining detail is scoping. During the linking phase, where the text of ID (or QID) is used to find the target node, several objects with the same name might exist, or some tar- get elements might not visible based on visibility rules of the language. To constrain the possible reference targets, scoping functions are used. These will be explained in the next chapter.

 Expressions The AssignmentStatement shown earlier is one of the statements that uses expressions. We repeat it here:

AssignmentStatement: "set" left=Expr "=" right=Expr;

The following snippet is a subset of the actual definition of expressions (we have omitted some additional expressions that don’t add anything to the description here).

Expr: ComparisonLevel;

ComparisonLevel returns Expression: AdditionLevel ((({Equals.left=current} "==") | ({LogicalAnd.left=current} "&&") | ({Smaller.left=current} "<")) right=AdditionLevel)?;

AdditionLevel returns Expression: MultiplicationLevel ((({Plus.left=current} "+") | ({Minus.left=current} "-")) right= MultiplicationLevel)*;

MultiplicationLevel returns Expression: PrefixOpLevel ((({Multi.left=current}"*") | ({Div.left=current} "/")) right=PrefixOpLevel)*;

PrefixOpLevel returns Expression: ({NotExpression} "!" "(" expr=Expr ")") | AtomicLevel;

AtomicLevel returns Expression: ({TrueLiteral} "true") | ({FalseLiteral} "false") | ({ParenExpr} "(" expr=Expr ")") | ({NumberLiteral} value=DECIMAL_NUMBER) | ({SymbolRef} symbol=[SymbolDeclaration|QID]); dsl engineering 205

To understand the above definition, we first have to explain in more detail how AST construction works in Xtext. Obvi- ously, as the text is parsed, meta classes are instantiated and the AST is assembled. However, instantiation of the respective meta class happens lazily, upon the first assignment to one of its properties. If no assignment is performed at all, no object is created. For example, in the grammar rule TrueLiteral: "true"; no instance of TrueLiteral will ever be created, be- cause there is nothing to assign. In this case, an action can be used to force instantiation: TrueLiteral: {TrueLiteral} "true";65. 65 Notice that the action can instan- Unless otherwise specified, an assignment such as name=ID tiate meta classes other than those that are derived from the rule name is always interpreted as an assignment on the object that has (we could write TrueLiteral: been created most recently. The current keyword can be used {SomeOtherThing} "true";. While this would not make sense in this case, to access that object in case it itself needs to be assigned to a we’ll use this feature later. property of another AST object. Now we know enough about AST construction to under- stand how expressions are encoded and parsed. In the expres- sion grammar above, for the rules with the Level suffix, no meta classes are created, because (as Xtext is able to find out statically) they are never instantiated. They merely act as a way to encode precedence. To understand this, let’s consider how 2 * 3 is parsed:

• The AssignmentStatement refers to the Expr rule in its left and right properties, so we "enter" the expression tree at the level of Expr (which is the root of the expression hierarchy).

• The Expr rule just calls the ComparisonLevel rule, which calls AdditionLevel, and so on. No objects are created at this point, since no assignment to any property is per- formed.

• The parser "dives down" until it finds something that matches the first symbol in the parsed text: the 2. This occurs on AtomicLevel when it matches the DECIMAL_NUMBER termi- nal. At this point it creates an instance of the NumberLiteral meta class and assigns the number 2 to the value property. It also sets the current object to point to the just-created NumberLiteral, since this is now the AST object created most recently.

• The AtomicLevel rule ends, and the stack is unwound. We’re back at PrefixOpLevel, in the second branch. Since nothing 206 dslbook.org

else is specified after the call to AtomicLevel, we unwind once more.

• We’re now back at the MultiplicationLevel. The rule is not finished yet and we try to match an * and a /. The match on * succeeds. At this point the assignment action on the left side of the * kicks in (Multi.left=current). This action creates an instance of Multi, and assigns the current (the NumberLiteral created before) to its left property. Then it makes the newly created Multi the new current. At this point we have a subtree with the * at the root, and the NumberLiteral in the left property.

• The rule hasn’t ended yet. We dive down to PrefixOpLevel and AtomicLevel once more, matching the 3 in the same way as the 2 before. The NumberLiteral for 3 is assigned to the right property as we unwind the stack.

• At this point we unwind the stack further, and since no more text is present, no more objects are created. The tree struc- ture has been constructed as expected.

If we’d parsed 4 + 2*3 the + would have matched before the *, because it is "mentioned earlier" in the grammar (it is in a lower-precedence group, the AdditionLevel, so it has to end up "higher" in the tree). Once we’re at 4 +, we’d go down again to match the 2. As we unwind the stack after matching the 2 66 This somewhat convoluted approach we’d match the *, creating a Multi again. The current at this to parsing expressions and encod- point would be the 2, so it would be put onto the left side ing precedence is a consequence of the LL(k) grammar class support by of the *, making the * the current. Unwinding further, that ANTLR, which underlies Xtext. A * would be put onto the right side of the +, building the tree cleaner approach would be declara- tive, where precedences are encoded just as we’d expect. explicitly and the order of the parsing Notice how a rule at a given precedence level only always rules in not relevant. Spoofax uses this delegates to rules at higher precedence levels. So higher prece- approach. Xtext also supports syntactic predicates, which are annotations in dence rules always end up further down in the tree. If we want the grammar that tell the parser which to change this, we can use parentheses (see the ParenExpr in alternative to take in the case of an ambiguity. We don’t discuss this any the AtomicLevel): inside those, we can again embed an Expr, further in the book. i.e. we jump back to the lowest precedence level66. 67 Once you understand the basic approach, it is easy to add Note that the latter requires an invasive change to the grammar; this new expressions with a precedence similar to another one (just prevents the addition of operators with add it as an alternative to the respective Level rule) or to in- a precedence level between existing precedence levels in a sub-language, Level troduce a new precedence level (just interject a new rule whose definition cannot invasively between two existing ones)67. change the language it extends. dsl engineering 207

7.6 Spoofax Example

Mobl’s68 data modeling language provides entities, properties 68 Mobl is a DSL for defining applica- and functions. To illustrate the language, below are two data tions for mobile devices. It is based on HTML 5 and is closely related to type definitions related to a shopping list app. It supports lists WebDSL, which has been introduced of items that can be favorited, checked, and so on, and which earlier (Section 1.11). are associated with some Store. module shopping entity Item { name : String checked : Bool favorite : Bool onlist : Bool order : Num store : Store }

In Mobl, most files start with a module header, which can be followed by a list of entity type definitions. In turn, each entity can have one or more property or function definitions (shown in the next example snippet). Modules group entities. Inside a module, one can only access entities from the same module or from imported modules.

 Grammar Basics In Spoofax, the syntax of languages is de- scribed using SDF69. SDF is short for Syntax Definition Formal- 69 www.syntax-definition.org/ ism, and is a modular and flexible syntax definition formalism that is supported by the SGLR70 parser generator. It can gener- 70 SGLR parsing is a Scannerless, ate efficient, Java-based scannerless and general parsers, allow- Generalized extension of LR parsing. ing Spoofax to support the full class of context-free grammars, grammar composition, and modular syntax definitions. An ex- ample of a production written in SDF is:

"module" ID Entity* -> Start {"Module"}

The pattern on the left-hand side of the arrow is matched by the symbol Start on the right-hand side71. After the right-hand 71 Note that SDF uses the exact opposite side, SDF productions may specify annotations using curly order for productions as the grammars we’ve discussed so far, switching the brackets. Most productions specify a quoted constructor label left-hand and right-hand side. that is used for the abstract syntax. This particular produc- tion creates a tree node with the constructor Module and two children that represent the ID and the list of Entities respec- tively. As discussed earlier, Spoofax represents abstract syntax trees as ATerms. Thus, the tree node will be represented as Module(..., [...]). In contrast to Xtext, the children are not named; instead, they are identified via the position in the child collection (the ID is first, the Entity list is second). Spoofax generates the following signature from the production above: 208 dslbook.org

signature sorts Start constructors Module: ID List(Entity) -> Start

The left-hand side of an SDF production is the pattern it matches against. SDF supports symbols, literals and character classes in this pattern. Symbols are references to other productions, such as ID. Literals are quoted strings such as "module" that must appear in the input literally. Character classes specify a range of characters expected in the input, e.g. [A-Za-z] speci- fies that an alphabetic character is expected. We discuss char- acter classes in more detail below. The basic elements of SDF productions can be combined us- ing operators. The A* operator shown above specifies that zero or more occurrences of A are expected. A+ specifies that one or more are expected. A? specifies that zero or one are expected. {A B}* specifies that zero or more A symbols, separated by B symbols, are expected. As an example, {ID ","}* is a comma- separated list of identifiers. {A B}+ specifies one or more A symbols separated by B symbols. Fig. 7.10 shows an SDF grammar for a subset of Mobl’s entities and functions syntax. The productions in this gram- mar should have few surprises, but it is interesting to note how SDF groups a grammar in different sections. First, the context-free start symbols section indicates the start sym- bol of the grammar. Then, the context-free syntax section lists the context-free syntax productions, forming the main part of the grammar. Terminals are defined in the lexical syntax section.

 Lexical Syntax As Spoofax uses a scannerless parser, all lexical syntax can be customized in the SDF grammar72. Most 72 It provides default definitions for lexical syntax is specified using character classes such as [0-9]. common lexical syntax elements such as strings, integers, floats, whitespace Each character class is enclosed in square brackets, and can and comments, which can be reused by Commons consist of ranges of characters (c1-c2), letters and digits (e.g. importing the module . x or 4), non-alphabetic literal characters (e.g., _), and escapes (e.g., \n). A complement of a character class can be obtained using the ∼ operator, e.g. ∼[A-Za-z] matches all non-alphabetic characters. For whitespace and comments a special terminal LAYOUT can be used. SDF implicitly inserts LAYOUT between all symbols in context- free productions. This behavior is the key distinguishing fea- dsl engineering 209

Figure 7.10: A basic SDF grammar for a module MoblEntities subset of Mobl. The grammar does not context-free start symbols yet specify the associativity, priority or name bindings of the language. Module context-free syntax

"module" ID Decl* -> Module {"Module"} "import" ID -> Decl {"Import"} "entity" ID "{" EntityBodyDecl* "}" -> Decl {"Entity"}

ID ":" Type -> EntityBodyDecl {"Property"} "function" ID "(" {Param ","}* ")" ":" ID "{" Statement* "}" -> EntityBodyDecl {"Function"} ID ":" Type -> Param {"Param"} ID -> Type {"EntityType"}

"var" ID "=" Expr ";" -> Statement {"Declare"} "return" Exp ";" -> Statement {"Return"}

Exp "." ID "(" Exp ")" -> Exp {"MethodCall"} Exp "." ID -> Exp {"FieldAccess"} Exp "+" Exp -> Exp {"Plus"} Exp "*" Exp -> Exp {"Mul"} ID -> Exp {"Var"} INT -> Exp {"Int"} lexical syntax

[A-Za-z][A-Za-z0-9]* -> ID [0-9]+ -> INT [\ \t\n] -> LAYOUT

ture between context-free and lexical productions: lexical sym- bols such as identifiers and integer literals cannot be inter- leaved with layout. The second distinguishing feature is that lexical syntax productions usually do not have a constructor la- bel in the abstract syntax, as they form terminals in the abstract syntax trees (i.e. they don’t own any child nodes).

 Abstract Syntax To produce abstract syntax trees, Spoofax uses the ATerm format, described in Section 7.4.2. SDF com- bines the specification of concrete and abstract syntax, primar- ily through the specification of constructor labels. Spoofax al- lows users to view the abstract syntax of any input file. As an example, the following is the textual representation of an abridged abstract syntax term for the shopping module shown at the beginning of this section:

Module( "shopping", [ Entity( "Item", [Property("name", EntityType("String")), Property("checked", EntityType("Bool")), ...] ) ] ]) 210 dslbook.org

Note how this term uses the constructor labels of the syntax above: Module, Entity and Property. The children of each node correspond to the symbols referenced in the production: the Module production first referenced ID symbol for the mod- ule name and then included a list of Decl symbols (lists are in square brackets). In addition to constructor labels, productions that specify parentheses can use the special bracket annotation:

"(" Exp ")" -> Exp {bracket}

The bracket annotation specifies that there should not be a separate tree node in the abstract syntax for the production. This means that an expression 1 + (2) would produce Plus ("1","2") in the AST, and not Plus("1",Parens("2")).

 Precedence and Associativity SDF provides special support for specifying the associativity and precedence of operators or other syntactic constructs. As an example, let us consider the production of the Plus operator. So far, it has been defined as:

Exp "+" Exp -> Exp {"Plus"}

Based on this operator, a parser can be generated that can parse an expression such as 1 + 2 to a term Plus("1", "2"). How- ever, the production does not specify if an expression 1 + 2 + 3 should be parsed to a term Plus("1", Plus("2", "3")) or Plus(Plus("1", "2"), "3"). If you try the grammar in Spoofax, it will show both interpretations using the special amb constructor: amb([ Plus("1", Plus("2", "3")), Plus(Plus("1", "2"), "3") ])

The amb node indicates an ambiguity and it contains all possi- ble interpretations73. Ambiguities can be resolved by adding 73 Whenever an ambiguity is encoun- annotations to the grammar that describe the intended inter- tered in a file, it is marked with a warning in the editor. pretation. For the Plus operator, we can resolve the ambiguity by specifying that it is left-associative, using the left annota- tion:

Exp "+" Exp -> Exp {"Plus", left}

In a similar fashion, SDF supports the definition of the prece- dence order of operators. For this, the productions can be placed into the context-free priorities section: context-free priorities dsl engineering 211

Exp "*" Exp -> Exp {"Mul", left} > Exp "+" Exp -> Exp {"Plus", left}

This example specifies that the Mul operator has a higher prior- ity than the Plus operator, resolving the ambiguity that arises for an expression such as 1 + 2 * 3.

 Reserved Keywords and Production Preference Parsers gener- ated with SDF do not use a scanner, but include processing of lexical syntax in the parser. Since scanners operate without any context information, they will simply recognize any token that corresponds to a keyword in the grammar as a reserved key- word, irrespective of its location in the program. In SDF, it is also possible to use keywords that are not reserved, or keywords that are only reserved in a certain context. As an example, the following is a legal entity in Mobl: entity entity {}

Since our grammar did not specify that entity is a reserved word, it can be used as a normal ID identifier. However, there are cases in which it is useful to reserve keywords, for example to prevent ambiguities. Consider what would happen if we added new productions for predefined type literals:

"Bool" -> Type {"BoolType"} "Num" -> Type {"NumType"} "String" -> Type {"StringType"}

If we were now to parse a type String, it would be ambigu- ous: it matches the StringType production above, but it also matches the EntityType production, as String is a legal en- tity identifier74. Keywords can be reserved in SDF by using a 74 So it is ambiguous because at the same production that rejects a specific interpretation: location in a program both interpretations are possible. "String" -> ID {reject}

This expresses that String can never be interpreted as an iden- tifier. Alternatively, we can say that we prefer one interpreta- tion over the other:

"String" -> Type {"StringType", prefer}

This means that this production is to be preferred if there are 75 This is the situation where a projec- any other interpretations. However, since these interpretations tional editor like MPS is more flexible, since instead of running into an am- cannot always be foreseen as grammars are extended, it is con- biguity, it would prompt the user to sidered good practice to use the more specific reject approach decide which interpretation is correct as instead75. he types String. 212 dslbook.org

 Longest Match Most scanners apply a longest match policy for scanning tokens76. For most languages, this is the expected 76 This means that if it is possible to behavior, but in some cases longest match is not what users ex- include the next character in the current token, the scanner will always do so. pect. SDF instead allows the grammar to specify the intended behavior. In Spoofax, the default is specified in the Common syntax module using a lexical restrictions section: lexical restrictions ID -/- [A-Za-z0-9]

This section restricts the grammar by specifying that any ID cannot be directly followed by a character that matches [A-Z a-z0-9]. Effectively, it enforces a longest match policy for the ID symbol. SDF also allows the use of lexical restrictions for keywords. By default it does not enforce longest match, which means it allows the following definition of a Mobl entity: entityMoblEntity {}

As there is no longest match, the parser can recognize the entity keyword even if it is not followed by a space. To avoid this behavior, we can specify a longest match policy for the entity keyword: lexical restrictions "entity" -/- [A-Za-z0-9]

 Name Bindings So far we have discussed purely syntax specification in SDF. Spoofax also allows the specification of name binding rules, which specify semantic relations between productions. We discuss how these relations are specified in Chapter 8.

7.7 MPS Example

We start by defining a simple language for state machines, roughly similar to the one used in the state machine exten- sion77 to mbeddr C. Its core concepts include StateMachine, 77 In mbeddr, state machines can be State, Transition and Trigger. The language supports the embedded into C code, as we will see later. definition of state machines, as shown in the following piece of code: module LineFollowerStatemachine {

statemachine LineFollower { events unblocked() blocked() bumped() initialized() states (initial = initializing) { dsl engineering 213

state initializing { on initialized [ ] -> running { } } state paused { on unblocked [ ] -> running { } } state running { on blocked [ ] -> paused { } on bumped [ ] -> crashed { } } state crashed { } } } }

 Concept Definition MPS is projectional, so we start with the definition of the AS. The code below shows the definition of the concept Statemachine. It contains a collection of States and a collection of InEvents. It also contains a reference to one of the states to mark it as the initial state. The alias is defined as statemachine, so typing this string inside a C module instantiates a state machine (it picks the Statemachine concept from the code completion menu). State machines also implement a couple of interfaces: IIdentifierNamedElement contributes a property name, IModuleContent makes the state machine embeddable in C Modules78. 78 The Module owns a collection of IModuleContents, just as the state concept Statemachine extends BaseConcept implements IModuleContent machine contains states and events. ILocalVarScopeProvider IIdentifierNamedElement children: State states 0..n InEvent inEvents 0..n

references: State initial 1

concept properties: alias = statemachine

A State (not shown) contains two StatementLists as entry- Actions and exitActions. StatementList is a concept de- fined by the com. mbeddr.core.statements language. To make it available visible, our statemachine language extends com.mbeddr.core.statements. Finally, a State contains a col- lection of Transitions. concept Transition children: Trigger trigger 1 Expression guard 1 StatementList actions 1

references: State target 1

concept properties: alias = on 214 dslbook.org

Transitions contain a Trigger, a guard condition, transition actions and a reference to the target state. The trigger is an 79 It expresses the fact that the refer- abstract concept; various specializations are possible: the de- enced event triggers the transition. fault implementation is the EventTrigger, which references 80 A type system rule will be defined Event79 Expression an . The guard condition is an , a con- later to constrain this expression to be cept reused from com.mbeddr.core.expressions80. The tar- Boolean. get state is a reference, i.e. we point to an existing state instead of owning a new one. action is another StatementList that can contain arbitrary C statements used as the transition ac- tions.

 Editor Definition Editors, i.e. the projection rules, are made of cells. When defining editors, various cell types are arranged so that the resulting syntax has the desired structure. Fig. 7.11 shows the editor definition for the State concept. It uses an indent collection of cells with various style attributes to ar- range the state keyword and name, the entry actions, the transitions and the exit actions in a vertical list. Entry and exit actions are shown only if the respective StatementList is not empty (a condition is attached to the respective cells, marked by the ? in front of the cell). An intention81 is used to add a 81 An intention is what Eclipse calls a new statement and hence make the respective list visible. Quick Fix – i.e. an entry in a little menu that transforms the program in some way. Intentions are explained in the next section.

Figure 7.11: The definition of the editor for the State concept. In MPS, editors are made from cells. In the editor definition you arrange the cells and define their contents; this defines the projection rule that is used when instances of the concept are rendered in Fig. 7.12 shows the definition of the editor for a Transition. the editor. It arranges the keyword on, the trigger, the guard condition, target state and the actions in a horizontal list of cells, the guard surrounded by brackets, and an arrow (->) in front of the target state. The editor for the actions StatementList comes with Figure 7.12: The editor for transitions. Note how we embed the guard condi- its own set of curly braces. tion expression simply by referring to the guard child relationship. We "in- herit" the syntax for expressions from the com.mbeddr.core.expressions language. dsl engineering 215

The %targetState% -> {name} part expresses the fact that in 82 We could even use the symbol X to order to render the target state, the target state’s name attribute render all target state references. The reference would still work, because should be shown. We could use any text string to refer to the the underlying data structure uses target state82. the target’s unique ID to establish the reference. It does not matter what we Note how we use on both as the leading keyword for a tran- use to represent the target in the model. sition and as the alias. This way, if a user types the on alias Using X for all references would of to instantiate a transition, it feels as if they typed the leading course be bad for human readability, but technically it would work. keyword of a transition (as in a regular text editor). If a language extension defined a new concept Special- Transition, they could use another alias to uniquely identify this concept in the code completion menu. The user decides which alias to type depending on whether they want to instan- tiate a Transition or a SpecialTransition. Alternatively, the SpecialTransition could use the same alias on. In this case, if the user types on, the code completion menu pops open and the user has to decide which of the two concepts to instanti- ate83. As we have discussed above, this means that there is 83 The code completion menu by default never an ambiguity that cannot be handled – as long as the shows from which language a language concept originates, so this is a way to user is willing and able to make the decision of which con- distinguish the two. Alternatively, a cept should be instantiated. A third option would transform a short explaining text can be shown for each entry in the code completion menu Transition SpecialTransition into a on demand, for exam- that helps the user make the decision. ple if the user executes a specific extension, or types a specific string on the right side of a Transition.

 Intentions Intentions are MPS’ term for what is otherwise known as a Quick Fix: a little menu can be displayed on a pro- gram element that contains a set of actions that change the un- derlying program element (see Fig. 7.13). The intentions menu Figure 7.13: The intentions menu for Alt-Enter is opened via . In MPS, intentions play an impor- a local variable declaration. It can be tant role in the editor. In many languages, certain changes to opened via Alt-Enter. To select an the program can only be made via an intention84. Using the action from the menu, you can just start typing the action label, so this is very intentions menu in a projectional editor is idiomatic. For ex- keyboard-friendly. ample, in the previous section we mentioned that we use them 84 to add an entry action to a State. Here is the intention code: This is mostly because building a just-type-along solution would be a lot of work in a projectional editor in some intention addEntryActions for concept State { cases. available in child nodes : true

description(editorContext, node)->string { "Add Entry Action"; }

isApplicable(editorContext, node)->boolean { node.entryAction.isNull; }

execute(editorContext, node)->void { node.entryAction.set new(); editorContext.selectWRTFocusPolicy(node.entryAction); 216 dslbook.org

} }

An intention is defined for a specific language concept (State in the example). It can then be invoked by pressing Alt-Enter on any instance of this concept. Optionally it is possible to also make it available in child nodes. For example, if you are in the guard expression of an transition, an intention for State with available in child nodes set to true will be available as well. The intention implementation also specifies an expres- sion used as the title in the menu and an applicability condi- tion. In the example the intention is only applicable if the cor- responding state does not yet have any entry action (because in that case you can just type in additional statements). Finally, the execute section contains procedural code that performs the respective change on the model. In this case we simply create a new instance of StatementList in the entryAction child. We also set the cursor into this new StatementList85. 85 Notice how we don’t have to specify any formatter or serializer for our lan- Expressions Since we inherit the expression structure and guage. Remember how a projectional  editor always goes from AS to CS. So syntax from the C core language, we don’t have to define ex- after changing the AS procedurally, the pressions ourselves to be able to use them in guards. It is respective piece of the tree is simply rerendered to update the representation nonetheless interesting to look at their implementation in the of the program in the editor. However, language com.mbeddr.core.expressions. we do have to define an editor for each Expressions are arranged into a hierarchy starting with the language concept. abstract concept Expression. All other kinds of expressions ex- tend Expression, directly or indirectly. For example, PlusEx- pression extends BinaryExpression, which in turn extends Expression. BinaryExpressions have left and right child Expressions. This way, arbitrarily complex expressions can be built86. The editors are also straightforward – in the case of the 86 Representing expressions as trees is + expression, they are a horizontal list of: editor for left argu- a standard approach that we have seen with the Xtext example already; in that ment, the + symbol, and the editor for the right argument. sense, the abstract syntax of mbeddr As we have explained in the general discussion about projec- expressions (and more generally, the way to handle expressions in MPS) is tional editing (Section 7.2), MPS supports linear input of hierar- not very interesting. chical expressions using side transforms. The code below shows the right side transformation for expressions that transforms an arbitrary expression into a PlusExpression by putting the PlusExpression "on top" of the current node87. 87 Using the alias (i.e. the oper- ator symbol) of the respective side transform actions makeArithmeticExpression BinaryExpression and the inheri-

right transformed node: Expression tag: default_ tance hierarchy, it is possible to factor all side transformations for all binary actions : operations into one single action im- add custom items (output concept: PlusExpression) plementation, resulting in much less simple item implementation effort. dsl engineering 217

matching text + do transform (operationContext, scope, model, sourceNode, pattern)->node< > { node expr = new node(); sourceNode.replace with(expr); expr.left = sourceNode; expr.right.set new(); return expr.right; }

The fact that you can enter expressions linearly leads to a prob- lem not unlike the one found in grammars regarding operator precedence. If you enter 2 + 3 * 4 by typing these characters sequentially, there are two ways in which the tree could look, 88 88 depending on whether + or * binds more tightly . Note how this really is a consequence To deal with this problem, we proceed as follows: each sub- of the linear input method; you could build the tree by first typing the + concept of BinaryExpression has a numerical value associated and then filling in the left and right with it that expresses its precedence. The higher the number, arguments, in which case it would be clear that the * is lower in the tree and the higher the precedence (i.e. the lower in the tree). The ac- hence binds tighter. However, this is tion code shown above is changed to include a call to a helper tedious and hence not an option in function that rearranges the tree according to the precedence practice. values. do transform (operationContext, scope, model, sourceNode, pattern)->node< > { node expr = new node(); sourceNode.replace with(expr); expr.left = sourceNode; expr.right.set new(); // rearranges tree to handle precedence PrecedenceHelper.rearrange(expr); return expr.right; }

This method scans through an expression tree and checks for cases in which a binary expression with a higher precedence is an ancestor of a binary expression with a lower precedence value. If it finds one, it rearranges the tree to resolve the prob- lem89. 89 Since the problem can only arise as a consequence of the linear input method, Context Restrictions MPS makes strong use of polymor- it is sufficient to include this rearrange-  ment in the side transformation like the phism. If a language concept defines a child relationship to an- one shown above. other concept C, then any subtype of C can also be used in this child relationship. For example, a function has a body which is typed to StatementList, which contains a list of Statements. So every subtype of Statement can be used inside a function body. In general, this is the desired behavior, but in some cases, it is not. Consider test cases. Here is a simple example: module UnitTestDemo imports nothing {

test case testMultiply { assert (0) times2(21) == 42; } 218 dslbook.org

int8 times2(int8 a) { return 2 * a; } }

Test cases reside in a separate language com.mbeddr.core.unit- test. The language defines the TestCase concept, as well as the assert statement. AssertStatement extends Statement, so by default, an assert can be used wherever a Statement is expected, once the com.mbeddr.core.unittest is used in a program. However, this is not what we want: assert state- ments should be restricted to be used inside a UnitTest90. To 90 This is, among other reasons, because support such a use case, MPS supports a set of constraints. the transformation of the assert statement to C expects code generated Here is the implementation for AssertStatement: from the UnitTest to surround it. concept constraints AssertStatement { can be child (operationContext, scope, parentNode, link, childConcept)->boolean { parentNode.ancestor.isNotNull; } }

This constraint checks that a TestCase is among the ancestors of a to-be-inserted AssertStatement. The constraint is checked before the new AssertStatement is inserted and prevents inser- tion if not under a TestCase91. 91 This constraint is written from the perspective of the potential child Tables and Graphics The MPS projectional editor associates element. For reasons of dependency  management, it is also possible to write projection rules with language concepts. A projection rule con- the constraint from the perspective sists of cells. Each cell represents a primitive rendering ele- of the parent or an ancestor. This is useful if a new container concept ment. For example, a constant cell contains a constant text wants to restrict the use of existing that is rendered as-is in the programs. A property cell renders child concepts without changing those concepts. For example, the a property (for example, the name). Collections cells arrange Lambda concept, which contains a other cells in some predefined or configurable layout. Among statement list as well, prohibits the use others, MPS has vertical and horizontal collections. To render of LocalVariableRefs, in any of its statements. concepts as a table, a suitable kind of cell is required: MPS provides the table cell for this. For example, the editor for the decision table is shown in Fig. 7.14 (and an example table is shown in Fig. 14.7). Figure 7.14: The editor for a decision However, this is only half of the story. The real definition of table contains a horizontal collection of the table contents happens via a table model implementation cells. The first one contains the return inside the table cell. The inspector for the table cell contains type of the decision table, the second one contains the default value, and a function that has to return a TableModel, an interface that the last one contains the actual table, determines the structure of the table92. Here is the code used represented by the table cell. in the decision table: 92 This is similar to the approach used (node, editorContext)->TableModel { in Java Swing, but it is not exactly the return new XYCTableModel(node, link/DecTab : xExpr/, same interface. link/DecTab : yExpr/, dsl engineering 219

link/DecTab : cExpr/, editorContext); }

The XYCTableModel class is a utility class that ships with MPS for tables whose contents are represented by a concept that has three child collections, one for the contents of the row headers, one for the contents of the column headers and one for the remaining cells. We pass in the node that represents the table as well as the three child collections (and the editorcontext). If none of the existing utility classes is suitable, you have to implement the TableModel interface yourself93. Here is the 93 Later versions of MPS will provide definition of the interface: a higher-level approach to defining tables that is more consistent with public interface TableModel extends { the approach for editor definition in int getColumnCount(); MPS, and which does not require Java int getRowCount(); programming for defining a table. void deleteRow(int rowNumber); node<> getValueAt(int row, int column); void createElement(int row, int column); NodeSubstituteInfo getSubstituteInfo(int row, int column); void insertRow(int rowNumber); void deleteColumn(int columnNumber); void insertColumn(int columnNumber); int getMaxColumnWidth(int columnNumber); }

Note how the getValueAt method returns a node<>. The editor then renders the editor for that node into the respective table cell, supporting nesting of arbitrary other editors into tables. A similar approach will be used for graphical notations. New kinds of cells (for example, rectangle and line) may be required94. The fundamentally interesting characteristic of 94 At the time of this writing, MPS does projectional editors is that completely different styles of nota- not yet support graphical notations; however, it is planned to add them in tions can be supported, as long as the necessary primitive cell 2013. types are available. The approach to editor definition remains unchanged. Because all the different notations are based on the same paradigm, the combination of different notational styles is straightforward.

8 Scoping and Linking

Linking refers to the resolution of name-based references to the referenced symbols in parser-based languages. In pro- jectional systems this is not necessary, since every reference is stored as a direct pointer to the target element. However, in both cases we have to define which elements are actually visible from a given reference site. This information serves as the basis for code completion and to check existing ref- erences for their validity. The set of visible elements for a given reference is called its scope.

As we discussed in the previous chapter, the abstract syntax in its simplest form is a tree. However, the information repre- sented by the program is semantically almost always a graph; i.e. in addition to the tree’s containment hierarchy, it contains non-containment cross-references1. The challenge thus is: how 1 Examples abound and include variable to get from the "syntactic tree" to the "semantic graph" – or, references, procedure calls and target states in transitions of state machines. how to establish the cross-links. There is a marked difference between the projectional and parser-based approach:

• In parser-based systems, the cross-references have to be re- solved, from the parsed text after the AST has been created. An IDE may provide the candidates in a code completion menu, but after selecting a target, the resulting textual rep- resentation of the reference must contain all the information to re-resolve the reference each time the program is parsed.

• In projectional editors in which every program element has a unique ID, a reference is represented as a pointer to that ID. Once a reference is established, it can always be re-resolved trivially based on the ID. The reference is established di- 222 dslbook.org

rectly as the program is edited: the code completion menu shows candidate target elements for a reference in the code completion menu, and selection of one of them creates the reference2. 2 The code completion menu shows some human-readable (qualified) name of the target, but the persisted program uses the unique ID once the user makes a selection.

Typically, a language’s structure definition specifies which con- cepts constitute valid target concepts for any given reference (e.g., a Function, a Variable or a State), but this is usu- ally not enough. Language-specific visibility rules determine which instances of these concepts are actually permitted as a reference target3. The collection of model elements which are 3 For example, only the function and valid targets of a particular semantic cross-reference is called variables in the local module or the states in the same state machine as the transition the scope of that cross-reference. Typically, the scope of a par- may be valid targets. ticular cross-reference not only depends on the target concept of the cross-reference, but also on its surroundings, e.g. the namespace within which the element lives, the location inside the larger structure of the site of the cross-reference or some- thing that’s essentially non-structural in nature. A scope, the collection of valid targets for a reference, has two uses. First, it can be used to populate the code completion menu in the IDE if the user presses Ctrl-Space at the reference site. Second, independent of the IDE, the scope is used for checking the validity of an existing reference: if the reference target is not among the elements in the scope, the reference is invalid. Instead of looking at scopes from the Scopes can be hierarchical, in which case they are organized perspective of the reference (and hence calculating a set of candidate target as a stack of collections – confusingly, these collections are elements), one can also look at scopes often called scopes themselves. During resolution of a cross- from the perspective of visibility. In this case, we (at least conceptually) compute reference, the lowest or innermost collection is searched first. If for each location in the program, the set the reference cannot be resolved to match any of its elements, of visible elements. A reference is then the parent of the innermost collection is queried, and so forth. restricted to refer to any element from those visible at the particular location. The hierarchy often mimics the structure of the language Our notion is more convenient from the itself: e.g., the innermost scope of a reference consists of all cross-reference viewpoint, however, as it centers around resolving particular the elements present in the immediately-surrounding "block", cross-references one at a time. From while the outermost scope is the global scope. This provides a an implementation point of view, both mechanism to disambiguate target elements having the same perspective are exchangeable. reference syntax (usually the target element’s name) by always choosing the element from the innermost scope. This is often called shadowing, because the inner elements overshadow the (more) outer elements. dsl engineering 223

8.1 Scoping in Spoofax

In the previous chapter we described how to specify a gram- mar for a subset of the Mobl language. This chapter shows how to specify name resolution for this language by means of declarative name binding rules. Spoofax’ name binding rules are based on five concepts: namespaces, definitions, references, scopes and imports. We will introduce each of these concepts separately, going from simple to more complicated examples. 8.1.1 Namespaces To understand naming in Spoofax, the notion of a namespace is essential. In Spoofax, a namespace is a collection of names and is not necessarily connected to a specific language concept4. 4 Some languages such as C# provide Different concepts can contribute names to a single namespace. namespaces as a language concept to scope the names of declarations such For example, in Java, classes and interfaces contribute to the as classes. It is important to distinguish same namespace. Namespaces are declared in the namespace these namespaces as a language con- cept from Spoofax’ namespaces as a section of a language definition. For Mobl, we have separate language definition concept. The two are namespaces for modules, entities, properties, functions and lo- not related. cal variables. namespaces Module Entity Property Function Variable

8.1.2 Definitions and References Once we have defined namespaces, we can define name bind- ings with rules of the form pattern : clause*, where pattern 5 5 is a term pattern , and clause* is a list of name binding decla- A term pattern is a term that may _ rations about the language construct that matches with pattern. contain variables (x) and wildcards ( ). For example, the following rules declare definition sites for module and entity names. The patterns in these rules match module and entity declarations, binding variables m and e to module and entity names respectively. These variables are then used in the clauses on the right-hand sides6. 6 In the first rule, the clause specifies any term matched by Module(m, _) Module(m, _): defines non-unique Module m Entity(e, _): defines unique Entity e to define a name m in the Module namespace. Similarly, the second As an example, let us reconsider the example module from the rule specifies any term matched by Entity(e, _) to define a name e in the previous chapter: Entity namespace. module shopping entity Item { name : String ... }

The parser turns this into an abstract syntax tree, represented as a term: 224 dslbook.org

Module( "shopping", [ Entity( "Item", [Property("name", EntityType("String")), ...] ) ] ])

The patterns in name binding rules match subterms of this term, indicating definition and use sites. The whole term is a definition site of the module name shopping. The first name binding rule specifies this binding. Its pattern matches the term and binds m to "shopping". Similarly, the subterm Entity ("Item", ...) is a definition site of the entity name Item. The pattern of the second name binding rule matches this term and binds e to "Item". While entity declarations are unique definition sites, module declarations are non-unique definition sites. That is, multiple module declarations can share the same name. This allows Mobl users to spread the content of a module over several files, similar to Java packages. Namespaces are by default unique, so the unique keyword is only optional and can be omitted. For example, the following rules declare unique definition sites for property and variable names:

Property(p, _): defines Property p Param(p, _): defines Variable p Declare(v, _): defines Variable v

Note how Spoofax distinguishes the name of a namespace from the sort and the constructor of a program element: in the last rule above, the sort of the program element is Statement, its constructor is Declare, and it lives in the Variable namespace. By distinguishing these three things, it becomes easy to add or exclude program elements from a namespace7. 7 For example, return statements are Use sites which refer to definition sites of names can be de- also of the syntactic sort Statement, but do not live in any namespace. On clared similarly. For example, the following rule declares use the other hand, function parameters sites of entity names: also live in the Variable namespace, even though (in contrast to variable Type(t): refers to Entity t declarations) they do not belong to the syntactic sort Statement. Use sites might refer to different names from different names- paces. For example, a variable might refer either to a Variable or a Property. In Spoofax, this can be specified by exclusive resolution options:

Var(x): refers to Variable x otherwise refers to Property x dsl engineering 225

The otherwise keyword signals ordered alternatives: only if the reference cannot be resolved to a variable will Spoofax try to resolve it to a property. As a consequence, variable decla- rations shadow property definitions. If this is not intended, constraints can be defined to report corresponding errors. We will discuss constraints in Section 9.3 in the next chapter.

8.1.3 Scoping

 Simple Scopes In Spoofax, Scopes restrict the visibility of definition sites8. For example, an entity declaration scopes 8 Note that Spoofax’ use of the word property declarations that are not visible from outside the en- scope is different from the general meaning of the word in this chapter. tity. entity Customer { name : String // Customer.name } entity Product { name : String // Product.name }

In this example, both name properties live in the Property namespace, but we can still distinguish them: if name is ref- erenced in a function inside Customer, then it references the one in Customer, not the one in Product. Scopes can be nested and name resolution typically looks for definition sites from inner to outer scopes. In Mobl, mod- ules scope entities, entities scope properties and functions, and functions scope local variables. This can be specified in Spoofax in terms of scopes clauses:

Module(m, _): defines Module m scopes Entity Entity(e, _): defines Entity e scopes Property, Function Function(f, _): defines Function f scopes Variable

As these examples illustrate, scopes are often also definition sites. However, this is not a requirement. For example, a block statement9 has no name, but scopes variables: 9 A block statement groups statements syntactically to a single statement. _ Block( ): scopes Variable For example, Java provides curly braces to group statements into a block statement.  Definition Sites with Limited Scope So far we have seen ex- amples in which definitions are visible in their enclosing scope: entities are visible in the enclosing module, properties and functions are visible in the enclosing entity, and parameters are visible in the enclosing function. However, this does not hold for variables declared inside a function. Their visibility is limited to statements after the declaration. Thus, we need to 226 dslbook.org

restrict the visibility in the name binding rule for Declare to the subsequent scope:

Declare(v, _): defines Variable v in subsequent scope

Similarly, the iterator variable in a for loop is only visible in its condition, the update, and the loop’s body, but not in the initializing expression. This can be declared as follows:

For(v, t, init, cond, update, body): defines Variable v in cond, update, body

 Scoped References Typically, use sites refer to names which are declared in its surrounding scopes. But a use site might also refer to definition sites which reside outside its scope. For ex- ample, a property name in a property access expression might refer to a property in another entity: entity Customer { name : String } entity Order { customer : Customer function getCustomerName(): String { return customer.name; } }

Here, name in customer.name refers to the property in entity Customer. The following name binding rule is a first attempt to specify this:

PropAccess(exp, p): refers to Property p in Entity e

But this rule does not specify which entity e is the right one. Interaction with the type system10 is required in this case: 10 We will discuss type systems in Section 10.5. PropAccess(exp, p): refers to Property p in Entity e where exp has type EntityType(e)

This rule essentially says: give me a property with the name p in entity e, where e is the type of the current expression exp.

 Imports Many languages offer import facilities to include definitions from another scope into the current scope. For ex- ample, a Mobl module can import other modules, making en- tities from the imported modules available in the importing module: module order import banking entity Customer { dsl engineering 227

name : String account: BankAccount }

Here, BankAccount is not declared in the scope of module order. However, module banking declares an entity Bank- Account, which is imported into module order. The type of property account should refer to this entity. This can be speci- fied by the following name binding rule:

Import(m): imports Entity from Module m

This rule has two effects. First, m is interpreted as a name refer- ring to a module. Second, every entity declared in this module becomes visible in the current scope. 8.1.4 References in Terms Spoofax uses terms to represent abstract syntax. This enables many interesting features, for example generic tree traversals. But in contrast to object structures as used in MPS and Xtext, terms lack a native concept to represent cross-references. There are two approaches to handle cross-references when working with terms or similar tree structures. First, we can maintain a temporary environment with required information about de- fined elements during a transformation. This information can then be accessed at use sites. Second, we can maintain similar information in a global environment, which can be shared by various transformations. Spoofax follows the second approach and stores all defini- tions and references in an in-memory data structure called the index11. By collecting all this summary information about files 11 Spoofax also uses the index to store in a project together, it ensures fast access to global informa- metadata about definitions, such as type information, as we show in the tion (in particular, to-be-referenced names). The index is up- next chapter. dated automatically when Spoofax model files change (or are deleted) and is persisted as Eclipse exits. All entries in the in- dex have a URI which uniquely identifies the element across a project. These URIs are the basis for name resolution, and, by default, are constructed automatically, based on the name binding rules. As an example, consider the following entity: module storage entity Store { name : String address : Address }

Following the name binding rules discussed so far, there are two scope levels in this fragment: one at the module level and 228 dslbook.org

one at the entity level. We can assign names to these scopes (storage and Store) by using the naming rules for modules and entities. By creating a hierarchy of these names, Spoofax creates URIs: the URI for Store is Entity://storage.Store, and the one for name is Property://storage.Store.name. URIs are represented internally as lists of terms that start with the namespace, followed by a reverse hierarchy of the path names12. 12 The reverse order used in the repre- For the name property of the Store entity in the storage mod- sentation makes it easier to efficiently store and manipulate URIs in memory: ule, this would be: every tail of such a list can share the same memory space. [Property(), "name", "Store", "storage"]

Spoofax annotates each definition and reference with a URI to connect names with information stored in the index. Refer- ences are annotated with the same URI as their definition. This way, information about the definition site is also available at the reference. We can inspect URIs in Spoofax’ analyzed syn- tax view. This view shows the abstract syntax with all URIs as annotations13. Consider the following example with both 13 To obtain this view, select Show Ana- named and anonymous blocks: lyzed Syntax (selection) in the Transform menu of the Spoofax editor. Spoofax module banking will open a new editor which updates automatically when the content of the entity BankAccount { original editor changes. name : String number : Num

function toString() : String { { // anonymous block var result = name + number.toString(); return result; } } }

The analyzed abstract syntax for this example is the following:

Module( "banking"{[Module(),"banking"]}, [ Entity( "BankAccount"{[Entity(),"BankAccount","banking"]}, [ Property( "name"{[Property(),"name","BankAccount","banking"]}, StringType() ), Property( "number"{[Property(),"number","BankAccount","banking"]}, NumType() ), Function( "toString"{[Function(),"toString","BankAccount","banking"]}, [], StringType(), Block([ Declare("result"{[Var(),"result",Anon(125),Anon(124), "toString","BankAccount","banking"]}, Add( Var("name"{[Property(),"name","BankAccount","banking"]}), MethodCall(..., "toString"{[Unresolved(Function()), "toString", "BankAccount", "banking"]}) ) dsl engineering 229

), Return( Var("result"{[Var(),"result",Anon(125),Anon(124), "toString","BankAccount","banking"]}) ) ]) ) ] ) ] )

Any references that cannot be resolved are annotated with a special Unresolved constructor. For example, a variable named nonexistent could be represented as:

Var("nonexistent"{[Unresolved(Var()),"non\-existent",...]})

This makes it easy to recognize any unresolved references in constraints14: we can simply pattern-match against the Unre- 14 We discuss constraints in Section 9.3. solved term.

8.2 Scoping in Xtext

Xtext provides Java APIs for implementing all aspects of lan- guages except the grammar15. Language developers typically 15 In fact, you can use any JVM-based provide Java classes that implement aspect-specific interfaces language for implementing these language aspects, including Xtend. and then contribute those to Xtext using dependency injec- tion16. For most language aspects, Xtext comes with various 16 Xtext’s internal configura- default implementations developers can build on. A lot of tion is based on dependency injection with Google Guice functionality is provided "out of the box" with minimal config- code.google.com/p/google-guice/ uration, but it’s easy to swap specific parts by binding another or a custom class through Guice. 8.2.1 Simple, Local Scopes To implement scopes, language developers have to contribute a class that implements the IScopeProvider interface. It has one method called getScope that returns an IScope for a given reference. An IScope is basically a collection of candidate refer- ence targets, together with the textual representation by which these may be referenced from the current reference site (the same target may be referenced by different text strings from different program locations). The getScope method has two arguments: the first one, context, is the current program ele- ment for which a reference should be scoped; the second one, reference, identifies the reference for which the scope that 17 The class EReference is the Ecore needs to be calculated17. concept that represents references. public interface IScopeProvider { IScope getScope(EObject context, EReference reference); } 230 dslbook.org

To make the scoping implementation easier, Xtext provides declarative scope providers through the AbstractDeclarative- ScopeProvider base class: instead of having to inspect the reference and context object manually to decide how to com- pute the scope, the language implementor can express this in- formation via the name of the method (using a naming con- vention). Two different naming conventions are available:

//,: scoping the reference of the concept public IScope scope__( ctx, EReference ref );

//: the language concept we are looking for asa reference target //: the concept from under which we try to look for the reference public IScope scope_( ctx, EReference ref);

Let’s assume we want to scope the targetState reference of the ChangeStateStatement. Its definition in the grammar looks like this:

ChangeStateStatement: "state" targetState=[State];

We can use the following two alternative methods: public IScope scope_ChangeStateStatement_targetState (ChangeStateStatement ctx, EReference ref ) { ... } public IScope scope_State(ChangeStateStatement ctx, EReference ref) { ... }

The first alternative is specific for the targetState reference of the ChangeStateStatement. It is invoked by the declara- 18 tive scope provider only for that particular reference. The sec- It is a good idea to always use the most general variants, unless you ond alternative is more generic. It is invoked whenever we specifically want to scope one spe- are trying to reference a State (or any subconcept of State) cific reference. Here is why: de- pending on the structure of your ChangeStateStatement from any reference of a and all its de- language, Xtext may have a hard scendants in the AST. So we could write an even more general time finding out the current loca- alternative, which scopes the visible States from anywhere in tion, and hence the reference that needs to be scoped. In this case, the 18 a CoolingProgram, independent of the actual reference . tighter versions of the scoping method (scope_ChangeStateStatement_tar- public IScope scope_State(CoolingProgram ctx, EReference ref) { ... getState in the example) might not } be called in all the places you expect it to be called. This can be remedied The implementation of the scopes is simple, and relatively sim- either by changing the syntax (often not possible or not desired), or by ilar in all three cases. We write Java code that crawls up the using the more general variants of containment hierarchy until we arrive at a CoolingProgram (in the scoping function scope_State( CoolingProgram ctx, ... ). the last alternative, we already get the CoolingProgram as an argument, so we don’t need to move up the tree), and then construct an IScope that contains the States defined in that CoolingProgram. Here is a possible implementation: dsl engineering 231

public IScope scope_ChangeStateStatement_targetState (ChangeStateStatement ctx, EReference ref ) { CoolingProgram owningProgram = Utils.ancestor( ctx, CoolingProgram.class ); return Scopes.scopeFor(owningProgram.getStates()); }

The Scopes class provides a couple of helper methods to cre- ate IScope objects from collections of elements. The simple scopeFor method will use the name of the target element as the text by which it will be referenced19. So if a state is called 19 You can pass in code that creates normalCooling, then we’d have to write state normalCooling other strings than the name from the target element. This supports in a ChangeStateStatement in order to change to that state. the previously mentioned feature of The text normalCooling acts as the reference – pressing Ctrl-F3 referencing the same program element with different strings from different on that program element will go to the referenced state. reference sites.

8.2.2 Nested Scopes

The approach to scoping shown above is suitable for simple cases, such as the targetState reference shown above. How- ever, in languages with nested blocks a different approach is recommended. Here is an example of a program expressed in a language with nested blocks: var int x; var int g; function add( int x, int y ) { int sum = x + y;//1 return sum; } function addAll( int es ... ) { int sum = 0; foreach( e in es ) { sum += e;//2 } x = sum;//3 }

At the program location marked as 1, the local variable sum, the arguments x and y and the global variables x and g are visible, although the global variable x is shadowed by the argument of the same name. At 2, we can see x, g, sum and es, but also the iterator variable e. At 3, x refers to the global, since it is not shadowed by a parameter or local variable of the same 20 The symbols in outer blocks are name. In general, some program elements introduce blocks either not accessible at all, or a special name has to be used, for example, (often statement lists surrounded by curly braces). A block can by prefixing them with some outer declare new symbols. References from within these blocks can keyword (for example, outer.x). see the symbols defined in that block, as well as all ancestor blocks. Symbols in inner blocks typically hide symbols with the same name in outer blocks20. 232 dslbook.org

Xtext’s scopes support this scenario. IScopes can reference outer scopes. If a symbol is not found in any given scope, that scope delegates to its outer scope (if it has one) and asks it for a symbol of the respective name. Since inner scopes are searched first, this implements shadowing as expected. Also, scopes are not just collections of elements. Instead, they are maps between a string and an element21. The string 21 In addition, the text shown in the is used as the reference text. By default, the string is the same code completion window can be dif- ferent from the text that will be used as the target element’s name property. So if a variable is called as the reference once an element is x, it can be referenced by the string x. However, this reference selected. In fact, it can be a rich string that includes formatting, and it can string can be changed as part of the scope definition. This can contain an icon. be used to make shadowed variables visible under a different name, such as outer.x if it is referenced from location 1. The following is pseudo-code that implements this behavior:

// recursive method to build nested scopes private IScope collect( StatementList ctx ) { IScope outer = null if ( ctx is within another StatementList parent ) { outer = collect(parent) } IScope scope = new Scope( outer ) for( all symbols s in ctx ) { scope.put( s.name, s ) if ( outer.hasSymbolNamed( s.name ) ) { scope.put( "outer."+s.name, outer.getSymbolByName( s.name ) ) } } return scope }

// entry method, according to naming convention // in declarative scope provider public IScope scope_Symbol( StatementList ctx ) { return collect( ctx ) }

8.2.3 Global Scopes There is one more aspect of scoping that needs to be discussed. Programs can be separated into several files and references can cross file boundaries. That is, an element in file A can refer- ence an element in file B. In earlier versions of Xtext file A had to explicitly import file B to make the elements in B available 22 This resulted in several problems. as reference targets22. Since Xtext 1.0 both of these problems First, for internal reasons, scalability was limited. Second, as a consequence 23 are solved using the emphindex . The index is a data struc- of the explicit file imports, if the refer- ture that stores (String,IEObjectDescription)-pairs. The first enced element was moved into another file, the import statements in all refer- argument is the qualified name of the object, and the second encing files had to be updated. one, the IEObjectDescription, contains information about a 23 This is similar to Spoofax’ index model element, including a URI (a kind of global pointer that discussed above. also includes the file in which the element is stored) as well as arbitrary additional data provided by the language imple- mentation. By default, all references are checked against this dsl engineering 233

name in the index, not against the actual object. If the actual object has to be resolved, the URI stored in the index is used. Only then is the respective file loaded24. The index is updated 24 This is what improved scalability; files whenever a file is changed25. This way, if an element is moved are only loaded if a reference target is accessed, not to check a reference for to a different file while keeping its qualified name (which is validity. based on the logical program structure) constant, the reference 25 remains valid. Only the URI in the index is updated. Even when it has not been saved, so references against dirty editors work as There are two ways to customize what gets stored in the expected. index, and how. The IQualifiedNameProvider returns a qual- ified name for each program element. If it returns null, the ele- ment is not stored in the index, which means it is not reference- able. The other way is the IDefaultResourceDescription- Strategy, which allows language developers to build their own IEObjectDescription for program elements. This is im- portant if custom user data has to be stored in the IEObject- Description for later use during scoping. The IGlobalScopeProvider is activated if a local scope re- turns null or no applicable methods can be found in the declar- ative scope provider class (or if they return null). By default, the ImportNamespacesAwareGlobalScopeProvider is config- ured26, which provides the possibility of referencing model 26 As with any other Xtext configura- elements outside the current file, either through their (fully) tion, the specific implementation is configured through a Guice binding. qualified name, or through their unqualified name if the re- spective namespace is imported using an import statement27. 27 That import statement is different from the one mentioned earlier: it Polymorphic References In the cooling language, expressions makes the contents of the respective  namespace visible; it does not refer to also include references to entities such as configuration param- the a particular file. eters, variables and hardware elements (compressors or fans defined in a different model). All of these referenceable ele- ments extend SymbolDeclaration. This means that all of them can be referenced by the single SymbolRef construct.

AtomicLevel returns Expression: ... ({SymbolRef} symbol=[SymbolDeclaration|QID]);

The problem with this situation is that the reference itself does 28 not encode the kind of thing that is referenced . This makes 28 By looking at the reference alone we writing code that processes the model cumbersome, since the only know that we reference some kind target of a SymbolRef has to be taken into account when de- of symbol. We don’t know whether the reference points to a variable, a ciding how to treat (translate, validate) a symbol reference. A configuration parameter or a hardware more natural design of the language would use different refer- element. ence constructs for the different referenceable elements. In this case, the reference itself is specific to the referenced element, 29 It would also make writing the scopes 29 making processing much easier : and extending the language simpler. 234 dslbook.org

AtomicLevel returns Expression: ... ({VariableRef} var=[Variable]); ({ParameterRef} param=[Parameter]); ({HardwareBuildingBlockRef} hbb=[HardwareBuildingBlock]);

However, this is not possible with Xtext, since the parser cannot distinguish the three cases syntactically. As we can see from the (invalid) grammar above, in all three cases the reference syntax itself is just an ID. Only during the linking phase could the system check which kind of element is actually referenced, but this is too late for the parser, which needs an unambiguous grammar. The grammar could be disambiguated by using a different syntax for each element:

AtomicLevel returns Expression: ... ({VariableRef} var=[Variable]); ({ParameterRef} "%" param=[Parameter]); ({HardwareBuildingBlockRef} "#" hbb=[HardwareBuildingBlock]);

While this approach will technically work, it would lead to an awkward syntax and is hence typically not used. The only remaining alternative is to make all referenceable elements ex- tend SymbolDeclaration and use a single reference concept, as shown above.

8.3 Scoping in MPS

Making references work in MPS requires several ingredients. First of all, as we have seen earlier, the reference is defined as part of the language structure. Next, an editor is defined that determines how the referenced element is rendered at the ref- erencing site30. To determine which instances of the referenced 30 The syntax used to represent the concept are allowed, a scoping function has to be implemented. reference is defined by that editor and can be changed at any time, since the This simply returns a list of all the elements that are considered actual reference is implemented based valid targets for the reference, as well as an optional text string on the target element’s UID. used to represent the respective element in the code completion menu. As we explained above (Section 7.2), smart references are an important ingredient to make this work conveniently. They make sure that users can simply type the name (or whatever else is put into the code completion menu by the language de- veloper) of the targeted element; once something is selected, the corresponding reference concept is instantiated, and the selected target is set. dsl engineering 235

 Simple Scopes As an example, we begin with the scope def- inition for the target reference of the Transition concept. To recap, it is defined as: concept Transition // ... references: State target 1

The scope itself is defined via the search scope constraint be- low. The system provides an anonymous search scope func- tion that has a number of arguments that describe the context including the enclosing node and the referencing node. As the signature shows, the function has to return either an ISearch- Scope or simply a sequence of nodes of type State. The scope of the target state is the set of states of the state machine that (transitively) contains the transition. To implement this, the ex- pression in the body of this function crawls up the containment hierarchy31 until it finds a Statemachine and then returns its 31 Note that for a smart reference, where states32. the reference object is created only after selecting the target, the referenceNode link {target} argument is null! This is why we write referent set handler: the scope using the enclosingNode argument. search scope: (referenceNode, linkTarget, enclosingNode, ...) 32 ->join(ISearchScope | sequence>) { The code used to express scopes enclosingNode.ancestor.states; can be arbitrarily complex and is } implemented in MPS’ BaseLanguage, validator: an extended version of Java. presentation :

In addition to the search scope, language developers can pro- vide code that should be executed if a new reference target is set (referent set handler), additional validation (validator), as well as customized presentation in the code completion menu (presentation)33. 33 This can be different than the text used to represent the reference once it Nested Scopes In a more complex, block-oriented language is established. That text is controlled by  the referencing concept’s editor. with nested scopes, a different implementation pattern is rec- ommended34: 34 In this section we describe the ap- proach as we have implemented it • All program elements that contribute elements that can be for mbeddr C. Since version 2.5, MPS supports this approach out of the box. referenced (such as blocks, functions or methods) implement For example, an interface similar to an interface IScopeProvider. IScopeProvider ships with MPS, and scopes can be inherited from parent nodes. • The interface provides getVisibleElements(concept<> c), a method that returns all elements of type c that are avail- able in that scope. 236 dslbook.org

• The search scope function simply calls this method on the owning IScopeProvider, passing in the concept whose in- stances it wants to see (State in the above example).

• The implementation of the method recursively calls the me- thod on its owning IScopeProvider, as long as there is one. It also removes elements that are shadowed from the result.

This approach is used in the mbeddr C language, for example for local variables, because those are affected by shadowing from blocks. Here is the code for the variable reference of the LocalVariableReference concept:

link {variable} search scope: (referenceNode, linkTarget, enclosingNode, ... ) ->join(ISearchScope | sequence>) {

// find the statement that contains the future local variable ref node s = enclosingNode.ancestor;

// find the first containing ILocalVariableScopeProvider which is // typically next next higher statement that owns a StatementList. // An example would be a ForStatement or an IfStatement node scopeProvider = enclosingNode.ancestor;

// In case we are not in a Statement or there // is no ILocalVarScopeProvider, // we return an empty list - no variables visible if (s == null || scopeProvider == null){ return new nlist; }

// we now retrieve the position of the current Statement in the // context StatementList. This is important because we only want to // see those variables that are defined before the reference site int pos = s != scopeProvider ? s.index : LocalVarScope.NO_POSITION;

// finally we query the scopeProvider for the visible local variables scopeProvider.getLocalVarScope(s, pos).getVisibleLocalVars(); }

 Polymorphic References We have explained above how ref- erences work in principle: they are real pointers to the refer- enced element, based on the target’s unique ID. In the section on Xtext we have seen how from a given location only one kind of reference for any given syntactic form can be implemented. Consider the following example, where we refer to a global variable a and an event parameter (timestamp) from within the guard condition expression:

int a; int b;

statemachine linefollower { in event initialized(int timestamp); states (initial=initializing) { state initializing { on initialized [now() - timestamp > 1000 && a > 3] -> running dsl engineering 237

} state running { } } }

Both references to local variables and to event parameters use the same syntactic form: a text string that represents the name of the respective target element. As we have discussed above, in Xtext, this is implemented with a single reference concept, typically called SymbolReference, that can reference to any kind of Symbol. LocalVariableDeclaration and EventPara- 35 The problem with this approach is meter would both extend Symbol, and scopes would make sure that the reference itself contains no type both kinds are visible from within guard expressions35. information about what it references, it is simply a SymbolReference. Pro- In MPS this is done differently. To solve the example above, cessing code has to inspect the type one would create a LocalVariableReference and an Event- of the referenced symbol to find out ParameterReference. The former references variables and the what a particular SymbolReference actually means. It can also be a problem latter references event parameters. Both have an editor that regarding modularity, because every renders the name of the referenced element, and each of them referenceable concept must extend 36 Symbol. Referenceable elements con- has their own scope definition . The following is the respective tributed by an independently developed code for the EventParameterReference expression: language which we may want to embed into the C language will not extend concept EventParameterReference extends Expression Symbol, though! We discuss language link {parameter} modularization and composition in search scope: Section 16.2. (referenceNode, linkTarget, enclosingNode, ...) ->join(ISearchScope | sequence>) { 36 This retains modularity. Adding enclosingNode.ancestor.trigger.event.args; new kinds of references to existing } expression languages can be done in a modular fashion, since the new Entering the reference happens by typing the name of the refer- reference expression comes with its enced element (cf. the concept of smart references introduced own, independent scoping rule. above). In the case in which there are a LocalVariableDecla- 37 It may not, however, be obvious to ration and an EventParameter of the same name, the user the user, so use this approach with has to make an explicit decision, at the time of entry (the name caution and/or use different syntax won’t bind, and the code completion menu requires a choice). highlighting to distinguish the two. The real benefit of this approach is that if It is important to understand that, although the names are sim- two independent language extensions ilar, the tool still knows whether a particular reference refers to define such scopes independently, there will not be any ambiguity if these a LocalVariableDeclaration or to an EventParameter, be- extensions are used together in a single cause the reference is encoded using the ID of the target37. program.

9 Constraints

Constraints are Boolean expressions that must be true for every program expressed with a specific language. Together with type systems, which are discussed in the next chap- ter, they ensure the static semantics of a language. This chapter introduces the notion of constraints, some consid- erations regarding languages suitable for expressing con- straints, and provides examples with our tools.

As we explained in the DSL Design part of the book, not all programs that conform to the structure (grammar, AS, meta model) of a language are valid. Language definitions include further restrictions that cannot be expressed purely by struc- ture. Such additional restrictions are typically called constraints. Constraints are Boolean conditions that have to evaluate to true in order for the model to be correct ("does expr hold?")1. 1 Constraints represent the static se- An error message is reported if the expression evaluates to mantics of a language. The execution semantics are typically represented by false ("expr does not hold!"). Constraints are typically asso- transformations, generators or inter- ciated a particular language concept ("for each instance of con- preters. We discuss those in Chapter 11. cept C, expr-with-C must hold")2. There are two major kinds 2 In addition to just associating a con- of constraints we can distinguish: well-formedness and type straint with a language concept, addi- systems. Examples for well-formedness constraints include: tional applicability conditions or match patterns may be used. • Uniqueness of names in lists of elements (e.g., functions in a namespace).

• Every non-start state of a state machine has at least one in- coming transition.

• A variable is defined before it is used (statement ordering). Type system rules are different in that they verify the correct- ness of types in programs, e.g., they make sure you don’t as- 240 dslbook.org

sign a float to an int. In expression languages particularly, type calculation and checking can become quite complicated, and therefore warrant special support. This is why we distin- guish between constraints in general (covered in this chapter) and type systems (which we cover in the next chapter). Constraints can be implemented with any language or frame- work that is able to query a model and report errors to the user. To make expressing constraints efficient3, it is useful if the lan- 3 Constraint checking should also guage has the following characteristics: be efficient in terms of speed and memory usage, even for large models. • It should be able to effectively navigate and filter the model. To this end, it is useful if the constraint language supports impact analysis, Support for path expressions (as in aClass.operations. so we can find out efficiently which arguments.type as a way to find out the types of all ar- constraints have to be reevaluated for any given change to a program. guments of all operations in a class) is extremely useful.

• Support for higher-order functions is useful, so that one can write generic algorithms and traversal strategies.

• A good collections language, often making use of higher- order functions, is very useful, so it is easily possible to filter collections, create subsets or get the set of distinct values in a list.

• Finally, it is helpful to be able to associate a constraint declar- atively with the language concept (or structural pattern) for whose instances it should be executed.

Here is an example constraint written in a pseudo-language: constraint for: Class expression: this.operations.arguments.type.filter(ComplexNumber).isNotEmpty && !this.imports.any(i|i.name == "ComplexNumberSupportLib") message: "class "+this.name+" uses complex numbers, "+ "so the ComplexNumberSupportLib must be imported"

Some kinds of constraints require specialized data structures to be built or maintained in sync with the program. Examples include dead code detection, missing returns in some branches of a method’s body, or read access to an uninitialized variable. To be able to find these kinds of errors statically, a dataflow graph has to be constructed from the program. It models the various execution paths through a (part of a) program. Once a dataflow graph is constructed, it can be used to check whether there exists a path from program start to a variable read with- out coming across a write to the same variable. We show an example of the use of a data flow graph in the MPS example (Section 9.2). dsl engineering 241

9.1 Constraints in Xtext

Just like scopes, constraints are implemented in Java or any other JVM language4. Developers add methods to a validator 4 As mentioned earlier, a language class generated by the Xtext project wizard. In the end, these that provides higher-order functional abstractions such as Xtend is very 5 validations plug into the EMF validation framework . useful for navigating and querying A constraint checking method is a Java method with the fol- ASTs. lowing characteristics: it is public, returns void, can have an 5 Other EMF EValidator implementa- arbitrary name, it has a single argument of the type for which tions can be used in Xtext as well. the check should apply, and it has the @Check annotation. For example, the following method is a check that is invoked for all instances of CustomState (i.e. not for start states and back- ground states). It checks that each such state can actually be reached by verifying that it has incoming transitions (expressed via a ChangeStateStatement):

@Check(CheckType.NORMAL) public void checkOrphanEndState( CustomState ctx ) { CoolingProgram coopro = Utils.ancestor(ctx, CoolingProgram.class); TreeIterator all = coopro.eAllContents(); while ( all.hasNext() ) { EObject s = all.next(); if ( s instanceof ChangeStateStatement ) { ChangeStateStatement css = (ChangeStateStatement) s; if ( css.getTargetState() == ctx ) return; } } error("no transition ever leads into this state", CoolingLanguagePackage.eINSTANCE.getState_Name()); }

The method retrieves the cooling program that owns the ctx state, then retrieves all of its descendants and iterates over them. If the descendant is a ChangeStateStatement and its targetState property references the current state, then we re- turn: we have found a transition leading into the current state. If we don’t find one of these, we report an error. An error report contains the error message, a severity (INFO, WARNING, ERROR), the element to which it is attached6, as well as the par- 6 The error message in Eclipse will be ticular feature7 of that element that should be highlighted. The attached to this program element.

CheckType.NORMAL in the annotation defines when this check 7 Feature is EMF’s term for properties, should run: references and operations of EClasses.

• CheckType.NORMAL: run when the file is saved.

• CheckType.FAST: run after each model change (more or less after each keypress).

• CheckType.EXPENSIVE: run only if requested explicitly via the context menu. 242 dslbook.org

Note that neither Xtext nor any of the other tools supports impact analysis by default. Impact analysis is a strategy for finding out whether a particular constraint can potentially be affected by a particular change, and only evaluating the con- straint if it can. Impact analysis can improve performance if this analysis is faster than evaluating the constraint itself. For local constraints this is usually not the case. Only for non-local constraints that cover large parts of the model (and possibly require loading additional fragments), is impact analysis im- portant. Xtext uses a pragmatic approach in the sense that these constraints must be marked as EXPENSIVE by the user and run only on request (over lunch, during nightly build). As an example, let us get back to the example about orphan states. The implementation of the constraint checks orphan-ness sep- arately for each state. In doing so, it gets all descendants of the cooling program for each state. This can be a scalability problem for larger programs. To address this issue, one would write a single constraint for the whole cooling program that identifies all orphan states in one or maybe two scans through the pro- gram. This constraint could then be marked as EXPENSIVE as programs get really big8. 8 In general, local constraints (as shown in the code above) are easier to write than the more optimized global con- straints. However, the latter often 9.2 Constraints in MPS perform better. Unless it is clear from the start that programs will become 9.2.1 Simple Constraints big, it is a good idea to first write local, simpler and maybe less efficient con- MPS’ approach to constraints is very similar to Xtext’s9. The straints, and then use profiling to detect performance bottlenecks later. As usual, main difference is that the constraint is written in BaseLan- premature optimization leads to code guage, an extended version of Java that has some of the fea- that is hard to maintain. tures that makes constraints more concise. Here is the code for 9 Note that in MPS, constraints are the same state unreachable constraint, which we can make use implemented as part of the type system, of in the state machines extension to C: in Non-Typesystem Rules. MPS also has a language aspect called Constraints, but checking rule stateUnreachable { as we have seen before, this is used for applicable for concept = State as state scopes and context constraints. do { if (!state.initial && state.ancestor. descendants. where({~it => it.target == state; }).isEmpty) { error "orphan state - can never be reached" -> state; } } }

Currently there is no way to control when a constraint is run10, 10 In contrast to Xtext, constraints it is decided based on some MPS-internal algorithm which can also not be marked as FAST or EXPENSIVE. tracks changes to a model and reevaluates constraints as neces- dsl engineering 243

sary. However, pressing F5 in a program or explicitly running the model checker forces all constraints to be reevaluated.

9.2.2 Dataflow As we have said earlier, dataflow analysis can be used to de- tect dead code, null access, unnecessary ifs (because it can be shown statically that the condition is always true or false) or read-before-write errors. The foundation for data flow analysis is the data flow graph. This is a data structure that describes the flow of data through a program’s code. Consider the following example: int i = 42; j = i + 1; someMethod(j);

The 42 is "flowing" from the init expression in the local vari- able declaration into the variable i and then, after adding 1, into j, and then into someMethod. Data flow analysis consists of two tasks: building a data flow graph for a program, and then performing analysis on this data flow graph to detect problems in the program. MPS comes with predefined data structures for representing data flow graphs, a DSL for defining how the graph can be de- rived from language concepts (and hence, programs) and a set of default analyses that can be integrated into your language11. 11 MPS also comes with a framework for We will look at all these ingredients in this section. developing custom analyses; however, this is beyond the scope of this book.

 Building a Data Flow Graph Data flow is specified in the Dataflow aspect of language definitions. There you can add data flow builders (DFBs) for your language concepts. These are programs expressed in MPS’ data flow DSL that build the data flow graph for instances of those concepts in programs. Here is the DFB for LocalVariableDeclaration. data flow builder for LocalVariableDeclaration { (node)->void { if (node.init != null){ code for node.init write node = node.init } else { nop } } }

If the LocalVariableDecaration has an init expression (it is optional!), then the DFB for the init expression has to be exe- cuted using the code for statement. Then we perform an ac- tual data flow definition: the write node = node.init spec- 244 dslbook.org

ifies that write access is performed on the current node. The statement also expresses that whatever value was in the init expression is now in the node itself. If there is no init expres- sion, we still want to mark the LocalVariableDeclaration node as visited by the data flow builder using the nop state- ment – the program flow has come across this node12. 12 A subsequent analysis reports all To illustrate a read statement, we can take a look at the program nodes that have not been visited by a DFB as dead code. So LocalVariableRef expression which read-accesses the vari- even if a node has no further effect able it references. Its data flow is defined as read node.var, on a program’s data flow, it has to be marked as visited using nop. where var is the name of the reference that points to the refer- enced variable. In an AssignmentStatement, we first execute the DFB for the rvalue and then "flow" the rvalue into the lvalue – the purpose of an assignment: data flow builder for AssigmentStatement { (node)->void { code for node.rvalue write node.lvalue = node.rvalue } }

For a StatementList, we simply mark the list as visited and then execute the DFBs for each statement in the list. We are now ready to inspect the data flow graph for the simple func- tion below. Fig. 9.1 shows the data flow graph. void trivialFunction() { int8 i = 10; i = i + 1; } Figure 9.1: An example of a data flow Most interesting data flow analysis has to do with loops and for a simple C function. You can access the data flow graph for a program branching. So specifying the correct DFBs for things like if, element (e.g., a C function) by selecting switch and for is important. As an example, we look at the Language Debug -> Show Data Flow Graph from the element’s context menu. DFB for the IfStatement. We start with the obligatory nop to This will render the data flow graph mark the node as visited. Then we run the DFB for the condi- graphically and constitutes a good tion, because that is evaluated in all cases. Then it becomes in- debugging tool when building your own data flow graphs and analyses. teresting: depending on whether the condition is true or false, we either run the thenPart or we jump to where the else if parts begin. Here is the code so far: nop code for node.condition ifjump after elseIfBlock // elseIfBlock is a label defined later code for node.thenPart { jump after node }

The ifjump statement means that we may jump to the specified label (i.e. we then execute the else ifs). If not (we just "run over" the ifjump), then we execute the thenPart. If we execute the thenPart, we are finished with the whole IfStatement dsl engineering 245

– no else ifs or else parts are relevant, so we jump after the current node (the IfStatement) and we’re done. However, there is an additional catch: in the thenPart, there may be a return statement. So we may never actually arrive at the jump after node statement. This is why it is enclosed in curly braces: this says that the code in the braces is optional, so if the data flow does not visit it, that’s fine (and no dead code error is reported). Let’s continue with the else ifs. We arrive at the label elseIfBlock if the condition was false, i.e. the above ifjump actually happened. We then iterate over the elseIfs and exe- cute their DFB. After that, we run the code for the elsePart, if there is one. The following code can only be understood if we know that, if we execute one of the else ifs, then we jump after the whole IfStatement. This is specified in the DFB for the ElseIfPart, which we’ll illustrate below. Here is the rest of the code for the IfStatement’s DFB: label elseIfBlock foreach elseIf in node.elseIfs { code for elseIf } if (node.elsePart != null){ code for node.elsePart }

We can now inspect the DFB for the ElseIfPart. We first run the DFB for the condition. Then we may jump to after that else if, because the condition may be false and we want to try the next else if, if there is one. Alternatively, if the con- dition is true, we run the DFB for the body of the ElseIfPart. Then two things can happen: either we jump to after the whole IfStatement (after all, we have found an else if that is true), or we don’t do anything at all anymore because the current else if contains a return statement. So we have to use the curly braces again for the jump to after the whole if. The code is below, and an example data flow graph is shown in figure Fig. 9.2. code for node.condition ifjump after node code for node.body { jump after node.ancestor }

The DFB for a for loop makes use of the fact that loops can be represented using conditional branching. Here is the code: code for node.iterator label start Figure 9.2: A data flow graph for the an code for node.condition if if ( i > 0 ) j = 1; ifjump after node statement else j = 2; 246 dslbook.org

code for node.body code for node.incr jump after start

We first execute the DFB for the iterator (which is a subcon- cept of LocalVariableDeclaration, so the DFB shown above works for it as well). Then we define a label start so we can jump to this place from further down. We then execute the condition. Then we have an ifjump to after the whole loop (which covers the case in which the condition is false and the loop ends). In the other case (where the condition is still true) we execute the code for the body and the incr part of the for loop. We then jump to after the start label we defined above.

 Analyses MPS supports a number of data flow analyses out of the box13. The following utility class uses the unreach- 13 These analyses operate only on the able code analysis: data flow graph, so the same analyses can be used for any language, once the public class DataflowUtil { DFBs for that language map programs to data flow graphs. private Program prog;

public DataflowUtil(node<> root) { // builda program object and store it prog = DataFlow.buildProgram(root); }

public void checkForUnreachableNodes() { // grab all instructions that // are unreachable(predefined functionality) sequence allUnreachableInstructions = ((sequence) prog.getUnreachableInstructions());

// remove those that may legally be unreachable sequence allWithoutMayBeUnreachable = allUnreachableInstructions.where({~instruction => !(Boolean.TRUE.equals(instruction. getUserObject("mayBeUnreachable"))); });

// get the program nodes that correspond // to the unreachable instructions sequence> unreachableNodes = allWithoutMayBeUnreachable. select({~instruction => ((node<>) instruction.getSource()); });

// output errors for each of those unreachable nodes foreach unreachableNode in unreachableNodes { error "unreachable code" -> unreachableNode; } } }

The class builds a Program object in the constructor. Programs are wrappers around the data flow graph and provide access to a set of predefined analyses on the graph. We will make use of one of them here in the checkForUnreachableNodes method. This method extracts all unreachable nodes from the graph (see comments in the code above) and reports errors for them. To actually run the check, we call this method from a non-typesystem rule for C functions: dsl engineering 247

checking rule check_DataFlow { applicable for concept = Function as fct overrides false do { new DataflowUtil(fct.body).checkForUnreachableNodes(); } }

9.3 Constraints in Spoofax

Spoofax uses rewrite rules to specify all semantic parts of a lan- guage definition. In this section, we first provide a primer on rewrite rules. Next we show how they can be used to specify constraints in language definitions14. 14 Rewrite rules are used for all kinds of other purposes in Spoofax, and we will 9.3.1 Rewrite Rules encounter them again, for example in the chapter on transformation and code Rewrite rules are functions that operate on terms, transforming generation, Section 11.4. This is why we one term to another. Rewrite rules in Spoofax are provided as explain them in some detail here. part of the Stratego program transformation language. A basic rewrite rule that transforms a term pattern term1 to a term pattern term2 has the following form: rule-name: term1 -> term2

Term patterns have the same form as terms: any term is a le- gal term pattern. In addition to the basic constructors, string literals, integer literals, and so on, they also support variables (e.g., v or name) and wildcards (indicated by _). As an exam- ple, the following rewrite rule rewrites an Entity to the list of properties contained in that entity: get-properties: Entity(name, properties) -> properties

So, for an Entity("User", [Property("name", String)]), it binds "User" to the variable name, and [Property("name", "String")] to the variable properties. It then returns the collection properties. While rewrite rules can be viewed as functions, they have one important difference: they can be de- fined multiple times for different patterns15. In the case of 15 This is comparable to polymorphic get-properties, we could add another definition that works overloading. for property access expressions: get-properties: PropAccess(expr, property) -> property

Rules can have complex patterns. For example, it is possible 16 Note how this rule uses a wildcard to write a rule that succeeds only for entities with only a name since it doesn’t care about the name of property16: the entity. 248 dslbook.org

is-name-only-entity: Entity(_, [Property("name", "String")]) -> True()

Rewrite rules can be invoked using the syntax term17. The angle brackets make it easy to distinguish rule in- 17 For example, vocations from terms, and makes it possible to use invocations Entity("Unit", []) would return an empty list of properties. in term expressions. Stratego provides a with clause that can be used for addi- tional code that should be considered for rewrite rules. The with clause is most commonly used for assignments and calls to other rules. As an example, we can write the rule above us- ing a with. This rule assigns the value of get-properties to a variable result and returns that as the result value of the rule: invoke-get-properties: Entity(name, properties) -> result with result := Entity(name, properties)

Rules can also have conditions. These can be specified using where18. These clauses typically use the operators listed in the 18 If the pattern of a rule does not following table: match, or if its conditions do not succeed, a rule is said to fail. As we will Expression Description see later, whether rules succeed or fail helps guide the execution sequence of t Applies e to t, or fails if e is unsuccessful. sets of languages. v := t Assign a term expression t to a variable v. !t => p Match a term t against a pattern p, or fail. not(e) Succeeds if e does not succeed. e1; e2 Sequence: apply e1. If it succeeds, apply e2. e1 <+ e2 Choice: apply e1, if it fails apply e2 instead. An example of a rule with a where clause is the following: has-properties: Entity(name, properties) -> True() with properties := Entity(name, properties); where not(!properties => [])

This rule only succeeds for entities where the where condition not(!properties => []) holds19. That is, it succeeds as long 19 !x => y matches a term x against as an entity does not have an empty list (indicated by []) of a pattern y. It does not mean logical negation. properties. Rewrite rules can have any number of where and with clauses, and they are evaluated in the order they appear. Like functions or methods in other languages, rewrite rules 20 Rules that take both rule and term can have parameters. Stratego distinguishes between parame- parameters have a signature of the ters that pass other rules and parameters that pass terms, using form rule(r|t), those with only 20 rule parameters use rule(r), and a vertical bar to separate the two separate lists . The Stratego those with only term parameters use standard library provides a number of higher-order rules, i.e. rule(|t). dsl engineering 249

rules that take other rules as their argument. These rules are used for common operations on abstract syntax trees: for ex- ample, map(r) applies a rule r to all elements of a list: get-property-types: Entity(_, properties) -> types with types := properties get-property-type: Property(_, type) -> type

Rules like map specify a traversal on a certain term structure: they specify how a particular rule should be applied to a term and its subterms. Rules that specify traversals are also called strategies21. In Spoofax, strategies are used to control traversals 21 This is where the name of the Stratego in constraints, transformation, and code generation. transformation language comes from.

9.3.2 Basic Constraint Rules Spoofax uses rules with the name constraint-error to indi- cate constraints that trigger errors, constraint-warning for warnings, and constraint-note for notes. To report an error, warning or information note, these rules have to be overwrit- ten for the relevant term patterns. The following example is created by default by the Spoofax project wizard. It simply reports a note for any module named example: constraint-note: Module(name, _) -> (name, "This is just an example program.") where !name => "example"

The condition checks if the module name matches the string "example". The rule returns (via its right-hand side) a tu- ple with the tree node where the marker should appear and a string message that should be shown. All constraint rules have this form. Most constraint rules use string interpolation for error mes- sages. Interpolated strings have the form $[...] where vari- ables can be escaped using [...]. The following example uses string interpolation to report a warning22. 22 The rule uses the a standard library rule string-starts-with-capitals. constraint-warning: These and other library rules are Entity(theName, _) -> (theName, $[Entity [theName] does not have a capitalized name]) documented on the Spoofax website at where www.spoofax.org/. not( theName) 23 Notable examples include constraints that forbid references to undefined 9.3.3 Index-Based Constraint Rules program elements and duplicate defini- 23 tions. Newly created Spoofax projects Some constraint rules interact with the Spoofax index . One provide default constraint rules for way to do this is to use URI annotations on the abstract syntax. these cases, which can be customized. 250 dslbook.org

These are placed on each reference and definition. For exam- ple, a reference to a Mobl variable v is represented as Var("v"). With an annotation, it reads as follows:

Var("v"{[Var(),"v","function","module"]})

The annotation is added directly to the name, surrounded with curly braces24. Unresolved references are represented by terms 24 The annotation itself is a URI such as the following (notice the Unresolved term, surround- Var://module/function/v, repre- sented as a list consisting of the names- ing the namespace): pace, the name and the path in reverse order. Var("u"{[Unresolved(Var()),"u","function","module"]})

In most statically typed languages, references that cannot be statically resolved indicate an error. The following constraint rule reports an error for these cases: constraint-error: x -> (x, $[Unable to resolve reference.]) where !x => _{[Unresolved(t) | _]}

This rule matches any term x in the abstract syntax, and reports an error if it has an Unresolved annotation25. Note how the 25 For dynamic languages, or languages pattern _{[Unresolved(t) | _]} matches any term (indicated with optional types, the constraint could be removed or relaxed. In those _ by the wildcard ) that has a list annotation where the head of cases, name resolution may only play a the list is Unresolved(t) and the tail matches _. role in providing editor services such as code completion. In addition to annotations, the Spoofax index provides an API for inspecting naming relations in programs. The follow- ing table shows some of the key rules the index provides. index-uri Gets the URI of a term. index-namespace Gets the namespace of a term. index-lookup Returns the first definition of a reference. index-lookup-all Returns all definitions of a reference. index-get-files-of Gets all files a definition occurred in. index-get-all-in-file Gets all definitions for a given file path. index-get-current-file Gets the path of the current file.

We can use the index API to detect duplicate definitions. In most languages, duplicate definitions are always disallowed. In the case of Mobl, duplicate definitions are not allowed for functions or entities, but they are allowed for variables, just as in JavaScript. The following constraint rules checks for dupli- cate entity declarations: constraint-error: Entity(name, _) -> (name, $[Duplicate definition]) where defs := name; ( defs, 1) dsl engineering 251

This rule matches any entity declaration. Then, it fires a helper rule is-duplicates-allowed. Next, the constraint rule deter- mines all definition sites of the entity name. If the list has more than one element, the rule reports an error. This is checked by comparing the length of the list with 1 by calling the gt ("greater than") rule. More sophisticated constraints and error messages can be specified using a type system, as we show in the next chapter.

10 Type Systems

Type systems are a subset of constraints – they implement type calculations and type checks. These can be relatively complex, so special support beyond general-purpose con- straint checking is useful. In this chapter we discuss what type systems do in general, we discuss various strategies for computing types, and we provide the usual examples with Xtext, MPS and Spoofax.

Let us start with a definition of type systems from Wikipedia:

A type system may be defined as a tractable syntactic frame- work for classifying phrases according to the kinds of values they compute. A type system associates types with each com- puted value. By examining the flow of these values, a type sys- tem attempts to prove that no type errors can occur. The type system in question determines what constitutes a type error, but a type system generally seeks to guarantee that operations expecting a certain kind of value are not used with values for which that operation makes no sense. 1 If a DSL uses dynamic typing, the type checks are performed at runtime based on the actual types of values. In summary, type systems associate types with program el- Many of the ways of expressing typing rules are similar in this case. However, ements and then check whether these types conform to pre- all the DSLs I have built so far use defined typing rules. We distinguish between dynamic type static typing – the fact you can actually systems, which perform the type checks as the program ex- have static type systems is a primary benefit of external DSLs. DSLs with ecutes, and static type systems, where type checks are per- dynamic type systems are probably formed ahead of execution, mostly based on type specifications better implemented as internal DSLs, relying on the dynamic type system of in the program. This chapter focuses exclusively on static type the host language. Internal DSLs are checks1. beyond the scope of this book. 254 dslbook.org

10.1 Type Systems Basics

To introduce the basic concepts of type systems, let us go back to the example used at the beginning of the section on syntax. As a reminder here is the example code, and Fig. 10.1 shows the abstract syntax tree. var x: int; calc y: int = 1 + 2 * sqrt(x)

Figure 10.1: Abstract syntax tree for the above program. Boxes represent instances of language concepts, solid lines represent containment, dotted lines represent cross-references.

Using this example, we can illustrate in more detail what type systems have to do:

Declare Fixed Types Some program elements have fixed types. They don’t have to be derived or calculated – they are al- ways the same and known in advance. Examples include the integer constants IntConst (whose type is IntType), the square root concept sqrt (whose type is double), as well as the type declarations themselves (the type of IntType is IntType, the type of DoubleType is Double- Type).

Derive Types For some program elements, the type has to be derived from the types of other elements. For example, the type of a VarRef (the variable reference) is the type of the referenced variable. The type of a variable is the type of its declared type. In the example above, the type of x and the reference to x is IntType.

Calculate Common Types Most type systems have some kind of type hierarchy. In the example, IntType is a subtype of dsl engineering 255

DoubleType (so IntType can be used wherever DoubleType is expected). A type system has to support the specifica- tion of such subtype relationships. Also, the type of certain program elements may be calculated from the arguments passed to them; in many cases the resulting type will be the "more general" one based on the subtyping relationship. Ex- amples include the Plus and Multi concepts: if the left and right arguments are two IntTypes, the result is an IntType. In the case of two DoubleTypes, the result is a DoubleType. If an IntType and a DoubleType are used, the result is a DoubleType, the more general of the two.

Type Checks Finally, a type system has to check for type errors and report them to the user. To this end, a language specifies type constraints or type checks that are checked at editing time by the type system based on the calculated types. In the example, a type error would occur if something with a DoubleType were assigned to an IntType variable.

The type of a program element is generally not the same as its language concept2. Different instances of the same concept can 2 For example, the concept (meta class) have different types: a + calculates its type as the more general of the number 1 is IntConst and its type is IntType. The type of the sqrt of the two arguments. So the type of each + instance depends is DoubleType and its concept is Sqrt. on the types of the arguments of that particular instance. Only for type declarations themselves the two are (usually) the same: the type Types are often represented with the same technology as of an IntType is IntType. the language concepts. As we will see, in the case of MPS types are just nodes, i.e. instances of concepts. In Xtext, we use EObjects, i.e. instances of EClasses as types. In Spoofax, any ATerm can be used as a type. In all cases, we can even define the concepts as part of the language. This is useful, because most of the concepts used as types also have to be used in the program text whenever types are explicitly declared (as in var x: int).

10.2 Type Calculation Strategies

Conceptually, the core of a type system can be considered to be a function typeof that calculates the type for a program el- ement. This function can be implemented in any way suitable; after all, it is just program code. However, in practice, three approaches seem to be used most: recursion, unification and pattern matching. We will explore each of these conceptually, and then provide examples in the tool sections. 256 dslbook.org

10.2.1 Recursion Recursion is widely used in computer science and we assume that every reader is familiar with it. In the context of type systems, the recursive approach for calculating a type defines a polymorphic function typeof, which takes a program element and returns its type, while calling itself3 to calculate the types 3 Or, most likely, one of the polymor- of those elements on which its own type depends. Consider phic overrides. the following example grammar (using Xtext notation):

LocalVarDecl: "var" name=ID ":" type=Type ("=" init=Expr)?;

The following examples are structurally valid example sen- tences: var i: int //1 var i: int = 42//2 var i: int = 33.33//3 var i = 42//4

Let’s develop the pseudo-code for typeof function the Local- VarDecl. A first attempt might look as follows: typeof( LocalVarDecl lvd ) { return typeof( lvd.type ) } typeof( IntType it ) { return it } typeof( DoubleType dt ) { return dt }

Notice how typeof for LocalVarDecl recursively calls typeof for its type property. Recursion ends with the typeof func- tions for the types; they return themselves. This implementa- tion successfully calculates the type of the LocalVarDecl, but it does not address the type check that makes sure that, if an init expression is specified, it has the same type (or a subtype) of the type property. This could be achieved as follows: typeof( LocalVarDecl lvd ) { if isSpecified( lvd.init ) { assert typeof( lvd.init ) isSameOrSubtypeOf typeof( lvd.type ) } return typeof( lvd.type ) }

Notice (in the grammar) that the specification of the variable type (in the type property) is also optional. So we have created a somewhat more elaborate version of the function: typeof( LocalVarDecl lvd ) { if !isSpecified( lvd.type ) && !isSpecified( lvd.init ) raise error

if isSpecified( lvd.type ) && !isSpecified( lvd.init ) return typeof( lvd.type )

if !isSpecified( lvd.type ) && isSpecified( lvd.init ) return typeof( lvd.init ) dsl engineering 257

// otherwise... assert typeof( lvd.init ) isSameOrSubtypeOf typeof( lvd.type ) return typeof( lvd.type ) }

10.2.2 Unification Unification is the second well-known approach to type calcu- lation. Let’s start with a definition from Wikipedia:

Unification is an operation . . . which produces from . . . logic terms a substitution which . . . makes the terms equal modulo some equational theory.

While this sounds quite sophisticated, we have all used unifi- cation in high-school for solving sets of linear equations. The "equational theory" in this case is algebra. Here is an example:

(1) 2 * x == 10 (2) x + x == 10 (3) x + y == 2 * x + 5

Substitution refers to assignment of values to x and y. A solu- tion for this set of equations is x := 5, y := 10. Using unification for type systems means that language de- velopers specify a set of type equations which contain type variables (cf. the x and y) as well as type values (the num- bers in the above example). Some kind of engine is then try- ing to make all equations true by assigning type values to the type variables in the type equations. The interesting property of this approach is that there is no distinction between typing rules and type checks. We simply specify a set of equations that must be true for the types to be valid4. If an equation 4 Consequently they can be evaluated cannot be satisfied for any assignment of type values to type "in both ways". They can be used for type checking, but they can also be variables, a type error is detected. To illustrate this, we return used to compute "missing" types, i.e. to the LocalVarDecl example introduced above. support type inference. MPS (which uses this approach) also exploits this var i: int //1 declarative nature of typing rules by var i: int = 42//2 var i: int = 33.33//3 supporting type-aware code completion var i = 42//4 (Ctrl-Shift-Space), where MPS computes the required type from the The following two type equations constitute the complete type current context and then only shows code completion menu entries that fit system specification. The :==: operator expresses type equa- the context regarding their type (and tion (left side must be the same type as right side), :<=: refers not just based on the structure). to subtype-equation (left side must be same type or subtype 5 The operators are taken from MPS, of right side, the pointed side of < points to the "smaller", the which uses this unification for the type system. more specialized type)5. typeof( LocalVarDecl.type ) :>=: typeof( LocalVarDecl.init ) typeof( LocalVarDecl ) :==: typeof( LocalVarDecl.type ) 258 dslbook.org

Let us look at the four examples cases. We use capital letters for free type variables. In the first case, the init expression is not given, so the first equation is ignored. The second equation can be satisfied by assigning T, the type of the variable declaration, to be int. The second equations acts as a type derivation rule and defines the type of the overall LocalVarDecl to be int.

// var i: int typeof( int ) :>=: typeof( int ) // ignore typeof( T ) :==: typeof( int ) // T := int

In the second case the type and the init expression are given, and both have types that can be calculated independently of the equations specified for the LocalVarDecl (they are fixed). So the first equation has no free type variables, but it is true with the type values specified (two ints). Notice how in this case the equation acts as a type check: if the equation were not true for the two given values, a type error would be reported. The second equation works the same as above, deriving T to be int.

// var i: int = 42 typeof( int ) :>=: typeof( int ) // true typeof( T ) :==: typeof( int ) // T := int

The third case is similar to the second case; but the first equa- tion, in which all types are specified, is not true, so a type error is raised.

// var i: int = 33.33 typeof( int ) :>=: typeof( double ) // error! typeof( T ) :==: typeof( int ) // T := int

Case four is interesting because no variable type is explicitly specified; the idea is to use type inference to derive the type from the init expression. In this case there are two free vari- ables in the equations; substituting both with int solves both equations6. 6 Notice how the unification approach automatically leads to support for type // var i = 42 inference! typeof( U ) :>=: typeof( int ) // U := int typeof( T ) :==: typeof( U ) // T := int

To further illustrate how unification works, consider the fol- lowing example, in which we specify the typing rules for array types and array initializers: var i: int[] var i: int[] = {1, 2, 3} var i = {1, 2, 3}

Compared to the LocalVarDecl example above, the additional complication in this case is that we need to make sure that all dsl engineering 259

the initialization expressions (inside the curly braces) have the same or compatible types. Here are the typing equations: typevar T foreach ( e: init.elements ) typeof(e) :<=: T typeof( LocalVarDecl.type ) :>=: new ArrayType(T) typeof( LocalVarDecl ) :==: typeof( LocalVarDecl.type )

We introduce an additional type variable T and iterate over all the expression in the array initializer, establishing an equation between each of these elements and T. This results in a set of equations that each must be satisfied7. The only way to achieve 7 This clearly illustrates that the this is for all array initializer members to be of the same (sub- :<=: operator is not an assignment, since if it were, only the last of the )type. In the examples, this makes T to be int. The rest of init.elements would be assigned to T, the equations works as explained above. Notice that if we’d which clearly makes no sense. written var i = {1, 33.33, 3}, then T := double, but the equations would still work because we use the :>=: operator.

10.2.3 Pattern Matching

In pattern matching we simply list the possible combinations of types in a big table. Cases that are not listed in the table will result in errors. For our LocalVarDecl example, such a table could look like this:

typeof(type) typeof(init) typeof(LocalVarDecl) int int int int - int - int int double double double double - double - double double double int double

To avoid repeating everything for all valid types, variables could be used. T+ refers to T or subtypes of T.

typeof(type) typeof(init) typeof(LocalVarDecl) TTT T - T - TT T T+T

Pattern matching is used for binary operators in MPS and also for matching terms in Spoofax. 260 dslbook.org

10.3 Xtext Example

Up to version 1.0, Xtext provided no support for implementing type systems8. In version 2.0 a type system integrated with the 8 Beyond implementing everything JVM’s type system is available9. It is not as versatile as it could manually and plugging it into the constraints. be, since it is limited to JVM-related types and cannot easily 9 We illustrate it to some extent in be used for languages that have no relationship with the JVM, the section on language modularity (Section 16.2). such as C or C++. As a consequence of this limitation and the fact that Xtext is widely used, two third-party libraries have been developed: the Xtext Typesystem Framework (developed by the author10), 10 code.google.com/a/eclipselabs and XTypes (developed by Lorenzo Bettini11. In the remainder .org/p/xtext-typesystem/ 11 xtypes.sourceforge.net/ of this section we will look at the Xtext Typesystem Frame- work12). 12 For a comparison of the various type system implementation approaches for Xtext, see this SLE 2012 paper  Xtext Typesystem Framework The Xtext Typesystem Frame- L. Bettini, D. Stoll, and M. Voelter. work is based on the recursive approach. It provides an in- Approaches and tools for implementing terface ITypesystem with a method typeof(EObject) which type systems in xtext. In SLE 2012, 2012 returns the type for the program element passed in as an argu- ment. In its simplest form, the interface can be implemented manually with arbitrary Java code. To make sure type errors are reported as part of the Xtext validation, the type system framework has to be integrated into the Xtext validation frame- work manually:

@Inject private ITypesystem ts;

@Check(CheckType.NORMAL) public void validateTypes( EObject m ) { ts.checkTypesystemConstraints( m, this ); }

As we have discussed in Section 10.1, many type systems rely on a limited set of typing strategies13. The DefaultTypesystem 13 Assigning fixed types, deriving the class implements ITypesystem and provides support for declar- type of an element from one of its properties, calculating the type as the atively specifying these strategies. In the code below, a sim- common type of its two arguments. plified version of the type system specification for the cool- ing language, the initialize method defines one type (the type of the IntType is a clone of itself) and defines one typ- ing constraint (the expr property of the IfStatement must be a Boolean). Also, for types which cannot be specified declara- tively, an operation type(..) can be implemented to program- matically define types. The example below shows this for the NumberLiteral. public class CLTypesystem extends DefaultTypesystem {

private CoolingLanguagePackage cl = CoolingLanguagePackage.eINSTANCE; dsl engineering 261

@Override protected void initialize() { useCloneAsType(cl.getIntType()); ensureFeatureType(cl.getIfStatement(), cl.getIfStatement_Expr(), cl.getBoolType()); }

public EObject type( NumberLiteral s, TypeCalculationTrace trace ) { if ( s.getValue().contains(".")) { return create(cl.getDoubleType()); } return create(cl.getIntType()); } }

In addition to the API used in the code above, the Typesys- tem Framework also comes with a textual DSL to express typ- ing rules (Fig. 10.2 shows a screenshot). From the textual type system specification, a generator generates the implementation of the Java class that implements the type system using the APIs14. The DSL provides the following advantages compared 14 In that sense, the DSL is just a facade to the specification in Java: on top of a framework; however, it is a nice example of how a DSL can provide added value over a framework or API. • The notation is much more concise compared to the API.

• Referential integrity and code completion with the target language meta model is provided.

• If the typing rules are incomplete, a static error is shown in the editor, as opposed to getting runtime errors during initialization of the framework (see the warning in Fig. 10.2).

• Ctrl-Click on a property jumps to the typing rule that de- fines the type for that property.

 Type System for the Cooling Language The complete type system for the cooling language is 200 lines of DSL code, and another 100 lines of Java code. We’ll take a look at some rep- resentative examples. Primitive types usually use a copy of themselves as their type15: 15 It has to be a copy as opposed to the element itself, because the actual typeof BoolType -> clone program element must not be pulled typeof IntType -> clone typeof DoubleType -> clone out of the EMF containment tree. typeof StringType -> clone

Alternatively, since all primitive types extend an abstract meta class PrimitiveType, this could be shortened to the follow- ing, where the + operator specifies that the rule applied for the specified concept and all its subconcepts: typeof PrimitiveType + -> clone 262 dslbook.org

Figure 10.2: The Xtext-based editor for the type system specification DSL provided by the Xtext Typesystem Framework. It is a nice example of the benefits of a DSL over an API (on which it is based), since it can statically show inconsistencies in the type system definition, has a more concise syntax and provides customized go-to-definition functionality.

For concepts that have a fixed type that is different from the concept itself (or a clone), the type can be specified explicitly:

typeof StringLiteral -> StringType

Type systems are most important, and most interesting, in the context of expressions. Since all expressions derive from the abstract Expr concept, we can declare that this class is abstract, and hence no typing rule is given16: 16 However, the editor reports a warning if there are concrete subclasses of an typeof Expr -> abstract abstract class for which no type is specified either. The notation provided by the DSL groups typing rules and type checks for a single concept. The following is the typing 17 These rules do not support using information for the Plus concept. It declares the type of Plus + for concatenating strings and for to be the common type of the left and right arguments (the concatenating strings with numbers (as in "a" + 1). However, support for "more general" one) and then adds two constraints that check this feature can be provided as well by that the left and right argument are either ints or doubles17. using a coercion rule. dsl engineering 263

typeof Plus -> common left right { ensureType left :<=: IntType, DoubleType ensureType right :<=: IntType, DoubleType }

The typing rules for Equals are also interesting. It specifies that the resulting type is boolean, that the left and right arguments must be COMPARABLE, and that the left and right arguments be compatible. COMPARABLE is a type characteristic: this can be considered as collection of types. In this case it is IntType, DoubleType and BoolType. The :<=>: operator describes unordered compatibility: the types of the two prop- erties left and right must either be the same, or left must be a subtype or right, or vice versa. characteristic COMPARABLE { IntType, DoubleType, BoolType } typeof Equals -> BoolType { ensureType left :<=: char(COMPARABLE) ensureType right :<=: char(COMPARABLE) ensureCompatibility left :<=>: right }

There is also support for ordered compatibility, as can be seen from the typing rule for AssignmentStatement below. It has no type (it is a statement), but the left and right argument must exhibit ordered compatibility: they either have to be the same types, or right must be a subtype of left, but not vice versa: typeof AssignmentStatement -> none { ensureCompatibility right :<=: left }

The framework uses the generation gap pattern, i.e. from the DSL-based type specification, a generator creates a class CLType- systemGenerated (for the cooling language) that contains all the code that can be derived from the type system specifica- tion. Additional specifications that cannot be expressed with the DSL (such as the typing rule for NumberLiteral shown 18 18 The type system DSL is incomplete, earlier, or type coercions) can be implemented in Java . since some aspects of type systems have to be coded in a lower level language (Java). However, in this case this is 10.4 MPS Example appropriate, since it keeps the type system DSL simple and, since the DSL MPS includes a DSL for type system rule definition. It is based users are programmers, it is not a problem for them to write a few lines of on unification, and pattern matching for binary operators. We Java code. discuss each of them. 19 Since only the expression within the do {...} block has to be written by  Unification The type of a LocalVariableReference is cal- the developer, we’ll only show that culated with the following typing rule19. It establishes an equa- expression in the remaining examples. 264 dslbook.org

tion between the type of the LocalVariableReference itself and the variable it references. typeof is a built-in operator that returns the type for its argument. rule typeof_LocalVariableReference { applicable for concept = LocalVariableReference as lvr overrides false

do { typeof( lvr ) :==: typeof( lvr.variable ); } }

The rules for the Boolean NotExpression contains two equa- tions. The first one makes sure that the negated expression is Boolean. The second one types the NotExpression itself to be Boolean20. 20 Just as in Xtext, in MPS types are instances of language concepts, so typeof( notExpr.expression ) :==: new node(); they can be instantiated like any other typeof( notExpr ) :==: ; concept. MPS supports two ways of doing this. The first one (as shown A more interesting example is the typing of structs. Consider in the first equation above) uses the the following C code: BaseLanguage new expression. The second one uses a quotation, where struct Person { a "piece of tree" can be inlined into char* name; program code. It uses the concrete int age; } syntax of the quoted construct – here: a BooleanType – in the quotation. int addToAge( Person p, int delta ) { return p.age + delta; }

At least two program elements have to be typed: the parameter p as well as the p.age expression. The type of the FunctionPa- rameter concept is the type of its type property. This is not specific to the fact that the parameter refers to a struct. typeof( parameter ) :==: typeof( parameter.type ); Figure 10.3: Structure diagram of the The language concept that represents the Person type in the language concepts involved in typing struct parameter is a StructType.A StructType refers to the Struct- s. Declaration whose type it represents, and extends Type, which 21 This is essentially a use of the acts as the super type for all types in mbeddr C21. Adapter pattern. p.age is an instance of a StructAttributeReference. It is defined as follows (see Fig. 10.4 as well as the code below). It is an Expression, owns another expression property (on the left of the dot), as well as a reference to a StructAttribute (name or age in the example). concept StructAttributeReference extends Expression implements ILValue children: Figure 10.4: Structure diagram of Expression context 1 the language concepts involved in references to struct attributes. references: StructAttribute attribute 1 dsl engineering 265

The typing rule for the StructAttributeReference is shown in the code below. The context, the expression on which we use the dot operator, has to be a GenericStructType, or a subtype thereof (i.e. a StructType which points to an actual StructDeclaration). Second, the type of the whole expres- sion is the type of the reference attribute (e.g., int in the case of p.age). typeof( structAttrRef.context ) :<=: new node(); typeof( structAttrRef ) :==: typeof( structAttrRef.attribute );

This example also illustrates the interplay between the type system and other aspects of language definition, specifically scopes. The referenced StructAttribute (on the right side of the dot) may only reference a StructAttribute that is part of the the StructDeclaration that is referenced from the Struct- Type. The following scope definition illustrates how we access the type of the expression from the scoping rule: link {attribute} search scope: (model, scope, referenceNode, linkTarget, enclosingNode)->join( ISearchScope | sequence>) { node<> exprType = typeof( referenceNode.expression ); if (exprType.isInstanceOf(StructType)) { return (exprType as StructType).struct.attributes; } else { return null; } }

 Pattern Matching As we will discuss in the chapter on lan- guage extension and composition, MPS supports incremental extension of existing languages. Extensions may also intro- duce new types, and, specifically, may allow existing operators to be used with these new types. This is facilitated by MPS’ use for pattern matching in the type system, specifically for bi- nary operators such as +, > or ==. As an example, consider the introduction of complex numbers into C. It should be possible to write code like this: complex c1 = (1, 2i); complex c2 = (3, 5i); 22 Alternatively, we could define a new complex c3 = c1 + c2;// results in (4,7i) + for complex numbers. While this would work technically (remember The + in c1 + c2 should be the + defined by the original C lan- there is no parser ambiguity problems), it would mean that users, when enter- 22 + guage . Reusing the original requires that the typing rules ing a +, would have to decide between defined for PlusExpression in the original C language will the original plus and the new plus for now have to accept complex numbers; the original typing rules complex numbers. This would not be very convenient from a usability per- must be extended. To enable this, MPS supports overloaded spective. By reusing the original plus operations containers. The following container, taking from the we avoid this problem. 266 dslbook.org

mbeddr C core language, defines the type of + and - if both arguments are int or double. overloaded operations rules binaryOperation operation concepts: PlusExpression | MinusExpression left operand type: right operand type: operation type:(operation, leftOperandType, rightOperandType)->node<> { ; } operation concepts: PlusExpression | MinusExpression left operand type: right operand type: operation type:(operation, leftOperandType, rightOperandType)->node<> { ; }

To integrate these definitions with the regular typing rules, the following typing rule must be written23. The typing rules tie in 23 Note that only one such rule must with overloaded operation containers via the operation type be written for all binary operations. Everything else will be handled with construct: the overloaded operations containers. rule typeof_BinaryExpression { applicable for concept = BinaryExpression as binex

do { node<> optype = operation type( binex , left , right ); if (optype != null){ typeof(binex) :==: optype; } else { error "operator " + be.concept.name + " cannot be applied to " + left.concept.name + "/" + right.concept.name -> be; } } }

The important aspect of this approach is that overloaded oper- ation containers are additive. Language extensions can simply contribute additional containers. For the complex number ex- ample, this might look like the following: we declare that as soon as one of the arguments is of type complex, the resulting type will be complex as well.

PlusExpression | MinusExpression one operand type: operation type : (operation, leftOperandType, rightOperandType)->node<> { ; }

The type system DSL in MPS covers a large fraction of the type system rules encountered in practice. The type system for BaseLanguage, which is an extension of Java, is implemented in this way, as is the C type system in mbeddr. However, for exceptional cases, procedural BaseLanguage code can be used to implement typing rules as well. dsl engineering 267

10.5 Spoofax Example

Spoofax’ rewrite rules support both the recursive approach and pattern matching in specifying type systems. However, in most projects the recursive approach will be found. Therefore we will focus on it in the remainder of the section.

 Typing Rules in Spoofax For typing rules in Spoofax, the basic idea is to use rewrite rules to rewrite language constructs to their types. For example, the following rule rewrites integer numbers to the numeric type. This is an example of assigning a fixed type to a language element. type-of: Int(value) -> NumType()

Similarly, we can rewrite a + expression to the numeric type: type-of: Add(exp1, exp2) -> NumType()

However, it is good practice to assign types only to well-typed language constructs. Thus, we should add type checks for the subexpressions: type-of: Add(exp1, exp2) -> NumType() where exp1 => NumType(); exp2 => NumType()

Spoofax allows for multiple typing rules for the same language construct. This is particular useful for typing overloaded oper- ators, since each case can be handled by a separate typing rule. For example, when the operator + is overloaded to support string concatenation, we can add the following typing rule: type-of: Add(exp1, exp2) -> StringType() where exp1 => StringType(); exp2 => StringType()

 Persistence of Typing Information Spoofax stores informa- tion about the definition sites of names in an in-memory data structure called the index. This typically includes information about types. For example, the type of property and variable references is initially not available at these references, but only at the declaration. But when Spoofax discovers a declaration, it stores its type in the index. Since declaration and references are annotated with the same URI, this information can also be accessed at references. Consider the following name binding rules which also involve type information: 268 dslbook.org

Property(p, t): defines Property p of type t Param(p, t): defines Variable p of type t

These rules match property and parameter declarations, bind- ing their name to p and their type to t. Spoofax stores this type in the index as an information about the property or parameter name. In the typing rules for variable references and property accesses, we need to retrieve this type information from the index: type-of: Var(name) -> name type-of: PropAccess(exp, name) -> name

Both rules rewrite references to the type of their definition sites. First, the definition of a reference is looked up in the index. Next, this definition is rewritten to its type. This uses the index-lookup-type rule, which implements the actual type lookup in the index. In the previous example, the type was explicitly declared in property and parameter declarations. But the type of a defini- tion site is not always explicitly declared. For example, variable declarations in Mobl come with an initial expression, but with- out an explicit type24. 24 The type is inferred from the initial expression. var x = 42;

The type of x is the type of its initial expression 42, that is, NumType(). To make this type of x explicit, we need to calculate the type of the initial expression. The following name binding rule makes this connection between name binding and type system:

Declare(v, e): defines Variable v of type t in subsequent scope where e has type t

 Additional Types In Spoofax, types are represented as terms. The constructors for these terms are specified in the syntax def- inition as labels to productions. Without the ability to define additional constructors, type systems are restricted to types which users can explicitly state in programs, for example in variable declarations. But many type systems require addi- tional types which do not originate from the syntax of the lan- guage. Typical examples are top and bottom types in type 25 25 A top type is a supertype of every hierarchies . For example, Java’s type system has a special other type; a bottom type is a subtype of type for null values at the bottom of its type hierarchy, which every other type. dsl engineering 269

cannot be used as a type in Java programs. Spoofax allows constructors for additional types in signatures to be defined: signature constructors FunType: List(Type) * Type -> Type

This defines an additional constructor FunType for Type. In general, a constructor definition is of the form cons: Arg-1 * ...* Arg-n -> Sort, where cons is the constructor name, Sort is the sort this constructor contributes to, and Arg-1 to Arg-n are the sorts of its arguments. In the example, the first subterm should be a list of parameter types (List(Type)), while the second subterm should be the return type. We can employ the so-defined function type in the typing rules for function definitions and calls:

Function(f, p*, t): defines Function f of type FunType(t*, t) where p* has type t* type-of: Call(name, arg*) -> type where name => FunType(type*, type)

 Type Constraints Like any other constraint, type constraints are specified in Spoofax by rewrite rules which rewrite lan- guage constructs to errors, warnings or notes. For example, we can define a constraint on additions26: 26 An expression is non-well-typed, or ill-typed, if it cannot be rewritten to constraint-error: exp -> (exp, $[Operator + cannot be applied to arguments a type. But reporting an error on all [ type1], [ type2].]) ill-typed expressions will make it hard where to discover the root cause of the error, !exp => Add(exp1, exp2); since every expression with an ill-typed exp; type1 := exp1; subexpression is also ill-typed. That is type2 := exp2 why we also check the subexpressions of the addition to be well-typed. The types of the subexpressions are then Type Compatibility Whether two types are compatible is used to construct a meaningful error  message. again defined by rewrite rules. These rules rewrite a pair of types to the second element of the pair, if the first one is com- patible with it. In the simplest case, both types are the same: is-compatible-to: (type, type) -> type

This rule only succeeds if it gets a tuple with two types that are identical (they both match the same variable type). A type might also be compatible with any type with which its super- type is compatible: is-compatible-to: (subtype, type) -> type where 270 dslbook.org

supertype := subtype; (supertype, type)

Here, the subtype relation is defined by a rewrite rule, which rewrites a type to its supertype: supertype: IntType() -> FloatType()

This approach only works for type systems in which each type has at most one supertype. When a type system allows for multiple supertypes, we have to use lists of supertypes and need to adapt the rule for is-compatible-to accordingly: supertypes: IntType() -> [ FloatType() ] is-compatible-to: (subtype, type) -> type where supertype* := subtype; supertype*

Here, fetch-elem tries to find an element in a list of super- types, which is compatible to type. It uses a variant of the rule is-compatible-to in order to deal with a list of types. This variant does not rewrite a pair of types, but only the first type. The second type is passed as a parameter to the rewrite rule. It can be defined in terms of the variant for pairs: is-compatible-to(|type2): type1 -> (type1, type2)

The compatibility of types can easily be extended to compati- bility of lists of types: is-compatible-to: (type1*, type2*) -> type* where type* := (type1*, type2*)

A list type1* of types is compatible with another list type2* of types, if each type in type1* is compatible with the cor- responding type in type2*. zip pairs up the types from both lists, rewrites each of these pairs by applying is-compatible-to to them, and collects the results in a new list type*. With the extension for lists, we can define a constraint for function calls, which ensures that the types of the actual argu- ments are compatible with the types of the formal parameters: constraint-error: Call(name, arg*) -> (arg*, $[Function [name] cannot be applied to arguments [ arg-type*].]) where fun-type := name ; !fun-type => FunType(para-type*, type) ; arg-type* := arg* ; (arg-type*, par-type*) 11 Transformation and Generation

In the case of both transformation and generation, another artifact is created from a program, often a program in a less abstract language. This is in contrast to interpretation, which executes programs directly without creating interme- diate artifacts. Transformation refers to the case in which the created artifact is an AST, and code generation refers to the case in which textual concrete syntax is created. In some systems, for example MPS, the two are unified into a common approach.

Transformation of models is an essential step in working with DSLs. We typically distinguish between two different cases: if models are transformed into other models, we call this model transformation. If models are transformed into text (usually source code, XML or other configuration files), we refer to code generation1. However, as we will see in the examples below, de- 1 As we discuss in Part I, we do not pending on the approach and tooling used, this distinction is cover generation of byte code or ma- chine code. This is mainly for the not always easy to make, and the boundary becomes blurred. following reason: by generating the A fundamentally different approach to processing models source code of a GPL, we can reuse this GPL’s compiler or interpreter, including is interpretation. While in the case of transformation and gen- all its optimizations (or platform inde- eration the model is migrated to artifacts expressed in a dif- pendence). We’d have to rebuild these ferent language, in the case of interpretation no such migra- optimizations in the DSL’s generator. This is a lot of work, and requires skills tion happens. Instead, an interpreter traverses an AST and di- that are quite different from those most rectly performs actions depending on the contents of the AST. DSL developers (including me) posses. Strictly speaking, we have already seen examples of interpreta- tion in the sections on constraints and type systems: constraint and type checks can be seen as an interpreter where the ac- tions performed as the tree is traversed are checks of various kinds. However, the term interpretation is typically only used 272 dslbook.org

for cases in which the actions actually execute the model. Exe- cution refers to performing the actions that are associated with the language concepts as defined by the execution semantics of the concepts. We discuss interpretation in the next chapter2. 2 We elaborate on the trade-offs be- Note that when developing transformations and code gener- tween transformation and generation versus interpretation in the chapter on ators, special care must be taken to preserve or record trace in- language design (Section 4.3.5). formation that can be used for error reporting and debugging. In both cases, we have to be able to go back from the gener- ated code to the higher-level abstractions it has been generated from, so we that can report errors in terms of the higher-level abstractions or show the higher-level source code during a de- bugging session. We discuss this challenge to some extent in Chapter 15.

11.1 Overview of the approaches

Classical code generation traverses a program’s AST and out- puts programming language source code (or other text). In this context, a clear distinction is made between models and source code. Models are represented as an AST expressed with some preferred AS formalism (or meta meta model); an API exists for the transformation to interact with the AST. In con- trast, the generated source code is treated as text, i.e. a se- quence of characters. The tool of choice for transforming an AST into text are template languages. They support the syn- tactic mixing of model traversal code and to-be-generated text, separated by some escape character3. Since the generated code 3 Xpand and Xtend use «guillemets». is treated merely as text, there is no language awareness (and corresponding tool support) for the target language while edit- ing templates. Xtend4, the language used for code generation 4 Xtend is also sometimes referred in Xtext, is an example of this approach. to as Xtend2, since it has evolved from the old oAW Xtend language. Classical model transformation is the other extreme, in that In this chapter we use Xtend to re- it works with ASTs only and does not consider the concrete fer to Xtend2. It can be found at www.eclipse.org/xtend/. syntax of either the source or the target languages. The source AST is transformed using the source language AS API and a suitable traversal language. As the tree is traversed, the API of the target language AS is used to assemble the target model5. 5 Note that in this chapter we look at For this to work smoothly, most specialized transformation lan- transformation in the context of re- finement, i.e. the target model is less guages assume that the source and target models are build abstract and more detailed than the with the same AS formalism (e.g., EMF Ecore). Model trans- source model. Model transformations can also be used for other purposes, formation languages typically provide support for efficiently including the creation of views, refac- navigating source models, and for creating instances of AS of torings and reverse engineering. dsl engineering 273

the target language (tree construction). Examples for this ap- proach once again include Xtext’s Xtend, as well as QVT Oper- ational6 and ATL7. MPS can also be used in this way. A slightly 6 en.wikipedia.org/wiki/QVT different approach just establishes relations between the source 7 www.eclipse.org/atl/ and target models instead of "imperatively" constructing a tar- get tree as the source is traversed. While this is often less in- tuitive to write down, the approach has the advantage that it supports transformation in both directions, and also supports 8 It does so by "relating" two instances 8 9 model diff . QVT relational is an example of this approach. of the same language and marking both In addition to the two classical cases described above, there as readoqnly; the engine then points out the difference between the two. are also hybrid approaches that blur the boundaries between these two clear-cut extremes. They are based on the support for language modularization and composition, in the sense that the template language and the target language can be com- posed. As a consequence, the tooling is aware of the syntactic structure and the static semantics of the template language and the target language. Both MPS and Spoofax support this ap- proach to various extents. In MPS, a program is projected and every editing opera- tion directly modifies the AST, while using a typically textual- looking notation as the "user interface". Template code (the code that controls the transformation process) and target-lan- guage code (the code you want to generate) can be represented as nested ASTs, each using its own textual syntax. MPS uses a slightly different approach based on a concept called anno- tations. Projectional editors can store arbitrary information in an AST. Specifically, they can store information that does not correspond to the language underlying a particular ASTs. MPS code generation templates exploit this approach: template code is fundamentally an instance of the target language. This "ex- ample model" is then annotated with template annotations that define how the example model relates to the source model, and which example nodes must be replaced by (further trans- formed) nodes from the source model. This allows any lan- guage to be "templatized" without changing the language def- inition itself. The MPS example below will elaborate on this approach. Spoofax, with its Stratego transformation language, uses a similar approach based on parser technology. As we have al- ready seen, the underlying grammar formalism supports flex- ible composition of grammars. So the template language and the target language can be composed, retaining tool support for 274 dslbook.org

both of these languages. Execution of the template directly con- structs an AST of the target language, using the concrete syntax of the target language to specify its structure. The Spoofax ex- ample will provide details.

11.2 Xtext Example

Since Xtext is based on EMF, generators can be built using any tool that can generate code from EMF models, includ- ing Acceleo10, Jet11 and of course Xtend. Xtend is a Java-like 10 www.acceleo.org/ pages/home/en general-purpose language that removes some of Java’s syntac- 11 www.eclipse.org/ tic noise (it has type inference, property access syntax, opera- modeling/m2t/?project=jet tor overloading) and adds syntactic sugar (extension methods, multiple-dispatch and closures). Xtend comes with an inter- preter and a compiler, the latter generating Java source. Xtend is built with Xtext, so it comes with a powerful Xtext-based edi- tor. One particularly interesting language feature in the context of code generators are Xtend’s template expressions. Inside these expressions, a complete template language is available (similar to the older Xpand language). Xtend also provides automatic whitespace management12. The functional abstractions pro- 12 Indentation of template code is vided by Xtend (higher-order functions in particular) make it traditionally a challenge, because it is not clear whether whitespace in very well suited for navigating and querying models. In the templates is intended to go into the rest of this section we will use Xtend for writing code genera- target file, or is just used for indenting the template itself. tors and model transformations.

11.2.1 Generator We will now look at generating the C code that implements cooling programs. Fig. 11.1 shows a screenshot of a typical generator. The generator is an Xtend class that implements 13 This is achieved by a Builder that IGenerator, which requires the doGenerate method to be im- comes with Xtext. Alternatively, Xtend generators can also be run from the plemented. The method is called for each model file that has command line, from another Java changed13 (represented by the resource argument), and it has program or from ant and maven. Also, other strategies can be implemented in to output the corresponding generated code via the fsa (file the Eclipse IDE itself, based on custom system access) object14. builder participants, buttons or menu When generating code from models, there are two distinct entries. cases. In the first case, the majority of the generated code is 14 Like any other Xtext language aspect, fixed; only some isolated parts of the code depend on the in- the generator has to be registered with the runtime module, Xtext’s main put model. In this case a template language is the right tool, configuration data structure. Once this because template control code can be "injected" into code that is done, the generator is automatically called for each changed resource looks similar to what should be generated. In the other case associated with the respective language. there are fine-grained structures, such as expressions. Since dsl engineering 275

Figure 11.1: The top-level structure of a generator written in Xtend is a class these are basically trees, using template languages for these that implements the IGenerator inter- face, which requires the doGenerate parts of programs seems unnatural and results in a lot of syn- method. Inside generator methods, tactic noise. A more functional approach is useful. Xtend can template expressions (delimited with triple single quotes) are typically used. deal with both cases elegantly and we will illustrate both cases Inside those, guillemets (the small as part of this example. double angle brackets) are used to We start with the high-level structure of the C code gen- switch between to-be-generated code (gray background) and template control erated for a cooling program. The following piece of code code. Note also the gray whitespace illustrates Xtend’s power to navigate a model as well as the in the init function. Gray whitespace is whitespace that will end up in the template syntax for text generation. generated code. White whitespace is used for indentation of template def compile(CoolingProgram program) { code; Xtend figures out which is which ’’’ automatically. <> <> #define <> <> <> <> 276 dslbook.org

// more ... ’’’ }

The FOR loop iterates over the moduleImports collection of the 15 Notice how the the first two and the program, follows the module reference of each of these, and last two lines are enclosed in guillemets. Since we are in template expression then selects all Appliances from the resulting collection. The mode (inside the triple single quotes) nested loop then iterates over the contents of each appliance the guillemets escape to template 15 control code. The #define is not in and generates a #define . After the #define we generate guillemets, so it is generated into the the name of the respective content element and then its index. target file. From within templates, the properties and references of model elements (such as the name or the module or the contents) can simply be accessed using the dot operator. map and filter are collection methods defined by the Xtend standard library16. 16 map creates a new collection from We also have to generate an enum for the states in the cooling an existing collection where, for each element in the existing collection, program. the expression after the | creates typedef enum states { the corresponding value for the new null_state, collection. filter once again creates <> a new collection from an existing one <> where only those elements are included <> }; that are an instance of the type passed as an argument to filter. Here we embed a FOR loop inside the enum text. Note how we use the SEPARATOR keyword to put a comma between two sub- sequent states. In the FOR loop we access the concreteStates property of the cooling program. However, if you look at the grammar or the meta model, you will see that no concrete- States property is defined there. Instead, we call an extension method17; since it has no arguments, it looks like property ac- 17 An extension method is an additional cess. The method is defined further down in the Cooling- method for an existing class, defined without invasively changing the defini- LanguageGenerator class and is essentially a shortcut for a tion of the class. complex expression: def concreteStates(CoolingProgram p) { p.states.filter(s | !(s instanceof BackgroundState) && !(s instanceof ShadowState)) }

The following code is part of the generator that generates the code for a state transition. It first generates code to execute the exit actions of the current state, then performs the state change (current_state = new_state;) and finally executes the entry actions of the new state (not shown): if (new_state != current_state) { < 0>> // execute exit action for state if necessary switch (current_state) { <> case <>: <> dsl engineering 277

<> <> break; <> default: break; } <>

// The state change current_state = new_state;

// similar as above, but for entry actions }

The code first uses an IF statement to check whether the pro- gram has any states with exit actions (by calling the concrete- StatesWitExitActions extension method). The subsequent switch statement is only generated if we have such states. The switch switches over the current_state, and then adds a case for each state with exit actions18. Inside the case we it- 18 The s.name expression in the case is erate over all the exitStatements and call compileStatement actually a reference to the enum literal generated earlier for the particular for each of them. compileStatement is marked with dispatch, state. From the perspective of Xtend, which makes it a multimethod: it is polymorphically over- we simply generate text: it is not 19 obvious from the template that the loaded based on its argument . For each statement in the cool- name corresponds to an enum literal. ing language, represented by a subclass of Statement, there Potential structural or type errors are is an implementation of this method. The next piece of code only revealed upon compilation of the generated code. shows some example implementations. 19 class StatementExtensions { Note that Java can only perform a polymorphic dispatch based on the def dispatch compileStatement(Statement s){ this pointer. Xtend can dispatch // raise error if the overload for the abstract class is called } polymorphically over the arguments of methods marked as dispatch. def dispatch compileStatement(AssignmentStatement s){ s.left.compileExpr +" = " + s.right.compileExpr +";" }

def dispatch compileStatement(IfStatement s){ ’’’ if( <> ){ <> <> <> }< 0>> else { <> <> <> }<> ’’’ }

// more ... }

The implementation of the overloaded methods simply returns the text string that represents the C implementation for the re- spective language construct20. Notice how the implementation 20 The two examples shown are simple because the language construct in the for the IfStatement uses a template string, whereas the one DSL closely resembles the C code in the for AssignmentStatement uses normal string concatenation. first place. 278 dslbook.org

The compileStatement methods are implemented in the class StatementExtensions. However, from within the Cooling- LanguageGenerator they are called using method syntax (st. compileStatement). This works because they are injected as extensions using the following statement:

@Inject extension StatementExtensions

Expressions are handled in the same way as statements. The in- jected class ExpressionExtensions defines a set of overloaded dispatch methods for Expression and all its subtypes. Since expressions are trees, a compileExpr method typically calls compileExpr recursively on the children of the expression, if it has any. This is the typical idiom to implement generators for expression languages21. 21 Earlier we distinguished between generating a lot of code with only def dispatch String compileExpr (Equals e){ e.left.compileExpr + " == " + e.right.compileExpr specific parts being model-dependent, } and fine-grained tree structures in expressions: this is an example of the def dispatch String compileExpr (Greater e){ latter. e.left.compileExpr + " > " + e.right.compileExpr } def dispatch String compileExpr (Plus e){ e.left.compileExpr + " + " + e.right.compileExpr } def dispatch String compileExpr (NotExpression e){ "!(" + e.expr.compileExpr + ")" } def dispatch String compileExpr (TrueExpr e){ "TRUE" } def dispatch String compileExpr (ParenExpr pe){ "(" + pe.expr.compileExpr + ")" } def dispatch compileExpr (NumberLiteral nl){ nl.value }

11.2.2 Model-to-Model Transformation For model-to-model transformations, the same argument can be made as for code generation: since Xtext is based on EMF, any EMF-based model-to-model transformation engine can be used with Xtext models. Examples include ATL, QVT-O, QVT- R and Xtend22. 22 Of course you could use any JVM- Model-to-model transformations are similar to code genera- based compatible programming lan- guage, including Java itself. However, tors in the sense that they traverse over the model. But instead Java is really not very well suited, be- of producing a text string as the result, they produce another cause of its clumsy support for model navigation and object instantiation. AST. So the general structure of a transformation is similar. In Scala and Groovy are much more fact, the two can be mixed. Let us go back to the first code interesting in this respect. example of the generator: dsl engineering 279

def compile(CoolingProgram program) { val transformedProgram = program.transform ’’’ <> <> #define <> <> <> <>

// more ... ’’’ }

We have added a call to a function transform at the beginning of the code generation process. This function creates a new CoolingProgram from the original one, and we store it in the transformedProgram variable. The code generator then uses the transformedProgram as the source from which it gener- ates code. In effect, we have added a "preprocessor" model-to- model transformation to the generator23. 23 As discussed in the design section The transform function (see below) enriches the existing (Section 4.3), this is one of the most common uses of model-to-model _ model. It creates a new state (EMERGENCY STOP), creates a new transformations. event (emergency_button_pressed) and then adds a new tran- sition to each existing state that checks whether the new event occurred, and if so, transitions to the new EMERGENCY_STOP state. Essentially, this adds emergency stop behavior to any existing state machine. Let’s look at the implementation:

class Transformation {

@Inject extension CoolingBuilder CoolingLanguageFactory factory = CoolingLanguageFactory::eINSTANCE

def CoolingProgram transform(CoolingProgram p ) { p.states += emergencyState p.events += emergencyEvent for ( s: p.states.filter(typeof(CustomState)).filter(s|s != emergencyState) ) { s.events += s.eventHandler [ symbolRef [ emergencyEvent() ] changeStateStatement(emergencyState()) ] } return p; }

def create result: factory.createCustomState emergencyState() { result.name = "EMERGENCY_STOP" }

def create result: factory.createCustomEvent emergencyEvent() { result.name = "emergency_stop_button_pressed" } }

24 The factory used in these methods create create The two methods create new objects, as the is the way to create model elements in prefix suggests24. However, simply creating objects could be EMF. It is generated as part of the EMF done with a regular method as well: code generator 280 dslbook.org

def emergencyState() { val result = factory.createCustomState result.name = "EMERGENCY_STOP" result }

What is different in create methods is that they can be called several times, and they still only ever create one object (for each combination of actual argument values). The result of the first invocation is cached, and all subsequent invocations return the object created during the first invocation. Such a behavior is very useful in transformations, because it removes the need to keep track of already created objects. For example, in the transform method, we have to establish references to the state created by emergencyState and the event created by emergencyEvent. To do that, we simply call the same create extension again. Since it returns the same object as during the first call in the first two lines of transform, this actually establishes references to those already created objects25. 25 This is a major difference between We can now look at the implementation of transform itself. text generation and model transforma- tion. In text, two textual occurrences of It starts by adding the emergencyState and the emergencyEvent a symbol are the same thing (in some to the program26. We then iterate over all CustomStates except sense, text strings are value objects). In model transformation the identity the emergency state we’ve just created. Notice how we just call of elements does matter. It is not the the emergencyState function again: it returns the same object. same if we create one new state and We then use a builder to add the following code to each of the then reference it, or if we create five new states. So a good transformation existing states. language helps keep track of identities. The create methods are a very nice on emergency_button_pressed { state EMERGENCY_STOP way of doing this. } 26 These are the first calls to the respec- This code could be constructed by procedurally calling the re- tive functions, so the objects are actually created at this point. spective factory methods: val eh = factory.createEventHandler val sr = factory.createSymbolRef sr.symbol = emergencyEvent val css = factory.createChangeStateStatement css.targetState = emergencyState eh.statements += css s.events += eh

The notation used in the actual implementation is more concise 27 Of course, if you add the line count and resembles the tree structure of the code much more closely. and effort for implementing the builder, then using this alternative over the It uses the well-known builder. Builders are implemented in plain procedural one might not look so Xtend with a combination of closures and implicit arguments interesting. However, if you just create and a number of functions implemented in the CoolingBuilder these builder functions once, and then create many different transformations, 27 class . Here is the code: this approach makes a lot of sense. class CoolingBuilder {

CoolingLanguageFactory factory = CoolingLanguageFactory::eINSTANCE dsl engineering 281

def eventHandler( CustomState it, (EventHandler)=>void handler ) { val res = factory.createEventHandler res }

def symbolRef( EventHandler it, (SymbolRef)=>void symref ) { val res = factory.createSymbolRef it.events += res }

def symbol( SymbolRef it, CustomEvent event ) { it.symbol = event }

def changeStateStatement( EventHandler it, CustomState target ) { val res = factory.createChangeStateStatement it.statements += res res.targetState = target } }

This class is imported into the generator with the @Inject extension construct, so the methods can be used "just so".

11.3 MPS Example

MPS comes with a textgen language for text generation. It is typically just used at the end of the transformation chain where code expressed in GPLs (like Java or C) is generated to text so it can be passed to existing compilers. Fig. 11.2 shows the textgen component for mbeddr C’s IfStatement. MPS’ text generation language basically appends text to a buffer. 28 We won’t discuss this aspect of MPS any further, since MPS If you want to generate text that is structurally different, then the textgen textgen is basically a wrapper language around a StringBuffer. language is a bit of a pain to use; in this However, this is perfectly adequate for the task at hand, since case, the MPS philosophy recommends that you build a suitable intermediate it is only used in the last stage of generation where the AST is language (such as for XML, or even for essentially structurally identical to the generated text28. a particular schema). DSLs and language extensions typically use model-to-model 29 The distinction between code gen- transformations to "generate" code expressed in a low-level erators and model-to-model transfor- 29 mations is much less clear in this case. programming language . Writing transformations in MPS in- While it is a model-to-model transfor- volves two ingredients. Templates define the actual transfor- mation (we map one AST onto another) mation. Mapping configurations define which template to run the transformations look very much like code generators, since the concrete when and where. Templates are valid sentences of the target syntax of the target language is used in language. Macros are used to express dependencies on and the "template". queries over the input model. For example, when the guard condition (a C expression) should be generated into an if state- ment in the target model, you first write an if statement with a dummy condition in the template. The following would work: if (true) {}. Then the nodes that should be replaced by the transformation with nodes from the input model are annotated 282 dslbook.org

Figure 11.2: The AST-to-text generator for an if statement. If first checks if the condition happens to be a true literal, in which case the if statement is optimized away and only the thenPart is output. Otherwise we generate an if statement, the condition in parentheses, and then the thenPart. We then iterate over all the elseIfs; an elseIf has its own textgen, and we delegate to that one. We finally output the code for the else part.

with macros. In our example, this would look like this: if (COPY_SRC[true]){}. Inside the COPY_SRC macro you put an expression that describes which elements from the input model should replace the dummy node true: we use node.guard to replace it with the guard condition of the input node (expected to be of type Transition here). When the transformation is executed, the true node will be replaced by what the macro expression returns – in this case, the guard of the input transi- tion. We will explain this process in detail below.

 Template-based Translation of the State Machine State machines live inside modules. Just like structs, they can be instantiated. The following code shows an example. Notice the two global variables c1 and c2, which are instances of the same state ma- chine Counter. module Statemachine imports nothing {

statemachine Counter { in events start() step(int[0..10] size) out events started() resetted() incremented(int[0..10] newVal) local variables int[0..10] currentVal = 0 int[0..10] LIMIT = 10 states ( initial = start ) state start { on start [ ] -> countState { send started(); } } state countState { on step [currentVal + size > LIMIT] -> start { send resetted(); } on step [currentVal + size <= LIMIT] -> countState { currentVal = currentVal + size; send incremented(currentVal); dsl engineering 283

} on start [ ] -> start { send resetted(); } } }

Counter c1; Counter c2;

void aFunction() { trigger(c1, start); } }

State machines are translated to the following lower-level C entities (this high level structure is clearly discernible from the two main templates shown in Fig. 11.3 and Fig. 11.7):

• An enum for the states (with a literal for each state).

• An enum for the events (with a literal for each event).

•A struct declaration that contains an attribute for the cur- rent state, as well as attributes for the local variables de- clared in the state machine.

• And finally, a function that implements the behavior of the state machine using a switch statement. The function takes two arguments: one named instance, typed with the struct mentioned in the previous item, and one named event that is typed to the event enum mentioned above. The function checks whether the instance’s current state can handle the event passed in, evaluates the guard, and if the guard is true, executes exit and entry actions and updates the cur- rent state.

Figure 11.3: The MPS generator that inserts two enum definitions and a struct into the module which contains the StateMachine.

The MPS transformation engine works in phases. Each phase transforms models expressed in some languages to other mod- els expressed in the same or other languages. Model elements for which no transformation rules are specified are copied from one phase to the next. Reduction rules are used to intercept 284 dslbook.org

program elements and transform them as generation progresses through the phases. Fig. 11.4 shows how this affects state ma- chines. A reduction rule is defined that maps state machines to the various elements we mentioned above. Notice how the sur- rounding module remains unchanged, because no reduction rule is defined for it.

Figure 11.4: State machines are trans- formed (via a model-to-model trans- formation, if you will) into two enums, a struct and a function. These are then transformed to text via the regular com.mbeddr.core textgen.

Let us look in more detail at the template in Fig. 11.3. It re- duces a Statemachine, the input node, to two enums and a struct. We use template fragments (marked with ) to highlight those parts of the template that should actu- ally be used to replace the input node as the transformation ex- ecutes. The surrounding module dummy is scaffolding: it is only needed because enums and structs must live in Implementa- tionModules in any valid instance of the mbeddr C language30. 30 Since the templates are projected ex- We have to create an enum literal for each state and each ample instances of the target language, the template has to be a valid instance of event. To achieve this, we iterate over all states (and events, any MPS-implemented language. respectively). This is expressed with the LOOP macros in the template in Fig. 11.3. The expression that determines what we iterate over is entered in the Inspector, MPS’ equivalent to a 31 Note that the only really interesting properties window; Fig. 11.5 shows the code for iterating over part of Fig. 11.5 is the body of the anonymous function (node.states;), 31 the states . For the literals of the events enum we use a similar which is why from now on we will only expression (node.events;). show this part.

Figure 11.5: The inspector is used to provide the implementation details for the macros used in the templates. This one belongs to a LOOP macro, so we provide an expression that returns a collection, over which the LOOP macro iterates. The LOOP macro iterates over collections and then creates an instance of the concept it is attached to for each iteration. In the case of the two enums, the LOOP macro is attached to an dsl engineering 285

EnumLiteral, so we create an EnumLiteral for each event and state we iterate over. However, these various EnumLiterals must all have different names. In fact, the name of each literal should be the name of the state/event for which it is created. We can use a property macro, denoted by the $ sign, to achieve this. A property macro is used to replace values of proper- ties32. In this case we use it to replace the name property of the 32 The node macros used above _ generated EnumLiteral with the name of the state/event over (COPY SRC) replace whole nodes, as opposed to primitive properties of nodes. which we loop. Here is the implementation expression of the property macro: node.cEnumLiteralName();

In the code above, cEnumLiteralName is a behavior method33. 33 Behavior methods are defined as part It concatenates the name of the parent Statemachine with the of the behavior aspect of the concepts, such as State. string __state_ and the name of the current state (in order to get a unique name for each state): concept behavior State {

public string cEnumLiteralName() { return this.parent : Statemachine.name + "__state_" + this.name; }

}

The first of the struct attributes is also interesting. It is used to store the current state. It has to be typed to the state enum that is generated from this particular state machine. The type of the attribute is an EnumType; EnumTypes extend Type and reference the EnumDeclaration whose type they represent. How can we establish the reference to the correct EnumDeclaration? We Figure 11.6: A reference macro has to return either the target node, or the use a reference macro (->$) to retarget the reference. Fig. 11.6 name of the target node. The name is shows the macro’s expression. then resolved using the scoping rules for the particular reference concept.

Note how a reference macro expects either the target node 34 This is not a global name lookup! (here: an EnumDeclaration) as the return value, or a string. Since MPS knows the reference is an EnumLiteralRef expression, the scope That string would be the name of the target element. Our im- of that concept is used. As long as the plementation returns the name of the states enum generated in name is unique within the scope, this is completely deterministic. Alternatively, the same template. MPS then uses the target language’s scop- the actual node can be identified and ing rules to find and link to the correct target element34. returned from the reference macro Let us now address the second main template, Fig. 11.7, using mapping labels. However, using names is much more convenient and which generates the execute function. The switch expression works also for cross-model references, is interesting. It switches over the current state of the cur- where mapping labels don’t work. 286 dslbook.org

rent state machine instance. That instance is represented by Figure 11.7: The transformation tem- plate for generating the switch-based instance the parameter passed into the function. It has a implementation of a StateMachine. __currentState field. Notice how the function that contains Looking at the template fragment mark- the switch statement in the template has to have the instance ers () reveals that we only generate the switch statement, __ argument, and how its type, the struct, has to have the cur- not the function that contains it. The rentState attribute in the template. If the respective elements reason is that we need to be able to embed the state machine switch into were not there in the template, we couldn’t write the template other function-like concepts as well code! Since there is a convention that in the resulting function (e.g., operations in components defined the argument will also be called instance, and the attribute in the mbeddr components extension), so we have separated the generation of __ will also be called currentState, we don’t have to use a the function from the generation of the reference macro to retarget the two. actual state machine behavior. See the text for more details. Inside the switch we LOOP over all the states of the state machine and generate a case, using the state’s corresponding enum literal. Inside the case, we embed another switch that switches over the event argument. Inside this inner switch we iterate over all transitions that are triggered by the event we currently iterate over: dsl engineering 287

context state.transitions.where({~it => it.trigger.event == node; });

We then generate an if statement that checks at runtime whether the guard condition for this transition is true. We copy in the guard condition using the COPY_SRC macro attached to the true dummy node. The COPY_SRC macro copies the original node, but it also applies additional reduction rules for this node (and all its descendants), if there are any. For example, in a guard condition it is possible to reference event arguments. The ref- erence to size in the step transition is an example: statemachine Counter { in events step(int[0..10] size) ... states ( initial = start ) ... state countState { on step [currentVal + size > LIMIT] -> start { send resetted(); } } }

Event arguments are mapped to the resulting C function via a void* array. A reference to an event argument (EventArgRef) hence has to be reduced to accessing the n-th element in the array (where n is the index of the event argument in the list of arguments). Fig. 11.8 shows the reduction rule. It accesses the array, casts the element to a pointer to the type of the argument, and then dereferences everything35. 35 The reduction rule creates code that looks like this (for an int event attribute): *((int*)arguments[0]).

Figure 11.8: The reduction rule for references to event arguments (to be used inside guard conditions of transitions).

Inside the if statement, we have to generate the code that has to be executed if a transition fires. We first copy in all the exit actions of the current state. Once again, the int8 exitActions; is just an arbitrary dummy statement that will be replaced by the statements in the exit actions (COPY_SRCL replaces a node with a list of nodes). The respective expression is this:

(node, genContext, operationContext)->sequence> { if (node.parent : State.exitAction != null){ return node.parent : State.exitAction.statements; } new sequence>(empty); } 288 dslbook.org

We then do the same for the transition actions of the current transition, set the instance->__currentState to the target state of the transition using a reference macro (node.targetState. cEnumLiteralName();), and then we handle the entry actions of the target state. Finally we return, because at most one tran- sition can fire as a consequence of calling the state machine execute function. As the last example I want to show how the TriggerSM- Statement is translated. It injects an event into a state machine:

Counter c1; void aFunction() { trigger(c1, start); }

It must be translated to a call to the generated state machine execute function that we have discussed above. For simplicity, we explain a version of the TriggerSMStatement that does not include event arguments. Fig. 11.9 shows the template.

We use a dummy function someMethod so that we can embed a Figure 11.9: The reduction rule for a trigger statement. It is transformed to function call, because we have to generate a function call to the a call to the function that implements execute function generated from the state machine. Only the the behavior of the state machine that is function call is surrounded with the template fragment mark- referenced in the first argument of the trigger statement. ers (and will be generated). The function we call in the template code is the smExecuteFunction. It has the same signature as the real, generated state machine execute function. We use a refer- ence macro to retarget the function reference in the function call. It uses the following expression, which returns the name of the function generated for the state machine referenced in the statemachine expression of the trigger statement: dsl engineering 289

StatemachineType.machine.cFunctionName();

Note how the first argument to the trigger statement can be any expression (local variable reference, global variable refer- ence, a function call). However, we know (and enforce via the type system) that the expression’s type must be a State- machineType, which has a reference to the Statemachine whose instance the expression represents. So we can cast the expres- 36 sion’s type to StatemachineType, access the machine reference, This is an example of where we use the type system in the code generator, and get the name of the execute function generated from that and not just for checking a program for state machine36. type correctness. The second argument of the trigger statement is a reference to the event we want to trigger. We can use another reference macro to find the enum literal generated for this event. The macro code is straight forward: node.event.cEnumLiteralName();

 Procedural Transformation of a Test Case Instead of using the template-based approach shown above, transformations can also be written manually against the MPS API. To do this, a mapping script is called from the mapping configuration (in- stead of the mapping configuration containing rules). Such a mapping script can contain arbitrary BaseLanguage code that operates on the output model of the transformation37. 37 This is essentially similar to the As part of the mbeddr project, we have built a Builder ex- implementation code for intentions or refactorings (Section 13.6); those also tension to BaseLanguage. In the example below, we will build modify a model using the MPS Node the following code: API. module SomeModule imports nothing {

exported test case testCase1 { }

exported int32 main(int32 argc, string*[] argv) { return test testCase1; } }

The code below builds the code above. Notice how by de- fault we work with concepts directly (as when we mention StatementList or ExecuteTestExpression). However, we can also embed expression in the builder using the #(..) ex- pression. Nodes created by the builder can be named (as in tc:TestCase) so they can be used as a reference target later (testcase -> tc). node immo = build ImplementationModule name = #(aNamePassedInFromAUserDialog) contents += tc:TestCase 290 dslbook.org

name = "testCase1" type = VoidType contents += #(MainFunctionHelper.createMainFunction()) body = StatementList statements += ReturnStatement expression = ExecuteTestExpression tests += TestCaseRef testcase -> tc

38 Builders in MPS are a first-class extension to BaseLanguage, Users do not have to build the helper functions we have seen for Xtext/Xtend which means that the IDE can provide support. For example, if above. On the other hand, the MPS a concept has a mandatory child (e.g. the body in a Function), builder extension is specific to building MPS node trees, whereas the approach the IDE will report an error if no node is put into this child taken by Xtext/Xtend is generic, as long slot. Code completion can be provided as well38. as users define the helper functions.

11.4 Spoofax Example

In Spoofax, model-to-model transformations and code genera- tion are both specified by rewrite rules39. This allows for the 39 Rewrite rules were introduced in seamless integration of model-to-model transformation steps Section 9.3.1. into the code generation process; the clear distinction between model-to-model transformation and code generation vanishes40. 40 Similar to MPS, it is also possible We look at the various approaches supported by Spoofax in this to express model-to-model transfor- mations using the concrete syntax of chapter. the target language, even though this requires a bit more setup and care.  Code Generation by String Interpolation Pure code genera- tion from abstract syntax trees to text can be realized by rewrit- ing to strings. The following simple rules rewrite types to their corresponding representation in Java. For entity types, we use their name as the Java representation: to-java: NumType() -> "int" to-java: BoolType() -> "boolean" to-java: StringType() -> "String" to-java: EntType(name) -> name

Typically, more complex rules are recursive and use string in- terpolation41 to construct strings from fixed and variable parts. 41 We used string interpolation already For example, the following two rewrite rules generate Java code before to compose error messages in Section 9.3. for entities and their properties: to-java: Entity(x, ps) -> $[ class [x] { [ps’] } ] with ps’ := ps to-java: Property(x, t) -> $[ private [t’] [x];

public [t’] get_[x] { return [x]; dsl engineering 291

}

public void set_[x] ([t’] [x]) { this.[x] = [x]; } ] with t’ := t

String interpolation takes place inside $[...] brackets and allows us to combine fixed text with variables that are bound to strings. Variables can be inserted using brackets [...] without 42 a dollar sign . Instead of variables, we can also directly use 42 You can also use any other kind of the results from other rewrite rules that yield strings or lists of bracket: {...}, <...>, and (...) are strings: allowed as well. to-java: Entity(x, ps) -> $[ class [x] { [ ps] } ]

Indentation is important, both for the readability of rewrite rules and of the generated code: the indentation leading up to the $[...] brackets is removed, but any other indentation beyond the bracket level is preserved in the generated output. In this way we can indent the generated code, as well as our rewrite rules. Applying to-java to the initial shopping entity will yield the following Java code: class Item {

private String name;

public String get_name { return name; }

public void set_name (String name) { this.name = name; }

private boolean checked;

public boolean get_checked { return checked; }

public void set_checked (boolean checked) { this.checked = checked; }

private Num order;

public Num get_order { return order; }

public void set_order (Num order) { this.order = order; } } 292 dslbook.org

When we prefer camelcase in method names, we need to slightly change our code generation rules, replacing get_[x] and set_[x] by get[x] and set[x]. We also need to specify the following rewrite rule:

to-upper-first: s -> s’ where [first|chars] := s ; upper := first ; s’ := [upper|chars]

explode-string turns a string into a list of characters, to-upper 43 All these strategies are part of Strat- upper-cases the first character, and implode-string turns the ego’s standard library, documented 43 at releases.strategoxt.org/docs/ characters back into a string . api/libstratego-lib/stable/docs/.

 Editor Integration To integrate the code generation into our editor, we first have to define the following rewrite rule:

generate-java: (selected, position, ast, path, project-path) -> (filename, result) with filename := path; result := ast

While we are free to choose the name of this rule44, the patterns 44 The rule is subsequently registered as on the left- and right-hand side need to follow Spoofax’ con- a builder to make it known to Spoofax; see below. vention for editor integration. On the left-hand side, it matches the current selection in the editor, its position in the abstract syntax tree, the abstract syntax tree itself (ast), the path of the source file in the editor45, and the path of the project this file 45 By default, generation happens on a belongs to. As the right-hand side shows, the rule produces per-file basis. We can also just generate code for the current selection: to do so, the name of the generated file and its content as a string. The we can replace the last line by result file name is derived from the source file’s path, while the file := selected. content is generated from the abstract syntax tree. Once we have defined this rule, we can register it as a builder in editor/Lang-Builders.esv. Here, we add the following rule:

builder: "Generate Java code (selection)" = generate-java (openeditor)( realtime)

This defines a label for our transformation, which is added to the editor’s Transform menu. Additional options, such as (openeditor) and (realtime), can be used to customize the behaviour of the transformation. The following table illustrates the available options. dsl engineering 293

Option Description (openeditor) Opens the generated file in an editor. (realtime) Re-generates the file as the source is edited. (meta) Excludes the transformation from the deployed plugin. (cursor) Transforms always the tree node at the cursor.

 Code Generation by Model Transformation Rewrite rules with string interpolation support a template-based approach to code generation. Thus, they share two typical problems of template languages. First, they are not syntax safe, that is, they do not guarantee the syntactical correctness of the generated code: we might accidently generate Java code which can not be parsed by a Java compiler. Such errors can only be detected by test- ing the code generator. Second, they inhibit subsequent trans- formation steps. For example, we might want to optimize the generated Java code, generate Java bytecode from it, and finally optimize the generate Java Bytecode. At each step, we would first need to parse the generated code from the previous step before we can apply the actual transformation. Both problems can be avoided by generating abstract syntax trees instead of concrete syntax, i.e. by using model-to-model transformations instead of code (text) generation. This can be achieved by constructing terms on the right-hand side of rewrite rules: to-java: NumType() -> IntBaseType() to-java: BoolType() -> BooleanBaseType() to-java: StringType() -> ClassType("java.lang.String") to-java: EntType(t) -> ClassType(t) to-java: Entity(x, ps) -> Class([], x, ps’) ps’ := ps to-java: Property(x, t) -> [field, getter, setter] with t’ := t ; field := Field([Private()], t’, x) ; getter := Method([Public()], t’, $[get_[x]], [], [Return(VarRef(x))]) ; setter := Method([Public()], Void(), $[set_[x]], [Param(t’, x)], [assign]) ; assign := Assign(FieldRef(This(), x), VarRef(x))

When we generate ASTs instead of concrete syntax, we can eas- ily compose transformation steps into a transformation chain by using the output of transformation n as the input for trans- formation n + 1. But this chain will still result in abstract syntax trees. To turn them back into text, it has to be pretty-printed (or serialized). Spoofax generates a language-specific rewrite 294 dslbook.org

rule pp--string which rewrites an abstract syntax 46 We will discuss these pretty-printer tree into a string according to a pretty-printer definition46. definitions and how they can be cus- tomized in Section 13.4

 Concrete Object Syntax Both template-based and term-based 47 This feature is not as easy to use as it approaches to code generation have distinctive benefits. While seems from this description: you need to think ahead about what you want template-based generation with string interpolation allows for to be able to quote, what to unquote, concrete syntax in code generation rules, AST generation guar- what character sequences to use for antees syntactical correctness of the generated code and en- that and how to avoid ambiguities. You need a good conceptual understanding ables transformation chains. To combine the benefits of both of the mapping between concrete and approaches, Spoofax can parse user-defined concrete syntax abstract syntax. A partial solution for the problems is an approach called quotations at compile-time, checking their syntax and replac- interactive disambiguation, and is ing them with equivalent abstract syntax fragments47. discussed in For example, a Java return statement can be expressed as L. C. L. Kats, K. T. Kalleberg, and E. Visser. Interactive disambiguation |[ return |[x]|; ]|, rather than the abstract syntax form of meta programs with concrete object Return(VarRef(x)). Here, |[...]| surrounds Java syntax. syntax. In SLE 2012 It quotes Java fragments inside Stratego code. Furthermore, |[x]| refers to a Stratego variable x, matching the expression in the return statement. In this case, |[...]| is an antiquote, switching back to Stratego syntax in a Java fragment. To enable this functionality, we have to customize Stratego, Spoofax’ transformation language. This requires four steps. First, we need to combine Stratego’s syntax definition with the syntax definitions of the source and target languages. There, it is important to keep the sorts of the languages disjunct. This can be achieved by renaming sorts in an imported module, which we do in the following example for the Java and the Stratego module: module Stratego-Mobl-Java imports Mobl imports Java [ ID => JavaId ] imports Stratego [ Id => StrategoId Var => StrategoVar Term => StrategoTerm ]

Second, we need to define quotations, which will enclose con- crete syntax fragments of the target language in Stratego rewrite rules48. We add a syntax rule for every sort of concrete syntax 48 These define the |[ ... ]| escapes fragments that we want to use in our rewrite rules: mentioned above. exports context-free syntax

"|[" Module "]|" -> StrategoTerm {"ToTerm"} "|[" Import "]|" -> StrategoTerm {"ToTerm"} "|[" Entity "]|" -> StrategoTerm {"ToTerm"} "|[" EntityBodyDecl "]|" -> StrategoTerm {"ToTerm"}

"|[" JClass "]|" -> StrategoTerm {"ToTerm"} "|[" JField "]|" -> StrategoTerm {"ToTerm"} "|[" JMethod "]|" -> StrategoTerm {"ToTerm"} "|[" JFeature* "]|" -> StrategoTerm {"ToTerm"} dsl engineering 295

With these rules, we allow quoted Mobl and Java fragments wherever an ordinary Stratego term is allowed. The first four rules concern Mobl, our example source language49. As quotes, 49 We like to use concrete syntax for we use |[...]|. The second set of rules work similarly for modules (Module), import declara- tions (Import), entities (Entity), and Java, our example target language. All syntax rules extend properties and functions of entities Term from the Stratego grammar, which we renamed to Stra- (EntityBodyDecl). tegoTerm during import. We use ToTerm as a constructor la- bel. This allows Stratego to recognize places where we use concrete object syntax inside Stratego code. It will then lift the subtrees at these places into Stratego code. For exam- ple, the abstract syntax of |[ return |[x]|; ]| would be ToTerm(Return(...)). Stratego lifts this to NoAnnoList(Op( "Return", [...])), which is the abstract syntax tree for the term Return(x). Third, we need to define antiquotations, which will enclose Stratego code in target language concrete syntax fragments. Here, we add a syntax rule for every sort where we want to inject Stratego code into concrete syntax fragments: exports context-free syntax "|[" StrategoTerm "]|" -> JavaId {"FromTerm"}

This rule allows antiquoted Stratego terms to be used wherever a Java identifier can be used50. We use FromTerm as a construc- 50 We renamed Id from the Java gram- tor in the abstract syntax tree. Like ToTerm, Stratego uses this mar to JavaID during import. to recognize places where we switch between concrete object syntax and Stratego code. For example, the abstract syntax of |[ return |[x]|; ]| would be

ToTerm(Return(FromTerm(Var("x"))))

Stratego lifts this to the following, which is the abstract syntax tree for the term Return(x):

NoAnnoList(Op("Return", [Var("x")]))

Finally, we need to create a .meta file for every transformation file .str with concrete syntax frag- ments. In this file, we tell Spoofax to use our customized Strat- ego syntax definition: 51 Since Spoofax replaces concrete syn- tax fragments with equivalent abstract Meta([ Syntax("Stratego-Mobl-Java") ]) syntax fragments, indentation in the fragments is lost. But the generated Now, we can use concrete syntax fragments in our rewrite code will still be indented by the pretty- rules51: printer.

to-java: |[ entity |[x]| { |[ps]| } ]| -> |[ class |[x]| { |[ ps]| } ]| to-java: 296 dslbook.org

|[ |[x]| : |[t]| ]| -> |[ private |[t’]| |[x]|;

public |[t’]| |[x]| { return |[x]|; }

public void |[x]| (|[t’]| |[x]|) { this.|[x]| = |[x]|; } ]| with t’ := t

Using concrete object syntax in Stratego code combines the benefits of string interpolation and code generation by model transformation. With string interpolation, we can use the syn- tax of the target language in code generation rules, which makes them easy to write. However, it is also easy to make syntactic errors, which are only detected when the generated code is compiled. With code generation by model transformation, we can check if the generated abstract syntax tree corresponds to the grammar of the target language. Actually we can check each transformation rule and detect errors early. With concrete object syntax, we can now use the syntax of the target language in code generation. This syntax is checked by the parser which is derived from the combined grammars of the target language and Stratego. This comes at the price of adding quotations and antiquo- tation rules manually. These rules might be generated from a declarative, more concise embedding definition in the future. However, we cannot expect full automation here, since choices for the syntactic sorts involved in the embedding and of quo- tation and antiquotation symbols require an understanding of Stratego, the target language, and the transformation we want to write. These choices have to be made carefully in order to avoid ambiguities52. 52 This is the core difference to MPS’ In general, there is room for more improvements of the em- projectional editor in this respect. In MPS, generator macros can be attached bedding of the target language into Stratego. When the target to any program element expressed language comes with a Spoofax editor, we want to get editor in any language. There is no need to define quotations and antiquotations for services like code completion, hover help, and content folding each combination. in the embedded editor as well. Until now, only syntax high- lighting has been supported, using the Stratego coloring rules. Keywords of the target language will be highlighted like Strat- ego keywords and embedded code fragments will be given a gray background color. 12 Building Interpreters

Interpreters are programs that execute DSL programs by directly traversing the DSL program and performing the semantic actions associated with the respective program el- ements. The chapter contains examples for interpreters with Xtext, MPS and Spoofax.

Interpreters are programs that read a model, traverse the AST and perform actions corresponding to the execution semantics of the language constructs whose instances appear in the AST1. 1 They may also produce text (in which How an interpreter implementation looks like depends a lot on case such an interpreter is typically called a generator), and/or inspect the programming language used for implementing it. Also, the the structure and check constraints (in complexity of the interpreter directly reflects the complexity of which case they are called a validator). In this section we focus on interpreters the language it processes in terms of size, structure and se- that directly execute the program. mantics2. The following list explains some typical ingredients 2 that go into building interpreters for functional and procedu- For example, building an interpreter for a pure expression language with a ral languages. It assumes a programming language that can functional programming language is al- polymorphically invoke functions or methods. most trivial. In contrast, the interpreters for languages that support parallelism can be much more challenging.  Expressions For program elements that can be evaluated to values, i.e., expressions, there is typically a function eval that is defined for the various expression concepts in the lan- guage, i.e. it is polymorphically overridden for subconcepts of Expression. Since nested expressions are almost always rep- resented as nested trees in the AST, the eval function calls it- self with the program elements it owns, and then performs 3 The expression generator we saw some semantic action on the result3. Consider an expression in the previous chapter exhibited the same structure: the eval template for 3 2 + 5. Since the + is at the root of the AST, eval(Plus) * some kind of expression calls the eval would be called (by some outside entity). It is implemented templates for its children. to add the values obtained by evaluating its arguments. So it 298 dslbook.org

calls eval(Multi) and eval(5). Evaluating a number literal is trivial, since it simply returns the number itself. eval(Multi) would call eval(3) and eval(2), multiplying their results and returning the result of the multiplication as its own result, al- lowing plus to finish its calculation. Technically, eval could be implemented as a method of the AST classes in an Statements Program elements that don’t produce a value object-oriented language. However, this  is typically not done, since the inter- only make sense in programming languages that have side ef- preter should be kept separate from the fects. In other words, execution of such a language concept AST classes, for example, because there may be several interpreters, or because produces some effect either on global data in the program (re- the interpreter is developed by other assignable variables, object state) or on the environment of the people than the AST. program (sending network data or rendering a UI). Such pro- gram elements are typically called Statements. Statements are either arranged in a list (typically called a statement list) or arranged recursively nested as a tree (an if statement has a then and an else part which are themselves statements or statement lists). To execute those, there is typically a func- tion execute that is overloaded for all of the different state- ment types4. Note that statements often contain expressions 4 It is also overloaded for and more statement lists (as in if (a > 3) { print a; a=0; StatementList which iterates over all statements and calls execute for } else { a=1;}), so an implementation of execute may call each one. eval and perform some action based on the result (such as de- ciding whether to execute the then-part of the else-part of the if statement). Executing the then-part and the else-part boils down to calling execute on the respective statement lists.

 Environments Languages that support assignment to vari- ables (or modify any other global state) require an environment during execution to remember the values for the variables at each point during program execution5. The interpreter must 5 Consider int a = 1; a = a + 1;. keep some kind of global hash table, known as the environment, In this example, the a in a + 1 is a variable reference. When evaluating this to keep track of symbols and their values, so it can look them reference, the system must "remember" up when evaluating a reference to that symbol. Many (though that it has assigned 1 to that variable in the previous statement. not all) languages that support assignable variables allow re- assignment to the same variable (as we do in a = a + 1;). In this case, the environment must be updateable. Notice that in a = a + 1 both mentions of a are references to the same variable, and both a and a+1 are expressions. However, only a (and not a + 1) can be assigned to: writing a + 1 = 10 * a; would be invalid. The notion of an lvalue is introduced to describe 6 Unless they evaluate to something that this. lvalues can be used "on the left side" of an assignment. is in turn an lvalue. An example of this is would be *(someFunc(arg1, arg2)) Variable references are typically lvalues (if they don’t point to = 12;, in C, assuming that someFunc a const variable). Complex expressions usually are not6. returns a pointer to an integer. dsl engineering 299

 Call Stacks The ability to call other entities (functions, pro- cedures, methods) introduces further complexity, especially re- garding parameter and return value passing, and the values of local variables. Assume the following function: int add(int a, int b) { return a + b; }

When this function is called via add(2, 3) the actual argu- ments 2 and 3 have to be bound to the formal arguments a and b. An environment must be established for the execution of add that keeps track of these assignments7. Now consider the 7 If functions can also access global following recursive function: state (i.e. symbols that are not explicitly passed in via arguments), then this int fac(int i) { environment must delegate to the return i == 0 ? 1 : fac(i - 1); global environment in case a referenced } symbol cannot be found in the local environment. In this case, each recursive call to fac requires that a new en- vironment is created, with its own binding for the formal vari- ables. However, the original environment must be "remem- bered" because it is needed to complete the execution of the outer fac after a recursively called fac returns. This is achieved using a stack of environments. A new environment is pushed onto the stack as a function is called (recursively), and the stack is popped, returning to the previous environment, as a called function returns. The return value, which is often expressed using some kind of return statement, is usually placed into the inner environment using a special symbol or name (such as __ret__). It can then be picked up from there as the inner environment is popped.

The example discussed in this sec- tion is built using the Xtext inter- preter framework that ships with 12.1 Building an Interpreter with Xtext the Xtext typesystem framework dis- cussed earlier: code.google.com/a/ This example describes an interpreter for the cooling language8. eclipselabs.org/p/ xtext-typesystem/ It is used to allow DSL users to "play" with the cooling pro- grams before or instead of generating C code. The interpreter 8 The interpreter in this section is built with Java because this is how we can execute test cases (and report success or failure) as well as did it in the actual project. Instead, simulate the program interactively. Since no code generation since the interpreter just operates on and no real target hardware is involved, the turn-around time the EMF API, it can be written with any JVM language. In particular, is much shorter and the required infrastructure is trivial – only Xtend would be well suited because of the IDE is needed to run the interpreter. The execution engine, support for functional programming and more generally more concise as the interpreter is called here, has to handle the following syntax, especially regarding working language aspects: with EMF abstract syntax trees. 300 dslbook.org

• The DSL supports expressions and statements, for example in the entry and exit actions of states. These have to be supported in the way described above.

• The top-level structure of a cooling program is a state ma- chine. So the interpreter has to deal with states, events and transitions.

• The language supports deferred execution (i.e. perform a set of statements at a later time), so the interpreter has to keep track of deferred parts of the program.

• The language supports writing tests for cooling programs, including mock behavior for hardware elements. A set of constructs exists to express this mock behavior (specifically, ramps to change temperatures over time). These background tasks must be handled by the interpreter as well.

 Expressions and Statements We start our description of the execution engine inside out, by looking at the interpreter for ex- pressions and statements first. As mentioned above, for inter- preting expressions, there is typically an overloaded eval op- eration, that contains the implementation of expression evalu- ation for each subtype of a generic Expression concept. How- 9 Java only supports polymorphic ever, Java doesn’t have polymorphically overloaded member dispatch on the this pointer, but not on method arguments. methods9. We compensate this by generating a dispatcher that calls a different method for each subtype of Expression10. 10 If the interpreter had been built The generation of this dispatcher is integrated with Xtext via a with Xtend instead, we would not workflow fragment, i.e. the dispatcher is generated during the have had to generate the dispatcher for the StatementExecutor or the overall Xtext code generation process. The fragment is config- ExpressionEvaluator, since Xtend ured with the abstract meta classes for expressions and state- provides polymorphic dispatch on ments. The following code shows the fragment configuration: method arguments. However, the fundamental logic and structure of the fragment = de.itemis.interpreter.generator.InterpreterGenerator { interpreter would have been similar. expressionRootClassName = "Expression" 11 statementRootClassName = "Statement" A similar class is generated for the } statements. Instead of eval, the method is called execute and it does not return This fragment generates an abstract class that acts as the basis a value. In every other respect the for the interpreter for the particular set of statements and ex- StatementExecutor is similar to the ExpressionEvaluator. pressions. As the following piece of code shows, the expression evaluator class contains an eval method that uses instanceof 12 The class also uses a logging frame- LogEntry checks to dispatch to a method specific to the subclass, thereby work (based on the class) that can be used to create a tree-shaped 11 emulating polymorphically overloaded methods . The spe- trace of expression evaluation, which, cific methods throw an exception and are expected to be over- short of building an actual debug- ger, is very useful for debugging and ridden by a manually written subclass that contains the actual understanding the execution of the interpreter logic for the particular language concepts12: interpreter. dsl engineering 301

public abstract class AbstractCoolingLanguageExpressionEvaluator extends AbstractExpressionEvaluator {

public AbstractCoolingLanguageExpressionEvaluator(ExecutionContext ctx) { super(ctx); }

public Object eval( EObject expr, LogEntry parentLog ) throws InterpreterException { LogEntry localLog = parentLog.child(LogEntry.Kind.eval, expr, "evaluating "+expr.eClass().getName()); if ( expr instanceof Equals ) return evalEquals( (Equals)expr, localLog ); if ( expr instanceof Unequals ) return evalUnequals( (Unequals)expr, localLog ); if ( expr instanceof Greater ) return evalGreater( (Greater)expr, localLog ); // the others... }

protected Object evalEquals( Equals expr, LogEntry log ) throws InterpreterException { throw new MethodNotImplementedException(expr, "evalEquals not implemented"); }

protected Object evalUnequals( Unequals expr, LogEntry log ) throws InterpreterException { throw new MethodNotImplementedException(expr, "evalUnequals not implemented"); }

// the others... }

Before we dive into the details of the interpreter code below, it is worth mentioning that the "global data" held by the execu- tion engine is stored and passed around using an instance of EngineExecutionContext. For example, it contains the envi- ronment that keeps track of symbol values, and it also has ac- cess to the type system implementation class for the language. The ExecutionContext is available through the eec() method in the StatementExecutor and ExpressionEvaluator. Let us now look at some example method implementations. The following code shows the implementation of evalNumber- Literal, which evaluates number literals such as 2 or 2.3 or -10.2. To recap, the following grammar is used for defining number literals:

Atomic returns Expression: ... ({NumberLiteral} value=DECIMAL_NUMBER);

terminal DECIMAL_NUMBER: ("-")? (’0’..’9’)* (’.’ (’0’..’9’)+)?;

With this in mind, the implementation of evalNumberLiteral should be easily understandable. We first retrieve the actual value from the NumberLiteral object, and we find the type of the number literal using the typeof function in the type sys- 302 dslbook.org

tem13. Based on this distinction, evalNumberLiteral returns 13 The type system basically inspects either a Java Double or Integer as the value of the Number- whether the value contains a dot or not and returns either a DoubleType or Literal. In addition, it creates log entries that document these IntType. decisions.

protected Object evalNumberLiteral(NumberLiteral expr, LogEntry log) { String v = ((NumberLiteral) expr).getValue(); EObject type = eec().typesystem.typeof(expr, new TypeCalculationTrace()); if (type instanceof DoubleType) { log.child(Kind.debug, expr, "value is a double, " + v); return Double.valueOf(v); } else if (type instanceof IntType) { log.child(Kind.debug, expr, "value is a int, " + v); return Integer.valueOf(v); } return null; }

The evaluator for NumberLiteral was simple because number literals are leaves in the AST and have no children, and no recursive invocations of eval are required. This is different for the LogicalAnd, which has two children in the left and right properties. The following code shows the implementation of evalLogicalAnd.

protected Object evalLogicalAnd(LogicalAnd expr, LogEntry log) { boolean leftVal = ((Boolean)evalCheckNullLog( expr.getLeft(), log )) .booleanValue(); if ( !leftVal ) return false; boolean rightVal = ((Boolean)evalCheckNullLog( expr.getRight(), log )) .booleanValue(); return rightVal; }

The first statement calls the evaluator, for the left property. If leftVal is false we return without evaluating the right ar- gument14. If it is true we evaluate the right argument15. The 14 Most programming languages never value of the LogicalAnd is then the value of rightVal. evaluate the right argument of a logical and if the left one is false and the So far, we haven’t used the environment, since we haven’t overall expression can never become worked with variables and their (changing) values. Let’s now true. look at how variable assignment is handled. We first look at the 15 The argument evaluation uses a util- AssignmentStatement, which is implemented in the Statement- ity method called evalCheckNullLog Executor: which automatically creates a log entry for this recursive call and stops the protected void executeAssignmentStatement( AssignmentStatement s, interpreter if the value passed in is LogEntry log) { null (which would mean the AST is Object l = s.getLeft(); somehow broken). Object r = evalCheckNullLog(s.getRight(), log); SymbolRef sr = (SymbolRef) l; SymbolDeclaration symbol = sr.getSymbol(); eec().environment.put(symbol, r); log.child(Kind.debug, s, "setting " + symbol.getName() + " to " + r); }

The first two lines get the left argument as well as the value of the right argument. Note how only the right value is evalu- ated: the left argument is a symbol reference (ensured through dsl engineering 303

a constraint, since only SymbolRefs are lvalues in this lan- guage). We then retrieve the symbol referenced by the symbol reference and create a mapping from the symbol to the value in the environment, effectively "assigning" the value to the sym- bol during the execution of the interpreter. The implementation of the ExpressionEvaluator for a sym- bol reference (if it is used not as an lvalue) is shown in the following code. We use the same environment to look up the value for the symbol. We then check whether the value is null (i.e. nothing has been assigned to the symbol as yet). In this case we return the default value for the respective type and log a warning16, otherwise we return the value. 16 This is specialized functionality in the cooling language; in most other protected Object evalSymbolRef(SymbolRef expr, LogEntry log) { languages, we would probably just SymbolDeclaration s = expr.getSymbol(); Object val = eec().environment.get(s); return null, since nothing seems to if (val == null){ have been assigned to the symbol yet. EObject type = eec().typesystem.typeof(expr, new TypeCalculationTrace()); Object neutral = intDoubleNeutralValue(type); log.child(Kind.debug, expr, "looking up value; nothing found, using neutral value: " + neutral); return neutral; } else { log.child(Kind.debug, expr, "looking up value: " + val); return val; } }

The cooling language does not support function calls, so we demonstrate function calls with a similar language that sup- ports them. In that language, function calls are expressed as symbol references that have argument lists. Constraints make sure that argument lists are only used if the referenced symbol is actually a FunctionDeclaration17. Here is the grammar. 17 This is a consequence of the sym- bol reference problem discussed in returns FunctionDeclaration Symbol: Section 8.2 {FunctionDeclaration} "function" type=Type name=ID "(" (params+=Parameter ("," params+=Parameter)* )? ")" "{" (statements+=Statement)* "}";

Atomic returns Expression: ... {SymbolRef} symbol=[Symbol|QID] ("(" (actuals+=Expr)? ("," actuals+=Expr)* ")")?;

The following is the code for the evaluation function for the symbol reference. It must distinguish between references to variables and to functions18. 18 This is once again a consequence of the fact that all references to any protected Object evalSymbolRef(SymbolRef expr, LogEntry log) { symbol are handled via the SymbolRef Symbol symbol = expr.getSymbol(); if ( symbol instanceof VarDecl ) { class. We discussed this in Section 7.5. return log( symbol, eec().environment.getCheckNull(symbol, log) ); } if ( symbol instanceof FunctionDeclaration ) { FunctionDeclaration fd = (FunctionDeclaration) symbol; 304 dslbook.org

return callAndReturnWithPositionalArgs("calling "+fd.getName(), fd.getParams(), expr.getActuals(), fd.getElements(), RETURN_SYMBOL, log); } throw new InterpreterException(expr, "interpreter failed; cannot resolve symbol reference " +expr.eClass().getName()); }

The code for handling the FunctionDeclaration uses a prede- fined utility method callAndReturnWithPositionalArgs19. It 19 It is part of the interpreter framework accepts as arguments the list of formal arguments of the called used in this example. function, the list of actual arguments (expressions) passed in at the call site, the list of statements in the function body, a symbol that should be used for the return value in the envi- ronment, as well as the obligatory log. The utility method is implemented as follows: protected Object callAndReturnWithPositionalArgs(String name, EList formals, EList actuals, EList bodyStatements) { eec().environment.push(name); for( int i=0; i

Remember that each invocation of a function has to get its own environment to handle the local variables for the particular in- vocation. We can see this in the first line of the implemen- tation above: we first create a new environment and push it onto the call stack. Then the implementation iterates over the actual arguments, evaluates each of them and "assigns" them to the formals by creating an association between the formal argument symbol and the actual argument value in the new environment. It then uses the StatementExecutor to execute all the statements in the body of the function. Notice that as the executed function deals with its own variables and func- tion calls, it uses the new environment created, pushed onto the stack and populated by this method. When the execution of the body has finished, we retrieve the return value from the environment. The return statement in the function has put it there under a name we have prescribed, the RETURN_SYMBOL, so we know how to find it in the environment. Finally, we pop the environment, restoring the caller’s state of the world and return the return value as the resulting value of the function call expression. dsl engineering 305

 States, Events and the Main program Changing a state20 from 20 State as in state machine, not as in within a cooling program is done by executing a ChangeState- program state. Statement, which simply references the state that should be entered. Here is the interpreter code in StatementExecutor: protected void executeChangeStateStatement(ChangeStateStatement s, LogEntry l) { engine.enterState(s.getTargetState(), log); } public void enterState(State targetState, LogEntry logger ) throws TestFailedException, InterpreterException, TestStoppedException { logger.child( Kind.info, targetState, "entering state "+targetState.getName()); context.currentState = targetState; executor.execute(ss.getEntryStatements(), logger); throw new NewStateEntered(); } executeChangeStateStatement calls back to an engine method that handles the state change21. The method sets the current 21 Since this is a more global operation state to the target state passed into the method (the current than executing statements, it is handled by the engine class itself, and not by the state is kept track of in the execution context). It then executes StatementExecutor. the set of entry statements of the new state. After this it throws an exception NewStateEntered, which stops the current exe- cution step. The overall engine is step driven, i.e. an external "timer" triggers distinct execution steps of the engine. A state change always terminates the current step. The main method step() triggered by the external timer can be considered the main program of the interpreter. It looks like this: public int step(LogEntry logger) { try { context.currentStep++; executor.execute(getCurrentState().getEachTimeStatements(), stepLogger); executeAsyncStuff(logger); if ( !context.eventQueue.isEmpty() ) { CustomEvent event = context.eventQueue.remove(0); LogEntry evLog = logger.child(Kind.info, null, "processing event from queue: "+event.getName()); processEventFromQueue( event, evLog ); return context.currentStep; } processSignalHandlers(stepLogger); } catch ( NewStateEntered ignore ) {} return context.currentStep; }

It first increments a counter that keeps track of how many steps have been executed since the interpreter has been started. It then executes the each time statements of the current state. This is a statement list defined by a state that needs to be re- executed in each step while the system is in the respective state. It then executes asynchronous tasks. We’ll explain those below. Next it checks if an event is in the event queue. If so, it removes the first event from the queue and executes it. After processing 306 dslbook.org

an event the step is always terminated. Lastly, if there was no event to be processed, we process signal handlers (the check statements in the cooling programs). Processing events checks whether the current state declares an event handler that can deal with the currently processed event. If so, it executes the statement list associated with this event handler.

private void processEventFromQueue(CustomEvent event, LogEntry logger) { for ( EventHandler eh: getCurrentState().getEventHandlers()) { if ( reactsOn( eh, event ) ) { executor.execute(eh.getStatements(), logger); } } }

The DSL also supports executing code asynchronously, i.e. af- ter a specified number of steps (representing logical program time). The grammar looks as follows:

PerformAsyncStatement: "perform" "after" time=Expr "{" (statements+=Statement)* "}";

The following method interprets the PerformAsyncStatements:

protected void executePerformAsyncStatement(PerformAsyncStatement s, LogEntry log) throws InterpreterException { int inSteps = ((Integer)evalCheckNullLog(s.getTime(), log)).intValue(); eec().asyncElements.add(new AsyncPerform(eec().currentStep + inSteps, "perform async", s, s.getStatements())); }

It registers the statement list associated with the PerformAsync- Statement in the list of async elements in the execution con- text. The call to executeAsyncStuff at the beginning of the step method described above checks whether the time has come and executes those statements:

private void executeAsyncStuff(LogEntry logger) { List stuffToRun = new ArrayList(); for (AsyncElement e: context.asyncElements) { if ( e.executeNow(context.currentStep) ) { stuffToRun.add(e); } } for (AsyncElement e : stuffToRun) { context.asyncElements.remove(e); e.execute(context, logger.child(Kind.info, null, "Async "+e)); } }

12.2 An Interpreter in MPS

Building an interpreter in MPS is essentially similar to building an interpreter in Xtext and EMF. All concepts would apply in dsl engineering 307

the same way22. However, since MPS’ BaseLanguage is itself 22 Instead of EObjects you would work built with MPS, it can be extended. So instead of using a gener- with the node<> types that are available on MPS to access ASTs. ator to generate the dispatcher that calls the eval methods for the expression classes, suitable modular language extensions can be defined in the first place.

12 1 A Dispatch Expression Figure . : An extension to MPS’  For example, BaseLanguage could BaseLanguage that makes writing be extended with support for polymorphic dispatch (similar to interpreters simpler. The dispatch what Xtend does with dispatch methods). An alternative solu- statement has the neat feature that, on the right side of the ->, the $ reference tion involves a dispatch expression, a kind of "pimped switch". to the ex expression is already down- Fig. 12.1 shows an example. cast to the type mentioned on the left of the ->.

The dispatch expression tests whether the argument ex is an instance of the type referenced in the cases. If so, the code on the right side of the arrow is executed. Notice the special expression $ used on the right side of the arrow. It refers to the argument ex, but it is already downcast to the type on the left 23 We discuss language modularization and composition in Section 4.6. of the case’s arrow. This way, writing annoying downcasts for each property access can be avoided. 24 Xbase/Xtend comes with a similar Note that this extension is modular in the sense that the "pimped" switch statement directly. definition of BaseLanguage was not changed. Instead, an ad- It supports switching over integer and switch ditional language module was defined that extends BaseLan- enum literals (as Java’s does), but it also supports switching over guage. This module can be used as part of the program that types plus arbitrary Boolean guard contains the interpreter, making the dispatch statement avail- conditions. This makes it very suitable 23 for building interpreters, since, as we able there . Also, the $ expression is restricted to only be have seen, an interpreter typically usable on the right side of the ->, allowing the overall base dispatches its behavior based on the language namespace to be kept clean24. metaclass of the node it processes, plus optionally some of its context or children (which can be expressed with a  Showing Results in the Editor Since MPS is a projectional guard condition in Xtend’s switch.) editor, it can show things in the editor that are read-only. For example, the result of an interpreter run can be integrated di- rectly into the editor. In Fig. 12.2, the bottom table contains test cases for the calculate rule. Users enter a number for 308 dslbook.org

squareMeters, numberOfRooms and the expected result, and in the last column, the editor shows the actual result of the computation (colored red or green depending on whether the actual result matches the expected result).

Figure 12.2: This screenshot shows an mbeddr-based demo application in which users can specify insurance rules. The system includes an interpreter that executes the test cases directly in the IDE.

The interpreter is integrated via the evaluate button25. Its 25 In MPS, an editor can embed Swing action listener triggers the computation: components, and these Swing compo- nents can react to their own events and component provider: (editorContext, node)->JComponent { modify the model in arbitrary ways. JButton evalButton = new JButton("evaluate"); final node suite = node; evalButton.addActionListener(new ActionListener() { public void actionPerformed(ActionEvent p0) { command { foreach tc in suite.cases { float result = new RateCalculator(tc).calculate( suite.ancestor.rateCalculation); tc.actualResult.value = result + ""; } } } }); return evalButton; }

The coloring of the cells (and making them read-only) is done via query-based formatting:

style { editable : false background-color : (node, editorContext)->Color { node.ancestor.isOK() ? Color.GREEN : Color.RED; } }

12.3 An Interpreter in Spoofax

So far we have seen procedural/functional interpreters in which the program’s execution state is separate from the program it- dsl engineering 309

self. As the interpreter runs, it updates the execution state. Another approach to writing interpreters is state-based inter- preters, where the execution of the interpreter is expressed as a set of transformations between program states. State-based interpreters can be specified with rewrite rules in Spoofax, re- alizing transitions between execution states. This requires:

• A representation of states. The simplest way to represent states are terms, but we can also define a new DSL for rep- resenting the states in concrete syntax.

• An initialization transformation from a program in the DSL to the initial state of the interpreter.

•A step transformation from an actual state of the interpreter to the next state of the interpreter.

In the remainder of the section, we develop an interpreter for a subset of Mobl. We start with a simple interpreter for expres- sions, which we then extend to handle statements. 12.3.1 An Interpreter for Expressions If we want to evaluate simple arithmetic expressions without variables, the expression itself is the state of the interpreter26. 26 Remember how in the design chapter Thus, no extra term signatures are needed and the initializa- we discussed building debuggers for purely functional languages, and in tion transformation is given by identity. For the step transfor- particular, expression languages. We ar- mation, we can define rewrite rules for the different expression gued that a debugger is trivial, because there is no real "flow" of the program; kinds: instead, the expression can be de- eval: Add(Int(x), Int(y)) -> Int(z) where z := (x, y) bugged by simply showing the values eval: Mul(Int(x), Int(y)) -> Int(z) where z := (x, y) of all intermediate expressions in a tree- like for. We exploit the same "flowless" eval: Not(True()) -> False() eval: Not(False()) -> True() nature of pure expression languages when building this interpreter. eval: And(True(), True()) -> True() eval: And(True(), False()) -> False() eval: And(False(), True()) -> False() eval: And(False(), False()) -> False() eval: LazyAnd(True(), True()) -> True() eval: LazyAnd(False(), _) -> False() eval: LazyAnd(_, False(), _) -> False()

We can orchestrate these rules in two different styles. First, we can define an interpreter which performs only a single evalua- tion step by applying one rule in each step: eval-one: exp -> exp

Here, oncebu tries to apply eval at one position in the tree, starting from the leaves (the bu in oncebu stands for bottom- up). We could also use oncetd, traversing the tree top-down. 310 dslbook.org

However, evaluations are likely to happen at the bottom of the tree, which is why oncebu is the better choice. The result of eval-one will be a slightly simpler expression, which might need further evaluation. Alternatively, we can directly apply as many rules as possible, trying to evaluate the whole expres- sion: eval-all: exp -> exp

Here, bottomup tries to apply eval at every node, starting at the leaves. The result of eval-all will be the final result of the expression.

12.3.2 An Interpreter for Statements If we want to evaluate statements, we need states which cap- ture the value of variables and the list of statements which needs to be evaluated. We can define these states with a signa- ture for terms: signature constructors

: ID * IntValue -> VarValue : ID * BoolValue -> VarValue

State: List(VarValue) * List(Statement) -> State

The first two rules define binary tuples which combine a vari- able name (ID) and a value (either IntValue or BoolValue). The last rule defines a binary constructor State, which com- bines a list of variable values with a list of statements27. We 27 This will represent the program as first have to adapt the evaluation of expressions to handle vari- it evolves over time. Starting with a statement list, the program state over able references in expressions. time can be represented as a collection of variable values and a remaining list eval(|varvals): exp -> exp eval(|varvals): VarRef(var) -> (var, varvals) of yet to be interpreted statements. This is what State represents. eval-one(|s): exp -> exp eval-all(|s): exp -> exp

The first two rules take the actual list of variable values of the interpreter (varvals) as a parameter. The first rule integrates the existing evaluation rules, which do not need any state in- formation. The second rule looks up the current value val of the variable var in the list of variable values. The last two rules define small-step and big-step interpreters of expressions, just as before. We can now define evaluation rules for statements. These rules rewrite the current state into a new state: eval: (varvals, [Declare(var, exp)|stmts]) -> (varvals’, stmts) where dsl engineering 311

val := exp; varvals’ := ((var, val), varvals) eval: (varvals, [Assign(VarRef(var), exp)|stmts]) -> (varvals’, stmts) where val := exp; varvals’ := ((var, val), varvals) eval: (varval, [Block(stmts1)|stmts2]) -> (varvals, (stmts1, stmts2))

The first rule handles variable declarations. On the left-hand side, it matches the current list of variable values varvals, the declaration statement Declare(var, exp), and the list of re- maining statements stmts. It evaluates the expression to a value and updates the list of variable values. The new state (on the right-hand side of the signature) consists of the updated variable values and the list of remaining statements28. The 28 Note how on the left side of the rule second rule handling assignments is quite similar. The third the [head|tail] notation is used for the statements, where stmts is the tail. rule handles block statements, by concatenating the statements By "returning" stmts on the right, we from a block with the remaining statements. The following rule automatically remove the first element from the list (it was the head). handle an if statement: eval: (varvals, [If(exp, thenStmt, elseStmt)|stmts]) -> (varvals, [stmt|stmts]) where val := exp; if !val => True() then stmt := thenStmt else !val => False(); stmt := elseStmt end

First, it evaluates the condition. Depending on the result, it chooses the next statement to evaluate. When the result is True(), the statement from the thenStmt branch is chosen. Otherwise the result has to be False() and the statement from the elseStmt branch is chosen. If the result is neither True() nor False(), the rule will fail. This ensures that the rule only works when the condition can be evaluated to a Boolean value. The following rule handles while loops: eval: (varvals, [While(exp, body)|stmts]) -> (varvals, stmts’) where val := exp; if !val => True() then stmts’ := [body, While(exp, body)|stmts] else !val => False(); stmts’ := stmts end

Again, the condition is evaluated first. If it evaluates to True(), the list of statements is updated to the body of the loop, the while loop again, followed by the remaining statements. If it 312 dslbook.org

evaluates to False(), only the remaining statements need to be evaluated. The eval rules already define a small-step interpreter, going from one evaluation state to the next. We can define a big-step interpreter by adding a driver, which repeats the evaluation until it reaches a final state: eval-all: state -> state

12.3.3 More Advanced Interpreters We can extend the interpreter to handle function calls and ob- jects in a similar way as we did for statements. First, we always have to think about the states of the extended interpreter. Func- tions will require a call stack, objects will require a heap. Next, we need to consider how the old rules can deal with the new states. Adjustments might be needed. For example, when we support objects, the heap needs to be passed to expressions. Expressions which create objects will change the heap, so we cannot only pass it, but have to propagate the changes back to the caller.

12.3.4 IDE Integration We can integrate interpreters as builders into the IDE. For big- step interpreters, we can simply calculate the overall execution result and show it to the user. For small-step interpreters, we can use the initialization transformation in the builder. This will create an initial state for the interpreter. When we define a concrete syntax for these states, they can be shown in an editor. The transition transformation can then be integrated as a refactoring on states, changing the current state to the next one. In this way, the user can control the execution, undo steps, or even modify the current state. 13 IDE Services

In this chapter we discuss various services provided by the IDE. This includes code completion, syntax coloring, pretty- printing, go-to-definition and find references, refactoring, outline views, folding, diff and merge, tooltips and visual- ization. In contrast to the other chapters, this one is not organized by tool, but rather by IDE service. We then pro- vide examples for each service with one or more of the tools. Note that debugging debugging is discussed in Chapter 15.

In this chapter we illustrate typical services provided by the IDE that are not automatically derived from the language def- inition itself and for which additional configuration or pro- 1 If we don’t show how X works in gramming is required. Note that we are not going to show some tool, this does not mean that you every service with every example tool1. cannot do X with this particular tool.

13.1 Code Completion

Code completion is perhaps the most essential service pro- vided by an IDE. We already saw that code completion is im- plicitly influenced by scopes: if you press Ctrl-Space at the location of a reference, the IDE will show you the valid targets of this reference (as defined in the scope) in the code comple- tion menu. Selecting one establishes the reference.

 Customizing Code Completion for a Reference in Xtext Con- sider the cooling DSL. Cooling programs can reference sym- bols. Symbols can be hardware building blocks, local variables Figure 13.1: Code completion in the or configuration parameters. It would be useful in the code cooling language is customized to spec- completion menu to show what kind of symbol a particular ify what kind of symbol a particular reference target is (the kind is shown in symbol is (Fig. 13.1). parantheses after the name). 314 dslbook.org

To customize code completion, you have to implement a method in the ProposalProvider for your language. The method name has to correspond to a rule/property whose code com- pletion menu you want to customize2. In this example, we 2 This also works for containment symbol Atomic want to customize the property of the expres- references or primitive properties sion: and not just for references (which are

Atomic returns Expression: affected by scopes). ... ({SymbolRef} symbol=[appliances::SymbolDeclaration|QID]);

The method takes various arguments; the first one, model, rep- resents the program element for which the symbol property should be completed. public class CoolingLanguageProposalProvider extends AbstractCoolingLanguageProposalProvider {

@Override public void completeAtomic_Symbol(EObject model, Assignment assignment, ContentAssistContext context, ICompletionProposalAcceptor acceptor) { ... } }

Let us now look at the actual implementation of the method3. 3 For this particular example above, In line three we get the scope for the particular reference so we where we only change the string rep- resentation of the targets computed by can iterate over all the elements and change their appearance the scope (as opposed to changing the in the code completion menu. To be able to get the scope, we actual contents of the code completion menu), we could have overridden the EReference need the for the particular reference. The first two getStyledDisplayString method lines in this method are used to this end. instead. Its task is to provide the textual representation for contents of the scope. CrossReference crossReference = ((CrossReference)assignment.getTerminal()); EReference ref = GrammarUtil.getReference(crossReference); IScope scope = getScopeProvider().getScope(model, ref); Iterable candidates = scope.getAllElements(); for (IEObjectDescription od: candidates) { String ccText = od.getName()+" ("+od.getEClass().getName()+")"; String ccInsert = od.getName().toString(); acceptor.accept(createCompletionProposal(ccInsert, ccText, null, context)); }

Once we have the scope, we can iterate over all its contents 4 Note how the scope does not directly 4 (i.e. the target elements) . Inside the loop we then use the contain the target EObjects, but rather name of the target object plus the name of the EClass to con- IEObjectDescriptions. This is because the code completion is resolved against struct the string to be shown in the code completion menu the index, a data structure maintained (ccText)5. The last line then calls the accept method on the by Xtext that contains all referenceable ICompletionProposalAcceptor to finally create a proposal. elements. This approach has the advan- tage that the target resource, i.e. the file Note how we also pass in ccInsert, which is the text to be in- that contains the target element, does serted into the program if the particular code completion menu not have to be loaded just to be able to reference into it. item is selected. 5 Note that we could use a rich string to  An Example with MPS The contents of the code completion add some nice formatting to the string. menu for references can be customized in MPS as well. It is dsl engineering 315

instructive to look at this in addition to Xtext for two reasons. The first one is brevity. Consider the following code, where we customize the code completion menu for function calls:

link {function} ... presentation : (parameterNode, visible, smartReference, inEditor, ...)->string { parameterNode.signatureInfo(); }

To customize the contents of the code completion menu, you have to provide the expression that calculates the text in the presentation section of the scope provider6. In this example 6 Styled/Rich strings are not supported we call a method that calculates a string that represents the here. complete signature of the function. The second reason why this is interesting in MPS is that we don’t have to specify the text that should be inserted if an ele- ment is selected from the code completion menu: the reference is established based on the UUID of the target node, and the editor of the referencing node determines the presentation of this reference7. 7 In the example of the function call, it projects the name of the called function (plus the actual arguments).  Code Completion for Simple Properties In Xtext, code com- pletion can be provided for any property of a rule, not just for references (i.e. also for children or for primitive properties such as strings or numbers). The mechanism to do that is the same as the one shown above. Instead of using the scope (only refer- ences have scopes) one could use a statically populated list of strings as the set of proposals, or one could query a database to get a list of candidate values8. 8 Note that if we use this approach to In MPS, the mechanism is different. Since this is a pure provide code completion for primitive properties this does not affect the con- editor customization and has nothing to do with scopes, this straint check (in contrast to references, behavior is customized in the editor definition. Consider a where a scope affects the code comple- tion menu and the constraint checks). LocalVariableDeclaration int x = 0; (as in ) where we want Users can always type something that to customize the suggested name of the variable. So if you is not in the code completion menu. A press Ctrl-Space in the name field of the variable, we want to separate constraint check may have to be written. suggest one or more reasonable names for the variable. Fig. 13.2 shows the necessary code. An editor cell may have a cell menu (the menu you see when you press Ctrl-Space). It consists of several parts. Each part contributes a set of menu entries. In the example in Fig. 13.2, we add a cell menu part of type property values, in which we simply return a list of values (one, in the example; we use the name of the type of the local variable, prefixed by an a). 316 dslbook.org

Figure 13.2: A cell menu for the name property of a LocalVariableDeclaration. In the editor definition (top window) we select the cell that renders the name. In the inspector we can then define additional properties for the selected cell. In this case we contribute an additional cell menu that provides the suggested names.

 Editor Templates Templates are more complex syntactic struc- tures that can be selected from the code completion menu. For example, the code completion menu may contain an if-then- else entry, which, if you select it, gets expanded into the fol- lowing code in the program: if ( expr ) {

} else {

}

Xtext provides templates for this purpose. These can be de- fined either as part of the language, or by the user in the IDE. Fig. 13.3 shows the if-then-else example as defined in the IDE. In MPS there are several ways to address this. One is simply an intention (explained in more detail in Section 7.7). It will not be activated via Ctrl-Space, but rather via Alt-Enter. In every other respect it is identical: the intention can insert ar- bitrary code into the program. Alternatively we can use a cell menu (already mentioned above). Fig. 13.4 shows the code for a cell menu that also creates the if-then-else structure illus- trated above.

13.2 Syntax Coloring

There are two cases for syntax coloring: syntactic highlight- ing and semantic highlighting. Syntactic highlighting is used to color keywords, for example. These keywords are readily dsl engineering 317

Figure 13.3: Template definitions contain a name (the text shown in the code completion menu), a description, as well as the context and the pattern. The context refers to a grammar rule. The template will show up in the code completion menu at all locations where that grammar rule would be valid as well. The pattern is the actual text that will be inserted into the editor if the template is selected. It can contain variables. Once inserted, the user can use TAB to step through the variables and replace them with text. In the example, we define the condition expression as a variable.

Figure 13.4: A cell menu to insert the if/then/else statement. Note how we contribute two menu parts. The first one inserts the default code completion contents for Statement. The second one provides an if/then/else statement under the menu text if-then-else. Notice how we can use a quotation (concrete syntax expression) in the cell menu code. Because of MPS’s support for language composition, the editor even provides code completion etc. for the contents of the quotation in the editor for the cell menu. available from the grammar. No customization is necessary beyond configuring the actual color. Semantic coloring colors code fragments based on some query over the AST structure. For example, in a state machine, if a state is unreachable (no incoming transitions) the state may be colored in gray instead of black.

 An Example with MPS Let us first look at syntax coloring in MPS, starting with purely syntactic highlighting. Fig. 13.5 shows a collage of several ingredients: at the top we see the editor for GlobalVariableDeclaration. GlobalVariableDe- claration implements the interface IModuleContent. IModule- Contents can be exported (which means they can be seen by 318 dslbook.org

Figure 13.5: In MPS, syntax color- ing is achieved by associating one or more style properties with the ele- ments at hand. In this case we assign a darkGreen text foreground color as well as a bold font style.

modules importing the current one), so we define an editor component (a reusable editor fragment) for IModuleContent that renders the exported flag. This editor component is em- bedded into the editor of GlobalVariableDeclaration, and is also embedded into the editor of all other concepts that im- plement IModuleCon- tent. The editor component simply de- fines a keyword exported that is rendered in dark green and in bold font. This can be achieved by specifying the respective style properties for the editor cell9. 9 Groups of style definitions can also Semantic highlighting works essentially the same way. In- be modularized into style sheets and reused for several cells. stead of using a constant (darkGreen) for the color, we embed a query expression. The code in Fig. 13.6 renders the state keyword of a State in a Statemachine gray if that particular state has no incoming transitions.

Figure 13.6: A style query that renders the associated cell in gray if the state (to which the cell belongs) has no incoming transitions. We first find out if the state has incoming transitions by finding the Statemachine ancestor of the state, finding all the Transitions in the subtree under the Statemachine, and then checking if one exists whose targetState is the current state (node). We then use the result of this query to color the cell appropriately. dsl engineering 319

 An Example with Xtext Xtext uses a two-phase approach. First, you have to define the styles you want to apply to parts of the text. This is done in the highlighting configuration of the particular language: public class CLHighlightingConfiguration extends DefaultHighlightingConfiguration {

public static final String VAR = "var";

@Override public void configure(IHighlightingConfigurationAcceptor acceptor) { super.configure(acceptor); acceptor.acceptDefaultHighlighting(VAR, "variables", varTextStyle()); }

private TextStyle varTextStyle() { TextStyle t = defaultTextStyle().copy(); t.setColor(new RGB(100,100,200)); t.setStyle(SWT.ITALIC | SWT.BOLD ); return t; } }

The varTextStyle method creates a TextStyle object. The method configure then registers this style with the framework using a unique identifier (the constant VAR). The reason for reg- istering it with the framework is that the styles can be changed by the user in the running application using the preferences dialog (Fig. 13.7). Figure 13.7: Preferences dialog that We now have to associate the style with program syntax10. allows users to change the styles registered with the framework for a The semantic highlighting calculator for the target language highlighting configuration. is used to this end11. It requires the provideHighlightingFor 10 method to be implemented. To highlight references to variables A particularly nice feature of Xtext syntax coloring is that styles are com- (not the variables themselves!) with the style defined above bined if more than one style applies to a works the following way: given program element. public void provideHighlightingFor(XtextResource resource, 11 Even though it is called semantic IHighlightedPositionAcceptor acceptor) { highlighting calculator, it is used for EObject root = resource.getContents().get(0); TreeIterator eAllContents = root.eAllContents(); syntactic and semantic highlighting. It while (eAllContents.hasNext()) { simply associates concrete syntax nodes EObject ref = (EObject) eAllContents.next(); with styles; it does not matter how it if ( ref instanceof SymbolRef ) { establishes the association (statically or SymbolDeclaration sym = ((SymbolRef) o).getSymbol(); if ( sym instanceof Variable ) { based on the structure of the AST). ICompositeNode n = NodeModelUtils.findActualNodeFor(ref); acceptor.addPosition(n.getOffset(), n.getLength(), CLHighlightingConfiguration.VAR); } } } }

The method gets passed in an XtextResource, which repre- sents a model file. From it we get the root element and iterate over all its contents. If we find a SymbolRef, we continue with coloring. Notice that in the cooling language we reference any 320 dslbook.org

symbol (variable, event, hardware element) with a SymbolRef, 12 so we now have to check whether we reference a Variable This is the place where we could 12 perform any other structural or seman- or not . If we have successfully identified a reference to a tic analysis (such as the check for no variable, we now have to move from the abstract syntax tree incoming transitions) as well. (on which we have worked all the time so far) to the concrete syntax tree, so we can identify particular tokens that shall be colored13. We use a utility method to find the ICompositeNode 13 The concrete syntax tree in Xtext is that represents the SymbolRef in the concrete syntax tree. Fi- a complete representation of the parse result, including keywords, symbols nally we use the acceptor to perform the actual highlighting and whitespace. using the position of the text string in the text. We pass in the VAR style defined before14. 14 Notice how we color the complete reference. Since it is only one text string An Example with Spoofax Spoofax supports syntax color- anyway, this is just as well. If we had  more structured concrete syntax (as in ing on the lexical and the syntactic level. At the lexical level, state someState {}), and we only tokens such as keywords, identifiers, or integers are colored. wanted to highlight parts of it (e.g., the state keyword), we’d have to do some This is the most common use case of syntax coloring. At the further analysis on the ICompositeNode syntactic level, we can color larger code fragments, for exam- to find out the actual concrete syntax ple to highlight embeddings. In Spoofax, syntax coloring is node for the keyword. specified declaratively as part of the editor specification. For the lexical level, Spoofax predefines the token classes keyword, identifier, string, number, var, operator and layout. For each of these, we can specify a color (either by name or by RGB values) and optionally a font style (bold, italic, or both). Spoofax generates the following default specification:

module MoblLang-Colorer.generated colorer Default, token-based highlighting

keyword : 127 0 85 bold identifier : default string : blue number : darkgreen var : 255 0 100 italic operator : 0 0 128 layout : 63 127 95 italic colorer System colors

darkgreen = 0 128 0 green = 0 255 0 darkblue = 0 0 128 blue = 0 0 255 ... default = _

The generated specification can be customized on the lexical level, but also extended on the syntactic level. These exten- sions are based on syntactic sorts and constructor names. For example, the following specification will color numeric types in declarations in dark green: dsl engineering 321

module DSLbook-Colorer imports DSLbook-Colorer.generated colorer

Type.NumType: darkgreen

Here Type is a sort from the syntax definition, while NumType is the constructor for the integer type. There are other rules for Type in the Mobl grammar, for example for the string type. When we want other types also to be colored dark green, we can either add more rules to the colorer specification, or replace the current definition with Type._, where _ acts as a wildcard and all types will be colored dark green, independent of their constructor. Similarly, we can use a wildcard for sorts. For example, _.NumType will include all nodes with a constructor NumType, independent of their syntactic sort. In the current example, predefined types like int and en- tity types are all colored dark green, but only the predefined types will appear in bold face. This is because Spoofax com- bines specified colors and fonts. The rule on the syntactic level specifies only a color, but no font. Since the predefined types are keywords, they will get the font from the keyword specifi- cation, which is bold. In contrast, entity types are identifiers, which will get the default font from the identifier specification.

13.3 Go-to-Definition and Find References

Following a reference (go to definition, Ctrl-Click) as well as finding references to a given program element works automat- ically without any customization in any of the language work- benches. However, one might want to change the default be- havior, for example because the underlying program element is not a reference at all (but you still want to go somewhere when Ctrl-Clicking on it).

 Customizing the Target with Xtext Let us first look at how to change the target of the go-to-definition functionality. Strictly speaking, we don’t change go-to-definition at all. We just de- fine a new hyperlinking functionality. Go-to-Definition is just the default hyperlinking behavior15. As a consequence: 15 Hyperlinking gets its name from the fact that, as you mouse over an element • You can define hyperlinking for elements that are not refer- while keeping the Ctrl key depressed, ences in terms of the grammar (a hyperlink can be provided you see the respective element in blue and underlined. You can then click on it for any program element). to follow the hyperlink 322 dslbook.org

• You can have several hyperlinks for the same element. If you Ctrl-Hover on it, a little menu opens up and you can select the target you are interested in.

To add hyperlinks to a language concept, Xtext provides the IHyperlinkHelper interface, which can be implemented by language developers to customize hyperlinking behavior. It re- quires one method, createHyperlinksTo, to be implemented16. 16 Typically, language develop- A typical implementation looks as follows: ers will inherit from one of the existing base classes, such as the public void createHyperlinksTo(XtextResource from, Region region, TypeAwareHyperlinkHelper. EObject to, IHyperlinkAcceptor acceptor) { if ( to instanceof TheEConceptIAmInterestedIn ) { EObject target =// find the target of the hyperlink super.createHyperlinksTo(from, region, target, acceptor); } else { super.createHyperlinksTo(from, region, to, acceptor); } }

 Customized Finders in MPS In many cases, there are differ- ent kinds of references to any given element. For example, for an Interface in the mbeddr C components extension, refer- ences to that interface can either be sub-interfaces (ISomething extends IAnother) or components, which can either provide an interface (so other components can call the interface’s oper- ation), or they can require an interface, in which case the com- ponent itself calls operations defined by the interface. When finding references, we may want to distinguish between these different cases. MPS provides finders to achieve this. Fig. 13.8 shows the re- Figure 13.8: The Find Usages dialog for Interfaces. The two additional Find- Interface sulting Find Usages dialog for an after we have im- ers in the top left box are contributed plemented two custom finders in the language: one for compo- by the language. nents providing the interface, and one for components requir- ing the interface. Implementing finders is simple, since, as usual, MPS pro- vides a DSL for specifying them. The following code shows the implementation. finder findProviders for concept Interface description: Providers

find(node, scope)->void { nlist<> refs = execute NodeUsages ( node , ); foreach r in refs.select(it|it.isInstanceOf(ProvidedPort)) { add result r.parent ; } Figure 13.9: The result dialog of run- } ning Find Usages with our customized finders. Note the Providers and Users getCategory(node)->string { categories; these correspond to the "Providers"; getCategory } strings returned from in the two finders. dsl engineering 323

We specify a name for the finder (findProviders) as well as the type to which it applies (references to which it will find: Interface in the example). We then have to implement the find method. Notice how in the first line of the implemen- tation we delegate to an existing finder, Node Usages, which finds all references. We then check whether the referencing el- ement is a ProvidedPort, and if so, we add the parent of the port, i.e. a Component, to the result17. Finally, getCategory re- 17 Note how we make use of extensions turns a string that is use to structure the result. Fig. 13.9 shows to the MPS BaseLanguage to concisely specify finders: execute and add an example result. result are only available in the finder specification language.  Customizing the Target with Spoofax Spoofax provides a de- fault hyperlinking mechanism from references to declarations. Alternative hyperlinking functionality can be implemented in rewrite rules. The names of these rules need to be specified in the editor specification. For example, the following specifi- cation tells Spoofax to use a custom rewrite rule to hyperlink this expressions to the surrounding class: references reference Exp.This : resolve-this

On the left-hand side of the colon the reference rule specifies a syntactic sort and a constructor, for which the hyperlinking should be customized18. On the right-hand side of the colon, 18 As in colorer specifications, we can _ the rule names a rewrite rule which implements the hyperlink- use as a wildcard for syntactic sorts and constructors. ing: resolve-this: (link, position, ast, path, project-path) -> target where Entity(t) := link ; target := t

This rule determines the type of a this expression and links it to the declaration of this type19. 19 Like all rewrite rules implementing hyperlinking functionality, the rule needs to follow a Spoofax-defined signature: on the left-hand side, it 13.4 Pretty-Printing matches a tuple consisting of the link, the position of this node in Pretty-printing refers to the reverse activity from parsing20.A the abstract syntax tree, the tree itself parser transforms a character sequence into an abstract syntax (ast), the path of the current file, and the path of the current Eclipse tree. A pretty printer (re-)creates the text string from the AST. project (project-path). On the right- As the term pretty printing suggests, the resulting text should hand side, it returns the target of the hyperlink. be pretty, i.e. whitespace must be managed properly. So when and where is a formatter useful? There is the ob- 20 This is also known as serialization or vious use case: users somehow mess up formatting, and they formatting. want to press Ctrl-Shift-F to clean it up. However, there is 324 dslbook.org

more essential reason. If the AST is modified by a transforma- tion, the updated text has to be rendered correctly. An AST is modified, for example, as part of a quick fix (see the next paragraph) or by a graphical editor that operates in parallel to a text editor on the same AST.

 Pretty-Printing in MPS In MPS, pretty-printing is a non- issue. The editor always pretty-prints as part of the projec- tion21. However, version 3.0 of MPS will support the definition 21 On the flip side, MPS users do not of several different editors for a single concept. They may be have the option of changing the layout or formatting of a program, since the fundamentally different (e.g., providing a textual and a graph- projection rules implement the one ical syntax for state machines) or just provide different "lay- true way of formatting. Of course, this can be considered a plus or a minus, outs" for a single notation (different positions of the opening depending on the context. curly brace, for example). More generally, there is no reason why a projectional editor may not provide a certain degree of freedom regarding layout. Users may be able to press ENTER to start a new line in a long expression, or press TAB to indent a statement22. However, MPS does currently not support this. 22 Note that such layout information must be stored with the program, or Pretty-Printing in Spoofax Spoofax generates a language- possibly in a separate layout model.  This is especially true if the notation specific rewrite rule pp--string which rewrites is graphical, which will always allow an abstract syntax tree into a string according to a pretty-printer some degree of custom layout and positioning of shapes definition (expressed in the Box language). Spoofax generates a default pretty-printer definition from the syntax definition of a language. For example, Spoofax generates the following pretty-printer definition for Mobl:

[ Module -- KW["module"] _1 _2, Module.2:iter-star -- _1, Import -- KW["import"] _1, Entity -- KW["entity"] _1 KW["{"] _2 KW["}"], Entity.2:iter-star -- _1, Property -- _1 KW[":"] _2, Function -- KW["function"] _1 KW["("] _2 KW[")"] KW[":"] _3 KW["{"] _4 KW["}"], Function.2:iter-star-sep -- _1 KW[","], Function.4:iter-star -- _1, Param -- _1 KW[":"] _2, EntType -- _1, NumType -- KW["int"], BoolType -- KW["boolean"], StringType -- KW["string"], Declare -- KW["var"] _1 KW["="] _2 KW[";"], Assign -- _1 KW["="] _2 KW[";"], Return -- KW["return"] _1 KW[";"], Call -- _1 KW["."] _2 KW["("] _3 KW[")"], PropAccess -- _1 KW["."] _2, Plus -- _1 KW["+"] _2, Mul -- _1 KW["*"] _2, Var -- _1, Int -- _1 ] dsl engineering 325

In the Box language, rules consist of constructors (i.e. AS el- ements or language concepts) on the left-hand side of a rule and a sequence of boxes and numbers on the right-hand side. The basic box construct is a simple string, representing a string in the output. Furthermore, two kinds of box operators can be applied to sub-boxes: layout operators specify the layout of sub-boxes in the surrounding box, and font operators specify which font should be used. In the example, all strings are em- bedded in KW[...] boxes. KW is a font operator, classifying the sub-boxes as keywords of the language23. 23 Since font operators are only mean- Numbers on the right-hand side can be used to combine ingful when pretty-printing to HTML or LaTeX, we do not dive into the boxes from the subtrees: a number n refers to the boxes from details here. the n-th subtree. When the syntax definition contains nested constructs, additional rules are generated for pretty-printing the corresponding subtrees. On the left-hand side, these rules have selectors, which consist of a constructor, a number select- ing a particular subtree, and the type of the nesting. The fol- lowing table shows all nesting constructs in syntax definitions and their corresponding types in pretty-printing rules. Construct Selector Type optionals S? opt non-empty lists S+ iter possibly empty lists S* iter-star separated lists S1 S2+ iter-sep possibly empty separated lists S1 S2* iter-star-sep alternatives S1 | S2 alt sequences (S1 S2) seq

Additionally, user-defined pretty-printing rules can be defined as well. Spoofax first applies the user-defined rules to turn an abstract syntax tree into a hybrid tree which is only partially pretty-printed. It then applies the default rules to pretty-print the remaining parts. For example, we could define our own pretty-printing rule for Mobl modules:

Module -- V vs=1 is=4 [ H [KW["module"] _1] _2]

This rule lets Spoofax pretty-print the term Module("shopping", [Entity(...), Entity(...)]) as module shopping

entity ...

entity ... 326 dslbook.org

The V box operator places sub-boxes vertically. In the exam- ple, it places the entities underneath the module shopping line. The desired vertical separation between the sub-boxes can be specified by the spacing option vs. Its default value is 0; that is, no blank lines are added between the boxes. In the exam- ple, a blank line is enforced by vs=1. For indenting boxes in a vertical combination, the spacing option is can be specified. All boxes except the first will be indented accordingly. In the example, the module shopping line is unindented, while the entities are indented by 4 spaces. The H box operator lays out sub-boxes horizontally. In the example, it is used to lay out the module keyword and its name in the same line. The desired horizontal separation between the sub-boxes can be specified by the spacing option hs. Its default value is 1; that is, a single space is added between the boxes.

 Pretty-Printing in Xtext In Xtext, the use of whitespace can be specified in a language’s Formatter. Formatters use a Java API to specify whitespace policies for a grammar. Consider an example from the cooling language. Assume we enter the following code:

state Hello : entry { if true {}}

If we run the formatter (e.g., by pressing Ctrl-Shift-F in the IDE), we want the resulting text to be formatted like this:

state Hello: entry { if true {} }

The following formatter code implements this. protected void configureFormatting(FormattingConfig c) { CoolingLanguageGrammarAccess f = (CoolingLanguageGrammarAccess) getGrammarAccess();

c.setNoSpace().before( f.getCustomStateAccess().getColonKeyword_3()); c.setIndentationIncrement().after( f.getCustomStateAccess().getColonKeyword_3()); c.setLinewrap().before( f.getCustomStateAccess().getEntryKeyword_5_0());

c.setLinewrap().after( f.getCustomStateAccess().getLeftCurlyBracketKeyword_5_1()); c.setIndentationIncrement().after( f.getCustomStateAccess().getLeftCurlyBracketKeyword_5_1());

c.setLinewrap().before( f.getCustomStateAccess().getRightCurlyBracketKeyword_5_3()); c.setIndentationDecrement().before( f.getCustomStateAccess().getRightCurlyBracketKeyword_5_3()); } dsl engineering 327

In the first line we get the CoolingLanguageGrammarAccess object, an API to refer to the grammar of the language itself. This API is the basis for an internal Java DSL for expressing formatting rules. Let’s look at the first block of three lines. In the first line we express that there should be no space before the colon in the CustomState rule. Line two states that we want to have indentation after the colon. The third line specifies that the entry keyword should be on a new line. The next two blocks of two lines manage the indentation of the entry action code. In the first block we express a line wrap and incremented 24 As you can see, specifying the for- indentation after the opening curly brace. The second block matting for a complete grammar can require a lot of code! In my opinion, expresses a wrap before the closing curly brace, as well as a there are two approaches to improve decrement in the indentation level24. this: one is reasonable defaults or global configurations. Curly braces, for exam- ple, are typically formatted the same way. Second, a more efficient way of 13.5 Quick Fixes specifying the formatting should be provided. Annotations in the grammar, A quick fix is a semi-automatic fix for a constraint violation. It or a DSL for specifying the formatting (such as the Box language used by is semi-automatic in the sense that it is made available to the Spoofax) should go a long way. user in a menu, and after selecting the respective quick fix from the menu, the code that implements the quick fix rectifies the problem that caused the constraint violation25. 25 Notice that a quick fix only makes sense for problems that have one or Quick Fixes in Xtext Xtext supports quick fixes for con- more "obvious" fixes. This is not true  for all problems. straint violations. Quick fixes can either be implemented on the concrete syntax (i.e. via text replacement) or on the abstract syntax (i.e. via a model modification and subsequent serializa- tion). As an example, consider the following constraint defined in the cooling language’s CoolingLanguageJavaValidator: public static final String VARIABLE_LOWER_CASE = "VARIABLE_LOWER_CASE";

@Check public void checkVariable( Variable v ) { if ( !Character.isLowerCase( v.getName().charAt(0) ) ) { warning("Variable name should start with a lower case letter", al.getSymbolDeclaration_Name(), VARIABLE_LOWER_CASE ); } }

Based on our discussion of constraint checks (Section 9.1), this code should be fairly self-explanatory. What is interesting is the third argument to the warning method: we pass in a con- stant to uniquely identify the problem. The quick fix will be tied to this constant. The following code is the quick fix, imple- mented in the CoolingLanguageQuickfixProvider26. Notice 26 This code resides in the UI part of how in the @Fix annotation we refer to the same constant that the language, since, in contrast to the constraint check, it is relevant only in was used in the constraint check. the editor. 328 dslbook.org

@Fix(CoolingLanguageJavaValidator.VARIABLE_LOWER_CASE) public void capitalizeName(final Issue issue, IssueResolutionAcceptor acceptor) { acceptor.accept(issue, "Decapitalize name", "Decapitalize the name.", "upcase.png", new IModification() { public void apply(IModificationContext context) throws BadLocationException { IXtextDocument xtextDocument = context.getXtextDocument(); String firstLetter = xtextDocument.get(issue.getOffset(), 1); xtextDocument.replace( issue.getOffset(), 1, firstLetter.toLowerCase()); } }); }

Quick fix methods accept the Issue that caused the problem as well as an IssueResolutionAcceptor that is used to register the fixes so they can be shown in the quick fix menu. The core of the fix is the anonymous instance of IModification that, when executed after it has been selected by the user, fixes the problem. In our example, we grab the document that contains the problem and use a text replacement API to replace the first letter of the offending variable with its lower case version. Working on the concrete syntax level is ok for simple prob- lems like this one. More complex problems should be solved on the abstract syntax though27. For these cases, one can use 27 Imagine a problem that requires an instance of ISemanticModification instead: changes to the model in several places. Often it is easy to navigate to these @Fix(CoolingLanguageJavaValidator.VARIABLE_LOWER_CASE) places via the abstract syntax (following public void fixName(final Issue issue, IssueResolutionAcceptor acceptor) { references, climbing up the tree), but acceptor.accept(issue, "Decapitalize name", finding the respective locations on the "Decapitalize the name", "upcase.png", concrete syntax would be cumbersome new ISemanticModification() { and brittle. public void apply(EObject element, IModificationContext context) { ((Variable) element).setName( Strings.toFirstLower(issue.getData()[0])); } }); }

A quick fix using an ISemanticModification basically works the same way; however, inside the apply method we now use the EMF Java API to fix the problem28. 28 Notice that after the problem is solved, the changed AST is serialized Quick Fixes in MPS Quick fixes in MPS work essentially back into text. Depending on the scope  of the change, a formatter has to be the same way as in Xtext. Of course there are only quick fixes implemented for the language to make that act on the abstract syntax – the concrete syntax is projected sure the resulting serialized text looks nice. in any case. Here is a constraint that checks that the name of an element that implements INameAllUpperCase actually consists of only upper case letters: checking rule check_INameAllUpperCase { applicable for concept = INameAllUpperCase as a dsl engineering 329

do { if (!(a.name.equals(a.name.toUpperCase()))) { warning "name should be all upper case" -> a; } } }

The quick fix below upper-cases the name if necessary. The quick fix is associated with the constraint check by simply ref- erencing the fix from the error message. Quick fixes are exe- 29 As discussed in Section 7.7, MPS also 29 cuted by selecting them from the intentions menu (Alt-Enter) . has intentions. These are essentially quick fixes that are not associated with quick fix fixAllUpperCase an error. Instead, they can be invoked arguments: on any instance of the concept for node node which the intention is declared. description(node)->string { "Fix name"; } execute(node)->void { node.name = node.name.toUpperCase(); }

 Model Synchronization via Quick Fixes A particularly inter- esting feature of MPS’ quick fixes is that they can be executed automatically in the editor. This can be used for synchronizing different parts of a model: a constraint check detects an incon- sistency in the model, and the automatically executed quick fix resolves the inconsistency. Here is an example where this makes sense. Consider the in- terfaces and components extension to C. An interface declares a couple of operations, each with their own unique signature. A component that provides the interface has to provide imple- mentations for each of the operations, and the implementations must have the same signature as the operation it implements. A constraint checks the consistency between interfaces and im- plementing components. An automatically executed quick fix adds missing (empty) operation implementations and synchro- nizes their signatures with the signatures of the operations in the interface.

13.6 Refactoring

Refactoring addresses changing the program structure without changing its behavior. It is typically used to "clean up" the pro- gram structure after it has gotten messy over time. While DSLs 330 dslbook.org

and their programs tend to be simpler than GPL programs, refactorings are still useful.

30  Renaming in Xtext One of the most essential refactorings The reason why it is a refactoring 30 (and not just typing a new name) is be- is renaming a program element . Xtext comes with rename cause all references to this element have refactoring out of the box, every language supports rename to be updated. In textual languages, refactoring automatically31. The only thing the user has to re- such references are by name, and if the name of the target element changes, so member is to not just type a new name, but instead invoke the has the text of the reference. Rename refactoring, for example via Ctrl-Alt-R. 31 In fact, it works across languages and even integrates with tools that are not  Renaming in Spoofax Like code generators, refactorings need implemented with Xtext, such as the to be specified in the editor specification and implemented JDT or XMI-persisted EMF models such as GMF diagrams. with rewrite rules. For example, the following specification specifies a refactoring for renaming entities: refactorings refactoring Decl.Entity : "Rename Entity" = rename-entity (cursor) shortcut : "org.eclipse.jdt.ui.edit.text.java.rename.element" input identifier : "new name" = "" Note that in a projectional editor The specification starts with a syntactic sort and a constructor, such as MPS, renaming is not even a on which the refactoring should be available, followed by a la- refactoring. A reference is established with the UUID of the target element. bel for the refactoring in the context menu, the implementing Renaming it does not lead to any rewrite rule, and two options. In the example, Spoofax is in- structural change. And since the editor for the referencing element defines how structed to use the current cursor position to determine the to render the reference, it will just node on which the refactoring should be applied. The speci- display the updated name if it changes. fication further defines a shortcut for the refactoring, which should be the same key binding as the one used in the JDT for renaming. Finally, it defines an interactive input dialog, with a label "new name" and an empty default input. The refactoring itself is implemented in a rewrite rule: rename-entity: (newname, Entity(name, elems), position, ast, path, project-path) -> ([(ast, new-ast)], errors, [], []) with new-ast := ast; [Entity(), oldname|path] := name; if [Entity(), newname|path] then errors := [(name, $[Entity of name [newname] already exists.])] else errors := [] end

rename-entity-local(|old-name, new-name): Entity(old-name, elems) -> Entity(new-name, elems)

rename-entity-local(|old-name, new-name): EntType(old-name) -> EntType(new-name)

As we have seen already for other editor services, rewrite rules for refactorings have to adhere to a specific interface (i.e. signa- ture). On the left-hand side, the example rule matches a tuple dsl engineering 331

consisting of the input from the refactoring dialog (newname), the node on which the refactoring is applied, its position in the abstract syntax tree, the tree itself (ast), and the paths of the current file and the project. On the right-hand side, it yields a tuple consisting of a list of changes in the abstract syntax tree and lists of fatal errors, normal errors and warnings. For simplicity, the example rule changes the whole abstract syntax tree into a new one and provides only duplicate def- inition errors. To do so, the new abstract syntax tree is re- trieved by traversing the old one in a topdown fashion, trying 32 try(s) tries to apply a strategy s to apply rewrite rules rename-entity-local32. These rules to a term. Thereby, it never fails. If s s take the old and new entity name as parameters. They en- succeeds, it will return the result of . Otherwise, it will return the original sure that declarations and references to entities are renamed. term. The first rule rewrites entity declarations, while the second one rewrites types of the form EntType(name), where name refers to an entity. An error is detected if an entity with the new name already exists. Therefore, we match the annotated URI of the old name, change it to the new name, and look it up. If we find an entity, the renamed entity would clash with the one just found.

 Introduce Local Variable in MPS A very typical refactoring for a procedural language such as C is to introduce a new local variable. Consider the following code: int8 someFunction(int8 v) { int8 y = somethingElse(v * FACTOR); if ( v * FACTOR > 20 ) { return 1; } else { return 0; } }

As you can see, the first two lines contain the same expression (v * FACTOR) twice. A nicer version of this code might look like this: int8 someFunction(int8 v) { int8 product = v * FACTOR; int8 y = somethingElse(product); if ( product > 20 ) {

return 1; 33 } else { In the meantime, the MPS refactoring return 0; API has changed to better separate UI } aspects (keystrokes, choosers) and the } refactoring itself. This is important to support integration of refactorings with The Introduce Local Variable refactoring performs this change. IntelliJ IDEA and, in the future, Eclipse. MPS provides a DSL for refactorings, based on which the im- However, I decided to keep the old style, since it is a little more concise and plementation is about 20 lines of code. We’ll go through it in easier to follow in the context of this steps33. We start with the declaration of the refactoring itself. example. 332 dslbook.org

refactoring introduceLocalVariable ( "Introduce Local Variable" ) keystroke: + target: node allow multiple: false isApplicableToNode(node)->boolean { node.ancestor.isNotNull; }

The code above specifies the name of the refactoring (intro- duceLocalVariable), the label used in the refactoring menu, the keystroke to execute it directly (Ctrl-Alt-V) as well as the target, i.e. the language concept on which the refactoring can 34 We cannot refactor an expression Expression be executed. In our case, we want to refactor s, if it is used, for example, as the init but only if these expressions are used in a Statement34. We expression for a global constant. find out about that by checking whether the Expression has a Statement among its ancestors in the tree. Next, we define a parameter for the refactoring: parameters: varName chooser: type: string title: Name of the new Variable init(refactoringContext)->boolean { return ask for varName; }

The parameter represents the name of the newly introduced variable. In the refactoring’s init block we ask the user for 35 The ask for expression returns 35 this parameter . We are now ready to implement the refactor- false if the user selects Cancel in the ing algorithm itself in the refactor block. We first declare two dialog that prompts the user for the name. The execution of the refactoring local variables that represent the expression on which we in- stops in this case. voked the refactoring (we get it from the refactoringContext36) 36 and the Statement under which this expression is located. Fi- If the refactoring was declared to allow multiple, we can use 37 nally, we get the index of the Statement . refactoringContext.nodes to ac- node targetExpr = refactoringContext.node; cess all of the selected nodes. node targetStmt = targetExpr.ancestor; 37 int index = targetStmt.index; .index returns the index of an element in the collection that owns the Next, we iterate over all siblings of the statement in which element. It is available on any node. the expression lives. As we do that, we look for all expres- sions that are structurally similar to the one we’re executing the refactoring on (using MatchingUtil.matchNodes). We re- member a matching expression if it occurs in a statement that is after the one that contains our target expression. nlist matchingExpressions = new nlist; sequence> siblings = targetStmt.siblings.union(new singleton>(stmt)); foreach s in siblings { if (s.index >= index) { foreach e in s.descendants { if (MatchingUtil.matchNodes(targetExpr, e)) { matchingExpressions.add(e); }}}} dsl engineering 333

The next step is to actually introduce the new local variable. We create a new LocalVariableDeclaration using the API. We set the name to the one we’ve asked the user for (varName), we set its type to a copy of the type calculated by the type system for the target expression, and we initialize the variable with a copy of the target expression itself. We then add this new variable to the list of statements, just before the one which contains our target expression. We use the add prev-sibling built-in function for that. node lvd = new node(); lvd.name = varName; lvd.type = targetExpr.type.copy; lvd.init = targetExpr.copy; targetStmt.add prev-sibling(lvd);

There is one more step we have to do. We have to replace all the occurrences of our target expression with a reference to the newly introduced local variable. We had collected the matchingExpressions above, so we can now iterate over this collection38: 38 Note how the actual replacement is done with the replace with built-in foreach e in matchingExpressions { node ref = new node(); function. It comes in very handy, since ref.var = lvd; we don’t have to manually find out e.replace with(ref); in which property or collection the } expression lives in order to replace it.

All in all, building refactorings is straightforward with MPS’ refactoring support. The implementation effort is reduced to essentially the algorithmic complexity of the refactoring itself. Depending on the refactoring, this can be non-trivial.

13.7 Labels and Icons

Labels and icons for language concepts are used in several places, among them the outline view and the code completion menu.

 Labels and Icons in Xtext Labels and icons are defined in the language’s LabelProvider, which is generated by Xtext for each language by default. To define the label text, you simply override the text method for your element, which returns ei- ther a String or a StyledString (which includes formatting information). For the icon, override the image method. Here are a couple of examples from the cooling language39: 39 Notice how the label and the im- age are defined via methods, you can public class CoolingLanguageLabelProvider extends DefaultEObjectLabelProvider { change the text and the icon dynam- ically, based on some property of the String text(CoolingProgram prg) { model. 334 dslbook.org

return "program "+prg.getName(); }

String image(CoolingProgram prg) { return "program.png"; }

String text(Variable v) { return v.getName()+": "+v.getType(); }

String image(Variable v) { return "variable.png"; } }

 Labels and Icons in MPS Labels are defined by overriding the getPresentation behavior method on the respective con- cept. This allows the label to also be adjusted dynamically. The icon can be selected in the inspector (see Fig. 13.10) if we select a language concept. The icon is fixed and cannot be changed dynamically. Figure 13.10: Assigning an icon to a language concept.

13.8 Outline

The outline provides an overview over the contents of some part of the overall model, typically a file. By default, it usually shows more or less the AST, down to a specific level (the imple- mentations of functions or methods are typically not shown). The contents of the outline view must be user-definable; at the very least, we have to define where to stop the tree. Also, the tree structure may be completely different from the nest- ing structure of the AST: the elements may have to be grouped based on their concept (first show all variables, then all func- tions) or they may have to be sorted alphabetically.

 Customizing the Structure in Xtext Xtext provides an Out- lineTreeProvider for your language that can be used to cus- tomize the outline view structure (labels and icons are taken from the LabelProvider discussed above). As an example, let us customize the outline view for cooling programs to look the Figure 13.11: A customized outline one shown in Fig. 13.11. view for cooling programs in Xtext The tree view organizes the contents of a file by first show- ing all programs and then all tests. To do this, we provide a suitable implementation of _createChildren: protected void _createChildren(DocumentRootNode parentNode, Model m) { for (EObject prg : m.getCoolingPrograms()) { createNode(parentNode, prg); } for (EObject t : m.getTests()) { dsl engineering 335

createNode(parentNode, t); } }

Inside the method, we first grab all the CoolingPrograms from the root element Model and create a node for them using the createNote API, which takes the parent (in terms of the out- line view) and the program element for which should be rep- resented by the new outline node40. We then do the same for 40 The text and icon for the outline tests. node is taken from the label provider discussed in the previous section. Inside a program, we want to show variables and states in separate sections, i.e. under separate intermediate nodes (see Fig. 13.11). Here is how this works: protected void _createChildren(IOutlineNode parentNode, CoolingProgram p) { TextOnlyOutlineNode vNode = new TextOnlyOutlineNode(parentNode, imageHelper.getImage("variable.png"), "variables"); for (EObject v: p.getVariables()) { createNode(vNode, v); } TextOnlyOutlineNode sNode = new TextOnlyOutlineNode(parentNode, imageHelper.getImage("state.png"), "states"); for (EObject s: p.getStates()) { createNode(sNode, s); } }

We introduce intermediate nodes that do not represent a pro- gram element; they are used purely for structuring the tree. The TextOnlyOutlineNode is a class we created; it simply ex- tends the class AbstractOutlineNode provided by Xtext. public class TextOnlyOutlineNode extends AbstractOutlineNode {

protected TextOnlyOutlineNode(IOutlineNode parent, Image image, Object text) { super(parent, image, text, false); } }

Xtext provides alphabetical sorting for outlines by default. There is also support for styling the outline (i.e. using styled labels as opposed to simple text) as well as for filtering the tree.

 The Outline in Spoofax With Spoofax, outlines can be spec- ified declaratively in the editor specification. Abstract syntax tree nodes, which should appear in the outline, are selected based on their syntactic sort and constructor names. For exam- 41 As in the specification of other editor ple, the following outline specification will include all entity services, we can use _ as a wildcard declarations41: for sorts and constructors. For ex- ample, Decl._ will include imports module MoblLang-Outliner and entities in the outline. Similarly, _ imports MoblLang-Outliner.generated .Property will include all nodes with a constructor Property, independent of outliner Entity Outliner their syntactic sort. 336 dslbook.org

Decl.Entity

Spoofax analyses the syntax definition and tries to come up with a reasonable default outline specification. We can then either extend the generated specification with our own rules, or create a new one from scratch.

 The Outline in MPS MPS does not have a customizable outline view. It shows the AST of the complete program as part of the project explorer, but the structure cannot be cus- tomized. However, it is of course possible to add arbitrary ad- ditional views (called tools in MPS) to MPS. The MPS tutorial at bit.ly/xU78ys shows how to implement your own outline view.

13.9 Code Folding

Code folding refers to the small minuses in the gutter of an editor that let you collapse code regions (see Fig. 13.12). The editor shows an ellipsis (...) for the folded parts of the code. Clicking on the + or on the ellipsis restores the full code.

 Folding in Xtext Xtext automatically provides folding for Figure 13.12: Code folding in Xtext. 42 If you hover over the folded code, a all language concepts that stretch over more than one line . To pop-up shows the hidden code. turn off this default behavior, you have to implement your own 42 subclass of DefaultFoldingRegionProvider and overwrite the Xtext also automatically provides hovers for the folded contents that method isHandled in a suitable way. For example, to not pro- show the text that is "folded away". vide folding for CustomStates, you could do the following: public class CLFoldingRegionProvider extends DefaultFoldingRegionProvider {

@Override protected boolean isHandled(EObject eObject) { if ( eObject instanceof CustomState ) { return false; } return super.isHandled(eObject); } }

 Folding in Spoofax Spoofax allows to specify folding declar- atively in the editor specification. Very similar to the specifica- tion of outlines, folding is specified in terms of syntactic sort and constructor names: module Mobl-Folding folding Module._ Decl.Entity _.Function dsl engineering 337

As for outlines, Spoofax analyses the syntax definition and tries to come up with a reasonable default specification. We can then either extend the generated specification with our own rules, disable particular specifications by adding a (disabled) annotation, or discard it completely. A (folded) annotation tells Spoofax to fold a node by default in a newly opened editor, which is typically seen for import sections.

 Folding in MPS In MPS, folding can be activated for any 43 It can also be set to query, in which collection. For example, in a state machine, each state contains case code can be written that deter- mines at runtime whether folding a vertical list of transitions. To enable folding for this collection, should be enabled or not. For example, we set the uses folding property for the collection to true43. folding could be enabled if there are more than three transitions in the state. Once we’ve set the property to true, we have to provide a cell that is rendered if the user requests the code to be folded. This Figure 13.13: After uses folding has allows the text shown as an ellipses to be customized beyond been set to true for the transitions just showing three dots. As Fig. 13.13 shows, we use a read collection of a State, we have to specify a folded cell which is shown in case only model access cell, which allows us to access the under- the collection is folded. In this case we lying model and return an arbitrary string. In the example, we use an R/O model access as the folded cell which can return a (read-only) output the number of "hidden" transitions. string that is projected if the collection is folded.

MPS provides a second mechanism that can be used to the same effect. Since MPS is a projectional editor, some parts of the editor may be projected conditionally. Fig. 13.14 shows a list/tree of requirements. After pressing Ctrl-Shift-D on a re- Figure 13.14: A list/tree of require- quirement, the editor shows the requirements details (Fig. 13.15). ments in mbeddr. This effect of "expanding editors" is implemented by making the detail part optional in the sense that the projection rule only shows it conditionally. Fig. 13.16 shows the editor definition.

13.10 Tooltips/Hover

A tooltip, or hover, is a small, typically yellow window that is shown if the user hovers the mouse over a program element. A hover may show the documentation of the target element, or, when hovering over a reference, some information about the Figure 13.15: Optionally, the require- referenced element. ment details can be shown inline. 338 dslbook.org

Figure 13.16: The part of the editor that includes the details pane is only projected if the open property is true. This property is toggled using Ctrl-Shift-D.

 Xtext In Xtext, hovers/tooltips can be customized to var- ious extents. The simplest customization retains the default hover structure (a one-line summary plus a more extensive documentation) and just changes the respective texts. To change the one-line summary, you override the getFirst- Line method in DefaultEObjectHoverProvider and return a custom string. The program element for which the hover should be created is represented by the EObject passed into the method. To customize the documentation, you override getDocumen- tation in IEObjectDocumentationProvider44. 44 Just as in any other case of Xtext framework configuration, you have Spoofax Spoofax supports tooltips directly. Tooltips are to register the two classes in the UI  module. provided by rewrite rules, which need to be defined as hovers in the editor specification: hover _: editor-hovering

This line tells Spoofax to use a rewrite rule editor-hovering to retrieve tooltips for all kinds of abstract syntax tree nodes45. 45 Note the use of the wildcard to When we want to define different rewrite rules for particular express "all kinds of AST nodes". constructors, we need to provide a hover specification for each constructor, replacing _ by _.. The specified rewrite rules have to follow the typical editor interface on their left-hand side and need to yield strings on dsl engineering 339

Figure 13.17: If a user selects a reference to a requirement in a requirements trace (Arg2 in the example), the inspector shows information about the referenced requirement.

their right-hand sides. The strings are then used as tooltips. For example, the following rewrite rule will provide type in- formation for any typeable node: editor-hovering: (target, position, ast, path, project-path) -> target

 MPS MPS does not support tooltips at this time. However, there is an acceptable workaround: any additional information for a program element can be shown in the inspector. For ex- ample, if users click on a reference to a requirement in program code, the inspector shows information about the referenced re- quirement (see Fig. 13.17). Looking at the editor definition for a RequirementRef, you can see that the actual editor (top in Fig. 13.18) shows only the name of the referenced element. The bottom part, the inspected cell layout, projects the details about the refer- enced element.

13.11 Visualizations

To provide an overview over the structure of programs, read- only graphical representations are useful. Note that these are 340 dslbook.org

Figure 13.18: The editor definition for the RequirementRef projects details about the referenced element in the inspector. Notice the use of the $swing component$ as a means to embed the Swing JTextArea that shows the prose description of the requirement. The code behind the $swing component$ also populates the text area and writes back the changed text to the AST.

not necessarily a workaround for not having graphical editors: visualizations can provide real added value.

 MPS In MPS we have integrated ZGRViewer46, a Java- 46 zvtm.sourceforge.net/ based renderer for GraphViz47 dot files. Fig. 13.19 shows an zgrviewer.html 47 www.graphviz.org/ example.

Figure 13.19: Clicking on a node in As part of the transformations, we map the model to another the Graph Viewer opens the respective model expressed in a graph description language. This model program element in the MPS editor. is then generated into a dot file. The graph viewer scans the output directory for dot files and shows them in the tree view at the top. Double-clicking on a graph node in the tree opens a rendered dot file in the graph view. dsl engineering 341

 Xtext In Xtext, Jan Koehnlein’s Generic Graph View48 can 48 github.com/JanKoehnlein/ be used to render diagrams of Xtext models in real-time – the Generic-Graph-View Generic Graph View is an interpreter, so changes in the model lead to updates in the graph immediately. The mapping from the model to the graph is expressed with an Xtext-based mapping DSL that extends Xbase (Xtext’s reusable expression language), which means you can use Xbase expres- sions to traverse and query the model you want to visualize (in Fig. 13.20 an example would be the this.eSuperTypes() expression). In addition, a separate styling DSL supports the definition of shapes, colors and line styles. Double-clicking a node in the graph opens the corresponding program element in the Xtext editor.

13.12 Diff and Merge

Highlighting the differences between versions of a program Figure 13.20: A model-to-graph map- and allowing the resolution of conflicts is important in the ping and a style definition expressed context of version control integration. For tools like Xtext or with the Generic Graph Viewer DSLs by Jan Koehnlein. Spoofax that store models as plain text this is a non-issue: ex- isting diff/merge tools can be used, be they in the IDE or on the command line. For projectional editors such as MPS, the story is more com- plicated. Since they store the programs based on the abstract syntax (e.g., using XML), diff and merge have to be performed on the concrete projected syntax. MPS provides this feature (see Fig. 7.7 for an example). MPS also annotates the editor with gutter annotations that highlight whether a part of the program has changed relative to the last checkout.

14 Testing DSLs

All the aspects of DSL implementation we have discussed so far need to be tested to keep them stable. In this chapter we address testing of the language syntax, the constraints and the semantics, as well as some of the editor services, based on examples with Xtext, MPS and Spoofax. We conclude the chapter with a brief look at "testing" a language for appropriateness relative to the domain.

DSL testing is a multi-faceted problem, since it needs to ad- dress all the aspects of the DSL implementation we have dis- cussed so far. In particular, this includes the syntax, the con- straints and type system, as well as the execution semantics (i.e. transformations or interpreters). Here are some examples:

• Can the syntax cover all required sentences? Is the concrete syntax "correct"?

• Do the scopes work correctly?

• Do the constraints work? Are all "wrong" programs actually detected, and is the right error message attached to the right program element? An important ingredient to testing is • are the semantics correct? Do transformations, generators that test execution can be automated via scripts, so they can be run as part of and interpreters work correctly? automatic builds. All the test strategies shown below can be executed from ant scripts. However, we don’t describe in • Can all programs relevant to the users actually be expressed? detail how this works for each of the Does the language cover the complete domain? tools. 344 dslbook.org

14.1 Syntax Testing

Testing the syntax is simple in principle. Developers simply try to write a large set of relevant programs and see if they can be expressed with the language. If not, the concrete syntax is in- complete. We may also want to try to write "wrong" programs and check that the errors are detected, and that meaningful error messages are reported.

 An Example with Xtext The following piece of code is the fundamental code that needs to be written in Xtext to test a DSL program using the Xtext testing utilities1. It is a JUnit 4 1 code.google.com/a/eclipselabs test case (with special support for the Xtext infrastructure2), so .org/p/xtext-utils/wiki/ Unit_Testing; more testing util- it can be run as part of Eclipse’s JUnit integration. ities for Xtext can be found in the org.eclipse.xtext.junit4 package. @RunWith(XtextRunner.class) @InjectWith(CoolingLanguageInjectorProvider.class) 2 public class InterpreterTests extends XtextTest { It is tested with Xtext 2.x.

@Test public void testET0() throws Exception { testFileNoSerializer("interpreter/engine0.cool", " tests.appl", "stdparams.cool" ); }

}

The single test method loads the interpreter/engine0.cool program, as well as two more files which contain elements referenced from engine0.cool. The testFileNoSerializer method loads the file, parses it and checks constraints. If either parsing or constraint checking fails, the test fails3. 3 There is also a testFile method On a more fine-grained level it is often useful to test par- which, after loading and parsing the file, reserializes the AST to the text file, tial sentences instead of complete sentences or programs. The writes it back, and loads it again, thus following piece of Xtext example code tests the CustomState comparing the two ASTs. This way, the (potentially adapted) formatter parser rule: is tested. Note that for testing the formatting, a text comparison is the @Test only way to go, even though we argue public void testStateParserRule() throws Exception { testParserRule("state s:", against text comparison in general. "CustomState" ); testParserRule("state s: entry { do fach1->anOperation }", "CustomState" ); testParserRule("state s: entry { do fach1->anOperation }", "State" ); }

The first line asserts that the string state s: can be parsed 4 These tests really just test the parser. with the CustomState parser rule. The second line passes in a No linking or constraints checks are more complex state, one with a command in an entry action. performed. This is why we can "call" anOperation on the fach1 object, Line three tries the same text with the State rule, which itself although anOperation is not defined as calls the CustomState4. a callable operation anywhere. dsl engineering 345

 An Example with Spoofax Spoofax supports writing tests for language definitions using a testing language. Consider the following test suite: module example language MoblEntities test empty module [[module foo]] parse succeeds test missing layout (module name) [[modulefoo]] parse fails

The first two lines specify the name of the test suite and the language under test. The remaining lines specify positive and negative test cases concerning the language’s syntax. Each test case consists of a name, the to-be-tested code fragment in dou- ble square brackets, and a specification that determines what kind of test should be performed (parsing) and what the ex- pected outcome is (succeeds or fails). We can also specify the expected abstract syntax based on the ATerm textual notation: test empty module (AST) [[module foo]] parse to Module("foo", [])

Instead of specifying a complete abstract syntax tree, we can only specify the interesting parts in a pattern. For example, if we only want to verify that the definition list of an empty module is indeed empty, we can use _ as a wildcard for the module name: test empty module (AST) [[module foo]] parse to Module(_, [])

Abstract syntax patterns are particularly useful for testing op- erator precedence and associativity: test multiply and add [[1 + 2 * 3]] parse to Add(_, Mul(_, _)) test add and multiply [[1 * 2 + 3]] parse to Add(Mul(_, _), _) test add and add [[1 + 2 + 3]] parse to Add(Add(_, _), _)

Alternatively, we can specify an equivalent concrete syntax frag- ment instead of an abstract syntax pattern: test multiply and add [[1 + 2 * 3]] parse to [[1 + (2 * 3)]] test add and multiply [[1 * 2 + 3]] parse to [[(1 * 2) + 3]] test add and add [[1 + 2 + 3]] parse to [[(1 + 2) + 3]]

A test suite can be run from the Transform menu. This will open the Spoofax Test Runner View, which provides information about failing and succeeding test cases in a test suite. Fig. 14.1 shows an example. Additionally, we can also get instant feedback while editing a test suite. Tests can also be evaluated outside the IDE, for example as part of a continuous integration setup.

 Syntax Testing with MPS Syntax testing in the strict sense is not useful or necessary with MPS, since it is not possible to “write text that does not parse”. Invalid programs cannot even 346 dslbook.org

Figure 14.1: Spoofax Test Runner View showing success and failure of test cases in a test suite.

be entered. However, it is useful to write a set of programs which the language developer considers relevant. While it is not possible to write syntactically invalid programs, the fol- lowing scenario is possible (and useful to test): a user writes a program with the language in version 1. The language evolves to version 2, making that program invalid. In this case, the pro- gram contains unbound language concepts or “holes”. By run- ning the model checker (interactively or via ant), such prob- lems can be detected. Fig. 14.2 shows an example. Figure 14.2: Top: an interface expressed in the mbeddr C components extension. Bottom: The same interface after we 14.2 Constraints Testing have removed the parameters collec- tion in an Operation. The error reports Testing of constraints is essential, especially for languages with that the model contains child nodes in a complex constraints, such as those implied by type systems. child collection that does not exist. The goal of constraints testing is to ensure that the correct er- ror messages are annotated to the correct program elements, if those program elements have a constraint or type error.

 An Example with Xtext A special API is necessary to be able to verify that a program which makes a particular constraint fail actually annotates the corresponding error message to the respective program element. This way, tests can then be written which assert that a given program has a specific set of error annotations. The unit testing utilities mentioned above also support test- ing constraints. The utilities come with an internal Java DSL that supports checking for the presence of error annotations after parsing and constraint-checking a model file.

@Test public void testTypesOfParams() throws Exception { testFileNoSerializer("typesystem/tst1.cool", "tests.appl", "stdparams. cool"); assertConstraints( issues.sizeIs(3) );//1 assertConstraints( issues.forElement(Variable.class, "v1").//2 theOneAndOnlyContains("incompatible type") );//2 assertConstraints( issues.under(Variable.class, "w1").//3 errorsOnly().sizeIs(2).oneOfThemContains("incompatible type") );//3 } dsl engineering 347

We first load the model file that contains constraint errors (in this case, type system errors). Then we assert the total number of errors in the file to be three (line 1)5. Next, in line 2, we 5 This makes sure that the file does not check that the instance of Variable named v1 has exactly one contain additional errors beyond those asserted in the rest of the test case. error annotation, and that it has the text "incompatible type" in the error message. Finally, in line 3 we assert that there are exactly two errors anywhere under (i.e. in the subtree below) a Variable named w1, and one of these contains "incompatible type" in the error message. Using the fluent API style shown by these examples, it is easy to express errors and their locations in the program. If a test fails, a meaningful error message is output that supports localizing (potential) problems in the test. The following is the error reported if no error message is found that contains the substring incompatible type: junit.framework.AssertionFailedError: failed - failed oneOfThemContains: none of the issues contains substring ’incompatible type’ at junit.framework.Assert.fail(Assert.java:47) at junit.framework.Assert.assertTrue(Assert.java:20) ...

A test may also fail earlier in the chain of filter expressions if, for example, there is no Variable named v1 in the program. More output is provided in this case: junit.framework.AssertionFailedError: failed - no elements of type com.bsh.pk.cooling.coolingLanguage.Variable named ’v1’ found - failed oneOfThemContains: none of the issues contains substring ’incompatible type’ at junit.framework.Assert.fail(Assert.java:47) ...

Scopes can be tested in the same way: we can write example programs where references point to valid targets (i.e. those in scope) and invalid targets (i.e. not in scope). Valid references may not have errors, invalid references must have errors6. 6 The absence or presence of these errors can be tested in the same way as the constraint checking tests discussed An Example with MPS MPS comes with the NodesTestCase  above. for testing constraints and type system rules (Fig. 14.3). It sup- ports special annotations to express assertions on types and er- rors, directly in the program. For example, the third line of the nodes section in Fig. 14.3 reads var double d3 = d without annotations. This is a valid variable declaration in mbeddr C. After this has been written down, annotations can be added. They are rendered in green (gray in the hardcopy version of the book). Line three asserts that the type of the variable d is double, i.e. it tests that variable references assume the type of the referenced variable. In line four we assign a double to an 348 dslbook.org

int, which is illegal according to the typing rules. The error is detected, hence the red underline. We use another annotation to assert the presence of the error.

Figure 14.3: Using MPS NodesTestCase, assertions about types and the presence of errors can be directly annotated on programs written in any language.

In addition to using these annotations to check types and typ- ing errors, developers can also write more detailed test cases about the structure or the types of programs. In the example we assert that the var reference of the node referred to as dref points to the node labeled as dnode. Note how labels (green, underlined) are used to add names to program elements so they can be referred to from test expressions. This approach can be used to test scopes. If two variables with the same name are defined (e.g., because one of them is defined in an outer block, and we assume that the inner variable shadows the outer variable of the same name), we can use this mechanism to check that a reference actually points to the inner variable. Fig. 14.4 shows an example.

 An Example with Spoofax In Spoofax’ testing language, we can write test cases which specify the number of errors and Figure 14.4: Labels and test methods warnings in a code fragment: can be used to check that scoping works. In this example, we check test duplicate entities [[ shadowing of variables. module foo entity X {} entity X {} ]] 1 error test lower case entity name [[ module foo entity x {} ]] 1 warning

Additionally, we can specify parts of the error or warning mes- sages using regular expressions: test duplicate entities [[ module foo dsl engineering 349

entity X {} entity X {} ]] 1 error /duplicate/

Here, /duplicate/ is a regular expression that matches error messages like "Duplicate definition of X". As in Xtext and MPS, we can test scopes by means of correct and incorrect references. Alternatively, we can specify the source and target of a link in a test case: test property reference [[ module foo entity X{ [[p]]: int function f(q: int) { r: int = 0; return [[p]]; } }

]] resolve #2 to #1 test parameter reference [[ module foo entity X{ p: int function f([[p]]: int) { r: int = 0; return [[p]]; } }

]] resolve #2 to #1 test variable reference [[ module foo entity X{ p: int function f(q: int) { [[p]]: int = 0; return [[p]]; } }

]] resolve #2 to #1

These cases use double square brackets to select parts of the program and specify the expected reference resolving in terms of these selections.

14.3 Semantics Testing

Fundamentally, testing the execution semantics of a program involves writing assertions against the execution of a program7. 7 While this sounds fancy, this is close In the simplest case this can be done the following way: to what we do in unit testing "normal" programs. We write a system in a pro- gramming language X, and then write • Write a DSL program, based on an understanding what the more code in X that states assertions program is expected to do. about what the system does.

• Generate the program into its executable representation. 350 dslbook.org

• Manually write unit tests (in the target language) that assert the generated program’s behavior based on the understand- ing of what the DSL program should do.

Notice how we do not test the structure or syntax of the gen- erated artifact. Instead we test its meaning, which is exactly what we want to test. An important variation of this approach is the following: instead of writing the unit tests manually in the target language, we can also write the tests in the DSL, as- suming the DSL has syntax to express such tests8. Writing the 8 Many DSLs are explicitly extended to test cases on DSL level results in more concise and readable support writing tests. tests.9 9 Testing DSL programs by running The same approach can be used to test execution semantics tests expressed in the same language runs the risk of doubly-negating errors. based on an interpreter, although it may be a little more diffi- If the generators for the tests and the cult to manually write the test cases in the target language; the core program are wrong in a "consis- tent" way, errors in either one may not interpreter must provide a means to "inspect" its execution so be found. However, this problem can that we can check whether it is correct. If the tests are written be alleviated by running a large enough in the DSL and the interpreter executes them along with the number of tests and/or by having the generators for the core system and core program, the approach works well. for the test cases written by different Strictly speaking, the approach discussed here tests the se- (groups of) people. mantics of a specific program. As always in testing, we have to write many of these tests to make sure we have covered all10 of the possible executions paths through a generator or 10 There may actually be an infinite interpreter. If we do that, the set of tests implicitly tests the number of possible execution paths, so we have to limit ourselves to a generator or interpreter – which is the goal we want to achieve reasonable set of tests. in semantics testing. If we have several execution backends, such as an interpreter and a compiler, it must be ensured that both have the same semantics. This can be achieved by writing the tests in the DSL and then executing them in both backends. By executing enough tests, we can get a high degree of confidence that the semantics of the backends are aligned.

Testing an Interpreter with Xtext The cooling language pro-  11 Strictly speaking, tests are a separate vides a way of expressing test cases for the cooling programs viewpoint to keep them out of the within the cooling language itself11. These tests are executed actual programs. with an interpreter inside the IDE, and they can also be exe- 12 Note that this approach is not re- cuted on the level of the C program, by generating the program stricted to testing interpreters or and the test cases to C12. The following code shows one of the generators – it can also be used to simplest possible cooling programs, as well as a test case for test whether a program written in the DSL works correctly. This is in fact that program: why the interpreter and the test sub- cooling program EngineProgram0 for Einzonengeraet uses stdlib { language have been built in the first var v: int place: DSL users should be able to test event e1 the programs written in the DSL. dsl engineering 351

init { set v = 1 }

start: entry { set v = v * 2 } on e1 { state s2 } state s2: entry { set v = 0 } }

test EngineTest0 for EngineProgram0 { assert-currentstate-is ^start // 1 assert-value v is 2 // 2 step // 3 event e1 // 4 step // 5 assert-currentstate-is s2 // 6 assert-value v is 0 // 7 }

The test first asserts that, when the program starts, it is in the start state (line 1 in the comments in the test script). We then assert that v is 2. The only reasonable way in which v can be- come 2 is that the code in the init block, as well as the code in the entry action of the start start, have been executed13. We 13 Note that this arrangement even then perform one step in the execution of the program in line checks that the init block is executed before the entry action of the start state, 3. At this point nothing should happen, since no event was since otherwise v would be 0! triggered. Then we trigger the event e1 (line 4) and perform another step (line 5). After this step, the program must tran- sition to the state s2, whose entry action sets v back to 0. We assert both of these in lines 6 and 7. These tests can be run interactively from the IDE, in which case assertion failures are annotated as error marks on the pro- gram, or from within JUnit. The following piece of code shows how to run the tests from JUnit.

@RunWith(XtextRunner.class) @InjectWith(CoolingLanguageInjectorProvider.class) public class InterpreterTests extends PKInterpreterTestCase {

@Test public void testET0() throws Exception { testFileNoSerializer("interpreter/engine0.cool", "tests.appl", "stdparams.cool" ); runAllTestsInFile( (Model) getModelRoot()); } }

The code above is basically a JUnit test that inherits from a base class that helps with loading models and running the in- terpreter. We call the runAllTestsInFile method, passing in the model’s root element. runAllTestsInFile is defined by the PKInterpreterTestCase base class, which in turn inherits 14 from XtextTest, which we have seen before. The method iter- The TestExecutionEngine is a wrap- per around the interpreter for cooling ates over all tests in the model and executes them by creating programs that we have discussed and running a TestExecutionEngine14. before. 352 dslbook.org

protected void runAllTestsInFile(Model m) { CLTypesystem ts = new CLTypesystem(); EList tests = m.getTests(); for (CoolingTest test : tests) { TestExecutionEngine e = new TestExecutionEngine(test, ts); final LogEntry logger = LogEntry.root("test execution"); LogEntry.setMostRecentRoot(logger); e.runTest(logger); } }

The cooling programs are generated to C for execution in the refrigerator. To make sure the generated C code has the same semantics as the interpreter, we simply generate C code from the test cases as well. In this way the same tests are executed against the generated C code. By ensuring that all of them work in the interpreter and the generator, we ensure that both behave in the same way.

 Testing a Generator with MPS The following is a test case ex- pressed using the testing extension of mbeddr C. It contributes test cases to modules15. testMultiply is the actual test case. 15 Instead of using a separate viewpoint It calls the to-be-tested function times2 several times with dif- for expressing test cases, these are inlined into the same program in ferent arguments and then uses an assert statement to check this case. However, the language for for the expected value. expressing test cases is a modular extension to C, to keep the core C clean. module UnitTestDemo {

int32 main(int32 argc, int8*[ ] argv) { return test testMultiply; }

test case testMultiply { assert (0) times2(21) == 42; assert (1) times2(0) == 0; assert (2) times2(-10) == -20; }

int8 times2(int8 a) { return 2 * a; } }

Note that, while this unit testing extension can be used to test any C program, we use it a lot to test the generator. Consider the following example: assert (0) 4 * 3 + 2 == 14;

One problem we had initially in mbeddr C was to make sure that the expression tree that was created while manually en- tering expressions like 4 * 3 + 2 is built correctly in terms of operator precedence. If the tree was built incorrectly, the gen- 16 erated code could end up as 4 * (3 + 2), resulting in 20. So This is also the reason why the unit test extension was the first extension we we’ve used tests like these to implicitly test quite intricate as- built for C: we needed it to test many pects of our language implementation16. other aspects of the language. dsl engineering 353

We have built much more elaborate support for testing var- ious other extensions. It is illustrative to take a look at two of them. The next piece of code shows a test for a state machine: exported test case test1 { initsm(c1); assert (0) isInState; test statemachine c1 { start -> countState step(1) -> countState step(2) -> countState step(7) -> countState step(1) -> initialState } } c1 is an instance of a state machine. After initializing it, we assert that it is in the initialState. We then use a special test statemachine statement, which consists of event/state pairs: after triggering the event (on the left side of the ->) we expect the state machine to go into the state specified on the right side of the ->. We could have achieved the same goal by using sequences of trigger and assert statements, but the syntax used here is much more concise. The second example concerns mocking. A mock is a part of a program that can be used in place of the real one. It simulates some kind of environment of the unit under test, and it can also verify that some other part of the system under test behaves as expected17. We use this with the components extension. 17 See Wikipedia for a more elab- The following is a test case that checks if the client uses the orate explanation of mocks: en.wikipedia.org/wiki/ PersistenceProvider interface correctly. Let’s start by taking Mock_object a look at the interface: interface PersistenceProvider { boolean isReady() void store(DataPacket* data) void flush() }

The interface is expected to be used in the following way: clients first have to call isReady, and only if that method returns true are they supposed to call store, and then after any number of calls to store, they have to call flush. Let us assume now we want to check if a certain client component uses the interface correctly18. Assuming the component provides an operation 18 Our components language actually run that uses the persistence provider, we could write the fol- also supports protocol state machines which support the declarative specifica- lowing test: tion of valid call sequences. exported test case runTest { client.run(); // somehow check is behaved correctly } 354 dslbook.org

To check whether the client behaves correctly, we can use a mock. Our mock specifies the incoming method calls it expects to see during the test. We have provided a mocking extension to components to support the declarative specification of such expectations. Here is the mock: exported mock component PersistenceMock { ports: provides PersistenceProvider pp expectations: total no. of calls: 4 sequence { 0: pp.isReady return false; 1: pp.isReady return true; 2: pp.store { 0: parameter data: data != null } 3: pp.flush } }

The mock provides the PersistenceProvider interface, so any other component that requires this interface can use this com- ponent as the implementation. But instead of actually imple- menting the operations prescribed by PersistenceProvider, we specify the sequence of invocations we expect to see. We expect a total number of 4 invocations. The first one is ex- pected to be to isReady. We return false, expecting the client to try again later. If it does, we return true and expect the client to continue with persisting data. We can now validate the mock as part of the test case: exported test case runTest { client.run(); validate mock persistenceMock }

If the persistenceMock saw behavior different from the one specified above, the validate mock statement will fail, and with it the whole test19. 19 The generator for mock components One particular challenge with this approach to semantics translates the expectations into imple- mentations of the interface methods testing is that, if an assertion fails, you get some kind of assert- that track and record invocations. ion XYZ failed at ABC output from the running test case. To The validate mock statement works with this recorded data to determine understand and fix the problem, you will have to navigate back whether the expectations were met. to the assert statement in the DSL program. If you have many failed assertions, or just generally a lot of test program output, this can be tedious and error-prone. For example, the follow- ing piece of code shows the output from executing an mbeddr test case on the command line:

./TestHelperTest $$runningTest: running test () @TestHelperTest:test_testCase1 :0#767515563077315487 dsl engineering 355

$$FAILED: ***FAILED*** (testID=0) @TestHelperTest:f:0#9125142491355884683 $$FAILED: ***FAILED*** (testID=1) @TestHelperTest:f:1#9125142491355901742

We have built a tool in mbeddr that simplifies finding the mes- sage source. You can paste arbitrary text that contains error messages into a text area (such as the example above) on the 20 This process is based on the unique left in Fig. 14.5. Pressing the Analyze button will find the nodes node ID; this is the long number that follows the # in the message text. that created a particular message20. You can then click on the node to select it in the editor.

Figure 14.5: The mbeddr error output analyzer parses test output and sup- ports navigating to the source of error messages in the MPS program.

 Testing Interpreters and Generators with Spoofax Spoofax’ test- ing language also supports testing transformations. We use it to test interpreters, assuming that the interpreter is imple- mented as a transformation from programs to program re- sults. For example, the following tests address a transforma- tion eval-all, which interprets expressions: test evaluate addition [[1+2]] run eval-all to [[3]] test evaluate multiplication [[3*4]] run eval-all to [[12]] test complex evaluation [[1+2*(3+4)]] run eval-all to [[15]]

To test generators, we can rely on Spoofax’ testing support for builders. For example, the following tests use a builder generate-and-execute, which generates code from expres- sions, runs the code, and returns the result of the run as a string: test generate addition [[1+2]] build generate-and-execute to "3" test generate multiplication [[3*4]] run generate-and-execute to "12" test generate evaluation 1 [[1+2*(3+4)]] run generate-and-execute to "15"

 Structural Testing What we suggested in the previous sub- section tests the execution semantics of programs written in DSLs, and, if we have enough of these tests, the correctness of the transformation, generator or interpreter. However, there is a significant limitation to this approach: it only works if the DSL actually specifies behavior! If the DSL only specifies struc- 356 dslbook.org

tures and cannot be executed, the approach does not work. In this case you have to perform a structural test. In principle, this is simple: you write an example model, you generate it21, 21 I have never seen an interpreter to and then you inspect the resulting model or test for the ex- process languages that only specify structures pected structures. Depending on the target formalism, you can use regular expressions, XPath expressions or OCL-like expres- sions to automate the inspection22. 22 Note that you really should only use Structural testing can also be useful to test model-to-model this if you cannot use semantics testing based on execution. Inspecting the gen- 23 transformations . Consider the example in Section 11.2.2. There, erated C code for syntactic correctness, we inserted additional states and transitions into whatever in- based on the input program, would be much more work. And if we evolve put state machine our transformation processed. Testing this the generator to generate better (faster, via execution invariably tests the model-to-model transforma- smaller, more robust) code, tests based tion as well as the generator (or interpreter). If we wanted to on the execution semantics will still work, while those that test the struc- test the model-to-model transformation in isolation, we have to ture may fail because line numbers or use structural testing, because the result of that transformation variable names change. itself is not yet executable. The following piece of Xtend code 23 If a program is transformed to an could be used to check that, for a specific input program, the executable representation in several transformation works correctly: steps, then the approach discussed above tests all transformations in total, // run transformation so it is more like an integration test, val tp = p.transform and not a unit test. Depending on the

// test result structurally complexity and the reuse potential of val states = tp.states.filter(typeof(CustomState)) the transformation steps, it may make assert( states.filter(s|s.name.equals("EMERGENCY_STOP")).size == 1 ) sense to test each transformation in isolation. val emergencyState = states.findFirst(s|s.name.equals("EMERGENCY_STOP")) states.findFirst(s|s.name.equals("noCooling")).eAllContents. filter(typeof(ChangeStateStatement)). exists(css|css.targetState == emergencyState)

This program first runs the transformation, and then finds all CustomStates (those that are not start or stop states). We then 24 assert that in those states there is exactly one with the name Notice that we don’t write an algo- rithmic check that closely resembles _ EMERGENCY STOP, because we assume that the transformation the transformation itself. Rather, we has added this state. We then check that in the (one and only) test a specific model for the presence of specific structures. For example, noCooling ChangeStateState- ment state there’s at least one we explicitly look for a state called whose target state is the emergencyState we had retrieved noCooling and check that this one has above24. the correct ChangeStateStatement.

14.4 Formal Verification

Formal verification can be used in addition to semantics testing in some cases. The fundamental difference between testing and verification is this: in testing, each test case specifies one par- ticular execution scenario. To get reasonable coverage of the whole model or transformation, you have to write and execute a lot of tests. This can be a lot of work, and, more importantly, dsl engineering 357

you may not think about certain (exceptional) scenarios, and hence you may not test them. Bugs may go unnoticed. Verification checks the whole program at once. Various non- trivial algorithms are used to do that25, and understanding 25 These include model checking, these algorithms in detail is beyond the scope of this book. SAT solving, SMT solving or abstract execution. Also, formal verification has inherent limitations (e.g., the halt- ing problem) that can only be solved by testing. So testing and verification each have sweet spots: neither can fully replace the other. However, it is very useful to know that verification ap- proaches exist, especially since, over the last couple of years, they have become scalable enough to address real-world prob- lems. In this section we look at two examples from mbeddr: model checking and SMT solving.

 Model Checking State Machines Model Checking is a verifi- cation technique for state machines. Here is how it works in principle26: 26 In this section we can only scratch the surface; to learn more about model • Some functionality is expressed as a state machine. checking, we recommend Berard, B., Bidoit, M., Finkel, A., Laroussinie, F., Petit, A., Petrucci, • You then specify properties about the behavior of the state L., Schnoebelen, and P. Systems and machine. Properties are expressions that have to be true for Software Verification. Springer, 2001 every execution of the state machine27. 27 For example, such a property could • You then run the model checker with the state machine and state that whenever you go to state X, the properties as input. you will have been in state Y directly beforehand. • The output of the model checker either confirms that your properties hold, or it shows a counter example28. 28 It may also run out of memory, in which case you have to reformulate Conceptually, the model checker performs an exhaustive search or modularize your program and try again. during the verification process. Obviously, the more complex your state machine is, the more possibilities the checker has to address – a problem known as state space explosion. With finite 29 For example, bounded model checking memory, this limits scalability, because at some point you will searches for counterexamples only for a bounded set of execution steps. If no out of memory, or the verification will run for an unacceptably counterexample is found within the long time. In reality the model checker does not perform an ex- bounds, the model checker assumes (rightly on not) that the property holds. haustive search; clever algorithms have been devised that are semantically equivalent to an exhaustive search, but don’t ac- 30 Sometimes input models have to be tually perform one29. This makes model checking scalable and reformulated in a way that makes them better suited for model checking. All fast enough for real-world problems, although there is still a in all, model checking is not a trivial limit in terms of input model complexity30. technique. However, if and when it is better integrated with the development The interesting aspect of model checking is that the proper- tools (as we show here), it could be ties you specify are not just simple Boolean expressions such used in many places where today it as each state must have at least one outgoing transition, unless it is isn’t even considered. 358 dslbook.org

a stop state. Such a check can be performed statically, as part of the constraint checks. The properties addressed by model checkers are more elaborate and are often typically expressed in (various flavors of) temporal logic31. Here are some exam- 31 There are various formalisms for ples, expressed in plain English: temporal logic, including LTL, CTL and • It is always true that after we have been in state X we will even- CTL+. Each of those have particular characteristics and limitations, but they tually be reaching state Y. This is a Fairness property. It en- are beyond the scope of this book. sures that the state machine does not get stuck in some state forever. For example, state Y may be the green light for pedestrians, and state X could be the green light for cars.

• Wherever we are in the state machine, it is always possible to get into state X. This is a Liveliness property. An example could be a system that you must always be able to turn off.

• It is not ever possible to get into state X without having gone through state Y before. This is a Safety property. Imagine a state machine where entering state X turns the pedestrian lights green and entering state X turns the car lights red. 32 If a state machine has no guard con- The important property of these temporal logic specifications ditions, some of these properties can be reduced to problems that can be solved is that quantifiers such as always, whenever and there exists are by "just looking", i.e. by inspecting available. Using these, one can specify global truths about the the structure of the state machine. For 32 example, if the only transition entering execution of a system – rather than about its structure . some state X originates from some state Model checking does come with its challenges. The input Y, then it is clear that we always come language for specifying state machines as well as specifying the through X before entering Y. However, in the presence of several transitions properties is not necessarily easy to work with. Interpreting the and guard conditions (which may take results of the model checker can be a challenge. And for some into account all kinds of other things, 33 such as the values of variables or event of the tools, the usability is really bad . arguments), these verifications become To make model checking more user friendly, the mbeddr C much more complex and cannot be language provides a nice syntax for state machines, then gener- solved by "just looking". ates the corresponding representation in the input language of 33 The SPIN/Promela model checker the model checker34. The results of running the model checker comes to mind here! are also reinterpreted in the context of the higher-level state 34 We use the NuSMV model checker: machine. Tool integration is provided as well: users can select nusmv.fbk.eu/. the context menu on a state machine and invoke the model checker. The model checker input is generated, the model checker is executed, and the replies are rendered in a nice ta- ble in MPS. Finally, we have abstracted the property specifi- cation language by providing support for the most important idioms35; these can be specified relatively easily (for example 35 Taken from the well-known never or always eventually reachable ). properties patterns collection at patterns.projects.cis.ksu.edu/ Also, a number of properties are automatically checked for each state machine. dsl engineering 359

Let’s look at an example. The following code shows a state machine that represents a counter. We can send the step event into the state machine, and as a consequence, it increments the currentVal counter by the size parameter passed with the event. If the currentVal becomes greater than LIMIT, the counter wraps around. We can also use the start event to reset the counter to 0. verifiable statemachine Counter { in events start() step(int[0..10] size) local variables int[0..100] currentVal = 0 int[0..100] LIMIT = 10 states ( initial = initialState ) state initialState { on start [ ] -> countState { } } state countState { on step [currentVal + size > LIMIT] -> initialState { } on step [currentVal + size <= LIMIT] -> countState { currentVal = currentVal + size; } on start [ ] -> initialState { } } }

Since this state machine is marked as verifiable, we can run the model checker from the context menu36. Fig. 14.6 shows 36 Marking a state machine as the result of running the model checker. verifiable also enforces a few re- strictions on the state machine (for Here is a subset of the properties it has checked successfully (it example, each state machine local vari- performs these checks for all states/transitions by default): able may only be assigned once during a transition). State machines restricted State ’initialState’ can be reached SUCCESS in this way are easier to model check. Variable ’currentVal’ is always between its defined bounds SUCCESS State ’countState’ has deterministic transitions SUCCESS Transition 0 of state ’initialState’ is not dead SUCCESS

The first one reports that NuSMV has successfully proven that the initialState can be reached somehow. The second one reports that the variable currentVal stays within its bounds (notice how currentVal is a bounded integer). Line three re- ports that in countState it never happens that more than one transition is ready to fire at any time. Finally, it reports that no transitions are dead, i.e. each of them is actually used at some point. Let’s provoke an error. We change the two transitions in countState to the following: on step [currentVal + size >= LIMIT] -> initialState { } on step [currentVal + size <= LIMIT] -> countState { currentVal = currentVal + size; } 360 dslbook.org

Figure 14.6: The result of running We have changed the > to a >= in the first transition37. Running the model checker on a verifiable the model checker again, we get, among others, the following state machine is directly shown in the IDE, listing each property in a table. messages: If a property fails, the lower part of the result view shows an example State ’countState’ contains nondeterministic transitions FAIL 4 execution of the state machine (states and local variable values) that leads to This means that there is a case in which the two transitions are the property being violated. non-deterministic, i.e. both are possible based on the guard, and it is not clear which one should be fired. The 4 at the 37 You perhaps don’t even recognize the difference – which is exactly why you end means that the execution trace to this problem contains would want to use a formal verification! four steps. Clicking on the failed property check reveals the problematic execution trace:

State initialState LIMIT 10 currentVal 0 State initialState in_event: start start() LIMIT 10 currentVal 0 State countState in_event: step step(10) LIMIT 10 currentVal 0 State initialState LIMIT 10 currentVal 10 dsl engineering 361

This is one (of potentially many) execution traces of this state machine that leads to the non-determinism: currentVal is 10, and because of the >=, both transitions could fire. In addition to these default properties, it is also possible to specify custom properties. Here are two examples, expressed using the property patterns mentioned earlier: verification conditions never LIMIT != 10 always eventually reachable initialState

The first one expresses that we want the model checker to prove that a specific Boolean condition will never be true. In our ex- ample, we check that the LIMIT really is a constant and is never (accidentally) changed. The second one specifies that wherever we are in the execution of the state machine, it is still possible (after an arbitrary number of steps) to reach the initialState. Both properties hold for the example state machine.

 SAT/SMT Solving SAT solving, which is short for satis- fiability solving, concerns the satisfiability of sets of Boolean equations. Users specify a set of Boolean equations and the solver tries to assign truth values to the free variables so as to satisfy all specified equations. SAT solving is an NP-complete problem, so there is no analytic approach: exhaustive search (implemented, of course, in much cleverer ways) is the way to address these problems. SMT solving (Satisfiability Modulo Theories) is an extension of SAT solving that allows other con- structs in addition to logical operators – the most frequently used being linear arithmetic, arrays or bit-vectors. As an example, SMT solving can be used to verify mbeddr’s decision tables. A decision table has a set of Boolean condi- tions as row headers, a set of Boolean conditions in the col- umn headers, as well as arbitrary values in the content cells. A decision table essentially represents nested if statements: the result value of the table is that content cell whose row and column header are true. Fig. 14.7 shows an example. SMT solving can be used to check whether all cases are han- dled. It can detect whether combinations of the relevant vari- ables exist for which no combination of row header and col- umn header expressions match; in this case, the decision table would not return any value. SAT and SMT solvers have some of the same challenges as model checkers regarding scalability: a low-level and limited input language and the challenge of interpreting and under- 362 dslbook.org

Figure 14.7: An example decision table in mbeddr C. SMT solving is used to check it for consistency and completeness.

standing the output of a solver. Hence we use the same ap- proach to solve the problem: from higher-level models (such as the decision table) we generate the input to the solver, run it, and then report the result in the context of the high-level language.

 Model Checking and Transformations A problem with model verification approaches in general is that they verify only the model. They can detect inconsistencies or property violations as a consequence of flaws in the program expressed with a DSL. However, even if we find no flaws in the model on the DSL level, the generator or interpreter used to execute the pro- gram may still introduce problems. In other words, the be- havior of the actual running system may be different from the (proven correct) behavior expressed in the model. There are three ways to address this:

• You can test your generator manually using the strategies suggested in this chapter. Once you trust the generator based on a sufficiently large set of tests, you then only have to verify the models, since you know they will be translated correctly.

• Some tools, for example the UPAAL model checker38, can 38 www.uppaal.com/ also generate test cases39. These are stimuli to the model, 39 One can generate test cases by using together with the expected reactions. You can generate those the model checking technology: just specify that some property is false, into your target language and then run them in your target and the model checker will provide language. This is essentially an automated version of the a counterexample that illustrates the execution of the state machine up to the first approach. point where the property is true. • Finally, you can verify the generated code. For example, there are model checkers for C. You can then verify that the properties that hold on the DSL level also hold on the level dsl engineering 363

of the generated code. This approach runs into scalability issues relatively quickly, since the state space of a C program is much larger than the state space of a well-crafted state machine40. However, you can use this approach to verify the 40 Remember that we use formalisms generated code based on a sufficient set of relatively small such as state machines instead of low- level code specifically to allow more test cases, making sure that these cover all aspects of the meaningful validation. generator. Once you’ve built trust in the generator in this way, you can resort to verifying just the DSL models (which scales better).

14.5 Testing Editor Services

Testing IDE services such as code completion (beyond scopes), quick fixes, refactorings or outline structure has some of the challenges of UI testing in general. There are three ways of approaching this:

• The language workbench may provide specific APIs to hook into UI aspects to facilitate writing tests for those.

• You can use generic UI testing tools41 to simulate typing and 41 Such as Eclipse Jubula: clicking in the editor, and checking the resulting behavior. www.eclipse.org/jubula/ • Finally, you can isolate the algorithmic aspects of the IDE behavior (e.g., in refactorings or quick fixes) into separate modules (classes) and then unit test those with the tech- niques discussed in the rest of this chapter, independent of the actual UI.

In practice, I try to use the third alternative as much as possible: for non-trivial IDE functionality in quick fixes and refactorings, I isolate the behavior and write unit tests. For simple things I don’t do any automated tests. For the actual UI, I typically don’t do any automated tests at all, for three reasons: (1) it is simply too cumbersome and not worth the trouble; (2) as we use the editor to try things out, we implicitly test the UI; and (3), language workbenches are frameworks which, if you get the functionality right (via unit tests), provide generic UIs that can be expected to work. In the remainder of this subsection we show examples of the case in which the language workbench provides specific APIs to test the IDE aspects of languages.

 An Example with MPS In a parser-based system, you can always type anything. So even if the IDE functionality (partic- 364 dslbook.org

ularly regarding code completion) is broken, you can still type the desired code. Also, the editing experience of typing code is always the same, fundamentally: you type linear sequences of characters. In a projectional editor, this is not the case: you can only enter things that are available in the code completion menu, and the editing experience itself relies on the editor im- plementation42. Hence it is important to be able to test editor 42 For example, you can only type 1 + behavior. 2 linearly (e.g. first the 1, then the +) if the respective right transformation for MPS supports this with a special DSL for editor testing (see number literals is implemented. Fig. 14.8). Note how MPS’ language composition facilities al- low embedding the subject DSL into the DSL for describing the editor test case.

Figure 14.8: This test tests whether code completion works correctly. We start with an "empty" variable declaration in the before slot. It is marked with cell, a special annotation used in UI tests to mark the editor cell that has the focus for the subsequent scripted behavior. In the result slot, we describe the state of the editor after the script code has been executed. The script code then simulates typing the word myVariable, pressing TAB, pressing CTRL-SPACE, typing boo (as a prefix of boolean) and pressing ENTER.

 An Example with Xtext/Xpect Xpect is a framework for inte- gration testing of Xtext DSLs developed by Moritz Eysholdt43. 43 https://github.com/ It can be used for testing various language aspects, among meysholdt/Xpect them, for example, code completion44. It does so by embed- 44 Xtext itself comes with a set of ding test expectations as comments inside the program to be helpers for IDE service testing such as content assist or builder tests. They tested. Here is an example based on a Hello World grammar run with, and often even without, an (literally): SWT display (e.g., headless).

Model: greetings+=Greeting*;

Greeting: ’Hello’ name=ID ’!’;

The following piece of code shows an example program that includes Xpect statements that test whether code completion works as expected: dsl engineering 365

// XPECT_TEST org.example.MyJUnitContentAssistTest END_TEST

// XPECT contentAssist at |Hel --> Hello Hello Peter!

// XPECT contentAssist at |! --> ! Hello Heiko!

The Xpect processor processes all comments that start with XPECT. In this case, we test the content assist (e.g., code com- pletion) functionality. Let us look at the details:

• The contentAssist refers to the kind of test to be executed (details on this below).

• at is a keyword for improved readability and has no seman- tic impact.

• |Hel and |! instruct the test to search for occurrences of Hel and ! somewhere in the code after the XPECT statement. The pipe | marks the assumed cursor position relative to Hel and ! where the content assist should be triggered.

• The part after -> marks the expectation of the test. In the first test, content assist is expected to suggest the keyword Hello, and in the second test the exclamation point is ex- pected.

Xpext is in fact a generic infrastructure for integration tests. As you can see from the example above, the test references a JUnit test class45: org.example.MyJUnitContentAssistTest. 45 Xpect implements a custom JUnit The term contentAssist is actually the name of a test method runner which allows you to execute Xpect tests as JUnit test; integration into inside that class. Everything from an Xpect comment after the IDEs such as Eclipse and CI environ- XPECT keyword is passed as parameters into the test method. ments is ensured. The test method can do whatever it wants as long as it pro- 46 The general idea behind Xpect is the duces a string as the output. This string is then compared with separation of test data, test expectations and test setup from implementation the expectation, the text behind the ->. While contentAssist details. The test data consists of DSL is predefined in Xpect-provided unit test base classes, you can documents written in the language that define your own methods. Since the actual testing is based on you want to test. The test expectations are anything you might want the test to string comparison, the system is easily extensible. The follow- verify and which can be expressed as ing language aspects can be tested with Xpect46: a string. The setup may declare other DSL documents that the test depends • The AST created from a DSL document. on, including Eclipse project setups. Since all these details are hidden, the • Messages and locations of error and warning markers. DSL user can potentially understand or even write the test cases, not just the • Names returned by scopes. DSL developer. • Proposal items suggest by content assist features (as the ex- ample above shows). 366 dslbook.org

• Textual diffs that were created by applying refactorings or quick fixes.

• Textual output of a code generator (but use with caution, since generated text may be too fragile).

Results from interpreters or execution of generated code. By embedding test expectations into the subject programs, Xpect implicitly solves the navigation problem47 by allowing you to 47 If you want to test language proper- use the | to select offsets inside a DSL document. Xpect also ties such as content assist or scoping, you will have to navigate/point/refer makes it easy to locate the offending expectation if case a test to to a model element after you have fails: since all expectations are represented as strings, if a test created the test data. This leads to boilerplate code in Java-based tests. fails, the Eclipse JUnit view provides a comparison dialog that shows all differences between the actual test result and the test expectation. Xpect also makes it easy to specify even large test data scenarios, possibly consisting of multiple languages and multiple Eclipse projects. Finally, since Xpect code is embed- ded into the DSL code to be tested, you can use your DLS’s Xtext editor to edit your test data. In plain JUnit tests you would have to embed snippets of your DSL documents into Java string literals, which won’t provide any tool support for your language at all48. 48 Notice, however, that when edit- ing the program to be tested, there is An Example with Spoofax Spoofax’ testing language sup- no tool support for the Xpect syntax  and the expectations. The reason is ports testing editor services such as reference resolution and that Xtext does not support language content completion. For reference resolution, we mark a defi- embedding: there is no way to easily define a composed language from the nition and a use site in a test case with [[...]]. We can refer to subject DSL and Xpect. While this these markers by numbers #1 and #2, specifying which marked limitation is not a problem for Xpect element should refer to the other marked element. For exam- itself (after all, its syntax is extremely simple), it may be a problem for expec- ple, the following test cases mark the name of an entity A in its tations with a more complex syntactic declaration and in the type of a property: structure. Of course, a specialized editor could be developed (based on test entity type reference (1) [[ the default Xtext editor) that provides module foo code completion for the Xpect code in the DSL program comments. But that entity [[A]] {} would require hand-coding and would entity B { be quite a bit of work. a: [[A]] }

]] resolve #2 to #1 test entity type reference (2) [[ module foo

entity B { a: [[A]] }

entity [[A]] {}

]] resolve #1 to #2 dsl engineering 367

The first test case addresses a backward reference, where the second marked name should resolve to the first marked name. The second test case addresses forward reference, where the first marked name should resolve to the second marked name. For content completion, we mark only one occurrence, and specify one of the expected completions: test entity type reference (1) [[ module foo

entity SomeEntity {}

entity A { a: [[S]] } ]] complete to "String" test entity type reference (2) [[ module foo

entity SomeEntity {}

entity A { a: [[S]] } ]] complete to "SomeEntity"

Refactorings are tested in a similar fashion. The selected part of the code is indicated with square brackets, and the name of the refactoring is specified in the test: test Rename refactoring [[ entity [[Customer]] {

} entity Contract { client : Customer } ]] refactor rename("Client") to [[ entity Client {

} entity Contract { client : Client } ]]

14.6 Testing for Language Appropriateness

A DSL is only useful if it can express what it is supposed to express. A bit more formally, one can say that the coverage of the DSL relative to the target domain should be 100%. In practice, this questions is much more faceted, though:

• Do we actually understand completely the domain the DSL is intended to cover?

• Can the DSL cover this domain completely? What does "completely" even mean? Is it ok to have parts of the sys- 368 dslbook.org

tem written in LD−1, or do we have to express everything with the DSL?

• Even if the DSL covers the domain completely: are the ab- stractions chosen appropriate for the model purpose?

• Do the users of the DSL like the notation? Can the users work efficiently with the notation?

It is not possible to answer these questions with automated tess. Manual reviews and validation relative to the (explicit or tacit) requirements for the DSL have to be performed. Getting these aspects right is the main reason why DSLs should be developed incrementally and iteratively. 15 Debugging DSLs

Debugging is relevant in two ways in the context of DSLs and language workbenches. First, the DSL developer may want to debug the definition of a DSL, including constraints, scopes or transformations and interpreters. Second, pro- grams written in the DSL may have to be debuggable by the end user. We address both aspects in this chapter.

15.1 Debugging the DSL Definition

Debugging the definition of the DSL boils down to a language workbench providing a debugger for the languages used for language definition. In the section we look at understanding and debugging the structure and concrete syntax, the definition of scopes, constraints and type systems, as well as debugging interpreters and transformations.

15.1.1 Understanding and Debugging the Language Struc- ture In parser-based systems, the transformation from text to the AST performed by the parser is itself a non-trivial process and has a potential for errors. Debugging the parsing process can be important.

 Xtext Xtext uses ANTLR1 under the hood. In other words, 1 antlr.org an ANTLR grammar is generated from the Xtext grammar which performs the actual parsing2. So understanding and debug- 2 It contains actions that construct the ging the Xtext parsing process means understanding and de- AST based on the mapping expressed in the Xtext grammar. bugging the ANTLR parsing process. 370 dslbook.org

There are two ways to do this. First, since ANTLR generates a Java-based parser, you can debug the execution of ANTLR (as part of Xtext) itself3. Second, you can have Xtext generate a 3 For obvious reasons, this is tedious debug grammar, which contains no action code (so it does not and really just a last resort. populate the AST). However, it can be used to debug the pars- ing process with ANTLRWorks4. ANTLRWorks comes with an 4 www.antlr.org/works interactive debugger for ANTLR grammars.

 MPS In MPS there is no transformation from text to the AST since it is a projectional editor. However, there are still means of helping to better understand the structure of an ex- isting program. For example, any program element can be in- spected in the Explorer. Fig. 15.1 shows the explorer contents for a trivial C function: int8 add(int8 x, int8 y) { return x + y; }

Figure 15.1: The MPS explorer shows the structure of a program as a tree. The explorer also shows the concept for each program element as well as the type, if an element has one.

MPS provides similar support for understanding the projection rules. For any program node MPS can show the cell structure as a tree. The tree contains detailed information about the cell hierarchy, the program element associated with each cell, and the properties of the cell (height, width, etc.).

15.1.2 Debugging Scopes, Constraints and Type Systems Depending on the level of sophistication of a particular lan- guage, a lot of non-trivial behavior can be contained in the code that determines scopes, checks constraints or computes dsl engineering 371

types. In fact, in many languages, these are the most sophisti- cated aspects of language definition. Consequently, there is a need for debugging those.

 Xtext In Xtext all aspects of a language except the gram- 5 You may also use other JVM languages mar and the abstract syntax are defined via Java5 programs such as Xtend. If such a language has a debugger, then you can obviously also using Xtext APIs. This includes scopes, constraints, type sys- use that debugger for debugging Xtext tem rules and all IDE aspects. Consequently, all these aspects DSL implementations. can be debugged by using a Java debugger. To do this, you can 6 It is easy to criticize Xtext for the fact simply launch the Eclipse Application that contains the lan- that it does not use DSLs for defining guage and editor in debug mode and set breakpoints at the DSLs. However, in the context of 6 debugging this is good, because no relevant locations . special debuggers are necessary.

Figure 15.2: The debugger that can de- bug MPS while it "executes" a language  MPS MPS comes with a similar facility, in the sense that is aware of all the relevant extensions to BaseLanguage. For example, in this a second instance of MPS can be run "inside" the current one. screenshot we debug a scope constraint. This inner instance can be debugged from the outer one. This Notice how in the Variables view approach can be used for all those aspects of MPS-defined lan- program nodes (such as the Statement s) are shown on the abstraction level of guages that are defined in terms of the BaseLanguage, MPS’ the node, not in terms of its underlying version of Java. For example, scopes can be debugged this way: Java data structure representation. in Fig. 15.2 we debug the scope for a LocalVariableRef. 372 dslbook.org

A related feature of MPS is the ability to analyze exception stack traces. To implement a language, MPS generates Java code from language definitions and then executes this Java code. If an exception occurs in language implementation code it produces a Java stack trace. This stack trace can be pasted into a dialog in MPS. MPS then produces a version of the stack trace in which the code locations in the stack trace (which are relative to the generated Java) have been translated to locations in the DSL definition (expressed in Base Language). The lo- cations can be clicked directly, opening the MPS editor at the respective location.

Figure 15.3: This example shows the Relative to the type system, MPS comes with two dedicated de- solver state for the Argument x. It bug facilities (beyond debugging a new instance of MPS inside first applies the rule typeof_ITyped (Argument implements ITyped), which MPS mentioned above). First, pressing Ctrl-Shift-T on any expresses that the type of the element program element will open a dialog that shows the type of the (type variable c is the same as the element. If the element has a type system error, the dialog also element’s type property (type variable d). It then applies the typeofype rule lets the user navigate to the rule that reported the error. The to the argument’s type itself. This rule second facility is much more sophisticated. For any program expresses the fact that the type of a Type is a clone of itself. Consequently, node, MPS can show the type system trace (Fig. 15.3 shows a the type variable d can be set to int8. simple example). Remember how the MPS type system relies In consequence this means that the type on a solver to solve the type system equations associated with variable c (which represents the type of the Argument) is also int8. Note that program elements (specified by the language developer for the this is a trivial example. Type system respective concepts). This means that each program has an as- traces can become quite involved and are not always easy to understand. sociated set of type system equations, which contain explicitly specified types as well as type variables. The solver tries to find type values for these variables such that all type system 7 In the design part we discussed how equations become true. The type system trace essentially visu- declarative languages may come with a debugger that fits the particular declar- alizes the state of the solver, including the values it assigns to ative paradigm used by a particular type variables, as well as which type system rules are applied declarative language (Section 5). The 7 type system trace is an example of this to which program element . idea. dsl engineering 373

15.1.3 Debugging Interpreters and Transformations

Debugging an interpreter is simple: since an interpreter is just a program written in some programming language that pro- cesses and acts on the DSL program, debugging the interpreter simply uses the debugger for the language in which the inter- preter is written (assuming there is one)8. 8 While it is technically simple to debug Debugging transformations and generators is typically not an interpreter, it is not necessarily sim- ple to follow what’s going on, because quite as trivial, for two reasons. First, transformations and gen- the interpreter is a meta program. We erators are typically written in DSLs optimized for this task. So discuss this in Section 4.3. a specialized debugger is required. Second, if multi-step trans- formations are used, the intermediate models may have to be accessible, and it should be possible to trace a particular ele- ment through the multi-step transformation.

 Xtext Xtext can be used together with any EMF-compatible code generator or transformation engine. However, since Xtext ships with Xtend, we look at debugging Xtend transformations. Model-to-model transformations and code generators in Xtend look very similar: both use Xtend to navigate over and query the model, based on the AST. The difference is that, as a side effect, model-to-model transformations create new model ele- ments and code generators create strings, typically using rich strings (aka template expressions). Any Xtend program can be debugged using Eclipse "out of the box". In fact, you can debug an Xtend program either on the Xtend level or on the level of the generated Java source9. Since 9 Xtend generates Java source code that interpreters and generators are just regular Xtend programs, is subsequently compiled. they can be debugged in this way as well. Fig. 15.4 shows an example of debugging a template expressions. Xtend is a fundamentally an object-oriented language, so the step-through metaphor for debuggers works. If Xtend is used for code generation or transformation, debugging boils down to stepping through the code that builds the target model10. 10 I emphasize this because the next example uses a different approach.  MPS In MPS, working with several chained transforma- tions is normal, so MPS provides support for debugging the transformation process. This support includes two ingredients. The first one is showing the mapping partitioning. For any given model, MPS automatically computes the order in which transformations are executed, based on the relative priorities specified for the generators involved. The mapping partition- ing reports the overall transformation schedule to the user. 374 dslbook.org

Figure 15.4: Debugging a code gener- ator written in Xtend. The debugger This is useful in understanding which transformations are ex- can even step through the template ecuted in which order, and in particular, to debug transforma- expressions. The Variables view shows tion priorities. Let us investigate a simple example C program the EMF representation (i.e. the imple- mentation) of program elements. that contains a message definition and a report statement. The report statement is transformed to printf statements:

module Simple imports nothing {

message list messages { INFO aMessage() active: something happened }

exported int32 main(int32 argc, int8*[] argv) { report (0) messages.aMessage(); return 0; } }

Below is the mapping configuration for this program:

[ 1 ] com.mbeddr.core.modules.gen.generator.template.main.removeCommentedCode [ 2 ] com.mbeddr.core.util.generator.template.main.reportingPrintf [ 3 ] com.mbeddr.core.buildconfig.generator.template.main.desktop com.mbeddr.core.modules.gen.generator.template.main.main dsl engineering 375

This particular model is generated in three phases. The first one removes commented code to make sure it does not show up in the resulting C text file. The second phase runs the gen- 11 It is called desktop because it gener- report printf erator that transforms statements into s. Finally, ates the code for a desktop computer. the desktop generator generates a make file from the build con- It can be exchanged to generate code figuration, and the last step generates the C text from the C or make files for arbitrary embedded devices and compilers. tree11. By default, MPS runs all generators until everything is either discarded or transformed into text. While intermediate models exist, they are not shown to the user. For debugging purposes though, these intermediate, transient models can be retained for inspection. Each of the phases is represented by one or more transient models. As an example, here is the program after the report statement has been transformed: module Simple imports nothing {

exported int32 main(int32 argc, int8*[] argv) { printf("$$ aMessage: something happened "); printf("@ Simple:main:0#240337946125104144 \n "); return 0; } } Figure 15.5: The generation trace MPS also supports tracing an element through the interme- functionality in MPS allows users to diate models. Fig. 15.5 shows an example. Users can select a trace how a particular program element is transformed through a chain of program element in the source, target or an intermediate model transformations. The generation tracer and trace it to the respective other ends of the transformation. also shows the transformation rules involved in the transformation.

Note how this approach to debugging transformations is very 12 As part of the general Debug-MPS-in- different from the Xtend example above: instead of stepping MPS functionality, MPS transformations through the transformation code12, MPS provides a static rep- can also be debugged in a more impera- tive fashion. This is useful, for example, resentation of the transformation in terms of the intermediate to debug more complex logic used models and the element traces through them. inside transformation templates. 376 dslbook.org

15.2 Debugging DSL Programs

To find errors in DSL programs, we can either debug them on the level of the DSL program or in its LD−1 representation (i.e. in the generated code or the interpreter). Debugging on LD−1 is useful if you want to find problems in the execution engine, or, to some extent, if the language users are program- mers and they have an intimate understanding of the LD−1 representation of the program. However, for many DSLs it is necessary to debug on the level of the DSL program, either be- 13 13 cause the users are not familiar with the LD−1 representation , This is true particularly for DSLs targeted at domain experts. or because the LD−1 is so low-level and complex that is bears no obvious resemblance to the DSL program. The way to build debuggers for DSLs of course depends on the DSL itself. For example, for DSLs that only describe struc- tures, debugging does not make much sense in the first place. For DSLs that describe behavior, the debugging approach de- pends on the behavioral paradigm used in the DSL. We have discussed this in Section 5. In this section we focus mostly on the imperative paradigm14. 14 An example for the functional Building a debugger poses two challenges. The first one is paradigm was provided in Section 5.3, and the type system tracer described the debugger UI: creating all the buttons and views for control- above is an example of a debugger for a ling the debugger and for showing variables and treads. The declarative language. second challenge concerns the control of and data exchange with the program to be debugged. The first challenge is rela- tively simple to solve, since many IDE frameworks (including Eclipse and MPS) already come with debugger frameworks. The second challenge can be a bit more tricky. If the DSL is executed by an interpreter, the situation is simple: the inter- preter can be run and controlled directly from the debugger. It is easy to implement single-stepping and variable watches, for example, since the interpreter can directly provide the re- spective interfaces15. On the other hand, if the DSL program 15 This is especially true if the inter- is transformed into code that is executed in some other envi- preter is written in the same language as the IDE: no language integration ronment outside of our control, it may even be impossible to issues have to be addressed in this case. build a debugger, because there is no way to influence and in- spect the running program. Alternatively, it may be necessary to build a variant of the code generator which generates a debug version of the program that contains specialized code to inter- act with the debugger. For example, values of variables may be stored in a special data structure inspectable by the debugger, and at each program location where the program may have to stop (in single-step mode or as a consequence of a breakpoint) dsl engineering 377

code is inserted that explicitly suspends the execution of the program, for example by sleeping the current thread. How- ever, such an approach is often limited and ugly – in the end, an execution infrastructure must provide debug support to en- able robust debugging.

15.2.1 Print Statements – a Poor Man’s Debugger As the above discussion suggests, building full-blown debug- gers may be a lot of work. It is worth exploring whether a simpler approach is good enough. The simplest such approach is to extend the DSL with language concepts that simply print interesting aspects of the executing program to the console or a log file. For example, the values of variables may be output this way. The mbeddr report statement is an example of this ap- proach. A report statement takes a message text plus a set of variables. It then outputs the message and the values of these variables. The target of the report statement can be changed. By default, it reports to the console. However, since certain tar- get devices may not have any console16, alternative transforma- 16 mbeddr addresses embedded soft- tions can be defined for report statements, that, for example, ware development, and small micro- controllers may not have a console. could output the data to an error memory or a serial line. A particularly interesting feature of report statements is that the transformation that handles them knows where in the program the report statement is located and can add this information to the output17. 17 The go-to-error-location functionality An approach based on print statements is sometimes clumsy, discussed in Fig. 14.5 is based on this approach. because it requires factoring out the expression to be printed18, and it only works for an imperative language in the first place. 18 It also requires changing the actual For languages that make use of sophisticated expressions, a program. In fact, any debug approach that requires any kind of change to different approach is recommended. Consider the following the program to be debugged (or the example: compiled machine code), requires that it is known in advance if a particular Collection[Type] argTypes = aClass.operations.arguments.type; program is supposed to be debugged. While this is a significant limitation, If you wanted to print the list of operations and arguments, it is true for almost all compiled lan- guages ("compile with debug options"), you would have to change the program to something like this: and hence accepted. print("operations: " + aClass.operations); print("arguments: " + aClass.operations.arguments); Collection[Type] argTypes = aClass.operations.arguments.type;

A much simpler alternative uses inlined reporting expressions:

Collection[Type] argTypes = aClass.operations.print("operations:") .arguments.print("arguments:").type; 378 dslbook.org

To make this convenient to use, the print function has to re- turn the object it is called on (the one before the dot), and it must be typed accordingly if a language with static type check- 19 The original openArchitectureWare Xtend did it this way. ing is used19.

15.2.2 Automatic Program Tracing As languages and programs become more complex, an auto- mated tracing of program execution may be useful. In this approach, all execution steps in a program are automatically traced and logged into a tree-like data structure. The refriger- ator cooling language uses this approach. Here is an example program: cooling program HelloWorld { var temp: int start: entry { state s1 } state s1: check temp < 10 { state s2 } state s2: }

Upon startup, it enters the start state and immediately tran- sitions to state s1. It remains in s1 until the variable temp becomes less than 10. It then transitions to s2. Below is a test for this program that verifies this behavior: test HelloWorldTest for HelloWorld { prolog { set temp = 30 } step assert-currentstate-is s1 step mock: set temp = 5 step assert-currentstate-is s2 }

Fig. 15.6 shows the execution trace. It shows the execution of 20 Obviously, the approach does not each statement and the evaluation of each expression. The log scale for big programs. However, by isolating problems into smaller, repre- viewer is a tree table, so the various execution steps can be sentative programs, it does provide a selectively expanded and collapsed. Users can double-click on degree of usefulness. an entry to select the respective program element in the source 21 node. By adding special comments to the source, the log can If, instead of an interpreter, a code 20 generator were used, the same ap- be structured further . proach could essentially be used. The execution engine for the programs is an interpreter, Instead of embedding the tracing code 21 in the interpreter, the code generator which makes it particularly simple to collect the trace data . would generate code that would build All interpreter methods that execute statements or evaluate ex- the respective trace data structure in the pressions take a LogEntry object as an additional argument. executing program. Upon termination, the data structure could be dumped to The methods then add children to the current LogEntry that an XML file and subsequently loaded describe whatever the method did, and then pass the child to by the IDE for inspection. dsl engineering 379

any other interpreter methods it calls. As an example, here is the implementation of the AssignmentStatement: protected void executeAssignmentStatement(AssignmentStatement s, LogEntry log) { LogEntry c = log.child(Kind.info, context, "executing AssignmentStatement" ); Object l = s.getLeft(); Object r = eval(s.getRight(), c); eec().environment.put(symbol, r); c.child(Kind.debug, context, "setting " + symbol.getName() + " to " + r); }

15.2.3 Simulation as an Approximation for Debugging The interpreter for the cooling programs mentioned above is of course not the final execution engine – C code is generated that is executed on the actual target refrigerator hardware. How- ever, as we discussed in Section 4.3.7, we can make sure the generated code and the interpreter are semantically identical by running a sufficient (large) number of tests. If we do this, we can use the interpreter to test the programs for logical er- rors.

Figure 15.6: The log viewer represents a program’s execution as a tree. The first column contains a timestamp and the tree nesting structure. The second column contains the kind (severity) of the log message. A filter can be used to show only messages above a certain severity. The third column shows the language concept with which the trace step is associated (double-clicking on a row selects this element in the editor). Finally, the last column contains the information about the semantic action that was performed in the respective step. 380 dslbook.org

The interpreter can also be used interactively, in which case it acts as a simulator for the executing program. It shows all variables in the program, the events in the queue, the running tasks, as well as the values of properties of hardware elements and the current state. It also provides a button to single-step the program, to run it continuously, or to run it until it hits a breakpoint. In other words, although the simulator does not 22 ... familiar to programmers, but not to the target audience! use the familiar22 UI of a debugger, it actually is a debugger23! If you already have the interpreter24, expanding it into a 23 As we have said above, the fact that simulator/debugger is relatively simple. Essentially only three it runs in the interpreter instead of the generated code is not a problem things have to be done: if we ensure the two are semantically identical. Of course we cannot find • First, the execution of the program must be controllable bugs in the implementation (i.e. in the generator) in this way. But to from the outside. This involves setting breakpoints, single- detect those, debugging on the level stepping through the program and stopping execution if a of the generated C code is more useful anyway. breakpoint is hit. In our example case, we do not single-step 25 through statements, but only through steps . Breakpoints 24 We discuss how to build one in are essentially Boolean flags associated with program ele- Section 12.

ments: if the execution processes a statement that has the 25 Remember that the language is breakpoint flat set to true, execution stops. time-triggered anyway, so execution is basically a sequence of steps triggered • Second, we have implemented an Observer infrastructure by a timer. In the simulator/debugger, for all parts of the program state that should be represented the timer is replaced with the user pressing the Next Step button for in the simulator UI. Whenever one of them changes (as a single stepping. side effect of executing a statement in the program), an event is fired. The UI registers as an observer and updates the UI in accordance with the event. • Third, values from the program state must be changeable from the outside. As a value in the UI (such as the temper- ature of a cooling compartment) is changed by the user, the value is updated in the state of the interpreter as well.

15.2.4 Automatic Debugging for Xbase-based DSLs

DSLs that use Xbase, Xtext’s reusable expression language, get 26 "Mostly" because in a few cases you debugging mostly26 for free. This is because of the tight inte- have to add trace information manually in transformations. gration of Xbase with the JVM. We describe this integration in more detail in Section 16.3; here is the essence. 27 Technically, Xtend doesn’t have A DSL that uses Xbase typically defines its own structural statements, and things like if or switch are expressions. and high-level behavioral aspects, but uses Xbase for the fine- grained, expression-level and statement-level27 behavior. For example, in a state machine DSL, states, events and transitions would be concepts defined by the DSL, but the guard con- ditions and the action code would reuse Xbase expressions. dsl engineering 381

When mapping this DSL to Java28, the following approach 28 Xbase-based DSLs must be mapped is used: the structural and high-level behavioral aspects are to Java to benefit from Xbase in the way discussed in this section. If you mapped to Java, but not by generating Java text, but by map- generate code other than Java, Xbase ping the DSL AST to a Java AST29. For the reused Xbase as- cannot be used sensibly. pects (the finer-grained behavior) a Java generator (called the 29 The Java AST serves as a hub for Xbase compiler) already exists, which we simply call from our all Xbase languages including Xtend, generator. your own DSL or Xcore. All those work nicely together. The JVM level serves as Essentially, we do not create a code generator, but a model- an interoperability layer. to-model transformation from the DSL AST to the Java AST. As part of this transformation (performed by the JVMModel in- ferrer), trace links between the DSL code and the Java code are established. In other words, the relationship between the Java code and the DSL code is well known. This relationship is ex- ploited in the debugging process. Xbase-based DSLs use the Java debugger for debugging. In addition to showing the gen- erated Java code, the debugger can also show the DSL code, based on the trace information collected by the JVMModel in- ferrer. In the same way, if a user sets a breakpoint in the DSL code, the trace information is used to determine where to set the breakpoint in the generated Java code. 30 Extensible languages were defined 15.2.5 Debuggers for an Extensible Language and discussed in Section 4.6. We show in detail in Section 16.2 how this works This section describes in some detail the architecture of an ex- with MPS. tensible debugger for an extensible language30. We illustrate the approach with an implementation based on mbeddr, an ex- tensible version of C implemented with the MPS. We also show the debuggers for non-trivial extensions of C.

 Requirements for the Debugger Debuggers for imperative languages support at least the following features: breakpoints suspend execution on arbitrary statements; single-step execution steps over statements, and into and out of functions or other callables; and watches show values of variables, arguments or other aspects of the program state. Stack frames visualize the call hierarchy of functions or other callables. When debugging a program that contains extensions, break- points, stepping, watches and call stacks, these elements at the extension-level differ from their counterparts at the base-level. The debugger has to perform the mapping from the base-level Figure 15.7: An extension-aware de- to the extension-level (Fig. 15.7). We distinguish between the bugger maps the debug behavior from tree representation of a program in MPS and the generated text the base-level to the extension-level (an extension may also be mapped onto that is used by the C compiler and the debugger backend. A other extensions; we ignore this aspect program in the tree representation can be separated into parts in this section). 382 dslbook.org

expressed in the base language (C in this case) and parts ex- pressed using extensions. We refer to the latter the extension- level or DSL-level (see Fig. 15.7). An extensible tree-level de- bugger for mbeddr that supports debugging on the base-level and extension-level, addresses the following requirements:

Modularity Language extensions in mbeddr are modular, so debugger extensions must be modular as well. No changes to the base language must be necessary to enable debugging for a language extension.

Framework Genericity In addition, new language extensions must not require changes to the core debugger infrastructure (not just the base language).

Simple Debugger Definition Creating language extensions is an integral part of using mbeddr. Hence, the development of a debugger for an extension should be simple and not re- quire too much knowledge about the inner workings of the framework, or even the C debugger backend.

Limited Overhead As a consequence of embedded software de- velopment, we have to limit the additional, debugger-specific code generated into the binary. This would increase the size of the binary, potentially making debugging on a small tar- get device infeasible.

Debugger Backend Independence Embedded software projects use different C debuggers, depending on the target device. This prevents modifying the C debugger itself: changes would have to be re-implemented for every C debugger used.

 An Example Extension We start out by developing a simple extension to the mbeddr C language31. The foreach statement 31 We assume that you know the basics can be used to conveniently iterate over C arrays. Users have of MPS language development, for example from reading the earlier to specify the array as well as its size. Inside the foreach body, implementation chapters in this book. it acts as a reference to the current iteration’s array element32. 32 Note that for the sake of the example, int8 s = 0; we don’t consider nested foreach state- int8[] a = {1, 2, 3}; foreach (a sized 3) { ments, so we don’t have to deal with s += it; unique names for various (generated) } variables.

The code generated from this piece of extended C looks as fol- lows. The foreach statement is expanded into a regular for statement and an additional variable __it: dsl engineering 383

int8 s = 0; int8[] a = {1, 2, 3}; for (int __c = 0; __c < 3; __c++) { int8 __it = a[__c]; s += __it; }

To make the foreach extension modular, it lives in a separate language module named ForeachLanguage. The new language extends C, since we will refer to concepts defined in C (see Fig. 15.8).

 Developing the Language Extension In the new language, we define the ForeachStatement. To make it usable wher- Figure 15.8: UML class diagram ever C expects Statements (i.e. in functions), it extends C’s showing the structure of the ForeachLanguage. Concepts from Statement. As Fig. 15.8 shows, ForeachStatements have three the C base language are in white boxes, children: an Expression that represents the array, an Expres- new concepts are gray. sion for the array length, and a StatementList for the body. Expression and StatementList are both defined in C. The editor is shown in Fig. 15.9. It consists of a horizontal list of cells: the foreach keyword, the opening parenthesis, the embedded editor of the array child, the sized keyword, the embedded editor of the len expression, the closing parenthesis and the editor of the body.

Figure 15.9: The editor definition of the foreach statement and its relationship to an example instance.

As shown in the code snippet below, the array must be of type ArrayType, and the type of len must be int64 or any of its shorter subtypes. rule typeof_ForeachStatement for ForeachStatement as fes do { typeof( fes.len ) :<=: ; if (!(fes.array.type.isInstanceOf(ArrayType))) { error "array required" -> fes.array; } }

As shown above, the generator translates a ForeachStatement to a regular for statement that iterates over the elements with a counter variable __c (Fig. 15.10). Inside the for body, we create a variable __it that refers to the array element at position __c. We then copy in the other statements from the body of the foreach. The ItExpression extends C’s Expression to make it us- able where expressions are expected. The editor consists of 384 dslbook.org

a single cell with the keyword it. A constraint enforces the ItExpression to be used only inside the body of a foreach: concept constraints ItExpression { can be child (context, scope, parentNode, link, childConcept)->boolean { parentNode.ancestor.isNotNull && parentNode.ancestor.isNotNull; } }

The type of it must be the base type of the array (e.g. int in the case of int[]), as shown in the code below: node basetype = typeof(it.ancestor.array) :ArrayType.baseType; typeof(it) :==: basetype.copy;

The foreach generator already generated a local variable __it into the body of the for loop. We can thus translate an ItEx- pression into a LocalVariableReference that refers to __it.

Figure 15.10: The foreach generator template. A ForeachStatement is replaced by the code that is framed when the template is executed; the dummy function around it just provides context. The COPY_SRC and COPY_SRCL macros contain ex- pressions (not shown) that determine with what the nodes in square brackets (e.g., 10, int8 x;) are replaced during a transformation.

 Developing the Debug Behavior The specification of the de- bugger extension for foreach resides completely in the For- eachLanguage; this keeps the debugger definition for the ex- tension local to the extension language. To set a breakpoint on a concept, it must implement the IBreakpointSupport marker interface. Statement already im- plements this interface, so ForEachStatement implicitly imple- ments this interface as well. Stepping behavior is implemented via ISteppable. The For- eachStatement implements this interface indirectly via State- ment, but we have to overwrite the methods that define the step over and step into behavior. Assume the debugger is sus- pended on a foreach and the user invokes step over. If the array is empty or we have finished iterating over it, a step over 33 This is the first line of the mbeddr foreach ends up on the statement that follows after the whole program, not the first line of the gen- statement. Otherwise we end up on the first line of the foreach erated base program (which would be __ __ body (sum += it;)33. int8 it = arr[ c];). dsl engineering 385

The debugger cannot guess which alternative will occur, since it would need to know the state of the program and to evaluate the expressions in the (generated) for. Instead we set breakpoints on each of the possible next statements and then resume execution until we hit one of them. The imple- mentations of the ISteppable methods specify strategies for setting breakpoints on these possible next statements. The contributeStepOverStrategies method collects strategies for the step over case: void contributeStepOverStrategies(list res) { ancestor statement list: this.body }

The method is implemented using a domain-specific language for debugger specification, which is part of the mbeddr debug- ger framework34. It is an extension of MPS’ BaseLanguage, a 34 This simplifies the implementation Java-based language used for expressing behavior in MPS. The of debuggers significantly. It is another example of where using DSLs for ancestor statement delegates to the foreach’s ancestor; this defining DSLs is a good idea. will lead to a breakpoint on the subsequent statement. The second line leads to a breakpoint on the first statement of the body statement list. Since the array and len expressions can be arbitrarily com- plex and may contain invocations of callables (such as function calls), we have to specify the step into behavior as well. This requires the debugger to inspect the expression trees in array and len and find any expression that can be stepped into. Such expressions implement IStepIntoable. If so, the debugger has to step into each of those, in turn. Otherwise the debugger falls back to step over. An additional method configures the expres- sion trees which the debugger must inspect: void contributeStepIntoStrategies(list res) { subtree: this.array subtree: this.len }

By default, the Watch window contains all C symbols (global and local variables, arguments) as supplied by the native C debugger35. To customize watches, a concept has to implement 35 In the case of the foreach, this means __ IWatchProvider. Here is the code for foreach, also expressed that it is not available, but it and __c are. This is exactly the wrong way in the debugger definition DSL: around: the watch window should __ __ void contributeWatchables(list unmapped, show it, but not it and c. list mapped) { hide "__c" map "__it" to "it" type: this.array.type : ArrayType.baseType category: WatchableCategories.LOCAL_VARIABLES context: this } 386 dslbook.org

The first line hides __c. The rest maps a base-level C variable to a watchable. It finds a C variable named __it (inserted by the foreach generator) and creates a watch variable named it. At the same time, it hides the base-level variable __it. The type of it is the base type of the array over which we iterate. We assign the it watchable to the local variables section and associate the foreach node with it (double-clicking on the it in the Watch window will highlight the foreach in the code). Stepping into the foreach body does not affect the call stack, since the concept represents no callable (for details, see the next paragraph). So we do not have to implement any stack frame related functionality.

 Debugger Framework Architecture The central idea of the de- bugger architecture is this: from the C code in MPS and its extensions (tree level) we generate C text (text level). This text is the basis for the debugging process by a native C debugger. We then use trace data to find out how the generated text maps back to the tree level in MPS. At the core of the execution architecture is the Mapper. It is driven by the Debugger UI (and through it, the user) and controls the C debugger via the Debug Wrapper. It uses the Program Structure and the Trace Data. The Mapper also uses a language’s debug specification, discuss in the next sub- section. Fig. 20.6 shows the components and their interfaces.

Figure 15.11: The Mapper is the cen- tral component of the debugger ex- ecution architecture. It is used by the Debugger UI and, in turn, uses the Debug Wrapper, the Program Structure and the Trace Data.

The IDebugControl interface is used by the Debugger UI to control the Mapper. For example, it provides a resume opera- tion. IBreakpoints allows the UI to set breakpoints on pro- dsl engineering 387

gram nodes. IWatches lets the UI retrieve the data items for the Watch window. The Debug Wrapper essentially provides the same interfaces, but on the level of C (prefixed with LL, for "low level"). In addition, ILLDebugControl lets the Mapper find out about the program location of the C Debugger when it is suspended at a breakpoint. IASTAccess lets the Mapper access program nodes. Finally, ITraceAccess lets the Mapper find out the program node (tree level) that corresponds to a specific line in the generated C source text (text level), and vice versa. To illustrate the interactions of these components, we de- scribe a step over. After the request has been handed over from the UI to the Mapper via IDebugControl, the Mapper performs the following steps:

• Asks the current node’s concept for its step over strategies; these define all possible locations where the debugger could end up after the step over. • Queries TraceData for the corresponding lines in the gener- ated C text for those program locations. • Uses the debugger’s ILLBreakpoints to set breakpoints on those lines in the C text. • Uses ILLDebugControl to resume program execution. It will stop at any of the breakpoints just created. • Uses ILLDebugControl to get the C call stack. • Queries TraceData to find out, for each C stack frame, the corresponding nodes in the tree-level program. • Collects all relevant IStackFrameContributors (see the next section). The Mapper uses these to construct the tree-level call stack. • Gets the currently visible symbols and their values via ILL- Watchables. • Queries the nodes for all WatchableProviders and use them to create a set of watchables.

At this point, execution returns to the Debugger UI, which then gets the current location and watchables from the Mapper to highlight the statement on which the debugger is suspended and populate the Watch window. In our implementation, the Debugger UI, Program Reposi- tory and Trace Data are provided by MPS. In particular, MPS builds a trace from the program nodes (tree level) in MPS to the generated text-level source. The Debug Wrapper is part of 388 dslbook.org

mbeddr and relies on the Eclipse CDT Debug Bridge36, which 36 www.eclipse.org/cdt provides a Java API to gdb37 and other C debuggers. 37 www.gnu.org/software/ gdb/documentation/  Debugger Specification The debugger specification resides in the respective language module. As we have seen in the foreach example, the specification relies on a set of interfaces and a number of predefined strategies, as well as the debugger specification DSL. The interface IBreakpointSupport is used to mark language concepts on which breakpoints can be set. C’s Statement im- plements this interface. Since all statements inherit from State- ment we can set breakpoints on all statements by default. When the user sets a breakpoint on a program node, the mapper uses ITraceAccess to find the corresponding line in the generated C text. A statement defined by an extension may be expanded to several base-level statements, so ITraceAccess actually returns a range of lines, the breakpoint is set on the first one. Stack frames represent the nesting of invoked callables at runtime38. We create stack frames for a language concept if 38 A callable is a language concept that it has callable semantics. The only callables in C are func- contains statements and can be called from multiple call sites. tions, but in mbeddr, test cases, state machine transitions and component methods are callables as well. Callable semantics on extension level do not necessarily imply a function call on the base level. There are cases in which an extension-level callable is not mapped to a function and where a non-callable is mapped to a function. Consequently, the C call stack may differ from the extension call stack shown to the user. Con- cepts with callable semantics on the extension level or base level implement IStackFrameContributor. The interface pro- vides operations that determine whether a stack frame has to be created in the debugger UI and what the name of the stack frame should be. Stepping behavior is configured via the IStackFrameCon- tributor, ISteppable, ISteppableContext, IStepIntoable and IDebugStrategy interfaces. Fig. 15.12 shows an overview. The methods defined by these interfaces return strategies that determine where the debugger may have to stop next if the Figure 15.12: The structure of language concepts implementing the stepping- user selects a stepping operation (remember that the debug- related interfaces. The boxes represent ger framework sets breakpoints to implement stepping). New language concepts implementing the interfaces discussed in the text. Those strategies can be added without changing the generic execution concepts define the containments, so aspect of the framework. this figure represents a typical setup. dsl engineering 389

Stepping relies on ISteppable contributing step over and step into strategies. Many ISteppables are embedded in an ISteppableContext (e.g., Statements in StatementLists). Stra- tegies may delegate to the containing ISteppableContext to determine where to stop next (the ancestor strategy in the foreach example). For step into behavior, an ISteppable specifies those sub- trees in which instances of IStepIntoable may be located (the array and len expressions in the foreach case). The debugger searches these subtrees at debug-time and collects all instances of IStepIntoable. An IStepIntoable represents a callable invocation (e.g., a FunctionCall), and the returned strategies suspend the debugger within the callable. Step out behavior is provided by implementors of IStack- FrameContributor (mentioned earlier). Since a callable can be called from many program locations, the call site for a particu- lar invocation cannot be determined by inspecting the program structure; a call stack is needed. We use the ordered list of IStackFrameContributors, from which the tree-level call stack is derived, to realize the step out behavior. By "going back" (or "out") in the stack, the call site for the current invocation is determined. For step out, the debugger locates the enclosing IStackFrameContributor and asks it for its step out strategies. Strategies implement IDebugStrategy and are responsible for setting breakpoints to implement a particular stepping be- havior. Language extensions can either implement their own strategies or use predefined ones. These include setting a break- point on a particular node, searching for IStepIntoables in expression subtrees (step into), or delegating to the outer stack frame (step out). To support watches, language concepts implement IWatch- Provider if they directly contribute one or more items into the Watch window. An IWatchProviderContext contains zero or more watch providers. Typically these are concepts that own statement lists, such as Functions or IfStatements. If the debugger is suspended on any particular statement, we can find all visible watches by iterating through all ancestor IWatchProviderContexts and asking them for their IWatch- Providers. Fig. 15.13 shows the typical structure of the con- cepts. Figure 15.13: Typical structure of IWatchProvider contributeWatchables An implements the language concepts implementing the operation. It has access to the C variables available in the native watches-related interfaces 390 dslbook.org

C debugger. Based on those, it creates a set of watchables. The method may hide a base-level C variable (because it is irrele- vant to the extension-level), promote C variable to a watchable or create additional Watchables based on the values of C vari- ables. The representation of a watchable often depends on the variable’s type as expressed in the extension program. This type may be different from the one in the C program. For example, we represent values of type Boolean with true and false, even though they are represented as ints in C. As the watchable is created, we specify the type that should be used. Types that should be used in this way must implement IMappableType. Its method mapVariable is responsible for computing a type- appropriate representation of a value.

 More Examples To illustrate mbeddr’s approach to exten- sible debuggers further, we have implemented the debugging behavior for mbeddr C and all default extensions. We discuss some interesting cases in this section. We encounter many cases where we cannot know statically which piece of code will be executed when stepping into a callable. Consider polymorphic calls on interfaces. The mbeddr components extension provides interfaces with operations, as well as components that provide and use these interfaces. The component methods that implement interface operations are generated to base-level C functions. The same interface can be implemented by different components, each implementation ending up in a different C function. A client component only specifies the interface it uses, not the compo- nent. Hence we cannot know statically which C function will be called if an operation is invoked on the interface. However, we do know statically all components that implement the inter- face, so we know all possible C functions that may be invoked. A strategy implemented specifically for this case sets breakpoints on the first line of each of these functions to make sure we stop in the first line of any of them if the user steps into a method invocation39. 39 We encounter a similar challenge In many cases a single statement on the extension level is in state machines: as an event is fired into a state machine, we do not know mapped to several statements or whole blocks on the base which transition will be triggered. level. Stepping over the single extension-level statement must Consequently we set breakpoints in all transitions (translated to case branches step over the whole block or list of statements in terms of C. in a switch statement) of the state An example is the assert statement used in test cases. It is machine. mapped to an if statement. The debugger has to step over dsl engineering 391

the complete if statement, independent of whether the condi- tion in the if evaluates to true or false. Note that we get 40 Remember that we never actually this behavior free40: the assert statement sets a breakpoint on step over statements, we always set the base-level counterpart of the next tree-level statement. It is breakpoints at the next possible code locations where the debugger may have irrelevant how many lines of C text further down this is. to stop next. Extensions may provide custom data types that are mapped to one or more data types or structures in the generated C. The debugger has to reconstruct the representation in terms of the extension from the base level data. For example, the state of a component is represented by a struct that has a member for each of the component fields. Component operations are mapped to C functions. In addition to the formal arguments declared for the respective operation, the generated C function also takes this struct as an argument. However, to support the polymorphic invocations discussed earlier, the type of this ar- gument is void*. Inside the operation, the void* is cast down to allow access to the component-specific members. The de- bugger performs the same downcast to be able to show watch- ables for all component fields.

 Discussion To evaluate the suitability of our solution for our purposes, we revisit the requirements described earlier.

Modularity Our solution requires no changes to the base lan- guage or its debugger implementation to specify the debug- ger for an extension. Also, independently developed ex- tensions retain their independence if they contain debugger specifications41. 41 In particular, MPS’ capability of incrementally including language Framework Genericity The extension-dependent aspects of the extensions in a program without defining debugger behavior are extensible. In particular, stepping a composite language first is preserved in the face of debugger specifications. behavior is factored into strategies, and new strategies can be implemented by a language extension. Also, the repre- sentation of watch values can be customized by making the respective type implement IMappableType in a suitable way. Simple Debugger Definition This challenge is solved by the de- bugger definition DSL. It supports the definition of stepping behavior and watches in a declarative way, without concern- ing the user with implementation details of the framework or the debugger backend. Limited Overhead Our solution generates no debugger specific code at all (except the debug symbols added by compiling the C code with debug options). Instead we rely on trace 392 dslbook.org

data to map the extension level to base level and ultimately to text. This is a trade-off: first, the language workbench must be able to provide trace information. Second, the gen- erated C text cannot be modified by a text processor before 42 The C preprocessor works, it is it is compiled and debugged, since this would invalidate handled correctly by the compiler and the trace data42. Our approach has another advantage: we debugger. do not have to change existing transformations to generate debugger-specific code. This keeps the transformations in- dependent of the debugger. Debugger Backend Independence We use the Eclipse CDT Debug Bridge to wrap the particular C debugger, so we can use any compatible debugger without changing our infrastruc- ture. Our approach requires no changes to the native C de- bugger itself, but since we use breakpoints for stepping, the debugger must be able to handle a reasonable number of breakpoints43. The debugger also has to provide an API for 43 Most C debuggers support this, so setting and deleting breakpoints, for querying the currently this is not a serious limitation. visible symbols and their values, as well as for querying the code location where the debugger suspended.

15.2.6 What’s Missing? The support from language workbenches for building debug- gers for the DSLs defined with the language workbench is not where it should be. In the face of extensible languages or lan- guage composition especially, the construction of debuggers is still a lot of work. The example discussed in Section 15.2.5 above is not part of MPS in general, but instead a framework that has been built specifically for mbeddr – although we be- lieve that the architecture can be reused more generally. Also, ideally, programs should be debuggable at any ab- straction level: if a multi-step transformation is used, then users should be able to debug the program at any interme- diate step. Debug support for for Xbase-based DSLs is a good example of this, but it is only one translation step, and it is a solution that is specifically constructed for Xbase and Java, and not a generic framework. So there is a lot of room for innovation in this space. 16 Modularization, Reuse and Composition

Language modularization, extension and composition is an important ingredient in the efficient use of DSLs, just as reuse in general is important to software development. We discuss the need for modularization, extension and compo- sition in the context of DSL design in Section 4.6, where we introduce the four classes of modularization, extension and composition. In this chapter, we look at the implementation approaches taken by our example tools.

16.1 Introduction

When modularizing and composing languages, the following challenges have to be addressed:

• The concrete and the abstract syntaxes of the languages have to be combined. Depending on the kind of composition, this requires the embedding of one syntax into another. This, in turn, requires modular grammars, or more generally, ways of specifying concrete syntax that avoids ambiguities.

• The static semantics, i.e. the constraints and the type system, have to be integrated. For example, in the case of language extension, new types have to be "made valid" for existing operators.

• The execution semantics have to be combined as well. In practice, this may mean mixing the code generated from the composed languages, or composing the generators or inter- preters. 394 dslbook.org

• Finally, the IDE services that provides code completion, syn- tax coloring, static checks and other relevant services have to be extended and composed as well.

In this chapter we show how each of those is addressed with the respective tools. We don’t discuss the general problems any further, since those have been discussed in Part II of the book on DSL design.

16.2 MPS Example

With MPS two of these challenges outlined above – compos- ability of concrete syntax and modular IDEs – are solved to a large degree. Modular type systems are reasonably well sup- ported. Semantic interactions are hard to solve in general, but can be handled reasonably in many relevant cases, as we show in this section as well. However, as we will see, in many cases languages have to be designed explicitly for reuse to make them reusable. After-the-fact reuse, without consideration during the design of the reusable language, is possible only in limited cases. However, this is true for reuse in software generally.

Figure 16.1: entities is the central lan- guage. uispec defines UI forms for the entities. uispec_validation adds val- idation rules, and composes a reusable expressions language. relmapping provides a reusable database map- ping language, relmapping_entities adapts it to the entities language. rbac is a reusable language for specifying permissions; rbac_entities adapts this language to the entities lan- We describe language modularization, extension and composi- guage. tion with MPS based on a set of examples1. At the center of 1 These are new examples that have not entities this section is a simple language – the usual entity- yet been used before in the book. This with-property-and-references "Hello World" example. We then is to keep them simple; showcasing build additional languages to illustrate extension and compo- these things, for example, with mbeddr would add a lot of additional context sition. Fig. 16.1 illustrates these additional languages. The and complexity that is not necessary to uispec (user interface specification) language illustrates refer- illustrate the essential techniques. encing with entities. relmapping (relational database map- ping) is an example of reuse with separated generated code. rbac (role-based access control) illustrates reuse with inter- mixed generated code. uispec_validation demonstrates ex- tension (of the uispec language) and embedding with regard to the expressions language. dsl engineering 395

16.2.1 Implementing the Entities Language Below is some example code expressed in the entities lan- guage. Modules are root nodes. They live as top-level elements in models2. 2 Referring back to the terminology in- troduced in Section 3.3, root nodes (and module company { their descendants) are considered frag- entity Employee { id : int ments, while the models are partitions. name : string Technically, they are XML files. role : string worksAt : Department freelancer : boolean } entity Department { id : int description : string } }

 Structure and Syntax Fig. 16.2 shows a class diagram of the structure of the entities language. Each box represents a language concept.

Figure 16.2: The abstract syntax of the entities language. Entities have attributes, which have types and names. EntityType extends Type and refer- ences Entity. This "adapts" entities to types (cf. the Adapter pattern).

The following code shows the definition of the Entity con- cept3. Entity implements the INamedConcept interface to in- 3 This is not the complete definition, herit a name property. It declares a list of children of type concepts can have more characteristics. This is simplified to show the essentials. Attribute in the attributes collection. Fig. 16.3 shows the definition of the editor for Entity. concept Entity extends BaseConcept implements INamedConcept can be root: true children: Attribute attributes 0..n

 Type System For the entities language, we specify two simple typing rules. The first one specifies that the type of the primitives (int, string) is a clone of themselves: rule typeof_Type { applicable for concept = Type as t overrides false do { typeof(type) :==: type.copy; } } 396 dslbook.org

Figure 16.3: The editor for Entity. The outermost cell is a vertical list ([/ .. /]). In the first line, we use a horizontal list that contains the "keyword" entity, the value of the name property and an opening curly brace. In the second line we use indentation and a vertical arrangements of the contents of the attributes collection. Finally, the third line contains the closing curly brace.

The only other typing rule is an equation that defines the type of the attribute as a whole to be the type of the attribute’s type property, defined as

typeof(attribute) :==: typeof(attribute.type);

 Generator From entities models we generate Java Beans expressed in MPS’ BaseLanguage. For the entities language, we need a root mapping rule4 and reduction rules5. The root map- 4 Root mapping rules are used to create ping rule is used to generate a Java class from an Entity. The new top-level artifacts from existing top-level artifacts (mapping a fragment reduction rule transforms the various types (int, string, etc.) to another fragment). to their Java counterparts. Fig. 16.4 shows a part of the map- 5 ping configuration for the entities language. Reduction rules are in-place transfor- mations. Whenever the transformation engine encounters an instance of the specified source concept somewhere in a program tree, it replaces that source node with the result of the associated template.

Figure 16.4: The mapping configuration for the entities language. The root mapping rule for Entity specifies that instances of Entity should be transformed with the map_Entity template. The reduction rules use inline templates. For example, the IntType is replaced with the Java int and the EntityRefType is reduced to a reference to the class generated from the target entity. The ->$ is a reference macro. It contains code (not shown) that "rewires" the reference to Double to _ a reference to the class generated from Fig. 16.5 shows the map Entity template. It generates a Java the referenced Entity. class. Inside the class, we generate a field for each entity at- tribute. To do this we first create a prototype field in the class (private int aField;). Then we use macros to "transform" this prototype into an instance for each Entity attribute. We first attach a LOOP macro to the whole field. Based on its ex- dsl engineering 397

pression node.attributes;, we iterate over the attributes in the current entity. We then use a COPY_SRC macro to transform the type. COPY_SRC copies the input node (the inspector spec- ifies the current attribute’s type as the input here) and applies reduction rules. So instances of the types defined as part of the entities language are transformed into a Java type us- ing the reduction rules defined in the mapping configuration above. Finally we use a property macro (the $ sign) to change the name property of the field we generate from the dummy value aField to the name of the attribute we currently trans- form (once again via an expression in the inspector).

Figure 16.5: The template for creating a Java class from an Entity. The running text explains the details. The «placeholder» is a special concept used later.

16.2.2 Referencing

We define a language uispec for defining user interface forms based on the entities. Below is an example model. Note how the form is another, separate fragment. It is a dependent fragment, since it references elements from another fragment (expressed in the entities language). Both fragments are ho- mogeneous, since they consist of sentences expressed in a single language. form CompanyStructure uses Department uses Employee field Name: textfield(30) -> Employee.name field Role: combobox(Boss, TeamMember) -> Employee.role field Freelancer: checkbox -> Employee.freelancer field Office: textfield(20) -> Department.description 398 dslbook.org

 Structure and Syntax The uispec language extends6 the 6 MPS uses the term "extension" when- entities language. This means that concepts from the entities ever the definition of one language uses or refers to concepts defined in another language can be used in the definition of language concepts in language. This is not necessarily an ex- the uispec language. Fig. 16.6 shows the abstract syntax as a ample of language extension as defined in this book. UML diagram.

Figure 16.6: The abstract syntax of the uispec language. Dotted lines represent classes from another language (here: the entities language). A Form contains EntityReferences that connect to an entities model. A form also contains Fields, each referencing an Attribute from an Entity and containing a Widget.

A Form owns a number of EntityReferences, which in turn reference the Entity concept. Also, Fields refer to the at- tribute that is intended to be edited via the field. Below is the definition of the Field concept. It owns a Widget and refers to an Attribute. concept Field extends BaseConcept implements properties: label : string children: Widget widget 1 references: Attribute attribute 1

 Type System The language enforces limitations over which widget can be used with which attribute type7. The necessary 7 A checkbox widget requires a Boolean typing rule is defined in the uispec language and references type, a ComboWidget requires a string type. types from the entities language8. The following is the code for the type check. 8 This is valid, since in referencing, the referencing language has a dependency checking rule checkTypes { on the referenced language. applicable for concept = Field as field overrides false do { if (field.widget.isInstanceOf(CheckBoxWidget) && !(field.attribute.type.isInstanceOf(BooleanType))) { error "only use checkbox with booleans" -> field.widget; } if (field.widget.isInstanceOf(ComboWidget) && !(field.attribute.type.isInstanceOf(StringType))) { error "cannot use combobox with strings" -> field.widget; } } }

 Generation The defining characteristic of language refer- encing is that the two languages only reference each other, and the instance fragments are dependent, but homogeneous. No dsl engineering 399

syntactic integration is necessary in this case. In this exam- ple, the generated code exhibits the same separation. From the Form definition, we generate a Java class that uses Java Swing to build the UI form. It uses the beans generated from the en- tities: the classes are instantiated, and the getters and setters are called9. The generators are separate, but they are dependent 9 To get values from the bean when the because they share information. Specifically, the uispec gener- form is populated and to set the values into the bean if they have been changed ator knows about the names of the generated entity classes, as in the form. well as the names of the setters and getters. This dependency is implemented by defining a couple of behavior methods on the Attribute concept that are called from both generators: concept behavior Attribute { public string qname() { this.parent:Entity.name + "." + this.name; } public string setterName() { "set" + this.name.toFirstUpper(); } public string getterName() { "get" + this.name.toFirstUpper(); } }

The original entities fragment is still sufficient for the trans- formation that generates the Java Bean. The uispec fragment is not sufficient for generating the UI; it needs the entities fragment. This is not surprising, since dependent fragments can never be sufficient for a transformation: the transitive closure of all dependencies has to be made available. Figure 16.7: Block expressions (in blue, gray in print) are basically anonymous 16.2.3 Extension inline methods. Upon transformation, a method is generated that contains the We extend the MPS base language with block expressions and block content, and the block expression placeholders. These concepts make writing generators that is replaced with a call to this method. generate base language code much simpler. Fig. 16.7 shows Block expressions are used mostly when implementing generators; this an example. screenshot shows a generator that uses a block expressions.

 Structure and Syntax A block expression is a block that can 10 M. Bravenboer, R. Vermaas, J. J. Vinju, be used where an Expression is expected10. The block can and E. Visser. Generalized type-based disambiguation of meta programs with contain any number of statements; yield can be used to "re- concrete object syntax. In GPCE, pages turn" values from within the block11. An optional name prop- 157–172, 2005 erty of a block expression is used as the name of the generated 11 So, in some sense, a block expression method. The generator of the block expression in Fig. 16.7 is an "inlined method", or a closure that transforms it into this structure: is defined and called directly. 400 dslbook.org

// the argument to setName is what was the block expression, // it is replaced bya method call to the generated method aEmployee.setName(retrieve_name(aEmployee, widget0));

...

// this is the method generated from the block expression public String retrieve_name(Employee aEmployee, JComponent widget0) { String newValue = ((JTextField) widget0).getText(); return newValue; }

The jetbrains.mps.baselanguage.exprblocks language ex- tends Ba- seLanguage. To make a block expression valid where BaseLanguage expects an Expression, BlockExpression ex- tends Expression12. 12 Consequently, fragments that use the exprblocks language can now concept BlockExpression extends Expression implements INamedConcept use BlockExpressions in addition to children: the concepts provided by the Base- StatementList body 1 Language. The fragments become heterogeneous, because languages are syntactically mixed.  Type System The type of the yield statement is the type of the expression that is yielded, specified by this equation13: 13 The type of yield 1; would be int.

typeof(yield) :==: typeof(yield.result);

Since the BlockExpression is used as an Expression, it has to have a type as well. However, its type is not explicitly specified, so it has to be calculated as the common supertype of the types of all yields. The following typing rule computes this type. We use the :>=: operator to express that the result type must be the same or a supertype of the right argument. Figure 16.8: We use a weaving rule to create an additional method typevar resultType ; BlockExpression for (node y : for a . A weav- blockExpr.descendants) { ing rule processes an input element resultType :>=: typeof(y.result); (BlockExpression) by creating another } node in a different place. The context typeof(blockExpr) :==: resultType; function defines this other place. In this case, it simply gets the class in which we have defined the block expression. dsl engineering 401

 Generator The generator for BlockExpressions reduces the new concept to pure BaseLanguage: it performs assimi- lation. It transforms a heterogeneous fragment (using BaseLan- guage and exprblocks) to a homogeneous fragment (using only BaseLanguage). The first step is the creation of the additional method for the block expression. Fig. 16.8 shows the definition of the weaving rule; Fig. 16.9 shows the template used in that weaving rule. The template shown in Fig. 16.9 shows the creation of the method. It assigns a mapping label14 to the created method. 14 In earlier MPS examples, we had used The mapping label creates a mapping between the BlockEx- name-based resolution of references. We use a mapping label here to show pression and the created method. We will use this label to how this works. refer to this generated method when we generate the method call that replaces the BlockExpression (shown in Fig. 16.10).

A second concept introduced by the exprblocks language is Figure 16.9: The generator creates a method from the BlockExpression. It the PlaceholderStatement. This extends Statement so that uses COPY_SRC macros to replace the it can be used inside method bodies. It also has a name. It is string type with the computed return used to mark locations at which subsequent generators can add type of the block expression, inserts a computed name, adds a parameter for additional code. These subsequent generators will use a reduc- each referenced variable outside the tion rule to replace the placeholder with whatever they want to block, and inserts all the statements from the block expression into the body put at this location. It is a means of building extensible genera- of the method (using the COPY_SRCL tors. Both BlockExpression and PlaceholderStatement will macro that iterates over all of the be used in subsequent examples of in this chapter. statements in the ExpressionBlock). The blockExprToMethod mapping label is used later in the method call.

Figure 16.10: Here we generate the call to the previously generated method. We use the mapping label blockExprToMethod (not shown; hap- pens inside the ->$ macro) to refer to the correct method. We pass in A particularly interesting feature of MPS is the ability to use the environment variables as actual several extensions of the same base language in a given pro- arguments. gram without defining a combining language. For example, a 15 These same benefits are also exploited user could decide to use the block expression language defined in the case of embedding multiple above together with the dispatch extension discussed in Sec- independent languages. Note that 15 there are also potential semantic issues, tion 12.2. This is a consequence of MPS’ projectional nature . that are independent of parsing vs. Let us consider the potential cases for ambiguity: projection. 402 dslbook.org

Same Concept Name The used languages may define concepts with the same name as the host language. This will not lead to ambiguity because concepts have a unique ID as well. A program element will use this ID to refer to the concept whose instance it represents. Same Concrete Syntax The projection of a concept is not rele- vant to the functioning of the editor. The program would still be unambiguous to MPS even if all elements had the same notation. Of course, it would be confusing to the users. Same Alias If two concepts that are valid at the same location use the same alias, then, as the user types the alias, it is not clear which of the two concepts should be instantiated. This problem is solved by MPS opening the code completion window and requiring the user to explicitly select which al- ternative to choose. Once the user has made the decision, the unique ID is used to create an unambiguous program tree.

16.2.4 Reuse with Separated Generated Code Language reuse covers the case in which a language that has been developed independently of the context in which it should be reused. The respective fragments remain homogeneous. In this section, we cover two alternative cases: the first case (in this subsection) addresses a persistence mapping language. The generated code is separate from the code generated from the entities language. The second case (discussed in the next sub- section) describes a language for role-based access control. The generated code has to be "woven into" the entities code to check permissions when setters are called.

 Structure and Syntax relmapping is a reusable language for mapping arbitrary data to relational tables. The relmapping language supports the definition of relational table structures, but leaves the actual mapping to the source data unspecified. As you adapt the language to a specific reuse context, you have to specify this mapping. The following code shows the reusable part: a database is defined that contains tables with columns. Columns have (database-specific) data types. database CompanyDB table Departments number id char descr table People number id dsl engineering 403

char name char role char isFreelancer

Fig. 16.11 shows the structure of the relmapping language. The abstract concept ColumnMapper serves as a hook: if we reuse this language in a different context, we extend this hook in a context-specific way.

Figure 16.11:A Database contains Tables which contain Columns.A column has a name and a type. A column also has a ColumnMapper. This is an abstract concept that determines where the column gets its data from. It is a hook intended to be specialized in sublanguages that are context-specific.

The relmapping_entities language extends relmapping and 16 Such a language could be called an entities 16 adapts it for reuse with the language . To this end, Adapter language, in reference to the it provides a subconcept of ColumnMapper, the AttributeCol- Adapter pattern from the GoF book. Mapper, which references an Attribute from the entities lan- guage as a means of expressing the mapping from the attribute to the column. The column mapper is projected on the right of the field definition, resulting in the following (heterogeneous) code fragment17: 17 This "mixed syntax" is fairly trivial, since the AttributeColMapper just database CompanyDB references an attribute with a qualified table Departments number id <- Department.id name (Entity.attribute). However, char descr <- Department.description arbitrary additional syntax could be table People added, and we could use arbitrary number id <- Employee.id concepts from the entities language char name <- Employee.name char role <- Employee.role mixed into the relmapping fragment. char isFreelancer <- Employee.freelancer

 Type System The type of a column is the type of its type property. In addition, the type of the column must also con- form to the type of the column mapper, so the concrete Column- Mapper subtype must provide a type mapping as well. This "typing hook" is implemented as an abstract behavior method typeMappedToDB on the ColumnMapper18. With this in mind, 18 It is acceptable from a dependency the typing rules of the relmapping language look as follows: perspective to have this typing hook, since relmapping is designed to be typeof(column) :==: typeof(column.type); reusable. typeof(column.type) :==: typeof(column.mapper); typeof(columnMapper) :==: columnMapper.typeMappedToDB();

The AttributeColMapping concept from the relmapping_en- tities implements this method by mapping ints to numbers and everything else to chars. 404 dslbook.org

public node<> typeMappedToDB() overrides ColumnMapper.typeMappedToDB { node<> attrType = this.attribute.type.type; if (attrType.isInstanceOf(IntType)) { return new node(); } return new node(); }

 Generator The generated code is also separated into a re- usable part (a class generated by the relmapping language’s generator) and a context-specific subclass of that class, gener- ated by the relmapping_entities language. The generic base class contains code for creating the tables and for storing data in those tables. It contains abstract methods that are used to access the data to be stored in the columns19. 19 So the dependency structure of the generated fragments, as well as the de- public abstract class CompanyDBBaseAdapter { pendencies of the respective generators, private void createTableDepartments() { resembles the dependency structure of // SQL to create the Departments table the languages: the generated fragments } are dependent, and the generators are

private void createTablePeople() { dependent as well (they share the name, // SQL to create the People table and implicitly, the knowledge about } the structure of the class generated by the reusable relmapping generator). public void storeDepartments(Object applicationData) { relmapping Insert i = new Insert("Departments"); A fragment (without the i.add( "id", getValueForDepartments_id(applicationData)); concrete column mappers) is sufficient i.add( "descr", getValueForDepartments_descr(applicationData)); for generating the generic base class. i.execute(); }

public void storePeople(Object applicationData) { // like above }

public abstract String getValueForDepartments_id(Object applicationData);

public abstract String getValueForDepartments_descr( Object applicationData);

// abstract getValue methods for the People table }

The subclass, generated by the generator in the relmapping_en- tities language, implements the abstract methods defined by the generic superclass. The interface, represented by the 20 applicationData object, has to be kept generic so that any Note how this class references the 20 Beans generated from the entities: kind of user data can be passed in . the generator for entities and the _ public class CompanyDBAdapter extends CompanyDBBaseAdapter { generator for relmapping entities public String getValueForDepartments_id(Object applicationData) { are dependent, the information shared Object[] arr = (Object[]) applicationData; between the two generators is the Department o = (Department) arr[0]; names of the classes generated from String val = o.getId() + ""; return val; the entities. The code generated from } the relmapping language is designed to public String getValueForDepartments_descr(Object applicationData) { be extended by code generated from Object[] arr = (Object[]) applicationData; a sublanguage (the abstract getValue Department o = (Department) arr[0]; String val = o.getDescription() + ""; methods). This is acceptable, since the return val; relmapping language itself is designed } to be extended to adapt it to a new } reuse context. dsl engineering 405

21 Like the example in the previous 16.2.5 Reuse with Interwoven generated code subsection, the models expressed in the two languages are separate and homo- rbac is a language for specifying role-based access control, to geneous. However, the code generated from the rbac specification has to be entities specify access permissions for entities defined with the mixed with the code generated from the language21. Here is some example code: entities.

RBAC users: user mv : Markus Voelter user ag : Andreas Graf user ke : Kurt Ebert roles: role admin : ke role consulting : ag, mv permissions: admin, W : Department consulting, R : Employee.name

 Structure and Syntax The structure of rbac is shown in Fig. 16.12. Like relmapping, it provides a hook, in this case, Resource, to adapt it to context languages: the sublanguage rbac_entities provides two subconcepts of Resource, namely AttributeResource to reference to an attribute, and Entity- Resource to refer to an Entity, to define permissions for enti- ties and their attributes.

Figure 16.12: Language structure of the rbac language. An RBACSpec contains Users, Roles and Permissions. Users can be members in several roles. A per- mission assigns a right to a Resource.

 Type System No type system rules apply here.

 Generator What distinguishes this case from the relmap- ping case is that the code generated from the rbac_entities language is not separated from the code generated from enti- ties. Instead, inside the setters of the Java beans, a permission check is required. public void setName(String newValue) { // check permissions(from rbac _entities) if (!new RbacSpecEntities().currentUserHasWritePermission( "Employee.name")) { throw new RuntimeException("no permission"); } this.name = newValue; } 406 dslbook.org

The generated fragment is homogeneous (all Java code), but it is multi-sourced, since several generators contribute to the same fragment. To implement this, several approaches are possible:

• We could use AspectJ22. This would allow us to generate 22 www.eclipse.org/aspectj/ separate Java artifacts (all single-sourced) and then use the aspect weaver to "mix" them. However, we don’t want to introduce the complexity of yet another tool, AspectJ, here, so we will not use this approach.

• An interceptor23 framework could be added to the gener- 23 en.wikipedia.org/wiki/ _ ated Java Beans, with the generated code contributing spe- Interceptor pattern cific interceptors (effectively building a custom AOP solu- tion). We will not use this approach either, since it would require the addition of a whole interceptor framework to the entities implementation. This seems like overkill. • We could "inject" additional code generation templates to the existing entities generator from the rbac_entities generator. This would make the generators woven, as op- posed to just dependent. Assuming that this would work in MPS, it would be the most elegant solution – but it does not. • We could define a hook in the generated Java Beans code and then have the rbac_entities generator contribute code to this hook. This is the approach we will use. The generators remain dependent, they have to agree on the way the hook works.

Notice that only the AspectJ solution can work without any preplanning from the perspective of the entities language, because it avoids mixing the generated code artifacts (it is han- dled "magically" by AspectJ). All other solutions require the original entities generator to "expect" certain extensions. In our case, we have modified the original generator in the entities language to contain a PlaceholderStatement (see Fig. 16.13). In every setter, the placeholder acts as a hook at which subsequent generators can add statements. So while we have to pre-plan that we want to extend the generator at this Figure 16.13: This generator fragment creates a setter method for each at- location, we don’t have to predefine how. tribute of an entity. The LOOP iterates over all attributes. The $ macro com- putes the name of the method, and the COPY_SRC macro on the argument type computes the type. The place- holder marks the location where the permission check will be inserted by a subsequent generator. dsl engineering 407

The rbac_entities generator contains a reduction rule for PlaceholderStatements. So when it encounters a placeholder (put there by the entities generator) it removes it and inserts the code that checks for the permission (Fig. 16.14). To make this work, we have to make sure that this generator runs after the entities generator (since the entities generator has to 24 Generator priorities express a partial create the placeholder first) but before the BaseLanguage gen- ordering (run before, run after, etc.) between pairs of generators. Upon erator (which transforms BaseLanguage code into Java text for generation, MPS computes an over- compilation). We use generator priorities to achieve this24. all schedule that determines which generators run in which order.

Figure 16.14: This reduction rule re- places PlaceholderStatements with a permission check. Using the condi- 16.2.6 Embedding tion, we only match those placeholders whose identifier is pre-set (notice  Structure and Syntax uispec_validation extends uispec: how we have defined this identifier in Fig. 16.13). The inserted code queries uispec it is a sublanguage of the language. It supports writing another generated class that contains code such as the following in the UI form specifications: the actual permission check. A runtime exception is thrown if the check fails. form CompanyStructure uses Department uses Employee

field Name: textfield(30) -> Employee.name validate lengthOf(Employee. name) < 30 field Role: combobox(Boss, TeamMember) -> Employee.role field Freelancer: checkbox -> Employee.freelancer validate if (isSet(Employee.worksAt)) Employee.freelancer == false else Employee.freelancer == true field Office: textfield(20) -> Department.description

Writing the expressions is supported by embedding an expres- sions language. Fig. 16.15 shows the structure. To be able to use the expressions, the user has to use a ValidatedField instead of a Field. ValidatedField is also defined in ui- spec_validation, and is a subconcept of Field. 408 dslbook.org

Figure 16.15: The icsnuispec_validation language defines a subtype of uispec.Field that contains an Expression from an embeddable expression language. The lan- guage also defines a couple of ad- ditional expressions, specifically the AttributeRefExpr, which can be used to refer to attributes of entities.

To support the migration of existing models that use Field in- stances, we provide an intention: the user can press Alt-Enter on a Field and select Make Validated Field. This transforms an existing Field into a ValidatedField, so that validation expressions can be entered25. The core of the intention is the 25 Alternatively it could be arranged following script, which performs the actual transformation: (with 5 lines of code) that users could simply type validate on the right of a field definition to trigger the execute(editorContext, node)->void { transformation code below. node vf = new node(); vf.widget = node.widget; vf.attribute = node.attribute; vf.label = node.label; node.replace with(vf); }

The uispec_validation language extends the uispec language. We also extend the existing, embeddable expressions language, so we can use Expressions in the definition of our language26. 26 Embedding requires that both ValidatedField has a property expr that contains the valida- composed languages (uispec and expressions) remain independent. tion expression. The uispec_validation language acts As a consequence of polymorphism, we can use any exist- as an adapter. In contrast to the reuse examples, this adapter also adapts the Expression ing subconcept of in validations. So without doing concrete syntax. anything else, we could write 20 + 40 > 10, since integer lit- erals and the + and > operators are defined as part of the com- posed expressions language. However, to write anything use- ful, we have to be able to reference entity attributes from within expressions27. To achieve this, we create the AttributeRefExpr, 27 We argued in the design part that, in as shown in Fig. 16.15. We also create LengthOf and IsSetEx- order to make an embedded language useful with its host language, it has pression as further examples of how to adapt an embedded to be extended: the following is an language to its new context – i.e. the uispec and entities example of this. languages. The AttributeRefExpr references an Attribute from the entities language; however, it may only reference those at- tributes of those entities that are used in the form in which we define the validation expression. The following is the code for the search scope: dsl engineering 409

(model, scope, referenceNode, linkTarget, enclosingNode) ->join(ISearchScope | sequence>) { nlist res = new nlist; node

form = enclosingNode.ancestor; if ( form != null ){ for (node er : form.usedEntities) { res.addAll(er.entity.attributes); } } res; }

Notice that the actual syntactic embedding of the expressions language in the uispec_validation language is no problem at all as a consequence of how projectional editors work. We simply define Expression to be a child of the ValidatedField.

 Type System The general challenge here is that primitive types such as int and string are defined in the entities language and in the embeddable expressions language. Al- though they have the same names, they are not the same types. So the two sets of types must be mapped. Here are a couple of examples. The type of the IsSetExpression is by definition expressions.BooleanType. The type of the LengthOf, which takes an AttrRefExpression as its argument, is expressions. IntType. The type of an attribute reference is the type of the attribute’s type property: typeof(attrRef) :==: typeof(attrRef.attr.type);

However, consider the following code: field Freelancer: checkbox -> Employee.freelancer validate if (isSet(Employee.worksAt)) Employee.freelancer == false else Employee.freelancer == true

This code states that if the worksAt attribute of an employee is set, then its freelancer attribute must be false else it must be true (freelancers don’t workAt anything). It uses the == oper- 28 It is probably a good idea to use the ator from the expressions language. However, that operator same set of primitive types (and ex- pressions) for all languages to avoid expects two arguments with expressions.BooleanType, but mappings like these. This could be the type of the Employee. freelancer is entities.Boolean- achieved by requiring all languages to 28 use a common base language (similar to Type . In effect, we have to override the typing rules for the Xbase). However, if the to-be-composed expressions language’s == operator. Here is how we do it. languages are developed by indepen- In the expressions language, we define overloaded operation dent people, then it is hard to enforce a common base language. So the ability rules. We specify the resulting type for an EqualsExpression to have such mappings is useful. depending on its argument types. Here is the code in the expressions language that defines the resulting type to be 29 In addition to this code, we have to boolean if the two arguments are Equallable29: specify that expressions.BooleanType is a subtype of Equallable, so this operation concepts: EqualsExpression rule applies if we use equals with two left operand type: new node() expressions.BooleanType arguments. 410 dslbook.org

right operand type: new node() operation type: (operation, leftOperandType, rightOperandType)->node< > { ; }

We have to tie this overloaded operation specification into a regular type inference rule: rule typeof_BinaryExpression { applicable for BinaryExpression as binex do { node<> opType = operation type( binex , left , right ); if (opType != null){ typeof(binex) :==: opType; } else { error "operator " + binex.concept.name + " cannot apply to these argument types " + left.concept.name + "/" + right.concept.name -> binex; } } }

To override these typing rules to work with entities.Boolean- Type, we simply provider another overloaded operation speci- fication in the uispec_validation language: operation concepts: EqualsExpression one operand type: // the entities.BooleanType! operation type: (op, leftOperandType, rightOperandType)->node< > { ; // the expressions.BooleanType }

 Generator The generator has to create BaseLanguage code, which is then subsequently transformed into Java text. To deal with the transformation of the expressions language, we can do one of two things:

• Either we can use the expressions language’s existing to- text generator and wrap the expressions in some kind of TextHolderStatement30. 30 Remember that we cannot simply embed text in BaseLanguage, since • Alternatively, we can write a (reusable) transformation from that would not work structurally: no expressions code to BaseLanguage code; these rules would concept in BaseLanguage expects "text" as children: a wrapper is necessary. get used as part of the transformation of uispec and ui- spec_validation code to BaseLanguage.

Since many DSLs will probably transform code to BaseLan- 31 In fact, it would be useful if a simple guage, it is worth the effort to write a reusable generator from language with types and expression came with MPS. This language could 31 expressions to BaseLanguage . So we choose this second al- either be used as part of BaseLanguage ternative. The generated Java code is multi-sourced, since it is as well (so no transformation would be needed) or the transformation to generated by two independent code generators. BaseLanguage could ship with MPS as Expression constructs from the reusable expressions lan- well. guage and those of BaseLanguage are almost identical, so this dsl engineering 411

generator is trivial. We create a new language project expres- 32 In MPS all "meta stuff" is called sions.blgen and add reduction rules32. Fig. 16.16 shows some a language. So even though expressions.blgen only contains a of these reduction rules. generator it is still a language in MPS terminology.

Figure 16.16: A number of reduc- In addition, we also need reduction rules for the new expres- tion rules that map the reusable sions that we have added specifically in the uispec_validation expressions language to BaseLan- language (AttrRefExpression, isSetExpression, LengthOf). guage (Java). Since the languages are very similar, the mapping is trivial. For _ These transformations are defined in uispec validation, since example, a PlusExpression is mapped this language is not reusable – it is specifically designed to inte- to a + in Java, the left and right argu- ments are reduced recursively through uispec expressions grate the and the languages. As an exam- the COPY_SRC macro. ple, Fig. 16.17 shows the rule for handling the AttrRefExpres- sion. The validation code itself is "injected" into the UI form via the same placeholder reduction as in the case of the lan- guage rbac_entities.

Figure 16.17: References to entity attributes are mapped to a call to their getter method. The template fragment (inside the ) uses two reference macros (->$) to "rewire" the object reference to the Java bean instance, and the toString method call Language extension can also be used to prevent the use of spe- to a call to the getter. cific base language concepts in the sublanguage, possibly in certain contexts. As an example, we restrict the use of some operators provided by the reusable expression language inside validation rules in uispec_validation. This can be achieved 412 dslbook.org

by implementing a can be ancestor constraint on Validated- Field. can be ancestor: (operationContext, scope, node, childConcept)->boolean { return !(childConcept == concept/GreaterEqualsExpression/ || childConcept == concept/LessEqualsExpression/); }

16.2.7 Annotations In a projectional editor, the CS of a program is projected from the AST. A projectional system always goes from AS to CS, never from CS to AS (as parsers do). This means that the CS does not have to contain all the data necessary to build the AST (which is necessary in the case of parsers). This has two consequences:

• A projection may be partial, in the sense that the AS con- tains data that is not shown in the CS. The information may, for example, only be changeable via intentions (discussed in Section 7.7), or the projection rule may project some parts of the program only in some cases, controlled by some kind of configuration data.

• It is also possible to project additional CS that is not part of the CS definition of the original language. Since the CS is never used as the information source, such additional syntax does not confuse the tool (in a parser-based tool the gram- mar would have to be changed to take into account this ad- ditional syntax to avoid derailing the parser).

In this section we discuss the second alternative, since it con- stitutes a form of language composition: the additional CS is composed with the original CS defined for the language. The mechanism MPS uses for this is called annotations. We have seen annotations when we discussed templates33: an annota- 33 The generator macros $, ->$ or _ tion is something that can be attached to arbitrary program ele- COPY SRC are implemented via the annotation mechanism. ments and can be shown together with the CS of the annotated element. In this section we use this approach to implement an alternative approach for the entity-to-database mapping. Using this approach, we can store the mapping from entity attributes to database columns directly in the Entity, resulting in the following code: module company entity Employee { id : int -> People.id name : string -> People.name dsl engineering 413

role : string -> People.role worksAt : Department -> People.departmentID freelancer : boolean -> People.isFreelancer }

entity Department { id : int -> Departments.id description : string -> Departments.descr }

This is a heterogeneous fragment, consisting of code from the entities language, as well as the annotation code (e.g., -> People.id). From a CS perspective, the column mapping is "embedded" in the Entity. In the AST the mapping informa- tion is also actually stored in the entities model. However, the definition of the entities language does not know that this additional information is stored and projected "inside" entities. No modification to the entities language is necessary.

 Structure and Syntax We define an additional language relmapping_annotations that extends the entities language as well as the relmapping language. In this language we define the following concept: concept AttrToColMapping extends NodeAnnotation references: Column column 1 properties: role = colMapping concept links: annotated = Attribute

34 In fact, this concept is called AttrToColMapping NodeAnnotation concept extends , a con- NodeAttribute in MPS. For histori- cept predefined by MPS34. Concepts that extend the concept cal reasons there a somewhat confused NodeAnnotation have to provide a role property and an anno- terminology around "attribute" and "annotation". We’ll stick with the term tated concept link. Structurally, an annotation is a child of the "annotation" in this chapter. node it annotates. So the Attribute has a new child of type Figure 16.18: The editor for the AttrToColMapping , and the reference that contains the child is AttrToColMapping embeds the ed- called @colMapping – the value of the role property prefixed itor of the concept it is annotated to by an @. The annotated concept link points to the concept to (using the attributed node cell). It then projects the reference to the refer- which this annotation can be added. AttrToColMappings can be enced column. This gives the editor of annotated to instances of Attribute. the annotation control of if and how the editor annotated element is projected.

While structurally the annotation is a child of the annotated node, in the CS the relationship is reversed: the editor for AttrToColMapping wraps the editor for Attribute, as Fig. 16.18 shows. 414 dslbook.org

Since the annotation is not part of the original language, it cannot just be typed in: it must be attached to nodes via an intention. The annotation simply adds a new instance of AttrToCol- Mapping to the @colMapping property of an Attri- bute, so we don’t show the code here.

 Type System The same typing rules are necessary as in the relmapping_entities language described previously. They reside in relmapping_annotations.

 Generator The generator is also broadly similar to the pre- vious example with relmapping_entities. It takes the enti- ties model as the input, and then uses the column mappings in the annotations to create the entity-to-database mapping.

The annotations introduced above were typed to be specific to certain target concepts (Attribute in this case). A particularly interesting use of annotations includes those that can be anno- tated to any language concept (formally targeting BaseConcept). In this case, there is no dependency between the language that contains the annotation and the language that is annotated. This is very useful for "meta data", as well as anything that can be processed generically.

16.3 Xtext Example

In this section we look at an example roughly similar to the one This section of the book has been for MPS discussed in the previous section. We start out with a written together with Chris- tian Dietrich. Contact him via DSL for entities. Here is an example program: [email protected]. module company { entity Employee { id : int name : string role : string worksAt : Department freelancer : boolean } entity Department { id : int description : string } }

The grammar is straightforward and should be clear if you have read the implementation part of this book so far. grammar org.xtext.example.lmrc.entity.EntityDsl with org.eclipse.xtext.common.Terminals generate entityDsl "http://www.xtext.org/example/lmrc/entity/EntityDsl" dsl engineering 415

Module: "module" name=ID "{" entities+=Entity* "}";

Entity: "entity" name=ID "{" attributes+=Attribute* "}";

Attribute: name=ID ":" type=AbstractType;

Named: Module|Entity|Attribute;

AbstractType: BooleanType|IntType|StringType|EntityReference;

BooleanType: {BooleanType} "boolean";

IntType: {IntType} "int";

StringType: {StringType} "string";

EntityReference: ref=[Entity|FQN];

FQN: ID ("." ID)*;

16.3.1 Referencing Referencing describes the case in which programs written in one DSL reference (by name) program elements written in an- other DSL35. The example we use is the UI specification lan- 35 Both programs reside in different guage, in which a Form defined in the UI model refers to Enti- fragments and no syntactic composition is required. ties from the language defined above, and Fields in a form refers to entity Attribute. Here is some example code:

form CompanyStructure uses Department // reference to Department Entity uses Employee // reference to Employee Entity

field Name: textfield(30) -> Employee.worksAt field Role: combobox(Boss, TeamMember) -> Employee.role field Freelancer: checkbox -> Employee.freelancer field Office: textfield(20) -> Department.description

36 Note that the references to entities Structure Referencing concepts defined in another language and fields do not technically reference  into an entity source file. Instead, these relies on importing the target meta model and then defining references refer to the EMF objects in references to concepts defined in this meta model36. Here is the AST that has been parsed from the source file. So, a similar approach the header of the grammar of the uispec language: can be used to reference to other EObject grammar org.xtext.example.lmrc.uispec.UispecDsl s. It does not matter whether with org.eclipse.xtext.common.Terminals these are created via Xtext or not. This is reflected by the fact that the grammar import "http://www.xtext.org/example/lmrc/entity/EntityDsl" as entity of the uispec language does not refer entity generate uispecDsl "http://www.xtext.org/example/lmrc/uispec/UispecDsl" to the grammar of the language, but to the derived meta model. Importing a meta model means that the respective meta classes can now be used. Note that the meta model import does not make the grammar rules visible, so the meta classes can only be 416 dslbook.org

used in references and as base types (as we will see later). In 37 This is true as long as the referenced the case of referencing, we use them in references: elements are in the index. The index is used by Xtext to resolve references EntityReference: against elements that reside in a differ- "uses" entity=[entity::Entity|FQN]; ent model file. By default, all elements Field: that have a name attribute are in the "field" label=ID ":" widget=Widget index. Entity and Attribute have "->" attribute=[entity::Attribute|FQN]; names, so this works automatically. To make this work, no change is required in the entities lan- 38 This is necessary so that the EMF 37 guage . However, the workflow generating the uispec lan- code generator, when generating the guage has to be changed. The genmodel file for the meta model meta classes for the uispec language, 38 knows where the generated Java classes has to be registered in the StandaloneSetup . for the entities languages are located. bean = StandaloneSetup { This is an EMF technicality and we ... won’t discuss it in any further detail. registerGenModelFile = "platform:/resource/org.xtext.example.lmrc. entity/src-gen/org/xtext/example/lmrc/entity/EntityDsl.genmodel" }

We have to do one more customization to make the language work smoothly. The only Attributes that should be visible are those from the entities referenced in the current Form’s uses clauses, and they should be referenced with a qualified name (Employee.role instead of just role). Scoping has to be cus- tomized to achieve this: public IScope scope_Field_attribute(Field context, EReference ref) { Form form = EcoreUtil2.getContainerOfType(context, Form.class); List visibleAttributes = new ArrayList(); for (EntityReference useClause : form.getUsedEntities()) { visibleAttributes.addAll(useClause.getEntity().getAttributes()); } Function nameComputation = new Function() { @Override public QualifiedName apply(Attribute a) { return QualifiedName.create(((Entity)a.eContainer()). getName(), a.getName()); } }; return Scopes.scopeFor(visibleAttributes, nameComputation , IScope. NULLSCOPE); }

This scoping function performs two tasks: first, it finds all the Attributes of all used entities. We collect them into a list visibleAttributes. The second part of the scoping func- tion defines a Function object39 that represents a function from 39 Note how we have to use the ugly Attribute to QualifiedName. In the implementation method function object notation, because Java does not provide support for closures apply we create a qualified name made from two parts: the or lambdas at this point! Alternatively entity name and the attribute name (the dot between the two is you could do this with Xtend, which does support closures. default behavior for the QualifiedName class). When we create the scope itself in the last line we pass in the list of attributes, as well as the function object. Xtext’s scoping framework uses the function object to determine the name by which each of the attributes is referenceable from this particular context. dsl engineering 417

 Type System As we discussed in Section 20.2, dealing with type systems in the referencing case is not particularly chal- lenging, since the type system of the referencing language can be built with knowledge of the type system of the referenced language.

 Generators The same is true for generators. Typically they just share knowledge about the naming of generated code ele- ments.

16.3.2 Reuse Referencing concerns the case in which the referencing lan- guage is built with knowledge about the referenced language, so it can have direct dependencies. In the example above, the uispec language uses Entity and Attribute from the entities language. It directly imports the meta model, so it has a direct dependency. In the case of reuse, such a direct dependency is not allowed. Our goal is to combine two independent languages. To illustrate this case, we again use the same example as in the MPS section.

 Structure We first introduce a db language, a trivial DSL for defining relational table structures. These can optionally be mapped to a data source, but the language makes no assump- tion about how this data source looks (and which language is used to define it). Consequently, the grammar has no depen- dency on any other, and imports no other meta model: grammar org.xtext.example.lmrc.db.DbDsl with org.eclipse.xtext.common. Terminals generate dbDsl "http://www.xtext.org/example/lmrc/db/DbDsl"

Root: Database;

Database: "database" name=ID tables+=Table*;

Table: "table" name=ID columns+=Column* ;

Column: type=AbstractDataType name=ID (mapper=AbstractColumnMapper)?;

AbstractColumnMapper: {AbstractColumnMapper}"not mapped";

AbstractDataType: CharType | NumberType;

CharType: {CharType}"char";

NumberType: {NumberType}"number"; 418 dslbook.org

Just as in the MPS example, the Column rule has an optional mapper property of type AbstractColumnMapper. Since it is not possible to explicitly mark rules as generating abstract meta 40 Since the mapper property in Column not mapped40 classes, we simply define the syntax to be . This is optional, you don’t ever have to type language has been designed for reuse, because it has this hook this. AbstractColumnMapper, which can be customized later. But the language is still independent. In the next step, we want to be able to reference Attributes from the entities language: database CompanyDB table Departments number id <- Department.id char descr <- Department.description table People number id <- Employee.id char name <- Employee.name char role <- Employee.role char isFreelancer <- Employee.freelancer

To make this possible, we create a new language db2entity that extends the db language and references the entities lan- guage41. This is reflected by the header of the db2entity lan- 41 Notice that we only extend DbDsl. guage (notice the with clause): The entities meta model is just referenced. This is because Xtext can grammar org.xtext.example.lmrc.db2entity.Db2EntityDsl only extend one base grammar. For with org.xtext.example.lmrc.db.DbDsl this reason we cannot embed language concepts from the entities language import "http://www.xtext.org/example/lmrc/db/DbDsl" as db in a db2entity program, we can import "http://www.xtext.org/example/lmrc/entity/EntityDsl" as entity only reference them. However, for this generate db2EntityDsl particular example, this is sufficient. "http://www.xtext.org/example/lmrc/db2entity/Db2EntityDsl"

We now have to overwrite the AbstractColumnMapper rule de- fined in the db language:

AbstractColumnMapper returns db::AbstractColumnMapper: {EntityColumnMapper} "<-" entity=[entity::Attribute|FQN];

We create a rule that has the same name as the rule in the super- grammar. So when the new grammar calls the AbstractColumn- Mapper rule, our new definition is used. Inside, we define the 42 There are two more things we have new syntax we would like to use, and as part of it, we ref- to do to make it work. First, we have to erence an Attribute from the imported entity meta model. register the two genmodel files in the db2entity’s StandaloneSetup bean in We then use the {EntityColumnMapper} action to force instan- the workflow file. Second, we have to tiation of an EntityColumnMapper object: this also implicitly address the fact that in Xtext, the first rule in a grammar file is the entry rule EntityColumnMapper leads to the creation of an class in the for the grammar, i.e. the parser starts generated db2entity meta model. Since our new rule returns consuming a model file using this rule. an db::AbstractColumnMapper, this new meta class extends In our db2entity grammar, the first rule is AbstractColumnMapper, so this AbstractColumnMapper from the db meta model – which is ex- won’t work. We simply copy the first actly what we need42. rule (Root) from the db language. dsl engineering 419

 Type System The primary task of the type system in this example would be mapping the primitive types used in the entities language to those used in the db language to make sure we only map those fields to a particular column that are type-compatible. Just as the column mapper itself, this code lives in the adapter language. It is essentially just a constraint that checks for type compatibility.

 Generator Let us assume there is a generator that generates Java Beans from the entities. Further, we assume that there is a generator that generates all the persistence management code from DbDsl programs, except the part of the code that fetches the data from whatever the data source is – essentially we leave the same "hole" as we do with the AbstractColumnMapper in the grammar. And just in the same way as we define the EntityColumnMapper in the adapter language, we have to adapt the executing code. We can use two strategies. The first one uses the composition techniques of the target language, i.e. Java. The generated code of the DbDsl could for example generate an abstract class that has an abstract method getColumnData for each of the table columns. The generator for the adapter language would generate a concrete subclass that implements these methods to grab the data from entities43. 43 This is how we did it in the MPS example. This way the modularity (entities, db, db2entity) is propa- gated into the generated artifacts as well. No generator compo- 44 In a Java/Enterprise world this would 44 most likely be the way we’d do it in sition is required . practice. The next alternative is a bit However, consider a situation in which we have to gener- constructed. ate inlined code, for reasons of efficiency, e.g., in some kind 45 Sometimes people complain about of embedded system. In this case the DbDsl generator would the fact that Xtend is a general purpose language, and not some dedicated code have to be built in an extensible way. Assuming we use Xtend generation language. However, the for generation, this can be done easily by using dependency fact that one can use abstract classes, injection45. Here is how you would do that: abstract methods and dependency injection is a nice example of how and why a general purpose language (with DbDsl • In the generator that generates persistence code from a some dedicated support for templating) program, the code that generates the inlined "get data for is useful for building generators. column" code delegates to a class that is dependency-injected46. 46 This is a nice illustration of building a The Xtend class we delegate to would be an abstract class generator that is intended to be extended that has one abstract method generateGetDataCodeFor( in some way. Column c).

class GetDataGenerator { def void generateGetDataCodeFor(Column c) }

class DbDslGenerator implements IGenerator { 420 dslbook.org

@Inject GetDataGenerator gdg

def someGenerationMethod(Column c) { // ... String getDataCode = gdg.generateGetDataCodeFor(c) // then embed getDataCode somewhere in the // template that generates the DbDsl code } }

• The generator for the adapter language would contain a sub- class of this abstract class that implements the generateGet- DataCodeFor generator method in a way suitable to the enti- ties language. • The adapter language would also set up Google Guice de- pendency injection in such a way as to use this a singleton instance of this subclass when instances of the abstract class are expected.

16.3.3 Extension We have already seen the mechanics of extension in the pre- vious example, since, as a way of building the reuse infras- tructure, we have extended the db language. In this section we look at extension in more detail. Extension is defined as syntactic integration with explicit dependencies. However, as we discussed in Section 4.6.2 there are two use cases that feel different:

1. In one case we provide (small scale, local, fine grained) ad- ditional syntax to an otherwise unchanged language47. 47 The db2entity language shown above is an example of this. The 2. The other case is where we create a completely new lan- db2entity programs look essentially guage, but reuse some of the syntax provided by the base like programs written in the db base language, but in one (or few) particular language. This use case feels like embedding (we embed place, something is different. In the syntax from the base language in our new language), but example, the syntax for referencing with regard to the classification according to syntactic inte- attributes is such a small scale change. gration and dependencies, it is still extension. Embedding would prevent explicit dependencies. In this section we look at extension with an embedding flavor.

To illustrate an extension-with-embedding flavor, we will show 48 This is a mixed blessing. As long how to embed Xbase expressions in a custom DSL. Xbase is as you stay in a JVM world (use Java types, generate Java code), many things a reusable expression language that provides primitive types, are very simple. However, as soon as various unary and binary operators, functions and closures. you go outside of the JVM world, a lot 48 of things become quite complex, and As we will see, it is very tightly integrated with Java . As an it is questionable whether using Xbase example, we essentially create another entity language; thanks makes sense in this case at all. to Xbase, we will be able to write: dsl engineering 421

entity Person { lastname : String firstname : String String fullName(String from) { return "Hello " + firstname + " " + lastname + " from " + from } }

Below is the essential part of the grammar. Note how it extends the Xbase grammar (the with clause) and how it uses various elements from Xbase throughout the code (those whose names start with an X). grammar org.xtext.example.lmrc.entityexpr.EntityWithExprDsl with org.eclipse.xtext.xbase.Xbase generate entityWithExprDsl "http://www.xtext.org/example/lmrc/entityexpr/EntityWithExprDsl"

Module: "module" name=ID "{" entities+=Entity* "}";

Entity: "entity" name=ID "{" attributes+=Attribute* operations+=Operation* "}";

Attribute: name=ID ":" type=JvmTypeReference;

Operation: type=JvmTypeReference name=ID "(" (parameters+=FullJvmFormalParameter (’,’ parameters+=FullJvmFormalParameter)*)? ")" body=XBlockExpression;

Let’s look at some of the details. First, the type properties of the Attribute and the Operation are not defined by our gram- mar; instead we use a JvmTypeReference. This makes all Java types legal at this location49. We use an XBlockExpression as 49 Limiting this to the primitive types the body of Operation, which essentially allows us to use the (or some other subset of the JVM types) requires a scoping rule. full Xbase language inside the body of the Operation. To make the parameters visible, we use the FullJvmFormalParameter rule50. 50 Above we wrote that Xbase is tightly In addition to using Xbase language concepts in the defini- integrated with the JVM and Java. The use of FullJvmFormalParameter and tion of our grammar, we also tie the semantics of our language JvmTypeReference is a sign of this. to Java and the JVM. To do this, the JvmModelInferrer, shown However, the next piece of code makes this even clearer. below, maps a model expressed with this language to a struc- turally equivalent Java "model". By doing this, we get a num- ber of benefits "for free", including scoping, typing and a code generator. Let us look at this crucial step in some detail. class EntityWithExprDslJvmModelInferrer extends AbstractModelInferrer {

@Inject extension IQualifiedNameProvider @Inject extension JvmTypesBuilder

def dispatch void infer(Entity entity, 422 dslbook.org

IAcceptor acceptor, boolean isPrelinkingPhase) { ... } }

This Xtend class extends AbstractModelInferrer and imple- ments its infer method to create structurally equivalent Java code as an EMF tree, and registers it with the acceptor. The method is marked as dispatch, so it can be polymorphically overwritten for various language concepts. We override it for the Entity concept. We have also injected the IQualifiedName- Provider and JvmTypesBuilder. The latter provides a builder API for creating all kinds of JVM objects, such as fields, setters, classes or operations. The next piece of code makes use of such a builder: acceptor.accept( entity.toClass( entity.fullyQualifiedName ) [ documentation = entity.documentation ... ] )

At the top level, we map the Entity to a Class. toClass is one of the builder methods defined in the JvmTypesBuilder. The class we create should have the same name as the entity; the name of the class is passed into the constructor. The second argument, written conveniently behind the parentheses, is a closure. Inside the closure, we set the documentation of the created class to be the documentation of the entity51. Next 51 Inside a closure, there is a variable we create a field, a getter and a setter for each of the attributes it that refers to the target element (the class in this case). It is possible of the Entity and add them to the Class’ members collection: to omit the it, so when we write documentation = ... this actually attr : entity.attributes ) { members += attr.toField(attr.name, attr.type) members += means it.documentation = .... attr.toGetter(attr.name, attr.type) members += attr.toSetter(attr.name, attr.type) } toField, toGetter and toSetter are all builders contributed by the JvmTypesBuilder. To better understand what they do, here is the implementation of toSetter52. 52 Note that the first argument supplied by the object in front of the dot, i.e. the public JvmOperation toSetter(EObject sourceElement, final String name, Attribute, is passed in as the first JvmTypeReference typeRef) { JvmOperation res = TypesFactory.eINSTANCE.createJvmOperation(); argument, sourceElement. res.setVisibility(JvmVisibility.PUBLIC); res.setSimpleName("set" + nullSaveName(Strings.toFirstUpper(name))); res.getParameters().add(toParameter(sourceElement, nullSaveName(name), cloneWithProxies(typeRef))); if (name != null){ setBody(res, new Functions.Function1() { public CharSequence apply(ImportManager p) { return "this." + name + " = " + name + ";"; } }); } dsl engineering 423

return associate(sourceElement, res); }

The method first creates a JvmOperation and sets the visibility and the name. It then creates a parameter that uses the typeRef passed in as the third argument as its type. As you can see, all of this happens via model-to-model transformation. This is important, because these created objects are used implicitly in scoping and typing. The body, however, is created textually; it is not needed for scoping or typing: it is used only in code generation53. The last line is important: it associates the source 53 Since that is a to-text transformation element (the Attribute in our case) with the created element anyway, it is good enough to represent the body of the setter as text already at (the setter Operation we just created). As a consequence of this level. this association, the Xbase scoping and typing framework can work its magic of providing support for our DSL without any further customization! Let’s now continue our look at the implementation of the Jvm- ModelInferrer for the Entity. The last step before our detour was that we created fields, setters and getters for all attributes of our Entity. We have to deal with the operations of our Entity next. for ( op : entity.operations ) { members += op.toMethod(op.name, op.type) [ for (p : op.parameters) { parameters += p.toParameter(p.name, p.parameterType) } body = op.body ] }

This code should be easy to understand. We create a method for each Operation using the respective builder method, pass in the name and type, create a parameter for each of the pa- 54 This approach of mapping a DSL to rameters of our source operation and then assign the body of Java "code" via this model transforma- the created method to be the body of the operation in our DSL tion works nicely as long as it maps to Java code in a simple way. In the program. The last step is particularly important. Notice that above case of entities, the mapping is we don’t clone the body, we assign the object directly. Looking trivial and obvious. If the semantic gap becomes bigger, the JvmTypeInferrer into the setBody method (the assignment is actually mapped becomes more complicated. However, to a setter in Xtend), we see the following: what is really nice is this: within the type inferrer, you can of course use void setBody(JvmExecutable logicalContainer, XExpression expr) { Xtend’s template syntax to create im- if (expr == null) return; associator.associateLogicalContainer(expr, plementation code. So it is easy to mix logicalContainer); model transformation (for those parts of } a mapping that are relevant to scoping and type calculation) and then use traditional to-text transformation using The associateLogicalContainer method is what makes the Xtend’s powerful template syntax for automatic support for scoping and typing happen54: the detailed implementation aspects. 424 dslbook.org

• Because the operation is the container of the expression, the expression’s type and the operation’s type must be compat- ible

• Because the expression(s) live inside the operation, the pa- rameters of the operation, as well as the current class’s fields, setters and getters are in scope automatically.

 Generator The JVM mapping shown above already consti- tutes the full semantic mapping to Java. We map entities to Java classes and fields to members and getters/setters. We do not have to do anything else to get a generator: we can reuse the existing Xbase-to-Java code generator. If we build a language that cannot easily be mapped to a JVM model, we can still reuse the Xbase expression compiler, by injecting the JvmModelGenerator and then delegating to it at the respective granularity. You can also change or extend the behavior of the default JvmModelGenerator by overriding its _internalDoGenerate(EObject, IFileSystemAccess) method for your particular language concept55. 55 Notice that you also have to make sure via Guice that your sub- Extending Xbase In the above example we embedded the class is used, and not the default  JvmModelGenerator. (otherwise unchanged) Xbase language into a simple DSL. Let’s now look at how to extend Xbase itself by adding new literals and new operators. We start by defining a literal for dates:

XDateLiteral: ’date’ ’:’ year=INT ’-’ month=INT ’-’ day=INT;

These new literals should be literals in terms of Xbase, so we have to make them subtypes of XLiteral. Notice how we over- ride the XLiteral rule defined in Xbase. We have to repeat its original contents; there is no way to "add" to the literals56. 56 Similarly, if you want to remove concepts, you have to overwrite the rule XLiteral returns xbase::XExpression: with the concept to be removed missing XClosure | XBooleanLiteral | from the enumeration. XIntLiteral | XNullLiteral | XStringLiteral | XTypeLiteral | XDateLiteral;

We use the same approach to add an additional operator that uses the === symbol57: 57 The triple equals represents identity.

OpEquality: ’==’ | ’!=’ | ’===’; dsl engineering 425

The === operator does not yet exist in Xtend, so we have to specify the name of the method that should be called if the op- erator is used in a program58. The second line of the method 58 Xtend supports initializeMapping maps the new operator to a method named by mapping operators to methods that can be overridden. operator_identity: public class DomainModelOperatorMapping extends OperatorMapping {

public static final QualifiedName IDENTITY = create("===");

@Override protected void initializeMapping() { super.initializeMapping(); map.put(IDENTITY, create("operator_identity")); } }

We implement this method in a new class that we call Object- Extensions259: 59 The existing class ObjectExtensions contains the implementations for the public class ObjectExtensions2 { existing == and != operators, hence the public static boolean operator_identity(Object a, Object b) { return a == b; name. } }

Through the operator_identity operation, we have expressed all the semantics: the Xbase generator will generate a call to that operation in the generated Java code60. We have also im- 60 Alternatively, as a performance im- plicitly specified the typing rules: through the mapping to the provement, you could use the @Inline annotation to inline the function in the _ operator identity operation, the type system uses the types generated code. specified in this operation. The type of === is boolean, and there are no restrictions on the two arguments; they are typed as java.lang.Object61. 61 If customizations are required, We also want to override the existing minus operator for the these could be done by overrid- ing the _expectedType operation in new date literals to calculate the time between two dates. We XbaseTypeProvider. don’t have to specify the mapping to a method name, since the mapping for minus is already defined in Xbase. How- ever, we have to provide an overloaded implementation of the operator_minus method for dates: public class DateExtensions { public static long operator_minus(Date a, Date b) { long resInMilliSeconds = a.getTime() - b.getTime(); return millisecondsToDays( resInMilliSeconds ); } }

To make Xtend aware of these new classes, we have to register them. To do so, we extend the ExtensionClassNameProvider. It associates the classes that contain the operator implementa- tion methods with the types to which these methods apply: public class DomainModelExtensionClassNameProvider extends ExtensionClassNameProvider { 426 dslbook.org

@Override protected Multimap, Class> simpleComputeExtensionClasses() { Multimap, Class> result = super.simpleComputeExtensionClasses(); result.put(Object.class, ObjectExtensions2.class); result.put(Date.class, DateExtensions.class); return result; } }

62 We now have to extend the type system: it has to be able to Don’t forget to register this class with Guice, just like all the other derive the types for date literals. We create a type provider DSL-specific subclasses of framework that extends the default XbaseTypeProvider62: classes.

@Singleton public class DomainModelTypeProvider extends XbaseTypeProvider {

@Override protected JvmTypeReference type(XExpression expression, 63 JvmTypeReference rawExpectation, boolean rawType) { Java annotations are markers you can if (expression instanceof XDateLiteral) { attach to various program elements return _type((XDateLiteral) expression, rawExpectation, rawType); such as classes, fields, methods or ar- } guments. For example, the @Override return super.type(expression, rawExpectation, rawType); } annotation declares that a method overrides a similar method in the su- protected JvmTypeReference _type(XDateLiteral literal, perclass. Another example is @NotNull JvmTypeReference rawExpectation, boolean rawType) { on an argument, which expresses the return getTypeReferences().getTypeForName(Date.class, literal); } fact that that argument may not be } null. Annotations may also capture metadata: the @Author( name = .., Finally we have to extend the Xbase compiler so that it can date = ..) annotation expresses who wrote a class. Annotations may be stan- handle date literals: dardized (e.g. @Override) or may be public class DomainModelCompiler extends XbaseCompiler { implemented by users. In the standard protected void _toJavaExpression(XDateLiteral expr, IAppendable b) { case the annotations are typically pro- b.append("new java.text.SimpleDateFormat(\"yyyy-MM-dd\").parse(\"" + cessed by the compiler (e.g., checking expr.getYear() + "-" + expr.getMonth() + "-" + expr.getDay() + "\")"); that there actually is a method with } the same signature in the superclass or } modifying the generated bytecode to report an error if a NotNull argument is null at runtime). Custom annotations  Active Annotations Xtext’s Xtend language comes with Ac- are processed in some way by some external tool (e.g. by checking certain tive Annotations. They use the same syntax as regular Java properties of the code, or by modifying annotations63. However, they can influence the translation pro- the class bytecode via some bytecode cess from Xtend to Java64. Each annotation is essentially asso- processor). ciated with a model-to-model transformation that creates the 64 This feature is actually introduced in necessary Java code. This allows the execution semantics of version 2.4; at the time of this writing, only prototypes are available, so some the respective Xtend class to be influenced. details about what I describe here may At the time of this writing, the most impressive active anno- be different in the final release. This is tation (prototype) I have seen involves GWT programming65. also why we don’t show source code here. They implement the following two annotations: 65 This has been built by Sven Efftinge Services From a simple Xtend class that contains the server- and Oliver Zeigermann. The slides side implementation methods, the annotation generates the that describe the system are here: slidesha.re/Shb3SO. The code is at necessary remote interface and the other boilerplate that en- github: github.com/DJCordhose/ ables the remote communication infrastructure. todomvc-xtend-gwt. dsl engineering 427

UI Forms In GWT, a UI form is defined by an XML file that de- fines the structure, as well as by a Java class that implements the behavior. The behavior includes the event handlers for the UI elements defined in the XML file. To this end, the class has to have fields that correspond (in name and type) to the UI elements defined in the XML. By using an annota- tion, this duplication can be avoided: the annotation imple- mentation inspects the associated XML and automatically introduces the necessary fields.

Active annotations will provide a number of additional fea- tures. First, they can implement custom validations and quick fixes for the IDE. Second, they can change the scope and the type system, with the IDE being aware of that66. Third, you 66 The transformation defined by the can pass JVM types or expressions into annotations: active annotation that maps the anno- tated Xtend construct to Java code is @Pre( b != 0 ) def divide(int a, int b) { return a / b run not just during code generation, } but also during editing (like any other JVM model inferrer). Since scoping and It is possible to define whether the expression is passed in as the type system of Xtend is based on the inferred JVM model, the annotation an AST (b != 0), or whether the result of the evaluation of the transformation can affect these as well. expression is passed in (true or false). While the syntactic limitations of annotations limit the kinds of language extensions that can be built in this way, the current prototypes show that nonetheless some quite interesting lan- guage extensions are possible67. 67 This is due in particular to the fact that the IDE is aware of the transfor- 16.3.4 Embedding mation (influencing typing and and code completion) as a consequence of Embedding is not supported by Xtext. The reason is that, as we the real-time transformation with a JVMModelInferrer. can see from Section 4.6.4, the adapter language would have to inherit from two base languages. However, Xtext only supports extending one base grammar. We have shown above how to embed Xbase expressions into a custom DSL. However, as we have discussed, this is an ex- ample of extension with embedding flavor: we create a new DSL into which we embed the existing Xbase expressions. So we only have to extend from one base language – Xbase. An example of embedding would be to take an existing, indepen- dent SQL language and embed it into the entity DSL created 68 While this sounds like a rather above. This is not possible. academic problem to have, the mbeddr case study referred to throughout The same is true for the combination (in the same program) this book shows where and how the of several independently developed extensions to the same base combination of independent language extensions is useful in practice. mbeddr language. In that case, too, the composite grammar would have could not have been built with Xtext for to inherit from several base languages68. this reason. 428 dslbook.org

16.4 Spoofax Example

In this section we look at an example roughly similar to that for MPS and Xtext discussed in the previous sections. We start with Mobl’s data modeling language, which we have already seen in previous chapters. To understand some of the discussions later, we first have to understand how Spoofax organizes languages. In Spoofax, lan- guage definitions are typically modularized (they declare their module at the top of a file). For example, Mobl’s syntax defini- tion comes with a module for entities, which imports modules for statements and expressions: module MoblEntities imports MoblStatements MoblExpressions

All syntax definition modules reside in the syntax directory of a Spoofax project. Typically, subdirectories are used to or- ganize the modules of different sublanguages. For example, we can have subdirectories entity for Mobl’s entity definition language, screen for Mobl’s screen definition language, and common for definitions shared by both languages: module entity/MoblEntities imports entities/MoblStatements entities/MoblExpressions

module screen/MoblScreens imports common/Lexical

module common/MoblExpressions imports common/Lexical

As the example shows, the directory structure is reflected in module names. You can read them as relative paths from the syntax directory to the module. Similarly to syntax definitions, rewrite rules for program analysis, editor services, program transformation, and code generation are organized in modules, which are imported from Mobl’s main module. The various modules for program analy- sis, editor services and program transformation are organized in subdirectories: dsl engineering 429

module mobl imports analysis/names analysis/types analysis/checks editor/complete editor/hover editor/refactor trans/desugar trans/normalize generate

16.4.1 Referencing We will illustrate references to elements written in another DSL with Mobl’s screen definition language. The following code uses Mobl sublanguage for data definition. It defines an entity Task with some properties69. 69 The example should look familiar - we have discussed the language in entity Task { several examples throughout the book name : String description : String already. done : Bool date : DateTime }

The next piece of code shows a screen definition written in Mobl’s screen definition language. It defines a root screen for a list of tasks, using the name of a Task as a label for list elements. screen root() { header("Tasks") group { list(t in Task.all()) { item { label(t.name) } } } }

There are two references to the data model: Task refers to an Entity, and name refers to a property of that Entity. In general, a Screen defined in the UI model refers to Entities from Mobl’s entity language, and Fields in a screen refer to Properties in an entity.

 Structure When referencing elements of another language, both languages typically share a definition of identifiers. For example, the screen definition language imports the same lex- ical module as does the data modeling language, via the ex- pression module: module entity/MoblEntities imports ... entity/MoblExpressions 430 dslbook.org

module entity/MoblExpressions imports ... common/Lexical

module screen/MoblScreens imports common/Lexical exports context-free syntax

"list" ID "in" Collection "{" Item* "}" -> List {"ScreenList"} ID "." "all" "(" ")" -> Collection {"Collection"} "item" "{" ItemPart* "}" -> Item {"Item"} "label" "(" ID "." ID ")" -> ItemPart {"LabelPart"}

However, Spoofax also supports the use of different, typically overlapping identifier definitions70. In this case, the referenc- 70 This requires scannerless parsing, ing language needs to import the identifier definition of the since a scanner cannot handle overlap- ping lexical definitions. referenced language.

 Name Binding Independent of the identifiers used in both languages, the reference has to be resolved. Definition sites are already defined by the referenced language. The correspond- ing references must be defined in the referencing language by using the namespaces from the referenced language. The pre- vious syntax definition fragment of the screen definition lan- guage specifies lists and items in these lists. The following frag- ment shows the corresponding name binding specifications: module screen/names imports entity/names namespaces Item rules

ScreenList(i, coll, i*): defines Item i of type t where coll has type t

Collection(e): refers to Entity e

LabelPart(i, p): refers to Item i refers to Property p in Entity e where i has type e

Here, the screen definition language declares its own names- pace Item for items, which are declared in the list head, in- troducing a variable for the current item of a collection. For example, the screen definition we have seen earlier defines an 71 It is the t in the list element that item t71. When we describe the collection, we can refer to en- references a Task. tities. The corresponding namespace Entity is defined in the data modeling language. The screen definition language uses dsl engineering 431

the same namespace, to resolve the references into the referred language72. 72 Similarly, the screen definition lan- guage uses the namespace Property Type System Similar to the name resolution, the type sys- from the data modeling language, to  refer to properties of entities. tem of the referencing language needs to be defined with the knowledge of the type system of the referenced language. constraint-error: LabelPart(item, property) -> (property, "Label has to be a string.") where type := property not (!type => !StringType())

To be able to check whether a label is a string, this constraint has to determine the type of the property used as a label.

 Generators Generators of the referencing language also need to be defined with the knowledge of the generators of the ref- erenced language. Typically, they just share knowledge about the naming and typing of generated code elements. to-java: LabelPart(item, property) -> |[ new Label([java-item].[java-prop]); ]| where label := "label"; java-item := item; java-prop := property

This rule generates Java code for a LabelPart. The generated code should create a new label with a text which should be determined from a property of an item. To generate the prop- erty access, the rule relies on the same scheme as rules from the generator of the entity definition part, by making calls to rules from this generator73. 73 It first generates the Java variable name from item. Second, it generates 16.4.2 Reuse a getter call from property. Finally, it composes these fragments with its own As discussed in the previous sections, referencing concerns the Java code. case in which the referencing language is built with knowledge about the referenced language, so that it can have direct depen- dencies74. In the case of reuse, such direct dependency is not 74 In the example above, the screen def- allowed: our goal is to combine two independent languages. inition language uses the namespaces and types from the entity language directly.  Structure To illustrate this case, we again use the same ex- ample as in the MPS section. We first introduce a trivial DSL for defining relational table structures. These can optionally be mapped to a data source, but the language makes no as- sumption about what this data source looks like (and which language is used to define it). Consequently, the grammar has no dependency on any other one: 432 dslbook.org

module DBTables

"database" ID Table* -> Database {"DB"} "table" ID Column* -> Table {"DBTable"} DataType ID ColumnMapper -> Column {"DBColumn"} -> ColumnMapper {"DBMissingMapper"}

"char" -> DataType {"DBCharType"} "number" -> DataType {"DBNumberType"}

Again, the Column rule has an optional ColumnMapper which works as the hook for reuse. The reusable language only pro- 75 Rules for a concrete mapper can be vides a rule for a missing column mapper75. In the next step, added later in a sublanguage. we want to be able to reference properties from Mobl’s data modeling language from a table definition: database TaskDB table Tasks

char name <- Task.name char description <- Task.description

To do this, we define an adapter module, which imports the reusable table module and the data modeling language. So far, ColumnMapper is only an abstract concept, without a use- ful definition. The adapter module now defines a rule for ColumnMapper, which defines the concrete syntax of an actual mapper that can reference properties from the data modeling language: module MoblDBAdapter imports DBTables MoblEntities context-free syntax "<-" ID "." ID -> ColumnMapper {"PropertyMapper"}

There is only one rule in this module, which defines a concrete mapper. On the right-hand side, it uses the same sort as the rule in the table module (ColumnMapper). On the left-hand side, it refers to a property (second ID) in an entity (first ID).

 NameBinding The actual reference to entity and property names from the imported data modeling language needs to be specified in the name binding module of the adapter: module adapter/names imports entity/names rules

PropertyMapper(e, p): refers to Entity e refers to Property p in Entity e dsl engineering 433

 Type System In our example, the types of the database lan- guage needs to be connected to the primitive types used in Mobl76. Constraints ensure we only map those fields to a par- 76 More generally, the type system needs ticular column that are type-compatible: to connect types from the abstract but reusable language to types from the module reuse/dbtables/analysis/types language which actually reuses it. rules constraint-error: DBColumn(type, _, mapper) -> (mapper, "Incompatible type") where type’ := mapper ; (type, type’)

compatible-types: _ ->

The code above is defined in the generic implementation of the database language. It assumes that a mapper has a type and checks if this type is compatible with the declared column type. It defines a default rule for type compatibility, which always fails. The connection to the type system of the entity language can now be made in an adapter module: module analysis/types/adapter imports module analysis/types module reuse/dbtables/analysis/types rules type-of: PropertyMapper(e, p) -> p

compatible-types: (DBCharType(), StringType()) -> compatible-types: (DBNumberType(), NumType()) ->

The first rule defines the type of a PropertyMapper to be the type of the property. Then, two rules define type compati- bility for Mobl’s String type with the char type in the table language, and Mobl’s Num type with table language’s number type. 77 As in the MPS example, the code Generator As in the Xtext example, we can use two strate- generator of the database language  generates an abstract Java class for gies to reuse a generator for the database language. The first fetching data, while Mobl’s original strategy relies on composition techniques of the target lan- code generator generates Java classes 77 from entities. We can then define an guage, if that language provides such composition facilities . additional generator, which generates a The second strategy we discussed in the Xtext example ad- concrete subclass that fetches data from dressed the generation of inlined code, which requires an ex- entities. tendable generator of the reusable language. With rewrite rules, this can be easily achieved in Spoofax. The reusable generator calls a dedicated rule for generating the inlined code, but de- fines only a failing implementation of this rule: rules db-to-java: Column(t, n, mapper) -> [Field([PRIVATE], t’, n), Method([PROTECTED], BoolType, n’, params, stmts)] 434 dslbook.org

where n’ := n ; param := mapper ; stmts := mapper

to-fetch-method-parameters: _ -> to-fetch-statements(|n) : _ ->

This rule generates code for columns. It generates a private field and a protected method to fetch the content. This method needs a name, parameters and an implementation. We assume that the method name is provided by the generic generator. For the other parts (in particular, the implementation of the meth- ods), the generic generator only provides failing placeholder rules. These have to be implemented in a concrete reuse set- ting by the adapter language generator: module generate/java/adapter imports generate/java reuse/table/generate rules to-fetch-method-parameters: PropertyMapper(entity, property) -> [Param(type, "entity")] where type := entity

to-fetch-statements(|field-name): PropertyMapper(entity, property) -> [Assign(VarRef(field-name), MethodCall(VarRef("entity"), m, []), Return(True())] where m := property

This adapter code generates a single parameter for the fetch method. It is named entity and its type is provided by a rule from the entity language generator. The rule maps entity names to Java types. For the implementation body, the second rule generates an assignment and a return statement. The as- signment calls the getter method for the property. Again, the name of this getter method is provided by the entity language generator.

16.4.3 Extension Since, in the case of extension, the extending language has a dependency Because of Spoofax’ module system and rule-based nature, lan- on, and is developed with, knowledge guage extension feels like ordinary language development. When of the base language, it can be designed in a way that will not lead to parsing we want to add a new feature for a language, we simply create ambiguities. However, the composition new modules for syntax, name binding, type system and code of different, independent extensions of the same base language might lead generation rules. These modules import the existing modules to ambiguities. These ambiguities will as needed. In the syntax definition, we can extend a syntac- only occur between the extensions. The tic sort with new definition rules. In the type system, we add base language will stay unambiguous, since each module is only imported additional type-of rules for the new language constructs and once. dsl engineering 435

define constraints for well-typedness. Finally, we add new gen- erator rules, which can handle the new language constructs.

16.4.4 Restriction In the easiest case, restriction can be handled on the level of syntax rules. SDF’s import directives allow not only for re- naming of sorts, but also for replacing complete syntax rules. To remove a rule completely, we can replace it with a dummy rule for a sort, which is not used anywhere. The following example restricts Mobl to a version without property access expressions78: 78 In more complex cases, only parts of a syntax rule need to be restricted. module MoblWithoutPropertyAccess For example, we might restrict Mobl’s imports Mobl[Exp "." ID -> Exp => -> UnusedDummySort] screen definition language to support only unparameterized screen defini- tions. 16.4.5 Embedding Embedding can be easily achieved in Spoofax. In general, the procedure is very similar to reuse. We will discuss the embed- ding of HQL79 into Mobl as an example here80. 79 HQL is a declarative query lan- guage for entities stored in a relational database. It resembles SQL in syntax.  Structure Embedding requires an additional syntax defini- tion module which imports the main modules of the host and 80 We have discussed embedding al- guest language and defines additional syntax rules that realize ready in Section 11.4, where we em- bedded the target language into the the embedding. In target language embedding into Stratego, Stratego transformation language. this was achieved with quotations and antiquotations. The fol- lowing module is an initial attempt to embed HQL into Mobl: module Mobl-HQL imports Mobl Hql context-free syntax

QueryRule -> Exp {cons("HqlQuery")} DeleteStatement ";" -> Statement {cons("HqlStatement")}

"~" Exp -> Expression {cons("DslExp")}

The module imports syntax definitions of host and guest lan- guages. It embeds HQL queries as Mobl expressions and HQL’s delete statement as a Mobl statement without any quotations. Furthermore, it allows us to use quoted Mobl expressions in- side HQL queries, using the tilde as a quotation symbol. There are two issues in this module. First, we might acciden- tally merge sorts with the same name in host and guest lan- 81 81 Sort names like Expression, Exp or guage . Since both languages are developed independently, Statement are quite likely to be used in we cannot assume mutually exclusive names in their syntax several languages 436 dslbook.org

definitions. One way to avoid name clashes is to rename sorts manually during import: module Mobl-HQL imports Mobl Hql [ QueryRule => HqlQueryRule DeleteStatement => HqlDeleteStatement Expression => HqlExpression ... ]

This can be quite cumbersome, since we have to rename all sorts, not only the embedded ones. Alternatively, we can rely on Spoofax to generate a renamed version of a language def- inition. This Mix is a parameterized syntax definition, where Spoofax replaces each sort by a parameterized sort82: 82 This is a bit like generics in program- ming languages such as Java. module HqLMix[Context] imports Hql [ QueryRule => QueryRule[[Context]] DeleteStatement => DeleteStatement[[Context]] Expression => Expression[[Context]] ... ]

The parameter allows us to distinguish sorts from the host and the target language. We can then import this module with an actual parameter and use the parameterized sorts in the embedding: module Mobl-HQL imports Mobl HqlMix[HQL] context-free syntax

QueryRule[[HQL]] -> Exp {cons("HqlQuery")} DeleteStatement[[HQL]] ";" -> Statement {cons("HqlStatement")}

"~" Exp -> Expression[[HQL]] {cons("MoblExp")}

The second issue is ambiguity: we have to integrate HQL queries into the precedence rules for Mobl expressions. To do this, we do not have to repeat all rules: preceding and succeeding rules are sufficient83: 83 This is necessary because precedence is specified relative to other rules using context free priorities - the > operator introduced earlier. Assignment -> Exp > QueryRule[[HQL]] -> Exp > "if" "(" Exp ")" Exp "else" Exp -> Exp

 Name Binding The name bindings of host and embedded language are never connected. For example, only Mobl expres- sions can refer to Mobl variables. If an HQL query relies on a Mobl variable, it accesses it as an embedded Mobl expression. dsl engineering 437

 Type System The type system needs to connect the types from the host and guest languages. This can be achieved by adding typing rules for embedded and antiquoted constructs. For example, we need to connect the HQL type of a query to a Mobl type of the embedding expression: module mobl-hql/types imports mobl/types hql/types type-of: HqlQuery(query) -> mobl-type where hql-type := query mobl-type := hql-type type-of: MoblExp(exp) -> hql-type where mobl-type := exp hql-type := mobl-type hql-to-mobl-type: JDBC_Integer() -> NumType() hql-to-mobl-type: JDBC_Float() -> NumType() hql-to-mobl-type: JDBC_Bit() -> BoolType() mobl-to-hql-type: NumType() -> JDBC_Float() mobl-to-hql-type: BoolType() -> JDBC_Bit()

The first rule determines the type of an embedded HQL query and maps it to a corresponding Mobl type. The second rule de- termines the type of an antiquoted Mobl expression and maps it to an corresponding HQL type. The remaining rules exem- plify actual mappings between HQL and Mobl types84. 84 Additional constraints could check for incompatible types which cannot Generator There are two strategies for code generation for be mapped into the host or guest  language. embedded languages. If the guest language provides a suitable code generator, we can combine it with the code generator of the host language. First, we need rules which generate code for embedded constructs. These rules have to extend the host generator by delegating to the guest generator. Next, we need rules which generate code for antiquoted constructs. These rules have to extend the guest generator by delegating to the host generator. Another strategy is to define a model-to-model transforma- tion which desugars (or "assimilates") embedded constructs to constructs of the host language. This transformation is then applied first, before the host generator is applied to generate code. The embedding of a target language into Stratego is an example of this approach. The embedded target language will be represented by abstract syntax trees for code genera- tion fragments. These trees need to be desugared into Strat- 438 dslbook.org

ego pattern matching constructs. For example, the embedded |[return |[x]|; ]| will yield the following abstract syntax tree:

ToJava( Return( FromJava( Var("x") ) ) )

In ordinary Stratego without an embedded target language, we would have written the pattern Return(x) instead. The corresponding abstract syntax tree looks like this:

NoAnnoList( App( Op("Result"), [Var("x")] ) )

The desugar transformation now needs to transform the first abstract syntax tree into the second one: desugar-all: x -> x desugar-embedded: ToJava(e) -> e ast-to-pattern: ast -> pattern where if !ast => FromJava(e) then pattern := e else c := ast ; args := ast ; ps := args ; pattern := NoAnnoList(App(Op(c), ps))

The first rule drives the desugaring of the overall tree. It tries to apply desugar-embedded in a bottom-up traversal. The only rule for desugaring embedded target language code matches the embedded code and applies ast-to-pattern to it. If this is applied to an antiquote, the contained subnode is already a regular Stratego pattern. Otherwise, the node has to be an abstract syntax tree of the target language. It is deconstructed into its constructor and subtrees, which are desugared into pat- terns as well. The resulting patterns and the constructor are then used to construct the overall pattern. Part IV

DSLs in Software Engineering

dsl engineering 441

This part of the book looks at how DSLs can be used in var- ious aspects of software engineering. In particular, we look at requirements engineering, software architecture, developer utilities, implementation, product line engineering and busi- ness DSLs. Some of the chapters also serve as case studies for interesting, non-trivial DSLs. Note that this part has many contributory authors, so there may be slight variations in style.

17 DSLs and Requirements

This chapter looks at the role of DSLs in requirements en- gineering. In particular it explores the use of DSLs to spec- ify requirements formally, at ways of representing require- ments as models and at traceability between implementa- tion artifacts and requirements.

17.1 What are Requirements?

Wikipedia defines a requirements as follows:

A requirement is a singular documented need of what a partic- ular product or service should be or perform.

Wiktionary says:

[A requirement] specifies a verifiable constraint on an imple- mentation that it shall undeniably meet or (a) be deemed unac- ceptable, or (b) result in implementation failure, or (c) result in system failure.

In my own words I would probably define a requirement as

. . . a statement about what a system should do, and with what quality attributes, without presupposing a specific implementa- tion.

Requirements are supposed to tell the programmers what the system they are about to implement should do1. Require- 1 However, a requirement typically does not prescribe how a developer has to ments are a means of communicating from humans (people implement some functionality: archi- who know what the system should do) to other humans (those tecture, design, the use of patterns and that have to implement it). Of course, as well all know, there idioms and the choice of a suitable im- plementation technology and language are a number of challenges with this: are up to the developer. 444 dslbook.org

• Those who implement the requirements may have a differ- ent background than those who write them, making misun- derstandings between the two groups likely.

• Those who write the requirements may not actually really know what they want the system to do, at least initially. Re- quirements change over the course of a project, particularly as people start to "play" with early versions of the system2. 2 As we all know, only when we actu- ally play or experiment with something • Usually requirements are written in plain English (or what- do we really understand all the details, ever language you prefer). Writing things down precisely uncover corner cases and appreciate the complexity in the system. and completely in a non-formal language is next to impos- sible3. 3 Writing any large prose document consistently and free from bugs is hard. Traditional requirements documents are a means to communi- I am sure you will find problems in this book :-) cate from people to people. However, in the end this is not really true. In an ideal world, the requirements (in the brain of the person who writes them down) should be communicated directly to the computer, without the intermediate program- mer, to avoid the misunderstandings mentioned above. If we look at the problem in this way, requirements now become for- mal, computer-understandable. Wikipedia has a nice list of characteristics that requirements should posses. Here is a slightly adapted version of this list:

Complete The requirement is fully stated in one place with no missing information. This makes the requirement easy to consume, because readers do not have to build the complete picture from scattered information. 4 This is extremely hard to achieve with Consistent The requirement does not contradict any other re- prose, because there is no "compiler" quirement and is fully consistent with all authoritative ex- that finds inconsistencies. 4 ternal documentation . 5 For example "The postal code field must validate American and Canadian Cohesive & Atomic The requirement is atomic, i.e., it does not postal codes" should be written as two contain conjunctions5. This ensures that traceability from separate requirements: (1) "The postal implementation artifacts back to the requirements is rela- code field must validate American postal codes" and (2) "The postal code tively simple. field must validate Canadian postal codes". Current The requirement has not been made obsolete by the passage of time. Outdated requirements should be removed 6 Of course, as the person who writes the requirements, you may not be able or marked as outdated. to judge, since you may not know Feasible The requirement can be implemented within the con- the project constraints, the effort to implement the requirement, or whether 6 straints of the project . the implementation technology is able to address the requirement. This is Unambiguous The requirement is concisely stated without re- one reason why interaction with the course to technical jargon, acronyms (unless defined else- implementers is critical. dsl engineering 445

where in the requirements document), or other esoteric ver- biage. It expresses objective facts, not subjective opinions. It is subject to one and only one interpretation. Vague sub- jects, adjectives, prepositions, verbs and subjective phrases are avoided. Negative statements and compound statements are prohibited7. 7 All of these things are intended to make the prose as precise as possible Mandatory The requirement represents a stakeholder-defined to avoid misunderstandings. However, characteristic the absence of which will result in a deficiency we all know how hard this is to achieve with prose. that cannot be ameliorated. An optional requirement is a contradiction in terms8. 8 Although requirements may have priorities that define how important Verifiable The implementation of the requirement can be de- a requirement is relative to others. termined through one of four possible methods: inspection, The implementation process should implement high-priority requirements demonstration, test or analysis. Otherwise it is hard to tell first, if possible. if a system actually fulfills a requirement or not9. 9 Ideally, all requirements can be tested If requirements are written as pure prose, then making sure in an automatic way, in the sense that acceptance tests can be specified, and all these characteristics are met boils down mostly to a manual these can be re-executed over and over review process. Of course, this is tedious and error-prone, and again. requirements end up in the sorry state we all know. To get one step better, you can use controlled natural lan- guage10 in which words like must, may or should have a well 10 en.wikipedia.org/wiki/Con- _ _ defined meaning and are used consciously. Using tables and trolled natural language – to some extent – state machines, is also a good way to make some of the data less ambiguous; these formalisms also help to verify requirements for consistency and completeness. To manage large sets of requirements, tools should be used to sup- port unique identification and naming of requirements, as well as the expression of relationships and hierarchies among re- quirements11 However, the requirements themselves are still 11 Example tools include DOORS, expressed as plain text, so the fundamental problems men- Requisite Pro, the Eclipse Requirements Framework (RMF), and itemis’ Yakindu. tioned above are not improved significantly. In this chapter we will give you some ideas and examples on how this situation can be improved with DSLs12. 12 Note that I don’t suggest in this chapter that all requirements should be captured with DSLs. Instead, DSLs can be one important ingredient for a well 17.2 Requirements versus Design versus Implemen- thought out requirements management tation approach.

Traditionally, we try to establish a clear line between require- ments, architecture and design, and implementation. For ex- ample, a requirement may state that the system be 99.99% reli- able. The design may use hot-standby and fail-over to continue service if a component breaks. The implementation would then 446 dslbook.org

select a specific standby/fail-over technology to realize the de- sign. We make this distinction because we want to establish different roles in the software engineering process. For exam- ple, product management writes the requirements, a systems architect comes up with the architecture and design, and then a programmer writes the actual code and chooses the tech- nologies13. Different organizations may even be involved, lead- 13 A different approach may have the ing to a separation between requirements and architecture that architect selecting the implementation technology and the programmer doing 14 is driven by organizational constraints : the OEM writes the the design work as well. requirements, a systems integrator does the architecture, and 14 some outsourcing company does the coding. In such a sce- In effect, everything the OEM does is called requirements (by definition!), nario it is important to draw precise boundaries between the what the integrator does is called activities. However, in some sense the boundaries are arbi- architecture, and what the outsourcer does is called implementation. trary. Consequently, the distinction between requirements and architecture are arbitrary as well: we could just as well state the following:

Requirement The system shall be 99.99% reliable by using hot- standby and fail-over to continue service if something breaks.

Architecture/Design We use two application servers running on two machines, using the XYZ messaging middleware as a replication engine for the hot-standby. We use a watchdog for detecting if the primary machine breaks, so we can fail over to the second one.

Implementation . . . all the code that is necessary to implement the design above.

From software development we know that it is very hard to get requirements right. In the real world, you have to elaborate on the requirements incrementally15. In systems engineering 15 You write some requirements, then this approach is also very well established. For example, when you write a prototype and check if the requirements make sense, then you satellites are built, the scientists come up with initial scientific refine the requirements, write a (maybe requirements, for example, regarding the resolution a satellite- more detailed) prototype, and so on. based radar antenna looking at the earth should have. Let’s look at some of the consequences:

• A given resolution requires a specific size of the antenna, and a specific amount of energy being sent out. (Actually, the two influence each other as well).

• A bigger antenna results in a heavier satellite, and more radar energy requires more solar panel area – increasing the size and weight even further. dsl engineering 447

• At some point, the size and weight of the satellite cannot be further increased, because a given launch vehicle reaches its limits – a different launch vehicle might be required.

• A bigger launch vehicle will be much more expensive, or you might have to change the launch provider. For example, you might have to use a Soyuz instead of an Ariane.

• A Soyus launched at Baikonur cannot reach the same orbits as an Ariane launched from Courou, because of the higher latitude. As a consequence, the satellite might be "further away" from the area you want to inspect with your radar, neglecting the advantages gained by the bigger antenna16. 16 Actually, they now launch Soyuz vehicles from Courou as well, for just . . . and this has just looked at size and weight! Similar prob- that reason. lems exist with heat management, pointing accuracy or propul- sion. As you can see, a change in any particular requirement can lead to non-trivial consequences you will only detect if you think about the implementation of the requirement. A strictly se- quential approach (first write all the requirements, then think about the implementation) will not work. So what do the sys- tems engineers do? They come up with a model of the satellite. Using mathematical formulas, they describe how the different properties discussed above relate. These might be approxima- tions or based on past experience – after all, the real physics can be quite complex. They then run a trade-off analysis. In other words, they change the input values until a workable compromise is reached. Usually this is a manual process, but sometimes parts of it can be automated. This example illustrates three things. First, requirements elicitation is incremental. Second, models can be a big help to precisely specify requirements and then "play" with them. And third, the boundary between requirements and design is blurred, and the two influence each other. Fig. 17.1 shows a multi-step approach to requirements definition, intertwined with incrementally more refined designs.

17.3 Using DSLs for Requirements Engineering

So here is the approach for using DSLs we suggest: identify a couple of core areas of the system to be built that lend them- selves to specification with a formal language17. Then develop 17 The trade-off between antenna resolu- one or more DSLs to express these areas and use them to de- tion and size/weight mentioned above is such an area. scribe the system. The rest of the system – i.e., the areas for 448 dslbook.org

Figure 17.1: Requirements and Design influence each other and are thus best done iteratively, and in parallel.

which a DSL-based description makes no sense – is described textually, with the usual tools18. 18 We will discuss the integration Once a suitable DSL has been found and implemented, those between the textual requirements and the DSL-based requirements below. people who have the requirements in mind can directly express them – the lossy human-to-human communication is no longer a problem. Various constraint checks, tests and simulations can be used to allow the requirements owners to "play" with the requirements models to see if they really express what they had in mind. Of course there is one significant caveat: we first have to build this DSL. So how do we go about that? We could have somebody write prose requirements and hand them over to the DSL developer . . . back to square one! There is a much better approach, though. Since today’s language workbenches support extremely rapid prototyping, you can actually build the DSLs interactively with the require- ments owner. Since you are not capturing the specific require- ments, but rather try to define how specific requirements are 19 This is similar to what has been done described, you essentially perform a domain analysis: you try with "analysis models" (back in the day . . . ). However, instead of drawing to understand the degrees of freedom in the domain to be able UML analysis diagrams, you capture to represent the domain with the DSL19. Here is the process I the domain into a language definition, have use successfully many times: which makes it executable, in the sense that you can always turn around and have the requirements owner try to 1. Have the requirements owner explain some particular as- express specific requirements with the DSL you’re building, verifying the pect of the domain. suitability of the DSL. dsl engineering 449

2. Try to understand that aspect and change your DSL so it can express that aspect. 3. Make the requirements owner try to express a couple of specific, but representative, requirements with the DSL. 4. Most likely you will run into problems, some things cannot be expressed with the DSL. If so, go back to 1 and reiterate. A complete iteration should take no more than 60 minutes. 5. After half a day, stop working with the requirements owner and clean up/refactor the DSL. 6. Start another of the language design sessions with the re- quirements owner and iterate – over time, you should get closer to the DSL for the domain.

Once the DSL is finished, the requirements owners will be able to express domain requirements without involvement of the software developers. This approach to requirements engineering is very close to regular DSL usage. We identify an aspect of the domain that lends itself to formalization, iteratively build a language, and then let the domain experts – who are the requirements owners for many of the business requirements – express the system directly. The classical requirements document is gone20. 20 In many ways the refrigerator and Using DSLs to specify (some parts of the) requirements for- pension plan examples from Part II of this book are examples of this mally helps achieve some of the desirable characteristics for approach, and so is the Health domain requirements discussed above. The following lists only those example discussed in Chapter 22. characteristics for which DSLs make a difference.

Consistent Consistency is enforced by the language. If the DSL is crafted correctly, no inconsistent requirements can be ex- pressed.

Feasible Specific requirements are checked for feasibility by be- ing expressible with the DSL: they are within the scope of what the DSL – hence, the domain for which we write the requirements – is intended.

Unambiguous A description of requirements – or application functionality in general – with a DSL always unambiguous, provided the DSL has well-defined semantics.

Verifiable Constraints, tests, verification or simulation can be used to verify the requirements regarding various proper- ties. Inspection and review is simplified, because DSL pro- grams are less verbose than implementation code, and clearer than prose. 450 dslbook.org

17.4 Integration with Plain Text Requirements

You will probably not be able to describe all the requirements of a system using the approach described above. There will always be aspects that cannot be formalized, or that are so spe- cific that the effort for building a DSL does not pay off. You therefore have to find some way of integrating plain text re- quirements with DSL code. Here are some approaches to how this can be done.

17.4.1 Embedding DSL Code in a Requirements Tool One approach is to mix prose requirements with formalized, DSL-based requirements. We show examples with Xtext and MPS.

 Xtext Example Eclipse-based tooling for requirements en- gineering is being developed as part of the VERDE21 and ProR22 21 www.itea-verde.org/ research projects. This includes Eclipse RMF23, a "classical" 22 www.pror.org/ 23 eclipse.org/RMF requirements engineering tool, in which textual requirements are classified, structured and put into relationships with each other24. The requirements structure is represented as an EMF 24 The tool actually implements the model, to make integration with other model-based artifacts OMG’s ReqIF standard. simple. In addition to plain text, requirements can have pa- rameters with well-defined types. The types of these parame- ters can be primitive (string, int), but they can also be any other Ecore meta class, so any additional model structure can be em- bedded into a requirement. Integration with Xtext is available, which provides textual concrete syntax for these data struc- tures. In other words, it is possible to enrich prose require- ments specifications with additional formal specifications ex- pressed in arbitrary DSLs. Fig. 17.2 shows a screenshot.

 MPS Example We have built a similar solution for MPS in the mbeddr project. The solution supports collecting trees of requirements, where each requirement has an ID, a kind and a short summary. Fig. 17.3 shows an example. In addi- tion, the one-line summary can be expanded to reveal addi- tional details (Fig. 17.4). There users can enter a detailed prose description, as well as additional constraints among require- ments (requires also, conflicts with.) In the Additional Specifications section, users can enter arbitrary DSL pro- grams: since MPS supports language modularization and com- position (Section 16.2), embedding arbitrary languages with dsl engineering 451

Figure 17.2: An Xtext DSL embedded in a requirements engineering tool based arbitrary syntax into the requirements language is trivial and on the Eclipse Requirements Modeling works out of the box. It is also possible to associate a specific Framework (RMF). additional specification with a particular requirements kind. This means that if a requirement has a particular kind, the ad- ditional data associated with that kind must be present in the Additional Specifications section25. 25 For example, a requirement kind timing may require an instance of TimingSpecification in the 17.4.2 Requirements Traceability Additional Specifications section. Requirements traceability establishes links, or traces, between implementation (or design or test) artifacts and requirements. This allows each (part of) an artifact to be traced back to the requirements that drive the artifact26. Once such pointers are 26 Traces are typically typed (as in implements, tests or refines), so available, various analyses become possible. For example, it various different relationships can be is easy to find out whether each requirement has been imple- established. mented (or tested), and we know which implementation arti- facts may have to change if a requirement changes. Requirements traceability has two challenges. The first one is social, the second one is technical. The social problem is that, while traces are easy to analyze once they are available, they still have to be established manually. This requires discipline by the people, typically developers, whose job it is to establish the traces. The technical problem addresses how to actually establish the pointers technically. In a world in which requirements – 452 dslbook.org

Figure 17.3: The requirements language as well as design, implementation and test artifacts – are all supports creating trees of requirements. model-based, establishing these pointers becomes trivial. In Each requirement has an ID, a kind and short text. In addition, a detailed mixed environments with many different tools built on many description and constraints among different foundations, it can become arbitrarily complicated. requirements can be added (Fig. 17.4). Again, we show tooling examples for Xtext and MPS.

Figure 17.4: The detail view of the requirements language. In the Additional Specifications section users can enter arbitrary programs written in arbitrary DSLs. Constraints can be used to make sure that specific additional specifications are always added to requirements of a particular kind.

27 In addition to generic EMF-based  Xtext Example The VERDE project mentioned above also tracing, tool adapters can be plugged in develops a traceability framework for Eclipse. Various trace to support for trace links from/into a kinds can be defined, which can subsequently be used to estab- number of specific tools or formats such as plain text files, non-EMF UML tools 27 lish links between arbitrary EMF-based models . The trace- or AUTOSAR models. dsl engineering 453

ability links are kept external to the actual models, so no mod- ifications to existing meta models or languages are required28. 28 This is also the reason why non-EMF artifacts can be integrated, even though they require a specific tool adapter.  MPS Example As part of mbeddr, we have implemented a generic tracing framework base on MPS’ language annotations (discussed in Section 16.2.7). Arbitrary models – independent of the language used to create them29 – can be annotated with 29 Of course this only works for arbi- traceability links, as shown in the left part of figure 17.5. trary MPS-based languages. A context menu action adds a new trace to any model el- ement: Ctrl-Space allows the selection of one or more re- quirements at which to point. Each trace has a kind. The traced requirements are color coded to reflect their trace sta- tus (Fig. 17.5, top right). Finally, the Find Usages functionality of MPS has been customized to show traces directly (Fig. 17.5, bottom right).

Figure 17.5: Left: A C function with an attached requirement trace. Traces can be attached to arbitrary program nodes, supporting tracing at arbitrary levels of detail. Right, top: The requirements can be color coded to reflect whether they are traced at all (gray), implemented (blue) and tested (green). Untraced requirements are red. Right, bottom: The Find Usages dialog shows the different kinds of traces as separate categories. Programs can also be shown without the traces using conditional projection. We discuss this mechanism in the section on product line variability (Section 7.7).

18 DSLs and Software Architecture

In this chapter we explore the relationship between DSLs and software architecture. In particular we establish the notion of an Architecture DSL (or ADSL), which is a DSL specifically created for a specific architecture. We first dis- cuss ADSLs based on an extensive example, then discuss the conceptual details. The chapter also looks at embedding business logic DSLs, as well as at the role software compo- nents can play in the context of DSLs.

18.1 What is Software Architecture?

 Definitions Software architecture has many definitions, from various groups of people. Here are a few. Wikipedia defines software architecture in the following way:

The software architecture of a program or computing system is the structure or structures of the system, which comprise software elements, the externally visible properties of those ele- ments, and the relationships between them.

This is a classic definition that emphasizes the (coarse grained) structure of a software system (as opposed to the behavior), observable by analyzing an existing system1. A definition by 1 As you may suspect from my empha- Boehm builds on this: sis here, I disagree. Architecture is not limited to structure, and it is not limited to coarse-grained things. I will get back A software system architecture comprises a collection of soft- to this later. ware and system components, connections, and constraints, a collection of system stakeholders’ need statements as well as a rationale which demonstrates that the components, connections, and constraints define a system that, if implemented, would sat- isfy the collection of system stakeholders’ need statements. 456 dslbook.org

In addition to the structure, this definition also emphasizes the relevance of the stakeholders and their needs, and ties the structure of the system to what the system is required to do. Hayes-Roth introduce another aspect:

The architecture of a complex software system is its style and method of design and construction.

Instead of looking at the structures, they emphasize that there are different architectural styles and they emphasize the "method of design and construction"2. Eoin Woods takes it one step fur- 2 This can be interpreted as emphasiz- ther: ing the process of how something is built, and not just the structures of the built artifact once it is finished. Software architecture is the set of design decisions which, if made incorrectly, may cause your project to be cancelled.

He emphasizes the design decisions that lead to a given sys- tem. So he doesn’t look at the system as a set of structures, but rather considers the architecture as a process – the design decisions – that leads to a given system. While I don’t disagree 3 A locking protocol for accessing a with any of these definitions – they all emphasize different as- shared resource is possibly a detail, but it is important to implement it pects of the architecture of a software system – I would like to consistently to avoid deadlocks and propose another one: other concurrency faults.

Software architecture is everything that needs to be consistent throughout a software system.

This definition is useful because it includes structures and be- havior; it doesn’t say anything about coarse-grained versus de- tailed3 and it implies that there needs to be some kind of pro- cess or method to achieve the consistency. As we will see, DSLs can help with achieving this consistency. Figure 18.1: Conceptual structure of a software architecture.  Terminology Fig. 18.1 looks at software architecture in more 4 detail. At the center of the diagram you can see the concep- Separating the two has the advantage that the conceptual discussions are not tual architecture. This defines architectural concepts from which diluted by what specific technologies concrete systems are built (we will come back to this in the may or may not be able to do, and how. The decision of what technology to next section). Examples could include: task, message, queue, map to is then driven by non-functional component, port, or replicated data structure. Note that these concerns, i.e. whether and how a given concepts are independent of specific implementation technolo- technology can provide the required quality of service. gies: it is the technology mapping that defines how the architec- 4 ture concepts are implemented based on specific technologies . 5 Ideally, the programming model The programming model is the way in which the architectural should not change as long as the 5 architectural concepts remain stable – concepts are implemented in a given programming language . even if the technology mapping or the Applications are implemented by instantiating the architectural implementation technologies change. dsl engineering 457

concepts, and implementing them based on the programming model.

 The Importance of Concepts I want to reemphasize the im- portance of the conceptual architecture. When asking people about the architecture of systems, one often gets answers like: "It’s a Web service architecture", or "It’s an XML architecture" or "It’s a JEE architecture". Obviously, all this conveys is that a certain technology is used. When talking about architectures per se, we should talk about architectural concepts and how they relate to each other. Only as a second step should a map- ping to one or more technologies be discussed. Here are some of these fundamental architectural concepts:

Modularize Break big things down into smaller things, so they can be understood (and potentially reused) more easily. Ex- amples: procedures, classes, components, services, user sto- ries.

Encapsulate Hide the innards of a module so that they can be changed without affecting clients. Examples: private mem- bers, facade pattern, components, layers/rings/levels.

Contracts Describe the external interface of a module clearly. Examples: interfaces, pre- and post-conditions, protocol state machines, message exchange patterns, published APIs.

Decoupling Reduce dependencies in time, data structure or con- tention. Examples: message queues, eventual consistency, compensating transactions.

Isolate crosscuts Encapsulate handling of cross-cutting concerns. Examples: aspect orientation, interceptors, application servers, exception handling.

As we will see in the following section, architecture DSLs can specify architectures unambiguously while emphasizing these fundamentals, instead of technologies.

18.2 Architecture DSLs

An Architecture DSL (ADSL) is a language that expresses a system’s architecture directly. "Directly" means that the lan- guage’s abstract syntax contains constructs for all the ingredi- ents of the conceptual architecture. The language can hence be 458 dslbook.org

used to describe a system on the architectural level without us- ing low-level implementation code, but still in an unambiguous way6. Code generation is used to generate representations of 6 Meaning: it is more useful than the application architecture in the implementation language(s), Powerpoint architecture, even though that is also high level and technology 7 automating the technology mapping . Finally, the program- independent :-). ming model is defined with regards to the generated code plus 7 Of course, the decisions about the 8 possibly additional frameworks . mapping have to be made manually. I do not advocate the definition of a generic, reusable lan- But once they are made (and generators are implemented), the mapping can be guage such as the various ADLs, or UML (see below). Based executed automatically for instances of on our experience, the approach works best if you define the the architecture. ADSL in real time as you understand, define and evolve the 8 Architecture DSLs are typically incom- conceptual architecture of a system9. The process of defin- plete. Only the architecturally relevant structures and behaviors are expressed ing the language actually helps the architecture/development with the DSL. The application logic is team to better understand, clarify and refine the architectural implemented in a GPL. abstractions, as the language serves as a (formalized) ubiqui- 9 In other words, like any DSL it should tous language that lets you reason about and discuss the archi- be closely aligned with the domain it is tecture. supposed to represent. In this case, the domain is the architecture of system, platform or product line. 18.2.1 An Example ADSL This section contains an example of an Architecture DSL taken from a real system in the domain of airport management sys- tems10. 10 I was helping that customer build an architecture DSL for their architecture. Background The customer decided they wanted to build They hired me because they needed  to get a grip on their architecture a new airport management system. Airlines use systems like independent of specific technologies, these to track and publish information about whether airplanes since the system was supposed to last for 20 years. have landed at airports, whether they are late, and to track the technical status of the aircraft11. The system also populates 11 The actual system dealt with a differ- the online-tracking system on the Web and information mon- ent transportation domain, not aircraft. I had to anonymize the example a bit. itors at airports. This system is in many ways a typical dis- However, the rest of the story happened tributed system: there is a central data center to do some of the exactly as it appears here. heavy number crunching, but there are additional machines distributed over relatively large areas. Consequently you can- not simply shut down the whole system, which leads to a re- quirement to be able to work with different versions of parts of the system at the same time. Different parts of the system will be built with different technologies: Java, C++, C#. This is not an untypical requirement for large distributed systems either. Often you use Java technology for the backend, and .NET tech- nology for a Windows front end. The customer had decided that the backbone of the system would be a messaging infras- tructure, and they were evaluating different messaging tools for performance and throughput. dsl engineering 459

While my customer knew many of the requirements and had made specific decisions about some architectural aspects, they didn’t have a well-defined conceptual architecture. It showed: when the team were discussing their system, they stumbled into disagreements about architectural concepts all the time because they had no language for the architecture. Also, they didn’t have a good idea of how to maintain the architecture over the 20 years of expected lifetime in the face of changing technologies.

 Getting Started To solve this issue, an architecture DSL was developed. We started with the notion of a component. At that point the notion of components is defined relatively loosely, simply as the smallest architecturally relevant building block, a piece of encapsulated application functionality12. We also 12 In many cases it is a good idea to assume that components can be instantiated, making compo- start with the notion of a component, although components may turn out nents the architectural equivalent to classes in object-oriented to look different in different systems. programming. To enable components to interact with each Since components are so important, we discuss them in Section 18.3. other, we also introduce the notion of interfaces, as well as ports, which are named communication endpoints typed with an interface. Ports have a direction13 (provides, requires) as 13 It is important not to just state which well as a cardinality. Based on this initial version of the ADSL, interfaces a component provides, but also which interfaces it requires, because 14 we could write the following example code : we want to be able to understand (and later analyze with a tool) component component DelayCalculator { provides aircraft: IAircraftStatus dependencies. This is important for any provides managementConsole: IManagementConsole system, but is especially important for requires screens[0..n]: IInfoScreen the versioning requirement. }

14 component Manager { The following is an example instance requires backend[1]: IManagementConsole that makes use of the architecture DSL. } In this section we discuss the ADSL based on an example and don’t discuss component InfoScreen { provides default: IInfoScreen the implementation in Xtext. The DSL is } not very challenging in terms of Xtext. component AircraftModule { requires calculator[1]: IAircraftStatus }

We then looked at instantiation. There are many aircraft, each running an AircraftModule, and there are even more Info- Screens. So we need to express instances of components15. 15 Note that these are logical instances. We also introduce connectors to define actual communication Decisions about pooling and redundant physical instances had not been made paths between components (and their ports). yet. instance dc: DelayCalculator instance screen1: InfoScreen instance screen2: InfoScreen connect dc.screens to (screen1.default, screen2.default) 460 dslbook.org

 Organizing the System At some point it became clear that in order to not get lost in all the components, instances and connectors, we need to introduce some kind of namespace. It became equally clear that we’d need to distribute things to dif- ferent files: namespace com.mycompany {

namespace datacenter {

component DelayCalculator { provides aircraft: IAircraftStatus provides managementConsole: IManagementConsole requires screens[0..n]: IInfoScreen }

component Manager { requires backend[1]: IManagementConsole }

}

namespace mobile {

component InfoScreen { provides default: IInfoScreen }

component AircraftModule { requires calculator[1]: IAircraftStatus }

} }

It is also a good idea to keep component and interface defini- tions (essentially type definitions) separate from system defi- nitions (connected instances), so we introduced the concept of compositions, which make a group of instances and connectors identifiable by a name: namespace com.mycompany.test {

composition testSystem { instance dc: DelayCalculator instance screen1: InfoScreen instance screen2: InfoScreen connect dc.screens to (screen1.default, screen2.default) } }

 Dynamic Connectors Of course in a real system, the Delay- Calculator would have to dynamically discover all the avail- 16 Note how the specific realization able InfoScreens at runtime. There is not much point in man- depends on the technology. In the ually describing those connections. So, we specify a query model, we just express the fact that we have to be able to run the queries that is executed at runtime against some kind of naming/- specified in the model. We will map trader/lookup/registry infrastructure16. It is re-executed ev- this to an implementation later. ery 60 seconds to find the InfoScreens that have just come on line. dsl engineering 461

namespace com.mycompany.production {

instance dc: DelayCalculator

// InfoScreen instances are created and started in other configurations dynamic connect dc.screens query { type = IInfoScreen status = active } }

A similar approach can be used to address load balancing or fault tolerance. A static connector can point to a primary as well as a backup instance. Or a dynamic query can be re- executed when the currently used instance becomes unavail- able. To support registration of instances with the naming or registry service, we add additional syntax to their definition. A registered instance automatically registers itself with the reg- istry, using its name (qualified through the namespace) and all provided interfaces. Additional parameters can be specified, and the following example registers a primary and a backup instance for the DelayCalculator: namespace com.mycompany.datacenter {

registered instance dc1: DelayCalculator { registration parameters {role = primary} }

registered instance dc2: DelayCalculator { registration parameters {role = backup} } }

 Interfaces So far we hadn’t really defined what an interface is. We knew that we’d like to build the system based on a messaging infrastructure. Here’s our first idea: an interface is a collection of messages, where each message has a name and a list of typed parameters17. After discussing this notion of 17 This also requires the ability to define interfaces for a while, we noticed that it was too simplistic. We data structures, but in the interests of brevity we won’t show that. needed to be able to define the direction of a message: does it flow in or out of the port? More generally, which kinds of 18 A oneway message has no result. It message interaction patterns are there? We identified several; is sent by the client infrastructure to a receiver with best effort. The client does oneway request-reply18 here are examples of and : not know if the message ever arrives. A interface IAircraftStatus { request-reply message is exactly what it seems: the client sends a message and oneway message reportPosition(aircraft: ID, pos: Position ) expects a reply. Note that at this point we do not specify whether we expect request-reply message reportProblem { request (aircraft: ID, problem: Problem, comment: String) the request as a callback, by polling reply (repairProcedure: ID) a Future object, or whether the client } blocks until the reply arrives. } 462 dslbook.org

We talked a long time about various message interaction pat- terns. After a while it turned out that one of the core use cases for messages was to push updates of various data structures out to various interested parties. For example, if a flight was delayed because of a technical problem with an aircraft, then this information had to be pushed out to all the InfoScreens in the system. We prototyped several of the messages necessary for "broadcasting" complete updates, incremental updates and removal of data items. And then it hit us: we were working with the wrong abstraction!

 Data Replication While messaging is a suitable transport ab- straction for these things, architecturally we’re really talking about replicated data structures. It basically works the same way for all of those structures:

• You define a data structure (such as FlightInfo).

• The system then keeps track of a collection of such data structures.

• This collection is updated by a few components and typi- cally read by many other components.

• The update strategies from publisher to receiver always in- clude full update of all items in the collection, incremental updates of just one or a few items, and removal of items.

Once we understood that in addition to messaging, there’s this additional core abstraction in the system, we added this to our Architecture DSL and were able to write something like the fol- lowing. We define data structures and replicated items. Com- ponents can then publish or consume those replicated data structures. We state that the publisher publishes the replicated data whenever something changes in the local data structure. However, the InfoScreen only needs an update every 60 sec- onds (as well as a full load of data when it is started up). struct FlightInfo { from: Airport to: Airport scheduled: Time expected: Time } replicated singleton flights { flights: FlightInfo[] } component DelayCalculator { publishes flights { publication = onchange } dsl engineering 463

} component InfoScreen { consumes flights { update 60 } }

This is much more concise compared to a description based on messages. We can automatically derive the kinds of mes- sages needed for full update, incremental update and removal, and create these messages in the model using a model transfor- mation. The description reflects much more clearly the actual architectural intent: it expresses better what we want to do (replicate data) compared to a lower-level description of how we want to do it (sending around update messages)19. 19 Once again, this shows the advantage of a DSL that is closely aligned with Interface Semantics While replication is a core concept for its domain, and that can be evolved  together with the understanding of data, there is of course still a need for messages, not just as an the domain. With a fixed language, we implementation detail, but also as a way to express architec- most likely would have had to shoehorn 20 the data replication abstractions into tural intent . It is useful to add more semantics to an inter- messaging, somehow. And by doing so, face, for example to define valid sequencing of messages. A we would have lost the opportunity to well-known way to do that is to use protocol state machines. generate meaningful code based on the actual architectural intent. Here is an example. It expresses the fact that you can only re- 20 port positions and problems once an aircraft is registered. In Not all communication is actually data replication. other words, the first thing an aircraft has to do is register itself. interface IAircraftStatus {

oneway message registerAircraft(aircraft: ID )

oneway message unregisterAircraft(aircraft: ID )

oneway message reportPosition(aircraft: ID, pos: Position )

request-reply message reportProblem { request (aircraft: ID, problem: Problem, comment: String) reply (repairProcedure: ID) }

protocol initial = new { state new { registerAircraft => registered } state registered { unregisterAircraft => new reportPosition reportProblem } } }

Initially, the protocol state machine is in the new state where the only valid message is registerAircraft. If this is received, we transition into the registered state. In registered, you can either unregisterAircraft and go back to new, or receive a reportProblem or reportPosition message, in which case you will remain in the registered state. 464 dslbook.org

 Versioning We mentioned above that the system is dis- tributed geographically. This means it is not feasible to update all the software for all parts of the systems (e.g., all InfoScreens or all AircraftModules) at the same time. As a consequence, there might be several versions of the same component running in the system. To make this feasible, many non-trivial things need to be put in place in the runtime system. But the basic re- quirement is this: you have to be able to mark versions of com- ponents, and you have to be able to check them for compatibil- ity with older versions. The following piece of code expresses the fact that the DelayCalculatorV2 is a new implementation of DelayCalculator. newImplOf means that no externally visi- ble aspects change, which is why no ports and other externally exposed details of the component are declared21. 21 For all intents and purposes, it’s the same thing – just maybe a couple of component DelayCalculator { bugs are fixed. publishes flights { publication = onchange } } newImplOf component DelayCalculator: DelayCalculatorV2

To evolve the externally visible signature of a component, one can write this: component DelayCalculator { publishes flights { publication = onchange } } newVersionOf component DelayCalculator: DelayCalculatorV3 { publishes flights { publication = onchange } provides somethingElse: ISomething }

The keyword newVersionOf allows us to add additional pro- vided ports (such as the somethingElse port) and to remove required ports22. 22 You cannot add additional required ports or remove any of the provided ports, since that would destroy the 23 This wraps up our case study . In the next subsection we "plug-in compatibility". Constraints recap the approach and provide additional guidance. make sure that these rules are enforced on the model level.

23 18.2.2 Architecture DSL Concepts While writing this chapter, I contacted the customer, and we talked about the approach in hindsight. As it turns What we did in a Nutshell Using the approach shown here, out, they are still happily using the  approach as the foundation of their we were able to quickly get a grip of the overall architecture of system. They generate several hundred the system. All of the above was actually done in the first day thousand lines of code from the models, and they have to migrate to a new of the workshop. We defined the grammar, some important version of Xtext real soon now :-) constraints, and a basic editor (without many bells and whis- tles). We were able to separate what we wanted the system to do from how it would achieve it: all the technology discussions were now merely an implementation detail of the conceptual dsl engineering 465

descriptions given here24. We also achieved a clear and unam- 24 I don’t want to give the impression biguous definition of what the different architectural concepts that technology decisions are unim- portant. However, they should be an mean. The generally nebulous concept of component has a for- explicit second step, and not mixed mal, well-defined meaning in the context of this system. with the concepts. This is particularly important in light of the fact that the The approach discussed in this chapter therefore recommends system should live for more than 20 the definition of a formal language for your project’s or sys- years, during which technologies will tem’s conceptual architecture. You develop the language as the change more than once. understanding of your architecture grows. The language there- fore always resembles the complete understanding about your architecture in a clear and unambiguous way.

 Component Implementation By default, architecture DSLs are incomplete; component implementation code is written man- ually against the generated API, using well-known composi- tion techniques such as inheritance, delegation or partial classes. However, there are other alternatives for component imple- mentation that do not use a GPL, but instead use formalisms that are specific to certain classes of behavior: state machines, business rules or workflows. You can also define and use a domain-specific language for certain classes of functionality in a specific business domain. If such an approach is used, then the code generator for the application domain DSL has to act as the "implementor" of the components (or whatever other ar- chitectural abstractions are defined in the architecture DSL). The code generated from the business logic DSL must fit into the code skeletons generated from the architecture DSL. The code composition techniques discussed for incomplete DSLs in Section 4.5.1 can be used here25. 25 Be aware that the discussion in this section is only really relevant Standards, ADLs and UML Describing architecture with for application-specific behavior, not  for all implementation code. Huge formal languages is not a new idea. Various communities rec- amounts of implementation code are ommend using Architecture Description Languages (ADLs) or related to the technical infrastructure (e.g., remoting, persistence, workflow) the Unified Modeling Language (UML). However, all of those of an application. This can be derived approaches advocate using existing, generic languages for spec- from the architectural models, and ifying architecture, although some of them, including UML, generated automatically. can be customized to some degree26. 26 I am not going to elaborate on the Unfortunately, efforts like that completely miss the point. reasons why I think profiles are usually not a good alternative to a DSL. It We have not experienced much benefit in shoehorning an ar- should become clear from the example chitecture description into the (typically very limited, as well above and the book in general why I think this is true. as too generic) constructs provided by predefined languages – one of the core activities of the approach explained is this chapter is the process of actually building your own language to capture your system’s specific conceptual architecture. 466 dslbook.org

 Visualization In this project, as well as in many other ones, we have used textual DSLs. We have argued in this book why textual DSLs are superior in many cases, and these arguments apply here as well. However, we did use visualization to show the relationships between the building blocks, and to commu- nicate the architecture to stakeholders who were not willing to dive into the textual models. Figure 18.2 shows an example, created with Graphviz.

Figure 18.2: This shows the relationship between the ingredients of the example system described above. It contains namespace, components, interfaces and data structures, as well as the provides and requires relationships between components and interfaces.

 Code Generation Now that we have a formal model of the conceptual architecture (the language) and also a formal de- scription of system(s) we are building – i.e., the models defined using the language – we can exploit this by generating code27: 27 It should have become clear from the chapter that the primary benefit of • We generate an API against which the implementation is developing the architecture DSL (and using it) is just that: understanding con- coded. That API can be non-trivial, taking into account the cepts by removing any ambiguity and various messaging paradigms, replicated state, etc. The gen- defining them formally. It helps you erated API allows developers to code the implementation in understand your system and get rid of unwanted technology interference. But a way that does not depend on any technological decisions: of course, exploiting these assets further the generated API hides those from the component imple- by generating code is useful. mentation code. We call this generated API, and the set of idioms to use it, the programming model.

• Remember that we expect some kind of component con- tainer or middleware platform to run the components, so we also generate the code that is necessary to run the compo- nents (including their technology-neutral implementation) dsl engineering 467

on the implementation technology of choice. We call this layer of code the technology mapping code (or glue code). It typically also contains the configuration files for the various platforms involved28. 28 As a side effect, the generators cap- ture best practices in working with the It is of course completely feasible to generate APIs for sev- technologies you’ve decided to use. eral target languages (supporting component implementation in various languages) and/or generating glue code for several target platforms (supporting the execution of the same compo- nent on different middleware platforms). This nicely supports potential multi-platform requirements29. 29 Even if you don’t have explicit re- Another important point is that you typically generate in quirements to support several plat- forms, it may still be useful to be able several phases: a first phase uses type definitions (components, to do so. For one, it makes future mi- data structures, interfaces) to generate the API code so you can gration to a new platform simpler. Or a second platform may simply be a unit write the component implementation code. A second phase test environment running on a local generates the glue code and the system configuration code. PC, without the expensive distribution Consequently, it is often sensible to separate type definitions infrastructure of the real system. from system definitions into several different viewpoints30: these 30 Viewpoints were discussed in Sec- are used at different times in the overall process, and also often tion 4.4. created, modified and processed by different people. In summary, the generated code supports an efficient and technology independent implementation and hides much of the underlying technological complexity, making development more efficient and less error-prone.

 What Needs to be Documented? I advertise the above ap- proach as a way to formally describe a system’s conceptual and application architecture. So, this means it serves as some kind of documentation, right? Right, but it does not mean that you don’t have to document anything else. The following things still need to be documented31: 31 There are probably more aspects of an architecture that might be worth Rationales/Architectural Decisions The DSLs describe what your documenting, but the following two are architecture looks like, but it does not explain why it looks the most important ones when working with Architecture DSLs. the way it does. You still need to document the rationales for architectural and technological decisions. Note that the grammar of your architecture DSL is a really good baseline for such a documentation. Each of the constructs is the re- sult of architectural decisions. So, if you explain for each grammar element why it is there (and why other alterna- tives have not been chosen) you are well on your way to documenting the important architectural decisions. A sim- ilar approach can be used for the application architecture, i.e. the programs written with the DSL. 468 dslbook.org

User Guides A language grammar can serve as a well-defined and formal way of capturing an architecture, but it is not a good teaching tool. So you need to create tutorials for your users (i.e., the application programmers) that explain how to use the architecture and the DSL. This includes what and how to model (using your DSL) and also how to generate code and how to use the programming model (how to fill in the implementation code into the generated skeletons).

18.3 Component Models

As I have hinted at already, I think that component-based soft- ware architectures are extremely useful. There are many (more or less formal) definitions of what a components is. They range from a building block of software systems, through something with explicitly defined context dependencies, to something that contains business logic and is run inside a container. Our understanding (notice we are not saying we have a real definition) is that a component is the smallest architectural building block. When defining a system’s architecture, you typically don’t look inside components. Components have to specify all their architecturally relevant properties declaratively (using meta data, or models). As a consequence, components become analyzable and composable by tools. Typically they run inside a container that serves as a framework to act on the runtime-relevant parts of the meta data. The component boundary is the level at which the container can provide tech- nical services such as as logging, monitoring or fail-over. The component also provides well-defined APIs to its implemen- tation to address cross-cutting architectural concerns such as locking32. 32 A long time ago I coauthored the I don’t have any specific requirements for what meta data book Server Component Patterns. While the EJB-based technology examples a component actually contains (and hence, which properties are no longer relevant, many of the are described). The concrete notion of components has to be patterns identified and discussed in the book still are. The book was not written defined for each (system/platform/product line) architecture with a MDD/DSL background, but separately: this is exactly what we do with the language ap- the patterns fit very well with such an proach introduced above. approach. Based on my experience, it is almost always useful to start by modeling the component structure of the system to be built. To do that, we start by defining what a component actually is – that is, by defining a meta model for component-based development. Independent of the project’s domain, these meta dsl engineering 469

models are quite similar, with a set of specific variation points. We show parts of these meta models here to give you a head start when defining your own component architecture.

18.3.1 Three Typical Viewpoints It is useful to look at a component-based system from several viewpoints (Section 4.4). Three viewpoints form the backbone of the description.

 The Type Viewpoint The Type viewpoint describes compo- nent types, interfaces and data structures33. A component pro- 33 Referring back to the terminology in- vides a number of interfaces and references a number of re- troduced in Section 3.3, this viewpoint is independent. It is also sufficient for quired interfaces (often through ports, as in the example above). generating all the code that is needed An interface owns a number of operations, each with a return for developers to implement the appli- cation logic. type, parameters and exceptions. Fig. 18.3 shows this.

Figure 18.3: The types viewpoint describes interfaces, data types and components. These are referred to as types because they can be instantiated (in the composition viewpoint) to make up the actual application.

To describe the data structures with which the components work (the meta model is shown in Fig. 18.4), we start out with the abstract concept Type. We use primitive types as well as complex types. A ComplexType has a number of named and typed attributes. There are two kinds of complex types. Data transfer objects are simple (C-style) structs that are used to ex- change data among components. Entities have a unique ID and can be persisted (this is not visible from the meta model). Enti- ties can reference each other and can thus build more complex data graphs. Each reference specifies whether it is navigable in only one or in both directions. A reference also specifies the cardinalities of the entities at the respective ends, and whether the reference has containment semantics. 470 dslbook.org

Figure 18.4: The structure of data types of course depends a lot on the application. A good starting point for data definitions is the well-known entity relationship model, which is essentially represented by this meta model.

 The Composition Viewpoint The Composition viewpoint, ill- 34 The Composition viewpoint is depen- ustrated in Fig. 18.5, describes component instances and how dent: it depends on the Type viewpoint. It is also sufficient for generating stub 34 they are connected . Using the Type and Composition view- implementation of the system for unit points, we can define logical models of applications. A Confi- testing and for doing dependency anal- yses and other checks for completeness guration ComponentInstance consists of a number of s, each and consistency. referencing their type (from the Type viewpoint). An instance has a number of wires (or connectors): a Wire can be seen as an instance of a ComponentInterfaceRequirement. Note the constraints defined in the meta model:

• For each ComponentInterfaceRequirement defined in the instance’s type, we need to supply a wire.

• The type of the component instance at the target end of a wire needs to provide the interface to which the wire’s com- ponent interface requirement points.

Figure 18.5: The Composition viewpoint describes the composition of a logical system from the component types defined in the Types viewpoint. dsl engineering 471

 The System Viewpoint The system viewpoint describes the system infrastructure onto which the logical system defined 35 In some cases it may make sense to with the two previous viewpoints is deployed (Fig. 18.6), as separate the infrastructure definition from the mapping to the Composi- well as the mapping of the Composition viewpoint onto this tion viewpoint (for example, if the execution infrastructure35. infrastructure is configured by some operations department).

Figure 18.6: The System viewpoint maps the logical system defined in the Compo- sition viewpoint to a specific hardware and systems software configuration.

A system consists of a number of nodes, each one hosting con- tainers. A container hosts a number of component instances. Note that a container also defines its kind – representing tech- nologies such as OSGi, JEE, Eclipse or Spring. Based on this data, together with the data in the Composition viewpoint, you can generate the necessary "glue" code to run the components in that kind of container, including container and remote com- munication configuration code, as well as scripts to package and deploy the artifacts for each container.

You may have observed that the dependencies among the view- points are well-structured. Since you want to be able to define several compositions using the same components and inter- faces, and since you want to be able to run the same composi- tions on several infrastructures, dependencies are only legal in the directions shown in figure 18.7.

Figure 18.7: Viewpoint dependencies are set up so that the same components (Type viewpoint) can be assembled into several logical applications (Compo- sition viewpoint), and each of those can be mapped to several execution infrastructures (System viewpoint). 18.3.2 Aspect Models

The three viewpoints described above are a good starting point for modeling and building component-based systems. How- ever, in many cases these three models are not enough. Addi- tional aspects of the system have to be described using specific 472 dslbook.org

aspect models that are arranged "around" the three core view- point models. The following aspects are typically handled in separate aspect models:

• Persistence • Authorization and Authentication (for enterprise systems) • Forms, layout, page flow (for Web applications) • Timing, scheduling and other quality of service aspects (es- pecially in embedded systems) • Packaging and deployment • Diagnostics and monitoring

The idea of aspect models is that the information is not added to the three core viewpoints, but rather is described using a separate model with a suitable concrete syntax. Again, the meta model dependencies are important: the aspects may de- pend on the core viewpoint models and maybe even on one another, but the core viewpoints must not depend on any of the aspect models. Figure 18.8 illustrates a simplified persis- tence aspect meta model.

Figure 18.8: An example meta model for a persistence aspect. It maps data defined in the Type viewpoint to tables and columns.

18.3.3 Variations The meta models described above cannot be used in exactly this way in every project. Also, in many cases the notion of what constitutes a Component needs to be adapted or extended. 36 Consider them as an inspiration for As a consequence, there are many variations of these meta your own case. In the end, the actual system architecture drives these meta 36 models. In this section we discuss a few of them . models. dsl engineering 473

 Messaging Instead of operations and their typical call/block semantics, you may want to use messages, together with the well-known message interaction patterns. The example system in this chapter used messaging.

 No Interfaces Operations could be added directly to the components. As a consequence, of course, you cannot reuse the interface’s "contracts" separately, independently of the sup- plier or consumer components. You cannot implement inter- face polymorphism.

 Component Kinds Often you’ll need different kinds of com- ponents, such as domain components, data access components, process components or business rules components. Depending on this component classification, you can define the valid de- pendency structures between components (e.g., a domain com- ponent may access a data access component, but not the other way round)37. 37 You will typically also use different ways of implementing component func- Layers Another way of managing dependencies is to mark tionality, depending on the component  kind. each component with a layer tag, such as domain, service, GUI or facade, and define constraints on how components in these layers may depend on each other.

 Configuration Parameters A component might have a num- ber of configuration parameters – comparable to command line arguments in console programs – that help configure the be- havior of components. The parameters and their types are de- fined in the type model, and values for the parameters can be specified later, for example in the models for the Composition or the System viewpoints.

 Component Characteristics You may want to express whether a components is stateless or stateful, whether they are thread- safe or not, and what their lifecycle should look like (for exam- ple, whether they are passive or active, whether they want to be notified of lifecycle events such as activation, and so on).

 Asynchronicity Even if you use operations (and not mes- saging), it is not always enough to use simple synchronous communication. Instead, one of the various asynchronous com- munication paradigms, such as those described in the Remot- 38 M. Voelter, M. Kircher, and U. Zdun. 38 ing Patterns book , might be applicable. Because using these Remoting Patterns. Wiley, 2004 474 dslbook.org

paradigms affects the APIs of the components, the pattern to 39 If your application uses messaging be used has to be marked up in the model for the Type view- interfaces instead of the RPC-style interfaces discussed in this section, then point, as shown in Fig. 18.9. It is not enough to define it in the an API can be designed that remains Composition viewpoint39. independent of the communication paradigm.

Figure 18.9:A ComponentInterface- Requirement may describe which communication paradigm should be applied.

 Events In addition to the communication through inter- faces, you might need (asynchronous) events using a static or dynamic publisher/subscriber infrastructure40. 40 The "direction of flow" of these events is typically the opposite of the Dynamic Connection The Composition viewpoint connects dependencies discussed above: this  allows them to be used for notifications component instances statically. This is not always feasible. If or callbacks. dynamic wiring is necessary, the best way is to embed the in- formation that determines which instance to connect to at run- time into the static wiring model. So, instead of specifying in the model that instance A must be wired to instance B, the model only specifies that A needs to connect to a component with the following properties: it needs to provide a specific in- terface, and for example offer a certain reliability. At runtime, the wire is "deferenced" to a suitable instance using an instance repository41. 41 We have seen this in the example system described above.  Hierarchical Components Hierarchical components, as illus- trated in figure 18.10, are a very powerful tool. Here a com- ponent is structured internally as a composition of other com- ponent instances. This allows a recursive and hierarchical de- composition of a system to be supported. Ports define how components may be connected: a port has an optional pro- tocol definition that allows for port compatibility checks that go beyond simple interface equality. While this approach is powerful, it is also non-trivial, since it blurs the formerly clear distinction between Type and Composition viewpoints.

 Structuring Finally, it is often necessary to provide addi- tional means of structuring complex systems. The terms busi- dsl engineering 475

Figure 18.10: Hierarchical components support the recursive hierarchical decomposition of systems. In particular, a HierarchicalComponent consists internally of connected instances of other components.

ness component or subsystem are often used. Such a higher-level structure consists of a set of components (and related artifacts such as interfaces and data types). Optionally, constraints de- fine which kinds of components may be contained in a specific kind of higher-level structure. For example, you might want to define a business component to always consist of exactly one facade component and any number of domain components.

19 DSLs as Programmer Utility

This chapter was written by Birgit Engelmann and Sebastian Benz. You can reach them via [email protected] and [email protected].

In this chapter we look at an example of a DSL that is used as a developer utility. This means that the DSL is not nec- essarily used to implement the core application itself, but plays an important role as part of the development process. The example discussed in this chapter is Jnario, an Xtext- based DSL for behavior-driven development and unit test- ing.

Tests play an important role in software development1: they 1 We have covered testing DSLs and the ensure that an application behaves as expected. However, the programs written with DSLs earlier in Chapter 14. This DSL is used to specify benefit of tests goes beyond checking whether a system’s out- and test application code. put corresponds to the input. They provide valuable feedback about the quality your application’s design. For example, if a class is hard to test, this can be an indication of too many dependencies or responsibilities, which in turn hinders future maintainability. Furthermore, tests are effective means for de- velopers to elaborate, together with the stakeholders, how the desired features should work. If done right, tests created in such a way can be used as executable specifications of the ap- plication’s behavior. In particular, when tests are used in this latter way, using a general-purpose programming language for writing tests is not 478 dslbook.org

very suitable, because it is hard for non-developers to read, un- derstand and perhaps even write the specified behavior2. It is 2 Even when stakeholders are not desirable to describe an application’s behavior using a textual involved in the specification and testing of the system, this is still a domain that format that can be understood by business users, developers warrants its own DSL. and testers. At the same time, it should be possible to make these specifications executable in order to use them to check automatically whether the application fulfills its expected be- havior.

19.1 The Context

The core idea of Jnario is to describe features of your applica- tion using scenarios. For each scenario, preconditions, events and expected outcomes are described textually with Given, When and Then steps. Here is an example acceptance specification for adding values with a calculator. It is written using Cucumber3, 3 cukes.info a popular Java-based tool for behavior-driven development.

Feature: Basic Arithmetic

Scenario: Addition Given a just turned on calculator When I enter "50" And I enter "70" And press add Then the result should be "120"

This format was introduced six years ago by Dan North4, and 4 dannorth.net/introducing-bdd/ since then has been widely adopted by practitioners. There are tools for most programming languages to turn these scenarios into executable specifications. This is accomplished by map- ping each step to corresponding execution logic. For example, in Java/Cucumber you need to declare a separate class with a method for each step: public class CalculatorStepdefs { private Calculator calc;

@Given("^a calculator$") public void a_calculator() { calc = new Calculator(); }

... }

The method’s annotation contains a regular expression match- ing the text in the Given part of a scenario. The method’s body contains the actual implementation of the step. The tool then executes a scenario by instantiating the class and execut- ing each step with the associated method. The problem with this approach is that the overhead of adding new steps is quite dsl engineering 479

high, since for every new step, a new implementing method has to be created as well.

19.2 Jnario Described

To make it simpler to define executable scenarios for Java appli- cations, we decided to create Jnario5. In Jnario you can directly 5 www.jnario.org add the necessary code below your steps. In our example we create a calculator, pass in the parameters defined in the steps (via the implicit variable args) and check the calculated result:

... Feature: Basic Arithmetic

Scenario: Addition Calculator calculator Given a just turned on calculator calculator = new Calculator() When I enter "50" calculator.enter(args.first) And I enter "70" And press add calculator.add Then the result should be "120" result => args.first

This reduces the effort of writing scenarios by not needing to maintain separate step definitions. It is still possible to reuse existing step definitions in scenarios. The editor even provides code completion showing all existing steps. In our example, the step And I enter "70" reuses the code of the step Given I enter "50" with a different parameter value. A step is re- solved by removing all keywords and parameter values from its descriptions and then matching the normalized description to the description of steps with an implementation.

Figure 19.1: Hiding the step implemen- tation in the Editor.

You might think now that mixing code and text in your specs makes everything pretty unreadable. Actually, this is not a problem, as you can hide the code in the editor to improve the 480 dslbook.org

readability, as shown in Fig. 19.1. This is accomplished using the code folding feature of Eclipse for hiding the step imple- 6 Mixing prose scenario descriptions mentations. When editing a step in an editor, its implementa- and the implementation code is also not violating the separation of concerns. 6 tion will automatically be shown . This is because we do not mix scenarios Feature definitions in Jnario compile to plain Java JUnit tests, and application code. The scenario implementation code really is part of which can be directly executed from within Eclipse. Figure 19.2 the scenario, so mixing the code with shows the execution result of the Calculator feature in Eclipse. the prose makes sense.

Figure 19.2: Executing Jnario features from within Eclipse.

Scenarios are good for writing high-level acceptance specifica- tions, but writing scenarios for data structures or algorithms quickly becomes tedious. This is why Jnario provides another language for writing unit specifications. This languages re- moves all the boilerplate from normal JUnit tests, helping you to focus on what is important: specifying facts about your im- plementation. A fact can be as simple as a single expression asserting a simple behavior: package demo import java.util.Stack describe Stack{ fact new Stack().size => 0 }

We use => to describe the expected result of an expression. More complex facts can have an additional description: describe Stack{ fact "size increases when adding elements" { val stack = new Stack stack.add("something") stack.size => 1 } }

Objects can behave differently in different contexts. For exam- ple, calling pop on a non-empty stack decreases its size. How- ever, if the stack is empty, pop results in an exception. In Jnario dsl engineering 481

we can explicitly specify such contexts using the context key- word: describe "A Stack" { val stack = new Stack context "empty" { fact stack.size => 0 fact stack.pop throws Exception } context "with one element" { before stack.add("element") fact stack.size => 1 fact "pop decreases size" { stack.pop stack.size => 0 } } }

You can execute unit specifications as normal JUnit tests. Note that Jnario uses the description as test name or, if not present, the actual expression being executed. If you look at the ex- ecuted tests in Fig. 19.3, you can see that your specifications effectively document the behavior of your code.

Figure 19.3: Executing Jnario unit specs from within Eclipse.

Jnario is not just another testing framework. It is actually two domain-specific languages specifically made for writing exe- cutable specifications. The main advantage of using a DSL for writing tests is that it becomes possible to adapt the syn- tax to the skills of the intended users. In our case this means that specifications written in Jnario can be understood by users without a programming background. Another advantage of using a DSL for writing tests is the possibility of adding features that are not available in a general purpose programming language, but are really helpful when writing tests. If you think about current testing frameworks, they usually "stretch" the syntax of the underlying program- ming language to be able to write expressive tests. Compare that to a programming language in which the syntax is specif- ically designed for the purpose of writing tests. For example, 482 dslbook.org

a common scenario is to test a class with different sets of input parameters. Writing such tests in Jnario is really easy, as it has a special table syntax: describe "Adding numbers" { def additions { | a | b | sum | | 1 | 2 | 3 | | 4 | 5 | 9 | | 10 | 11 | 20 | | 21 | 21 | 42 | } fact additions.forEach[a + b => sum] }

Tables in Jnario are type safe: the type of a column will be automatically inferred to the common supertype of all cells in a column. You can easily iterate over each row in a table and write assertions by accessing the column values as variables. If you execute the example specification, it will fail with the following error: java.lang.AssertionError: examples failed

| a | b | sum | | 1 | 2 | 3 | | 4 | 5 | 9 | | 10 | 11 | 20 | X (1) | 21 | 21 | 42 |

(1) Expected a + b => sum but a + b is 21 a is 10 b is 11 sum is 20

This demonstrates another advantage of Jnario: it tries to give you as much information as possible about which assertion failed, and why. A failed assertion prints the values of all eval- uated sub-expressions. This means you don’t need to debug your tests any further in order to find the exact reason why an assertion failed. These are just some examples that demonstrate the advan- tages of test-centric domain-specific language. Having full con- trol over the syntax of the language and its translation into Java code allows us to add features that are helpful when writing 7 They would also be very hard to im- tests, but which would never make sense in a general purpose plement with a GPL and its associated language7. tooling!

19.3 Implementation

Both Jnario languages – the Feature and Spec languages – are developed with Xtext. Xtext is used for a number of reasons. Eclipse Integration Jnario targets Java developers, which makes Eclipse the best tooling choice, since it is currently the mostly used Java IDE. dsl engineering 483

Standalone Support Although Xtext provides tight Eclipse inte- gration, all language features, such as parser and compiler, are not restricted to Eclipse. This is important, as it should be possible to compile Jnario specifications with maven or from the command line.

Resuable Expression Language Implementing a DSL with a cus- tom expression language requires a lot of effort. Xtext pro- vides Xbase, a statically-typed expression language based on Java, which can be integrated into DSLs with relatively little effort. This eliminates the need to implement a custom expression language and ensures compatibility with existing 8 As we mentioned earlier in this book, Java code8. Xbase supports extension methods, type inference, lambda expressions and other useful extensions of Java. It 19.3.1 Language Definition compiles to Java source code, and is 100% compatible with the Java type Xtext’s Xbase provides a reusable expression language that can system. be embedded into DSLs. The Xtend language, which also ships with Xtext, is a general-purpose programming language for the JVM. Xtend enriches Xbase with additional concepts, such as classes, fields and methods. In Jnario we needed similar con- cepts, which is why we decided to base the Jnario languages on Xtend. Additionally, reusing Xtend had the advantage that we could reuse a lot of the existing tooling for Jnario. In Jnario we introduced new expressions, for example more expressive assertions, which can be used in both languages.

In order to avoid reimplementing these in the feature and the Figure 19.4: Jnario language hierarchy. specs language, we created a base language with all common Xbase and Xtend ship with Xtext, and features used by Jnario. The resulting language hierarchy is Xtend reuses the Xbase expression language by extending Xbase, and shown in Fig. 19.4. This example demonstrates that by, care- embedding its expressions into concepts fully modularizing language features, it possible in Xtext to like classes, fields or operations. Since Jnario Feature and Jnario Specs both use effectively reuse languages together with their tooling in dif- a set of common Xtend extensions, an ferent contexts. Referring back to the design part of the book, intermediate language Jnario Base was these are all examples of language extension. developed that contains these common concepts. Jnario Feature and Jnario Specs As we mentioned earlier, Jnario has assertions with improved then both extend Jnario Base and add error messages. An example is the assert statement, which con- their own specific language concepts. sists of the assert keyword followed by an expression evalu- ating to a Boolean: assert x != null 9 As we discussed in the chapter on Adding a new expression to the Xtext base language works by modularity with Xtext (Chapter 16.3), this requires repeating the original rule overriding the corresponding grammar rules. In our example, XPrimaryExpression and specifying we added the new assertion expression as a primary expres- its new contents. In this case, we add Assertion to the list of alternatives. sion9: 484 dslbook.org

XPrimaryExpression returns xbase::XExpression: XConstructorCall | XBlockExpression | ... Assertion;

The Assertion expression itself is defined as a separate rule. Notice how it again embeds concepts from Xbase, namely XEx- pression:

Assertion returns xbase::XExpression: ’assert’ expression=XExpression;

Tables are another example for an extension of Xtend. Defining the grammar for a table in Xtext is pretty straightforward10.A 10 Note that the grammar does not actu- table has a list of columns and rows. Each column in the cells ally specify that the rows and columns are laid out in the two-dimensional in a row are separated by ’|’: structure we associate with tables. Table: However, this is not really a problem ’def’ name=ID ’{’ since users will naturally use that (’|’ (columns+=Column)* notation. Also, a formatter could be im- (rows += Row)*)? plemented to format the table correctly. ’}’; Column: In addition, an editor template could be name=ValidID ’|’; defined to create an example table that is formatted in a meaningful way. Row: ’|’ {Row} (cells+=XExpression ’|’)*;

Cells in the table can contain arbitrary expressions. We reused the typing infrastructure provided by Xtext to calculate the type of each column in the table: a column’s type is the com- mon supertype of all expressions in the respective column. Here is the essential part of the code:

@Inject extension ITypeProvider @Inject extension TypeConformanceComputer def getType(ExampleColumn column){ val cellTypes = column.cells.map[type] return cellTypes.commonSuperType }

In the next section we will have a look at how we map these custom concepts to the generated Java classes. 19.3.2 Code Generation Jnario specifications are compiled to plain Java JUnit4 tests. Xtext provides a framework for mapping DSL concepts to cor- responding JVM concepts. An example mapping for the fea- ture language is shown in Fig. 19.5. Scenarios and Backgrounds11 are mapped to Java classes, in 11 Backgrounds contain steps being which each step is mapped to a JUnit test method. The ad- executed before each scenario. They are usually used to set up preconditions ditional @Test annotations required by JUnit are added during required by all scenarios. the mapping process. The expressions can be completely trans- formed by the existing Xtend compiler. However, in order to dsl engineering 485

Figure 19.5: Generation from Jnario Feature DSL (left column) to Java classes with JUnit tests (right column).

support custom expressions such as assert, the Xtend com- piler needs to be extended. Consider the following example: val x = "hello" val y = "world" assert x + y == "hello world"

The execution of these statements will result in the following error message: java.lang.AssertionError: Expected x + y == "hello world" but x + y is "helloworld" x is "hello" y is "world"

Note that the error message contains the original failing expres- sion x + y == "hello world" rendered as text, together with the value of all subexpressions. Creating such error messages is not possible in plain Java, as the underlying AST cannot be 12 This "transformation-time reflection" accessed. However, in a DSL like Jnario we can include this is a natural consequence of the fact that information when generating the Java code from our specifica- a generator can inspect a program’s tions12. To do so, we subclass the XtendCompiler and add a structure (and has access to the AST) during the transformation. We came custom generator for our assertion expression: across this effect earlier when we noted

@Override mbeddr outputs the source/location of public void doInternalToJavaStatement(XExpression obj, a reported message together with the ITreeAppendable appendable, boolean isReferenced) { message. if (obj instanceof Assertion) { toJavaStatement((Assertion) obj, appendable, isReferenced); } else { super.doInternalToJavaStatement(obj, appendable, isReferenced); } }

The custom generator translates an assertion into the corre- sponding Java code: private void toJavaStatement(Assertion assertion, ITreeAppendable b) { XExpression expr = assertion.getExpr(); toJavaStatement(expr, b, true); 486 dslbook.org

b.newLine(); b.append("org.junit.Assert.assertTrue("); b.append("Expected "); b.append(serialize(expr)); b.append(" but "); appendSubExpressionValues(expr, b); b.append(", "); toJavaExpression(expr, b); b.append(");"); }

We first get the actual expression of the assertion, which is then compiled by invoking toJavaStatement(expr,...). The Xtend compiler automatically creates a temporary variable for each subexpression13. The value of these variables will be 13 This is why the method is used later to generate the error message. The assertion itself called toJavaStatement and not toJavaExpression. is mapped to the JUnit method Assert.assertTrue(message, result). The key to showing the expression in a textual form as part of the error message is that the message is built by seri- alizing the expression as it is written in the Jnario file together with the values of the subexpressions taken from previously created temporary variables. Finally, execution result expres- sion is compiled by invoking toJavaExpression(expr, ...).

19.3.3 Tooling For Jnario, good tooling is essential, since the hurdles of learn- ing a new language should be as low as possible, and being able to work efficiently is a major goal of Jnario. The Feature language aims at being readable both for soft- ware developers and for domain experts who have no pro- 14 This means that domain experts only gramming background. The best way to support both groups is see (and edit) the prose description to provide two views on a Jnario Feature file. The editor uses of scenarios; they don’t have to care about the implementation code, and it specialized folding to show or hide the implementation code therefore feels like working with a text of the scenarios and features14. Specialized folding means we editor. used the existing folding component from Xtext and extended it to disable the normal folding functionality. We then intro- duced a button to trigger the folding of the implementation logic. In addition, syntax highlighting comes in handy when referring to existing steps. We used different colors to show whether a step is implemented (i.e. has its own code associ- ated with it), not implemented (i.e. needs to be implemented before the specification can be run), or is a reference to an exist- ing step (in which case the existing step’s implementation code is automatically reused). To improve productivity further, editor features such as quick fixes for importing necessary libraries or for creating new class dsl engineering 487

files were added. Auto completion for steps in features is sup- ported, since you can reuse other steps by referencing them, even if they are not implemented. Another important produc- tivity feature is the debug support of Xtext, which lets you step through your DSL file instead of using the generated code15. 15 This facility did not have to be adapted for Jnario; it worked out of 19.3.4 Testing the box. We discussed to some extent how Xtext DSL debugging works in Testing is an important part of developing a DSL. This of course Section 15.2. applies to Jnario as well. However, Jnario is a special case, since it itself is a language for writing tests. This gave us the chance to bootstrap the implementation of Jnario16. The advantage of 16 This means that all tests for Jnario are bootstrapping is that bugs or missing features quickly become written in Jnario. apparent just from using the language to build test for the lan- guage (we ate our own dog food). Here is an example in which we test Jnario’s throws expression:

Scenario: Testing exceptions with Jnario Given a unit spec ’’’ describe "expecting exceptions in Jnario"{ fact new Stack().pop throws EmptyStackException } ’’’ Then it should execute successfully

When we execute this scenario, the given unit specification will first be parsed and compiled into Java source code17. The Java 17 Note that in this approach, the Jnario source code will then be compiled into a Java class file using code inside the triple single quotes is treated as a string and no IDE support the normal Java compiler. The generated class is then loaded for the Jnario language is available. via the Java classloader and executed with JUnit. This pro- This is in contrast to some of the tests described for Spoofax in Section 14.1. cess is greatly simplified by the testing infrastructure provided by Xtext, which provides APIs for compiling and classloading DSL artifacts. The advantage of this approach is that the whole compiler chain is tested, and tests are relatively stable, as they are independent of changes in the internal implementation.

19.4 Summary

This chapter introduced Jnario, a DSL for testing, which was developed based on Xtext. Jnario is a good example for a DSL that is targeted towards developers with the goal of easing the development process. It is also a good example of an Xtext- based language that makes use of Xtext’s reusable expression language Xbase. Jnario is currently used in various domains, for example to specify and test the behavior of automotive con- trol units, web applications and Eclipse-based applications.

20 DSLs in the Implementation

In this chapter we discuss the use of DSLs in the context of software implementation, based on an extensive case study in embedded software development: the mbeddr system that has been discussed in the book before. mbeddr supports ex- tension of C with constructs useful for embedded software. In this chapter we show how language extension can ad- dress the challenges of embedded software development, and report on our experience in building these extensions.

This section of the book is based on a paper written together with Daniel Ratiu, Bernd Kolb and Bernhard Schaetz for the SPLASH/Wavefront 2012 conference.

20.1 Introduction

In this section we discuss the use of DSLs in the context of implementation. There it is crucial that the DSLs are tightly integrated with the application code that is typically written in a GPL. Language embedding and extension (Section 4.6) are obviously useful approaches. In this chapter we discuss the mbeddr system1 which supports domain-specific extensions to 1 mbeddr.com C2. 2 mbeddr is built with MPS, so you The amount of software embedded in devices is growing should make sure that you have read and understood the MPS-based imple- and the development of embedded software is challenging. In mentation examples in Part III of the addition to functional requirements, strict operational require- book. ments have to be fulfilled as well. These include reliability (a device may not be accessible for maintenance after deploy- ment), safety (a system may endanger life or property if it fails), efficiency (the resources available to the system may be limited) 490 dslbook.org

or real-time constraints (a system may have to run on a strict schedule prescribed by the system’s environment). Address- ing these challenges requires any of the following: abstrac- tion techniques should not lead to excessive runtime overhead; programs should be easily analyzable for faults before deploy- ment; and various kinds of annotations, for example for de- scribing and type checking physical units, must be integrated into the code. Process issues such as requirements traceabil- ity have to be addressed, and developers face a high degree of variability, since embedded systems are often developed in the context of product lines3. 3 We discuss product lines and DSLs Current approaches for embedded software development can specifically in Chapter 21. be divided roughly into programming and modeling. The pro- gramming approach mostly relies on C, sometimes C++, and 4 Ada is used mainly in safety-critical Ada in rare cases4. However, because of C’s limited support systems, and traditionally in avionics, space and defense. This has historic for defining custom abstractions, this can lead to software that reasons, but is also driven by the fact is hard to understand, maintain and extend. Furthermore, C’s that some Ada compilers are certified to very high levels of reliability. ability to work with very low-level abstractions, such as point- ers, makes C code very expensive to analyze statically5. The 5 C is very good for developing low- alternative approach uses modeling tools with automatic code level aspects of software systems. It is not so good for large-scale software generation. The modeling tools provide predefined, higher- engineering. level abstractions such as state machines or data flow compo- 6 6 7 A big problem with the vast majority nent diagrams . Example tools include ASCET-SD or Simu- of the tools in this space is that they 8 link . Using higher-level abstractions leads to more concise cannot be extended in meaningful programs and simplified fault detection using static analysis ways; they are hard to adapt to a particular domain, platform or system and model checking (for example using the Simulink Design context. 9 Verifier ). Increasingly, DSLs are used for embedded software, 7 www.etas.com/ and studies show that DSLs substantially increase productivity in embedded software development. However, most real-world systems cannot be described completely and adequately with a single modeling tool or DSL, and the integration effort between manually written C code and perhaps several modeling tools and DSLs becomes significant. A promising solution to this dilemma lies in much tighter integration between low-level C code and higher-level abstrac- tions specific to embedded software. We achieve this with an extensible C programming language. The advantages of C can be maintained: existing legacy code can be easily integrated, reused, and evolved, and the need for efficient code is immedi- ately addressed by relying on C’s low-level programming con- cepts. At the same time, domain-specific extensions such as state machines, components or data types with physical units dsl engineering 491

can be made available as C extensions. This improves produc- tivity via more concise programs, it helps improve quality in a constructive way by avoiding low-level implementation er- rors up front, and leads to system implementations that are more amenable to analysis. By directly embedding the exten- sions into C, the mismatch and integration challenge between domain-specific models and general-purpose code can be re- moved. An industry-strength implementation of this approach must also include IDE support for C and the extensions: syntax highlighting, code completion, error checking, refactoring and debugging. The LW-ES research project, run by itemis AG, fortiss GmbH, BMW Car IT and Sick AG explores the benefits of language en- gineering in the context of embedded software development with the mbeddr system10. 10 The code is open source and available via mbeddr.com.

20.2 Challenges in Embedded Software

In this section we discuss a set of challenges we address with the mbeddr approach. We label the challenges Cn so we can refer to them from Section 20.3.2, where we show how they are addressed by mbeddr11. 11 While these are certainly not all the challenges embedded software develop- C : Abstraction without Runtime Cost Domain-specific con- ers face, based on our experience with  1 embedded software and feedback from cepts provide more abstract descriptions of the system under various domains (automotive, sensors, development. Examples include data flow blocks, state ma- automation) and organizations (small, medium and large companies), these chines, or data types with physical units. On one hand, ade- are among the most important ones. quate abstractions have a higher expressive power that leads to programs that are shorter and easier to understand and main- tain. On the other hand, by restricting the freedom of pro- grammers, domain-specific abstractions also enable construc- tive quality assurance. For embedded systems, where run- time efficiency is a prime concern, abstraction mechanisms are needed that can be resolved before or during compilation, and not at runtime.

 C2: C considered Unsafe While C is efficient and flexible, several of C’s features are often considered unsafe12. Conse- 12 For example, unconstrained cast- quently, the unsafe features of C are prohibited in many organi- ing via void pointers, using ints as Booleans, or the weak typing implied zations. Standards for automotive software development such by unions can result in runtime errors as MISRA limit C to a safer language subset. However, most that are hard to track down. C IDEs are not aware of these and other, organization-specific restrictions, so they are enforced with separate checkers that 492 dslbook.org

are often not well integrated with the IDE. This makes it hard for developers to comply with these restrictions efficiently.

 C3: Program Annotations For reasons such as safety or ef- ficiency, embedded systems often require additional data to be associated with program elements. Examples include physi- cal units, coordinate systems, data encodings or value ranges for variables. These annotations are typically used by specific, often custom-built analysis or generation tools. Since C pro- grams can only capture such data informally as comments or pragmas, the C type system and IDE cannot check their correct use in C programs. They may also be stored separately (for example, in XML files) and linked back to the program using names or other weak links13. 13 Even with tool support that checks the consistency of these links and C : Static Checks and Verification Embedded systems often helps navigate between code and this  4 additional data, the separation of core have to fulfill strict safety requirements. Industry standards for functionality and the additional data safety such as ISO-26262, DO-178B or IEC-61508 demand that leads to unnecessary complexity and maintainability problems. for high safety certification levels various forms of static anal- yses are performed on the software. These range from sim- ple type checks to sophisticated property checks, for example by model checking. Since C is a very flexible and relatively weakly-typed language, the more sophisticated analyses are very expensive. Using suitable domain-specific abstractions (for example, state machines) leads to programs that can be analyzed much more easily.

 C5: Process Support There are at least two cross-cutting and process-related concerns relevant to embedded software development. First, many certification standards (such as those mentioned above) require that code be explicitly linked to re- quirements such that full traceability is available. Today, re- quirements are often managed in external tools, and maintain- ing traceability to the code is a burden to the developers and often done in an ad hoc way, for example via comments. Sec- ond, many embedded systems are developed as part of product lines with many distinct product variants, where each variant consists of a subset of the (parts of) artifacts that comprise the product line. This variability is usually captured in constraints expressed over program parts such as statements, functions or states. Most existing tools come with their own variation mech- anism, if variability is supported at all. Integration between program parts, the constraints and the variant configuration dsl engineering 493

(for example via feature models) is often done through weak links, and with little awareness of the semantics of the under- lying language14. As a consequence, variant management is a 14 For example, the C preprocessor, huge source of accidental complexity. which is often used for this task, per- forms simple text replacement or removal controlled by the conditions in An additional concern is tool integration. The diverse require- #ifdefs. ments and limitations of C discussed so far often lead to the use of a wide variety of tools in a single development project. Most commercial off-the-shelf (COTS) tools are not open enough to facilitate seamless and semantically meaningful integration with other tools, leading to significant accidental tool integra- tion complexity. COTS tools often also do not support mean- ingful language extension, severely limiting the ability to de- fine and use custom domain-specific abstractions.

20.3 The mbeddr Approach

Language engineering provides a holistic approach to solving these challenges. In this section we illustrate how mbeddr ad- dresses the challenges with an extensible version of the C pro- gramming language, growing a stack of languages extensions (see Fig. 20.2, and Section 4.6.2 for a discussion of language extension). The following section explores which ways Wm of extending C are necessary to address the challenges Cn. Sec- tion 20.3.2 then shows examples that address each of the chal- lenges and ways of extending C. The semantics of an extension are typically defined by a transformation back to the base language. For example, in an extension that provides state machines, these may be trans- formed to a switch/case-based implementation in C. Exten- sions can be stacked (Fig. 20.2), where a higher-level extension extends (and transforms back to) a lower-level extension in- stead of C. At the bottom of this stack resides plain C in text form, and a suitable compiler. Fig. 20.1 shows an example in which a module containing a component that contains a state machine is transformed to C, and then compiled. Figure 20.1: Higher-level abstractions As we have seen in Section 16.2, MPS supports modular lan- such as state machines or components are reduced incrementally to their guage extension, as well as the use of independently devel- lower-level equivalent, reusing the oped language extensions in the same system. For example, in transformations built for lower-level extensions. mbeddr a user can include an extension that provides state ma- chines and an extension that provides physical units in the same program without first defining a combined language statema- chine-with-units. This is very useful, because it addresses real- 494 dslbook.org

world constraints: a set of organizations, such as the depart- ments in a large company, will probably not agree on a sin- gle set of extensions to C, since they typically work in slightly different areas. Also, a language that contains all relevant ab- stractions would become big and unwieldy. Modular language extension solves these problems.

Figure 20.2: Based on MPS, mbeddr comes with an implementation of the C programming language. On top of C, mbeddr defines a set of default extensions (white boxes) stacked on top of each other. Users can use them in their programs, but they don’t have to. Support for requirements traceability and product line variability is cross-cutting. Users build their own extensions on top of C or on top of the default extensions. (Component/state machine integration and state machine 20.3.1 Ways of Extending C tests are not discussed in this chapter.) In this section we discuss in which particular ways C needs to be extensible to address the challenges discussed above15. 15 Section 20.3.2 shows examples for each of these.

 W1: Top-Level Constructs Top level constructs (on the level of functions or struct declarations) are necessary. This enables the integration of test cases or new programming paradigms relevant in particular domains, such as state machines, or in- terfaces and components.

 W2: Statements New statements, such as assert or fail statements in test cases, must be supported16. Statements may 16 If statements introduce new blocks, have to be restricted to a specific context; for example, assert then variable visibility and shadowing must be handled correctly, just as in or fail statements must only be used in test cases and not in regular C. any other statement list.

 W3: Expressions New kinds of expressions must be sup- ported. An example is a decision table expression that rep- resents a two-level decision tree as a two-dimensional table (Fig. 20.4).

 W4: Types and Literals New types, e.g., for matrices, com- plex numbers or quantities with physical units, must be sup- ported. This also requires the definition of new operators, and overriding the typing rules for existing ones. New literals may also be required: for example, physical units could be attached to number literals (as in 10kg). dsl engineering 495

 W5: Transformation Alternative transformations for exist- ing language concepts must be possible. For example, in a module marked as safe, the expression x + y may have to be translated into an invocation of addWithBoundsCheck(x, y), an inline function that performs bounds-checking, besides the addition.

 W6: Meta Data Decoration It should be possible to add meta data, such as trace links to requirements or product line vari- ability constraints, to arbitrary program nodes, without chang- ing the concept of the node.

 W7: Restriction It should be possible to define contexts that restrict the use of specific language concepts17. For example, 17 Like any other extension, such con- the use of pointer arithmetic should be prohibited in modules texts must be definable after the original language has been implemented, with- marked as safe, or the use of real numbers should be prohibited out invasive change. in state machines that are intended to be model checked (model checkers do not support real numbers).

20.3.2 Extensions Addressing the Challenges

In this section we present example extensions that illustrate how we address the challenges discussed in Section 20.2. We show at least one example for each challenge18. The table be- 18 How such extensions are built will be low shows an overview of the challenges, the examples in this discussed in Section 20.4. section, and the ways of extension each example makes use 19 This is not the full set of extensions of19. available in mbeddr. The mbeddr user guide contains the full description. Challenge Example Extensions

C1 State machines (W1, W2) (Low-Overhead Components (W1) Abstraction) Decision Tables (W3) C2 Cleaned up C (W7) (Safer C) Safe Modules (W5, W7) C3 Physical Units (W4) (Annotations)

C4 Unit Tests (W1, W2) (Static Checks, State Machines (W1, W2) Verification) Safe Modules (W2, W5, W7) C5 Requirements Traceability (W6) (Process Support) Product Line Variability (W6)

 A Cleaned-Up C (addresses C2, uses W7) To make C ex- tensible, we first had to implement C in MPS. This entails the 496 dslbook.org

definition of the language structure, syntax and type system20. 20 A generator to C text is also required, In the process we changed some aspects of C. Some of these so that the code can be fed into an existing compiler. However, since this changes are the first step in providing a safer C (challenge C2). generator merely renders the tree as Others changes were implemented, because it is more conve- text, with no structural differences, this generator is trivial: we do not discuss it nient to the user, or because it simplified the implementation of any further. the language in MPS. Out of eight changes in total, four are for reasons of improved robustness and analyzability, two are for end-user convenience and three are to simplify the implemen- tation in MPS. We discuss some of them below, and the table below shows a summary. Difference Reason No Preprocessor Robustness Native Booleans (and a cast Robustness operator for legacy interop) enums are not ints (special Robustness operators for next/previous C99 Integral Types Required Robustness Modules instead of Headers End-User Convenience hex<..>, oct<..>, bin<..> Simplified instead of 0x.. and 0.. Implementation Type annotation on type Simplified (int[] a instead of int a[]) Implementation Cleaned up syntax for function End-User Convenience, types and function pointers Simplified Implementation mbeddr C provides modules (Fig. 20.3). A module contains the top-level C constructs (such as structs, functions or global variables). These module contents can be exported. Modules can import other modules, in which case they can access the exported contents of the imported modules. While header files are generated, we do not expose them to the user: modules Figure 20.3: Modules are the top-level provide a more convenient means of controlling modularizing container in mbeddr C. They can import other modules, whose exported programs and limiting which elements are visible globally. contents they can then use. Exported mbeddr C does not support the preprocessor. Empirical stud- contents are put into the header files ies21 show that it is often used to emulate missing features of C generated from modules. in ad hoc way, leading to problems regarding maintenance and 21 M. D. Ernst, G. J. Badros, and analyzability. Instead, mbeddr C provides first-class support D. Notkin. An empirical analysis of c preprocessor use. IEEE Trans. Softw. for the most important use cases of the preprocessor. Examples Eng., 28, December 2002 include the modules mentioned above (replacing #include), as well as the support for variability discussed below (replacing #ifdefs). Instead of defining macros, users can create first- dsl engineering 497

class language extensions, including type checks and IDE sup- port. Removing the preprocessor and providing specific sup- port for its important use cases goes a long way in creating more maintainable and more analyzable programs. The same is true for introducing a separate boolean type and not inter- preting integers as Booleans by default (an explicit cast opera- tor is available). Type decorations, such as array brackets or the pointer aster- isk, must be specified on the type, not on the identifier (int[] a; instead of int a[];). This has been done for reasons of consistency and to simplify the implementation in MPS: it is the property of a type to be an array type or a pointer type, not the property of an identifier. Identifiers are just names. Figure 20.4: A decision table evaluates to the value in the cell for which the  Decision Tables (addressing C1, uses W3) Decision tables row and column headers are true, are a new kind of expression, i.e. they can be evaluated. An ex- a default value otherwise (FAIL in ample is shown in Fig. 20.4. A decision table represents nested the example). By default, a decision if statements. It is evaluated to the value of the first cell whose table is translated to nested ifs in a separate function. The figure shows the column and row headers are true (the evaluation order is left translation for the common case where to right, top to bottom). A default value (FAIL) is specified to a decision table is used in a return. This case is optimized to not use the handle the case in which none of the column/row header com- indirection of an extra function. binations is true. Since the compiler and IDE have to compute a type for expressions, the decision table specifies the type of its result values explicitly (int8).

 Unit Tests (addresses C4, uses W1, W2) Unit tests are new top-level constructs (Fig. 20.5) introduced in a separate unittest language that extends the C core. They are like void func- tions without arguments. The unittest language also introduces assert and fail statements, which can only be used inside test cases. Testing embedded software can be a challenge, and the unittest extension is an initial step towards providing compre- hensive support for testing. Figure 20.5: The unittest language  Components (addresses C1, uses W1) are new top-level con- introduces test cases as well as assert structs that support modularization, encapsulation and the sep- and fail statements which can only be used inside of a test case. Test aration between specification and implementation (Fig. 20.6). cases are transformed to functions, In contrast to modules, a component uses interfaces and ports and the assert statements become if statements with a negated condition. to declare the contract it obeys. Interfaces define operation sig- The generated code also counts the natures and optional pre- and post-conditions (not shown in number of failures so that it can be the example). Provided ports declare the interfaces offered by reported to the user via a binary’s exit value. a component; required ports specify the interfaces a component expects to use. Different components can implement the same 498 dslbook.org

interface differently. Components can be instantiated (also in contrast to modules), and each instance’s required ports have to be connected to compatible provided ports provided by other component instances. Polymorphic invocations (different com- ponents "behind" the same interface) are supported.

Figure 20.6: Two components providing the same interface. The arrow maps operations from provided ports to  State Machines (addresses C1, C4, uses W1, W2) State ma- implementations. Indirection through chines provide a new top-level construct (the state machine it- function pointers allows different im- plementations for a single interface, self), as well as a trigger statement to send events into state enabling OO-like polymorphic invoca- machines (see Fig. 20.7). State machines are transformed into a tions. switch/case-based implementation in the C program. Entry, exit and transition actions may only access variables defined locally in state machines and fire out events. Out events may optionally be mapped to functions in the surrounding C pro- gram, where arbitrary behavior can be implemented. In this way state machines are semantically isolated from the rest of the code, enabling them to be model checked: if a state ma- chine is marked as verifiable, we also generate a representa- tion of the state machine in the input language of the NuSMV model checker22, including a set of property specifications that 22 nusmv.fbk.eu dsl engineering 499

are verified by default. Examples include dead state detec- tion, dead transition detection, non-determinism and variable bounds checks. In addition, users can specify additional high- level properties based on the well-established catalog of tem- poral logic properties patterns23. The state machines extension 23 M. B. Dwyer, G. S. Avrunin, and J. C. also supports hierarchical states as a further means of decom- Corbett. Patterns in property specifi- cations for finite-state verification. In posing complex behavior. ICSE, 1999

Figure 20.7: A state machine is em- bedded in a C module as a top-level construct. It declares in and out events, as well as local variables,  Physical Units (addresses C3, uses W4) Physical units are states and transitions. Transitions react new types that specify a physical unit in addition to the data to in events, and out events can type (see Fig. 20.8). New literals support the specification of be fired in actions. Through bindings (e.g., tickHandler), state machines in- values for those types that include the physical unit. The typ- teract with C code. State machines can ing rules for the existing operators (+, * or >) are overridden to be instantiated. They are transformed to enums for states and events, and a perform the correct type checks for types with units. The type function that executes the state machine system also performs unit computations to deal correctly with using switch statements. The trigger unit computations (as in speed = length/time). statement injects events into a state machine instance by calling the state machine function.  Requirements Traces (addresses C5, uses W6) Requirements traces are meta data annotations that link a program element 24 We discussed traces in Chapter 17. to requirements, essentially elements in other models imported We also briefly introduced mbeddr’s from requirements management tools24. Requirements traces approach to traces there. can be attached to any program element without that element’s definition having to be aware of this (see green (gray in print) highlights in Fig. 20.9 and in Fig. 20.22). 500 dslbook.org

Figure 20.8: The units extension ships with the SI base units. Users can define derived units (such as the mps in the example), as well as convertible units that require a numeric conversion for mapping back to SI units. Type checks ensure that the values associated with unit literals use the correct unit and perform unit computations (as in speed equals length divided by time). Errors are reported if incompatible units are used together (e.g., if we were to  Presence Conditions (addresses C5 and W6) A presence con- add length and time). To support this dition determines whether the program element to which it is feature, the typing rules for the existing operators (such as + or /) have to be 25 attached is part of a product in the product line . A prod- overridden. uct is configured by specifying a set of configuration flags (ex- pressed via feature models), and the presence condition spec- 25 For details about DSLs and product ifies a Boolean expression over these configuration switches. lines, see Chapter 21. Like requirements traces, presence conditions can be attached to any program element26. Upon transformation, program el- 26 For example, in Fig. 20.9, the ements whose presence condition evaluates to false for the resetted out event and the on start... transition in the second selected product configuration are simply removed from the state have the resettable presence program (and hence will not end up in the generated binary). condition, where resettable is a reference to a configuration flag. This program customization can also be performed by the edi- tor, effectively supporting variant-specific editing.

 Safe Modules (addresses C2, uses W5, W7) Safe modules help prevent writing risky code. For example, runtime range checking is performed for arithmetic expressions and assign- ments. To enable this, arithmetic expressions are replaced by function calls that perform range checking and report errors if an overflow is detected. As another example, safe modules also provide the safeheap statement, which automatically frees dy- namic variables allocated inside its body (see Fig. 20.13).

20.3.3 Addressing the Tool Integration Challenge

By building all languages (C, its extensions or any other DSLs) on top of MPS, the tool integration challenge is completely solved. All languages get an MPS-style IDE, including syn- tax highlighting, code completion, static error checking and annotation, quick fixes and refactorings, as well as a debugger (details see Section 15.2.5). Fig. 20.9 shows a screenshot of the tool, as we edit a module with a decision table, a state machine, requirements traces and presence conditions. dsl engineering 501

Figure 20.9: A somewhat overloaded example program in the mbeddr IDE 20.4 Design and Implementation (an instance of MPS). The module contains an enum, a decision table and This section discusses the implementation of mbeddr language a state machine. Requirements traces are attached to the table and the step extensions. We briefly discuss the structure of the C core lan- in-event, and a presence condition guage. The main part of this section discusses each of the ways is attached to an out-event and a transition. Wm of extending C based on the extensions discussed in the previous section27. 27 We expect you to have read the MPS examples in Part III of the book. 20.4.1 The mbeddr Core Languages

C can be partitioned into expressions, statements, functions, etc. We have factored these parts into separate language mod- ules to make each of them reusable without pulling in all of C. The expressions language28 is the most fundamental lan- 28 The language is actually called guage. It depends on no other language and defines the prim- com.mbeddr.core.expressions. We won’t repeat the prefix in this chapter. itive types, the corresponding literals and the basic operators. Support for pointers and user defined data types (enum, struct, union) is factored into the pointers and udt languages respec- 502 dslbook.org

tively. statements contains the procedural part of C, and the modules language covers modularization. Fig. 20.10 shows an overview of some of the languages and constructs.

Figure 20.10: Anatomy of the mbeddr language stack: the diagram shows some of the language concepts, their relationships and the languages that contain them.

20.4.2 Addressing W1 (Top-Level Constructs): Test Cases In this section we illustrate the implementation of the test case construct, as well as of the assert and fail statements available inside test cases.

 Structure Modules own a collection of IModuleContents, an interface that defines the properties of everything that can reside directly in a module. All top-level constructs such as Functions implement IModuleContent. IModuleContent ex- tends MPS’ IIdentifierNamedConcept interface, which pro- vides a name property. IModuleContent also defines a Boolean property exported that determines whether the respective mod- ule content is visible to modules that import this module29. 29 This property is queried by the Since the IModuleContent interface can also be implemented scoping rules that determine which elements can be referenced. by concepts in other languages, new top-level constructs such as the TestCase in the unittest language can implement this interface, as long as the respective language has a dependency on the modules language, which defines IModuleContent. The class diagram in Fig. 20.10 shows some of the relevant concepts and languages. dsl engineering 503

 Constraints A test case contains a StatementList, so any C statement can be used in a test case. StatementList becomes available to the unit test language through its dependency on the statements language. unittest also defines new state- ments: assert and fail. They extend the abstract Statement concept defined in the statements language. This makes them valid in any statement list, for example in a function body. This is undesirable, since the transformation of asserts into C de- pends on them being used in a TestCase. To enforce this, a can be child constraint is defined (Fig. 20.11).

concept constraints AssertStatement { Figure 20.11: This constraint restricts can be child an AssertStatement to be used only (context, scope, parentNode, link, childConcept)->boolean { inside a TestCase by checking that at parentNode.ancestor.isNotNull; } least one of its ancestors is a TestCase. }

 Transformation The new language concepts in unittest are reduced to C concepts: the TestCase is transformed to a void function without arguments, and the assert statement is transformed into a report statement defined in the logging language. The report statement, in turn, it is transformed into a platform-specific way of reporting an error (console, serial line or error memory). Fig. 20.12 shows an example of this two-step process.

test case exTest { void test_exTest { void test_exTest { Figure 20.12: Two-stage transforma- int x = add(2, 2); int x = add(2, 2); int x = add(2, 2); tion of TestCases. The TestCase is assert(0) x == 4; report if (!(x == 4)) { transformed into a C function using the } test.FAIL(0) printf("fail:0"); on !(x == 4); } logging framework to output error mes- }} sages. The report statement is in turn transformed into a printf statement if we generate for the Windows/Mac environment. It would be transformed to something else if we generated for 20.4.3 Addressing W2 (Statements): Safeheap Statement the actual target device (configured by the user in the build configuration). We have seen the basics of integrating new statements in the previous section where assert and fail extended the State- ment concept inherited from the C core languages. In this sec- tion we focus on statements that need to handle local variable scopes and visibilities. We implement the safeheap statement mentioned earlier (see Fig. 20.13), which automatically frees dynamically allocated memory. The variables introduced by the safeheap statement must only be visible inside its body, and have to shadow variables of the same name declared in 504 dslbook.org

outer scopes (such as the a declared in the second line of the measure function in Fig. 20.13).

 Structure The safeheap statement extends Statement. It contains a StatementList as its body, as well as a list of Safe- HeapVars. These extend LocalVarDecl, so they fit with the ex- isting mechanism for handling variable shadowing (explained below).

Figure 20.13:A safeheap statement declares heap variables which can only be used inside the body of the statement. When the body is left, the memory is automatically freed. Notice also how we report an error if the variable tries to escape.

 Behavior LocalVarRefs are expressions that reference a LocalVarDecl. A scope constraint determines the set of visible variables for a given LocalVarRef. We implement this con- straint by plugging into mbeddr’s generic local variable scop- ing mechanism using the following approach. The constraint ascends the containment tree until it finds a node which imple- ments ILocalVarScopeProvider, and calls its getLocalVar- Scope method. A LocalVarScope has a reference to an outer scope, which is set by finding its ILocalVarScopeProvider an- cestor, effectively building a hierarchy of LocalVarScopes. To get at the list of the visible variables, the LocalVarRef scope constraint calls the getVisibleLocalVars method on the in- nermost LocalVarScope object. This method returns a flat list of LocalVarDecls, taking into account that variables owned by a LocalVarScope that is lower in the hierarchy shadow vari- ables of the same name from a higher level in the hierarchy. So, to plug the SafeHeapStatement into this mechanism, it has to implement ILocalVarScopeProvider and implement the two methods shown in Fig. 20.14.

 Type System To make the safeheap statement work cor- rectly, we have to ensure that the variables declared and allo- cated in a safeheap statement do not escape from its scope. To dsl engineering 505

Figure 20.14:A safeheap statement im- public int LocalVarScope getLocalVarScope(node<> ctx, stmtIdx) { plements the two methods declared by LocalVarScope scope = new LocalVarScope(getContainedLocalVariables()); node outer = the ILocalVarScopeProvider interface. this.ancestor; getContainedLocalVariables returns if (outer != null){ the LocalVarDecls that are declared this this scope.setOuterScope(outer.getLocalVarScope( , .index)); between the parentheses (see Fig. 20.13). } return scope; getLocalVarScope constructs a scope } that contains these variables and then public sequence> getContainedLocalVars() { builds the hierarchy of outer scopes this .vars; by relying on its ancestors that also } implement ILocalVarScopeProvider. The index of the statement that contains the reference is passed in to make sure prevent this, an error is reported if a reference to a safeheap that only variables declared before the variable is passed to a function. Fig. 20.15 shows the code. reference site can be referenced.

checking rule check_safeVarRef for concept = LocalVarRef as lvr { Figure 20.15: This type system rule boolean isInSafeHeap = reports an error if a reference to a local lvr.ancestor.isNotNull; variable declared and allocated by boolean isInFunctionCall = lvr.ancestor.isNotNull; the safeheap statement is used in a boolean referencesSafeHeapVar = function call. lvr.var.parent.isInstanceOf(SafeHeapStatement); if (isInSafeHeap && isInFunctionCall && referencesSafeHeapVar) error "cannot pass a safe heap var to a function" -> lvr; }

20.4.4 Addressing W3 (Expressions): Decision Tables Fig. 20.4 showed the decision table expression. It is evaluated to the expression in a cell c if the column header of c and the row header of c are true30. If none of the condition pairs is 30 Strictly speaking, it is the first of the true, then the default value, FAIL in the example, is used as the cells for which the headers are true. It is optionally possible to use static resulting value. A decision table also specifies the type of the verification based on an SMT solver to value it will evaluate to, and all the expressions in content cells ensure that only one of them will be true for any given set of input values. have to be compatible with that type. The type of the header cells has to be Boolean.

 Structure The decision table extends the Expression con- cept defined in the expressions language. Decision tables con- tain a list of expressions for the column headers, one for the row headers and another for the result values. It also contains a child of type Type, to declare the type of the result expres- sions, as well as a default value expression. The concept defines an alias dectab to allow users to instantiate a decision table in 31 the editor . 31 Obviously, for non-textual notations such as the table, the alias will be  Editor Defining a tabular editor is straightforward: the different than the concrete syntax (in textual notations, the alias is typically editor definition contains a table cell, which delegates to a made to be the same as the "leading Java class that implements ITableModel. This is similar to keyword", e.g., assert). 506 dslbook.org

the approach used by Java Swing. It provides methods such as getValueAt( int row, int col) or deleteRow(int row), which have to be implemented for any specific table-based ed- itor. To embed another node in a table cell (such as the expres- sion in the decision table), the implementation of getValueAt simply returns the node (whose editor is then embedded in the table’s editor).

 Type System MPS uses unification in the type system. Lan- guage concepts specify type equations that contain type liter- als (such as boolean) as well as type variables (such as typeof (dectab)). The unification engine then tries to assign values to the type variables such that all applicable type equations be- come true. New language concepts contribute additional type equations. Fig. 20.16 shows those for decision tables32. 32 New equations are solved along with those for existing concepts. For example, the typing rules for a ReturnStatement ensure that the // the type of the whole decision table expression // is the type specified in the type field type of the returned expression is typeof(dectab) :==: typeof(dectabc.type); the same or a subtype of the type // the type of each of the column header of the surrounding function. If a // expressions must be Boolean ReturnStatement uses a decision foreach expr in dectab.colHeaders { typeof(expr) :==: ; table as the returned expression, the } type calculated for the decision table // ... same for row headers must be compatible with the return foreach expr in dectabc.rowHeaders { type of the surrounding function. typeof(expr) :==: ; } // the type of each of the result values must // be the same or a subtype of the table itself foreach expr in dectab.resultValues { infer typeof(expr) :<=: typeof(dcectab); } Figure 20.16: The type equations for the // ... same for the default typeof(dc.def) :<=: typeof(dectab); decision table (see the comments for details).

20.4.5 Addressing W4 (Types and Literals): Physical Units We use physical units to illustrate the addition of new types and literals. We have already shown example code earlier in Fig. 20.8.

 Structure Derived and convertible UnitDeclarations are IModuleContents. Derived unit declarations specify a name 33 The unit extension does not auto- (mps, kmh) and the corresponding SI base units (m, s), plus matically support prefixes like k, M or m. If you need km or mm you have to an exponent; a convertible unit declaration specifies a name define this as a convertible unit with and a conversion formula33. The backbone of the extension the respective conversion formulae. This is a conscious decision driven by is the UnitType, which is a composite type that has another limited value ranges in C data types type (int, float) in its valueType slot, plus a unit (either and conversion overhead. dsl engineering 507

an SI base unit or a reference to a UnitDeclaration). It is represented in programs as baseType/unit/. We also pro- vide LiteralWithUnits, which are expressions that contain a valueLiteral and, like the UnitType, a unit (so we can write, for example, 100 kmh).

 Scoping LiteralWithUnits and UnitTypes refer to a Unit- Declaration, which is a module content. According to the vis- ibility rules, valid targets for the reference are the UnitDecla- rations in the same module, and the exported ones in all im- ported modules. This rule applies to any reference to any mod- ule content, and is implemented generically. Fig. 20.17 shows the code for the scope of the reference to the UnitDeclaration. We use an interface IVisibleNodeProvider, (implemented by Modules) to find all instances of a given type. The implementa- tion of visibleContentsOfType searches through the contents of the current and imported modules and collects instances of the specified concept. The result is used as the scope for the reference.

Figure 20.17: The visibleContents- link search scope {unit} : OfType operation returns all instances (model, refNode, enclosingNode, operationContext) ->sequence> { of the concept argument in the cur- enclosingNode.ancestor. rent module, as well as all exported visibleContentsOfType(concept/UnitDeclaration/); instances in modules imported by the } current module.

 Type System We have seen how MPS uses equations and unification to specify type system rules. However, there is special support for binary operators that makes overloading for new types easy: overloaded operations containers essen- tially specify 3-tuples of (leftArgType, rightArgType, resultType), plus applicability conditions to match type patterns and de- cide on the resulting type. Typing rules for new (combina- tions of) types can be added by specifying additional 3-tuples. Fig. 20.18 shows the overloaded rules for C’s MultiExpression (the language concept that implements the multiplication op- erator *) when applied to two UnitTypes: the result type will be a UnitType as well, where the exponents of the SI units are added. While any two units can legally be used with * and / (as long as we compute the resulting unit exponents correctly), this is not true for + and -. There, the two operand types must 508 dslbook.org

Figure 20.18: This code overloads operation concepts : MultiExpression the MultiExpression to work for left operand type: new node() right operand type: new node() UnitTypes. In the is applicable sec- is applicable: tion we check whether there is a typing (op, leftOpType, rightOpType)->boolean { rule for the two value types (e.g., int operation type node<> resultingValueType = (op, float). This is achieved by trying leftOpType.valueType , rightOpType.valueType ); * return resultingValueType != null; to compute the resulting value type. If } none is found, the types cannot be mul- operation type: tiplied (and consequently, the two types (op, leftOpType, rightOpType)->node<> { with unit cannot be multiplied either). node<> resultingValueType = operation type(op, leftOpType.valueType, rightOpType.valueType ); In the computation of the operation UnitType.create(resultingValueType, type we create a new UnitType that leftOpType.unit.toSIBase().add( uses the resultingValueType as the rightOpType.unit.toSIBase(), 1 ) value type and then computes the ); } resulting unit by adding up the expo- nents of component SI units of the two operand types. be the same in terms of their representation in SI base units. We express this by using the following expression in the is applicable section34: 34 isSameAs actually reduces each unit to their SI base unit (with exponents), leftOpType.unit.isSameAs(rightOpType.unit) then compares the result for structural equality. In the operation type section we then compute the resulting unit type by adding the exponents of the components of the two unit types. The typing rule for the LocalVariableDeclaration requires that the type of the init expression must be the same or a sub- type of the type of the variable. To make this work correctly, we have to define a type hierarchy for UnitTypes. We achieve this by defining the supertypes for each UnitType: the super- types are those UnitTypes whose unit is the same, and whose valueType is a supertype of the current UnitType’s value type. Fig. 20.19 shows the rule. Figure 20.19: This typing rule computes subtyping rule _ supertypeOf UnitType the direct supertypes of a UnitType. It for concept = UnitType as ut { nlist<> res = new nlist<>; iterates over all immediate supertypes foreach st in immediateSupertypes(ut.valueType) { of the current UnitType’s value type, res.add(UnitType.create(st, ut.unit.copy)); wrapped into a UnitType with the } same unit as the original one. return res; }

20.4.6 Addressing W5 (Alternative Transformations): Range Checking The safemodules language defines an annotation to mark Mo- dules as safe (we will discuss annotations in the next subsec- tion). If a module is safe, the binary operators such as + or * are replaced with calls to functions that, in addition to performing the addition or multiplication, perform a range check. dsl engineering 509

 Transformation The transformation that replaces the binary operators with function calls is triggered by the presence of this annotation on the Module which contains the operator. Fig. 20.20 shows the code. The @safeAnnotation != null checks for the presence of the annotation.

Figure 20.20: This reduction rule trans- forms instances of PlusExpression into a call to a library function addWithRangeChecks, passing in the left and right argument of the + using the two COPY_SRC macros. The condition ensures that the transforma- tion is only executed if the containing Module has a safeAnnotation attached to it. A transformation priority defined in the properties of the transformation makes sure it runs before the C-to-text transformation. MPS uses priorities to specify relative orderings of transfor- mations, and MPS then calculates a global transformation or- der for any given model. We use a priority to express the fact that this transformation runs before the final transformation that maps the C tree to C text for compilation.

20.4.7 Addressing W6 (Meta Data): Requirements Traces Annotations are concepts whose instances can be added as chil- 35 We discussed annotations in Sec- dren to a node N without this being specified in the definition tion 16.2.7. of N’s concept35. While structurally the annotations are chil- 36 Optionally, it is possible to explic- dren of the annotated node, the editor is defined the other way itly restrict the concepts to which a particular annotation can be attached. round: the annotation editor delegates to the editor of the an- However, for the requirements traces notated element. This allows the annotation editor to add ad- discussed in this section, we do not 36 want such a restriction: traces should be ditional syntax around the annotated element . attachable to any program node. We illustrate the annotation mechanism based on the re- quirements traces. As we discussed at the end of Section 20.3.2, a requirements trace establishes a link from a program element to a requirement. It is important that this annotation can be annotated to any node, independent of the concept of which it is an instance. As a consequence of the projectional approach, the program can be shown with or without the annotations, controlled by a global switch. Fig. 17.5 had shown an example.

 Structure Fig. 20.21 shows the structure. Notice how it extends the MPS-predefined concept NodeAnnotation. It also specifies a role, which is the name of the property that is used to store TraceAnnotations under the annotated node. 510 dslbook.org

Figure 20.21: Annotations have to concept extends implements TraceAnnotation NodeAnnotation extend the MPS-predefined concept children: TraceKind tracekind 1 NodeAttribute. They can have an TraceTargetRef refs 0..n arbitrary child structure (tracekind, concept properties: refs), but they have to specify the role = trace role (the name of the property that concept links: annotated = BaseConcept holds the annotated child under its parent), as well as the attributed concept. The annotations can only be attached to instances of this concept (or  Editor In the editor annotations look as if they surrounded subconcepts). their parent node (although they are in fact children). Fig. 20.22 shows the definition of the editor of the requirements trace annotation (an example is shown in Fig. 20.9): it puts the trace to the right of the annotated node. Since MPS is a projectional editor, there is base-language grammar that needs to be made aware of the additional syntax in the program. This is key to enabling arbitrary annotations on arbitrary program nodes.

Figure 20.22: The editor definition for the ReqTrace annotation (an example trace annotation is shown in Fig. 20.9). It consists of a vertical list [/ .. /] with two lines. The first line contains Annotations are typically attached to a program node via an the reference to the requirement. The intention. Intentions are an MPS editor mechanism: a user second line uses the attributed node selects the target element, presses Alt-Enter and selects Add construct to embed the trace into the editor of the program node to which Trace from the popup menu. Fig. 20.23 shows the code for the this annotation is attached. So the intention that attaches a requirements trace. annotation is always rendered over of whatever syntax the original node uses.

Figure 20.23: An intention defini- intention for addTrace BaseConcept { tion consists of three parts. The description(node)->string { "Add Trace"; description returns the string that } is shown in the intentions popup menu. isApplicable(node)->boolean { The isApplicable section determines null node.@trace == ; under which conditions the intention } execute(editorContext, node)->void { is available in the menu – in our case, node.@trace = new node(); we can only add a trace if there is no } trace on the target node already. Fi- } nally, the execute section performs the action associated with the intention. In our case we simply put an instance of TraceAnnotation into the @trace 20.4.8 Addressing W7 (Restriction): Preventing Use of Real property of the target node. Numbers

We have already seen in Section 20.4.2 how constraints can pre- vent the use of specific concepts in certain contexts. We use the same approach for preventing the use of real number types in- 37 side model-checkable state machines: a can be ancestor con- It is possible to define such con- straints in extensions, thereby restrict- straint in the state machine prevents instances of float in the ing existing concepts after the fact, in a state machine if the verifiable flag is set37. modular way. dsl engineering 511

20.5 Experiences

In Section 20.5.1 we provide a brief overview of our experiences in implementing mbeddr, including the size of the project and the efforts spent. Section 20.5.2 discusses to what degree this approach leads to improvements in embedded software devel- opment.

Element Count LOC-Factor Figure 20.24: We count various lan- Language Concepts 260 3 guage definition elements and then use a factor to translate them into lines of Property Declarations 47 1 code. The reasons why many factors Link Declarations 156 1 are so low (e.g., reference constraints Editor Cells 841 0.25 or behavior methods) is that the imple- Reference Constraints 21 2 mentation of these elements is made Property Constraints 26 2 up of statements, which are counted separately. In the case of editor cells, Behavior Methods 299 1 typically several of them are on the Type System Rules 148 1 same line, hence the fraction. Finally, Generation Rules 57 10 the MPS implementation language sup- Statements 4,919 1.2 ports higher-order functions, so some Intentions 47 3 statements are rather long and stretch over more than one line: this explains Text Generators 103 2 the 1.2 in the factor for statements. Total LOC 8,640

20.5.1 Language Extension

 Size Typically, lines of code are used to describe the size of a software system. In MPS, a "line" is not necessarily mean- ingful. Instead we count important elements of the implemen- tation and then estimate a corresponding number of lines of code. Fig. 20.24 shows the respective numbers for the core, i.e. C itself plus unit test support, decision tables and build/- make integration (the table also shows how many LOC equiva- lents we assume for each language definition element, and the caption explains to some extent the rationale for these factors). According to our metric the C core is implemented with less than 10,000 lines of code. Let us look at an incremental extension of C. The compo- nents extension (interfaces, components, pre- and post-con- ditions, support for mock components in testing and a gen- erator back to plain C) is circa 3,000 LOC equivalents. The state machines extension is circa 1,000. Considering the fact that these LOC equivalents represent the language definition (including type systems and generators) and the IDE (includ- 512 dslbook.org

ing code completion, syntax coloring, some quick fixes and refactorings), this clearly demonstrates the efficiency of MPS for language development and extension.

 Effort In terms of effort, the core C implementation has been circa 4 person months divided between three people. This results in roughly 2,500 lines of code per person month. Ex- trapolated to a year, this would be 7,500 lines of code per de- veloper. According to McConnell38, in a project up to 10,000 38 www.codinghorror.com/ LOC, a developer can typically do between 2,000 and 25,000 blog/2006/07/diseconomies-of- scale-and-lines-of-code.html LOC. The fact that we are at the low end of this range can be explained by the fact that MPS provides very expressive lan- guages for DSL development: you don’t have to write a lot of code to express a lot about a DSL. Instead, MPS code is rela- tively dense and requires quite a bit of thought. Pair program- ming is very valuable in language development. Once a developer has mastered the learning curve, language extension can be very productive. The state machines and com- ponents extension have both been developed in about a month. The unit testing extension or the support for decision tables can be implemented in a few days.

 Language Modularity, Reuse and Growth Modularity and com- position are central to mbeddr. Building a language extension should not require changes to the base languages. This requires that the extended languages are built with extension in mind. Just as in object-oriented programming, where only complete methods can be overridden, only specific parts of a language definition can be extended or overwritten. The implementation of the default extensions served as a test case to confirm that the C core language is in fact extensible. We found a few prob- lems, especially in the type system, and fixed them. None of these fixes were "hacks" to enable a specific extension – they were all genuine mistakes in the design of the C core. Due to the broad spectrum covered by our extensions, we are confi- dent that the current core language provides a high degree of extensibility. Independently developed extensions should not interact with each other in unexpected ways. While MPS provides no auto- mated way of ensuring this, we have not seen such interactions so far. The following steps can be taken to minimize the risk of unexpected interactions. Generated names should be qual- ified to make sure that no symbol name clashes occur in the dsl engineering 513

generated C code. An extension should never consume "scarce resources": for example, it is a bad idea for a new Statement to require a particular return type of the containing function, or change that return type during transformation. Two such badly designed statements cannot be used together, because they are likely to require different return types39. 39 Note that unintended syntactic in- Modularity should also support reuse in contexts not antic- tegration problems between indepen- dently developed extensions (known ipated during the design of a language module. Just as in the from traditional parser-based systems) case of language extension (discussed above), the languages to can never happen in MPS. This was one of the reasons to use MPS for mbeddr. be reused have to be written in a suitable way so that the right parts can be reused separately. We have shown this with the state machines language. State machines can be used as top- level concepts in modules (binding out-events to C functions), and also inside components (binding out-events to component methods). Parts of the transformation of a state machine have to be different in these two cases, and these differences were successfully isolated to make them exchangeable. Also, we reuse the C expression language inside the guard conditions in a state machine’s transitions. We use constraints to prevent the use of those C expression that are not allowed inside tran- sitions (for example, references to global variables). Finally, we have successfully used physical units in components and interfaces. Summing up, these facilities allow different user groups to develop independent extensions, growing40 the mbeddr stack 40 "Growing" in the sense of Growing a even closer towards their particular domain. Language, Guy Steele’s great OOPSLA talk.

 Who can create Extensions? mbeddr is built to be extended. The question is by whom. This question can be addressed in two ways: who is able to extend it from a skills perspective, and who should extend it? Let us address the skills question first. We find that it takes about a month for a developer with solid object-oriented pro- gramming experience to become proficient with MPS and the 41 This may be reduced by better docu- structures of the mbeddr core languages41. Also, designing mentation, but a steep learning curve good languages, independent of their implementation, is a skill will remain. 42 that requires practice and experience . So, from this perspec- 42 The first part of this book provides tive we assume that in any given organization there should be some guidance. a select group of language developers who build the extensions for the end users. Notice that such an organizational structure is common today for frameworks and other reusable artifacts. There is also the question of who should create extensions. 514 dslbook.org

One could argue that, as language development becomes sim- pler, an uncontrolled growth in languages could occur, ulti- mately resulting in chaos. This concern should be addressed with governance structures that guide the development of lan- guages. The bigger the organization is, the more important such governance becomes43. 43 The modular nature of the mbeddr language extensions makes this prob- lem much easier to tackle. In an large organization we assume that a few language extensions will be strategic: 20.5.2 Improvements in Embedded Development aligned with the needs of the whole organization, well-designed, well tested In this section we discuss preliminary results of a real-world and documented, implemented by a development project. The project develops the software for a central group, and used by many de- 44 velopers. In addition, small teams may smart meter system . A smart meter is an electrical meter decide to develop their own, smaller that continuously records the consumption of electric power in extensions. Their focus is much more a home and sends the data back to the utility for monitoring local, and the development requires much less coordination. These could and billing. The particular software we develop will run on a be developed by the smaller units 2-chip board (TI MSP-430 for metrology, another TI processor themselves. (tbd.) for the remaining application logic). Instead of the gcc compiler used in mbeddr by default, this project uses an IAR compiler. 44 The smart meter system is developed The software comprises circa 30,000 lines of mbeddr code, by itemis France. Technical contacts are Bernd Kolb ([email protected]) and has several time-sensitive parts that require a low-overhead im- Stephan Eberle ([email protected]). plementation, and will have to be certified by an independent body. The software is derived from an existing example smart meter meter system written in traditional C, and reuses exist- ing artifacts such as header files and libraries45. 45 While the project is still going on, we can already report some experiences and draw some conclusions.  Why mbeddr? mbeddr was chosen to implement the smart meter for the following reasons. The project has to work with an existing code base which had no production-level quality. The code quality needed to be improved incrementally. So start- ing with the existing C code and then refactoring towards bet- ter abstractions seemed like a good idea. Also, as we will see 46 In particular, the smart meter is not below, the existing C extensions provided by mbeddr are a just a state-based or a data flow-based good match for what is needed in the smart meter (see be- system, it contains multiple different 46 behavioral paradigms. Using a state low) . Finally, as the goal is to have the meter certified, test- chart or data flow modeling tool was ing the software is very important. By using the abstraction hence not an option. mechanisms provided by mbeddr, and by exploiting the ability to build custom extensions, testability can be improved sig- nificantly. In particular, hardware-specifics can be isolated, which enables testing without the actual target hardware. Also, mbeddr’s support for requirements traceability comes in handy for the upcoming certification. dsl engineering 515

 Using the Existing Extensions The smart meter uses the fol- lowing default extensions:

Components The smart meter uses components to improve the structure of the application, and to support different imple- mentations of the same interface. This improves modularity and testability. Mocks components are used excessively for testing.

State Machines The smart meter communicates with its envi- ronment via several different protocols. So far, one of these protocols has been refactored to use a state machine. This has proven to be much more readable than the original C code. Components and state machines are combined, which allows decoupling message assembly and parsing from the application logic in the server component.

Units A major part of the smart meter application logic per- forms computations on physical quantities (time [s], current [A] or voltage [V]). So mbeddr’s support for physical units comes in handy. The benefits of these extensions are mostly in type checking, using types with units also improves the readability and comprehensibility of the code.

Requirements Tracing The smart meter also makes use of re- quirements traces. During the upcoming certification pro- cess, these will be extremely useful for tracking if and how the customer requirements have been implemented. Because of their orthogonal nature, the traces can be attached to the new language concepts specifically developed for the smart meter.

 Custom Extensions As part of the smart meter, so far mbeddr has been extended in the following ways:

Registers The smart meter software makes extensive use of reg- isters (metrology: access the sensor values, UART: send and receive data). This cannot be abstracted away easily due to performance/overhead constraints. In addition, some reg- isters are special-purpose registers: when a value is written to such a register, a hardware-implemented computation is automatically triggered based on the value supplied by the programmer. The result of this computation is then stored in the register. To run code that works with these regis- ters on the PC for testing, developers face two problems: 516 dslbook.org

first, the header files that define the addresses of the reg- isters are not valid for the PC’s processor. Second, there are no special-purpose registers on the PC, so no automatic computations or other hardware-triggered actions would be triggered. This problem was solved with a language exten- sion that supports registers as first-class citizens and sup- ports accessing them from mbeddr-C code (see code below).

exported register8 ADC10CTL0 compute as val * 1000

void calculateAndStore( int8 value ) { int8 result =// some calculation with value ADC10CTL0 = result;// actually stores result * 1000 }

The extension also supports specifying an expression that performs the computation. When the code is translated for the real device, the real headers are included, and access to the registers is replaced with access to the constants defined in the header. In testing, structs are generated to hold the register data. Each write access to a register is replaced with a write access to the struct, and the expression that simulates the special purpose register is included in that assignment.

Interrupts Many aspects of the smart meter system are driven by interrupts. To integrate the component-based architec- ture used in the smart meter with interrupts, it is neces- sary to trigger component runnables (methods) via an in- 47 By default, runnables are triggered terrupt. To this end, we have implemented a language ex- by an "incoming" invocation, upon initialization or by a timer. This new tension that allows us to declare interrupts. In addition, the trigger connects a runnable to an extension provides runnable triggers that express the fact interrupt. that a runnable is triggered by an interrupt47. The extension 48 For testing purposes on the PC, also provides a concept to assign an interrupt to a runnable special language constructs can be during component instantiation. A check makes sure that used to emulate the occurrence of an each interrupt-triggered runnable has at least one interrupt interrupt: the test driver simulates the triggering of interrupts based on a test- 48 assigned . specified schedule and checks whether the system reacts correctly. Data Encoding As part of the communication protocol, data has to be encoded into messages49. A language extension sup- 49 The message/data format is similar to ports the definition of data structures, and generated code ASN.1. deals with constructing messages. In particular, the gener- ated code deals with packed data (where data has sizes that are not multiples of 8 bits). Also, code that processes mes- sages can be statically checked against the message structure definition, making this code much more robust (in particular if the message definitions are changed). dsl engineering 517

 Conclusions The mbeddr default extensions have proven extremely useful in the development of the smart meter. The fact that the extensions are directly integrated into C (as op- posed to the classical approach of using external DSLs or sep- arate modeling tools) reduces the hurdle of using higher-level extensions and removes any potential mismatch between DSL code and C code. Generating code from higher-level abstractions may intro- duce performance and resource consumption overhead. While we have not yet performed a systematic analysis of the over- head incurred by the mbeddr extensions, it is low enough to run the smart meter system on the hardware intended for it50. 50 Some extensions (registers, interrupts Additional effort is required to integrate with existing legacy or physical units) have no runtime overhead at all, since they have no code. As a consequence of the projectional editor, we have to representation in the generated C code. parse the C text (with an existing parser) and construct the MPS Others, such as the components, incur a very small overhead as a consequence AST. mbeddr provides an importer for header files as a means of indirections from function pointers of connecting to existing libraries. However, mostly as a conse- (to support polymorphism). quence of C’s preprocessor, which allows all kinds of mischief to be done to otherwise well-structured C code, this importer is not trivial. For example, we currently cannot import all alter- natives expressed by #ifdefs. Users have to specify a specific configuration to be imported (in the future, we will support importing of all options by mapping the #ifdefs to mbeddr’s product line variability mechanism). Also, header files often contain platform-specific keywords or macros. Since they are not supported by the mbeddr C implementation, these have to be removed before they can be imported. The header importer provides a regular expression-based facility to remove these platform specifics before the import. The smart meter project, which is heavily based on an existing code base, also drove the 51 need for a complete source code importer (including .c files, The integration of legacy code de- scribed in this paragraph is clearly a not just header files), which we are currently in the process of disadvantage of projectional editing. developing51. However, because of the advantages We have performed scalability tests and found that mbeddr of mbeddr and language extensibility discussed in this chapter, we feel that it scales to at least the equivalent of 100,000 lines of C code in is a good trade-off. the developed system. These tests were based on automati- cally generated sample code and measured editor responsive- 52 The smart meter system consists of ness and transformation times. While there are certainly sys- 30,000 lines of mbeddr code. Since there is a factor of about 1.5 between tems that are substantially larger, a significant share of embed- the mbeddr code and generated C, the ded software is below this limit and can be addressed with smart meter system corresponds to mbeddr52. circa 45,000 lines of C. 518 dslbook.org

20.6 Discussion

 Why MPS? Our choice of MPS is due to its support for all aspects of language development (structure, syntax, type systems, IDE, transformations), its support for flexible syn- tax as a consequence of projectional editing, and its support for advanced modularization and composition of languages. The ability to attach annotations to arbitrary program elements without a change to that element’s definition is another strong advantage of MPS (we we use this for presence conditions and trace links, for example). While the learning curve for MPS is significant (a developer who wants to become proficient in MPS language development has to invest at least a month), we found that it scales extremely well for larger and more sophisticated languages53. 53 This is in sharp contrast to some of the other tools the authors have Projectional Editing Projectional editing is often considered worked with, where implementing  simple languages is quick and easy, a drawback, because the editors feel somewhat different and and larger and more sophisticated the programs are not stored as text, but as a tree (XML). We languages are disproportionately more complex to build. This is illustrated by have already highlighted the fact that MPS does a good job the very reasonable effort necessary for regarding the editor experience, and we feel that the advan- implementing mbeddr. tages of projectional editors regarding syntactic freedom far outweigh the drawback of requiring some initial familiariza- tion. Our experience so far with about ten users (pilot users from industry, students) shows that after a short guided in- troduction of about 30 minutes, and an initial accommodation period (circa 1-2 days), users can work productively with the projectional editor. Regarding storage, the situation is not any worse than with current modeling tools that store models in a non-textual format, and MPS does provide good support for diff and merge using the projected syntax. 54 In the context of a development  Feasibility of Language Extension Based on the experience project which, like the smart meter, with the smart meter, the effort for building extensions is rea- is planned to run a few person years, these efforts can easily be absorbed. sonable. For example, the implementation of the language ex- The benefits are well worth the effort in tensions for registers (and the simulation for testing) was done terms of the improved type safety and testability. in half a day. The addition of interrupts, interrupt-trig-gered runnables and the way to "wire them" up was circa one day54. 55 It requires new top-level module Building a language extension should not require changes to contents (the register definition them- selves), new expressions (for reading the base language. The extensions for the smart meter demon- and writing into the registers), and em- strate this point. The registers extension discussed above has bedding expressions into new contexts 55 (the code that emulates the hardware been built without changing the underlying C language . Sim- computation when registers are writ- ilarly, the interrupt-based runnable triggers have been hooked ten). dsl engineering 519

into the generic trigger facility that is part of the components language. Once a language is designed in a reasonable way, the language (or parts of it) should be reusable in contexts that have not been specifically anticipated in advance. The smart meter system contains such examples: expressions have been embedded in the register definition concept for emulating the hardware behavior, and types with units have been used in de- cision tables. Again, no change to the existing languages has been necessary. One criticism that has been used against language exten- sion is that the language will grow large and that it is hard for users to learn all its constructs. In our experience, this is not a problem in mbeddr for the following three reasons: first, the extensions provide linguistic abstractions for concepts that are well known to the users: state-based behavior, interfaces and components or test cases. Second, the additional language features are easily discoverable because of the IDE support. Third, and most important, these extensions are modularized, and any particular end user will only use those extensions that are relevant to whatever their current program addresses. This avoids overwhelming the user with too much "stuff" at a time.

21 DSLs and Product Lines

This chapter discusses the role of DSLs in Product Line En- gineering (PLE). We first briefly introduce PLE and feature models and discuss how feature models can be connected to programs expressed in DSLs. We then explain the differ- ence in expressivity between feature models and DSLs and argue why sometimes features models are not enough to ex- press the variability in a product line, and how DSLs can help. The chapter concludes with a mapping of the con- cepts relevant to PLE and DSLs. This chapter is written mostly for people with a background in PLE who want to understand how DSLs fit into PLE.

21.1 Introduction

The goal of product line engineering (PLE) is to efficiently man- 1 PLE also involves a lot product man- age a range of products by factoring out commonalities such agement, process and organizational aspects. We do not cover these in this that definitions of products can be reduced to a specification book. of their variable aspects1. As a consequence of this approach, 2 It can also help establish a common software quality can be improved and time-to-market of any user experience among a range of single product in the product line can be reduced2. One way products. of achieving this is the expression of product configurations 3 This higher level of abstraction is often on a higher level of abstraction than the actual implementa- called the problem space; the lower 3 level is often called the solution space tion . An automated mapping transforms the configuration to or implementation space. the implementation. 522 dslbook.org

Figure 21.1: An example feature di- agram for a product line of Stack data structures. Filled circles repre- sent mandatory features, empty circles represent optional features. Filled arcs represent n-of-m selection and empty arcs represent 1-of-m.

21.2 Feature Models

In PLE, this higher level of abstraction is typically realized with feature models4. Feature models express configuration 4 There are other approaches, such as options and the constraints among them. A graphical nota- Orthogonal Variability Models, but they are similar in essence to feature models. tion, called feature diagrams, is often used to represent feature models (Fig. 21.1 shows an example). Here is why we need constraints: if a product line’s variability was just expressed by a set of Boolean options, the configuration space would grow by 2n, with n representing the number of options. With feature models, constraints are expressed regarding the combinations of features, limiting the set of valid configurations to a more manageable size. Constraints include5: 5 Additionally, non-hierarchical con- straints can be expressed as well. • Mandatory (filled circles): mandatory features have to be in For example, a feature can declare conflicts with and a requires also each product. For example, in Fig. 21.1, each Stack has to constraints relative to arbitrary other have the feature ElementType. features. • Optional (empty circles): optional features may or may not be in a product. Counter and Optimization are examples of optional features. • Or (filled arc): a product may include zero, one or any num- ber of the features in an or group. In the example, a product may include any number of features from ThreadSafety, BoundsCheck and TypeCheck. • Xor (empty arc): a product must include exactly one of the features grouped into a xor group. The ElementType must either be int, float or String.

A configuration represents a product in the product line. It comprises a set of feature selections from a feature model that comply with the constraints expressed in the feature model. dsl engineering 523

For example, the configuration {Optimization, Memory Use, Additional Features, Type Check, Element Type, int, Size, Dynamic} would be valid. A configuration that includes Speed and Memory Usage would be invalid, because it violates the xor constraint between those two features expressed in the feature model. Note that a feature model does not yet describe the imple- mentation of a product or the product line, the feature model has to be connected to implementation artifacts in a separate step6. 6 While this sounds like a drawback, it is one of the main advantages of the approach: feature models support reasoning about product configurations 21.3 Connecting Feature Models to Artifacts independent of their implementation. By definition, feature models express product line variability at a level that is more abstract than the implementation. In many systems, the implementation of variability is scattered over (parts of) many implementation artifacts. However, to re- sult in a correct system, several variation points (VP) may need to be configured in a consistent, mutually dependent way. If each VP has to be configured separately, the overall complex- ity grows quickly. By identifying logical variation points and factoring them into features in a feature model, and then tying the (potentially many) implementation variation points to these logical variation points, related implementation variations can be tied together and managed as one (Fig. 21.2).

Figure 21.2: The Artifact Level rep- resents realization artifacts such as models, code or documentation. The Logical Level is the external description of variation points and the conceptual constraints among them, typically a feature model. One or more VPs in the implementation level are associated with variation points in the logical level (n:1, n:m).

If DSLs are used to implement a software system, then the artifacts configured from the feature model are typically DSL programs, and the variation points are program elements. By 524 dslbook.org

using DSL models instead of low-level implementation code, the number of variation points in the artifacts will be reduced, because you use the DSL-to-code transformation to expand all the details in a consistent way. The trade-off is that you have to define this high-level domain specific language, including a way to define variants of programs written in that language. You also need to define the transformation down to the actual implementation artifacts (Fig. 21.3).

Figure 21.3: A model describes domain abstractions in a formal and concise way. Transformations map the model to (typically more than one) implemen- tation artifact. Variability is expressed with fewer VPs in the models compared to implementation artifacts.

The configuration of models (and other artifacts) can be done in several different ways: removal, injection and parameteriza- tion.

 Removal (also known as negative variability) In this approach, the mapping from a feature model to implementation artifacts removes parts of a comprehensive whole (Fig. 21.4). This im- plies marking up the various optional parts of the comprehen- sive whole with Boolean expressions that determine when to remove the part. These expressions are called presence condi- Figure 21.4: Removal represents nega- tions. The biggest advantage of this approach is its apparent tive variability, in that it takes optional simplicity. However, the comprehensive whole has to contain things (small squares) away from a the parts for all variants (maybe even parts for combinations of comprehensive whole, based on the configuration expressed over the fea- variants), making it potentially large and complex. Also, de- tures a,b,c. The optional parts are pending on the tool support, the comprehensive whole might annotated with presence conditions referring to the configuration features. not even be a valid instance of the underlying language or for- malism7. In an IDE, the respective artifact might show errors, 7 For example, a Java class may have to which makes this approach annoying at times. extend different base classes, depending on the variant, or it may contain fields ifdefs in C and C++ are a well-known implementation of with different types. this strategy. A preprocessor removes all code regions whose ifdef condition evaluates to false. When calling the compil- er/preprocessor, you have to provide a number of symbols dsl engineering 525

that are evaluated as part of the conditions. Conditional com- pilation can also be found in other languages. Preprocessors that treat the source code simply as text are available for many languages and are part of many PLE tool suites. The AU- TOSAR standard, as well as other modeling formalisms, sup- port the annotation of model elements with presence condi- tions. The model element (and all its children) are removed from the model if the condition evaluates to false. The same approach is available in mbeddr.

 Injection (also known as positive variability) In this approach, additions are defined relative to a minimal core (Fig. 21.5). The core does not know about the variability: the additions point to the place where they need to be added. The clear advantage of this approach is that the core is typically small and contains only what is common for all products. The parts specific to a variant are kept external and added to the core only when nec- essary. To be able to do this, however, there must be a way to refer to the location in the minimal core at which to add a vari- able part. This either requires the explicit definition of named hooks in the minimal core, or some way of pointing into the Figure 21.5: A minimal base arti- fact made of various parts (the small core from an external source. Also, interactions between the rectangles) exists. There is also variant- additions for various features may also be hard to manage. specific code (the strange shapes), connected to features external to the Aspect-oriented programming is a way of implementing this actual artifact and pointing to the parts strategy. Pointcuts are a way of selecting from a set of join of the artifact to which they can be at- points in the base asset. A joint point is an addressable location tached. Implementing a variant means that the variant-specific code associated in the core. Instead of explicitly defining hooks, all instances with the selected features is injected of a specific language construct are automatically addressable. into the base artifact, attached to the parts they designate. Various preprocessors can also be used in this way. However, they typically require the explicit markup of hooks in the min- imal core. For models, injection is especially simple, since in most formalisms model elements are addressable by default and/or the language can be extended to be able to mark up hooks. This makes it possible to point to a model element, and add additional model elements to it, as long as the result is still a valid instance of the meta model.

 Parameterization The variable artifact defines parameters. A variant is constructed by providing values for those parame- 21 6 21 6 Figure . : An artifact defines a num- ters (Fig. . ). The parameters are usually typed to restrict the ber of (typed) parameters. A variant range of valid values. In most cases, the values for the param- provides values for the parameters. eters are relatively simple, such as strings, integers, Booleans or regular expressions. However, in principle, they can be arbi- 526 dslbook.org

trarily complex8. The parameterized artifact needs to explicitly 8 In the extreme case, these parameters define the parameters, as well as a way to specify values. The can be complete programs in some DSL. artifact has to query the values of those parameters explicitly and use them for whatever it does. The approach requires the 9 This is not necessarily the case with core to be explicitly aware of the variability9. the other two approaches. A configuration file that is read out by the an application is a form of parameterization. The names of the parameters are 10 There is an obvious connection to DSLs – a DSL can be used here. predefined by the application, and when defining a variant, a This approach is useful when the set of values is supplied. The Strategy pattern is a form of primary product definition can be parameterization, especially in combination with a factory. A expressed with a feature model. The DSL-typed attributes can be used variant is created by supplying an implementation of an inter- for those variation points for which face defined by the configurable application. All kinds of other selection is not expressive enough. small, simple or domain-specific languages can be used as a form of parameterization. A macro language in an application is a form of parameterization, where the type of parameter is "valid program written in language X"10.

21.3.1 Example Implementation in mbeddr In mbeddr we use a textual notation for feature models. Fig. 21.7 shows this notation for the stack feature model shown graph- ically above. Note how the constraint affects all children of a feature, so we had to introduce the intermediate feature options to separate mandatory features from optional features. Fea- tures can also have configuration attributes (of any type). A configuration is a named set of selections from the fea- tures in a feature model. The selection has to be valid regarding the constraints defined in the feature model. Fig. 21.8 shows Figure 21.7: An example feature model two example configurations. If an invalid configuration is cre- in mbeddr. Until MPS provides support ated, errors will be shown in the configuration model. for graphical notations (planned for 2013), we use a textual notation.

Figure 21.8: Two valid configurations of the feature model.

11 Note that the fact that you can make arbitrarily detailed program elements depend on features is not meant to imply that no further structuring of the product line is necessary, and all Presence conditions can be attached to any program element variability should be expressed via fine- expressed in any language11, without this language having to grained presence conditions. Instead, know about it, thanks to MPS’ annotations (discussed in Sec- presence conditions should be used to configure more coarse-grained entities tion 16.2.7). For example the two report statements and the such as the instantiation and wiring of message list in the left program in Fig. 21.9 are only part of a components. dsl engineering 527

product if the logging feature is selected in the product con- figuration. The background color of an annotated node is com- puted from the expression: annotated nodes using the same expression have the same color (an idea borrowed from Chris- tian Kaestner’s CIDE12). 12 C. Kaestner. Cide: Decomposing It is possible to edit the program as a product line (with the legacy applications into features. In Software Product Lines, 11th International annotations), undecorated (without annotations), as well as a Conference, SPLC 2007, Kyoto, Japan, specific product. Fig. 21.9 shows an example. During transfor- September 10-14, 2007, Proceedings. Second Volume (Workshops), pages mation, those parts of programs that are not in the product are 149–150. Kindai Kagaku Sha Co. Ltd., removed from the model. Tokyo, Japan, 2007

Figure 21.9: Left: A C program with product line annotations. Right: The 21.3.2 Feature Models on Language Elements program rendered in the Production variant. Note how the program is "cut Instead of using feature models to vary programs expressed down" to include only those parts that with DSLs, the opposite approach is also possible. In this case, are part of the variant, even in the editor. the primary product definition is done with DSLs. However, some language concepts have a feature model associated with them for detailed configuration. When the particular language concept is instantiated, a new ("empty") feature configuration is created, and can be configured by the application engineer. 528 dslbook.org

21.3.3 Variations in the Transformation or Execution When working with DSLs, the execution of models – by trans- formation, code generation or interpretation – is under the con- trol of the domain engineer. The transformations or the inter- preter can also be varied based on a feature model.

 Negative Variability via Removal The transformations or the interpreter can be annotated with presence conditions; the con- figuration happens before the transformations or the interpreter are executed.

 Branching The interpreter or the transformations can query over a feature configuration and then branch accordingly at runtime.

 Positive Variability via Superimposition Transformations or interpreters can be composed via superposition before execu- tion. For transformations, this is especially feasible if the trans- formation language is declarative, which means that the order in which the transformations are specified is irrelevant. In- terpreters are usually procedural, object-oriented or functional programs, so declarativeness is hard to achieve in those.

 Positive Variability via Aspects If the transformation lan- guage or the interpreter implementation language support as- pect oriented programming, then this can be used to configure the execution environment. For example, the Xpand code gen- eration engine13 supports AOP for code generation templates. 13 wiki.eclipse.org/Xpand

Creating transformations with the help of other transforma- tions, or by any of the above variability mechanisms, is also referred to as higher-order transformations14. Note that if a boot- 14 J. Oldevik and O. Haugen. Higher- strapped environment is used, the transformations are them- order transformations for product lines. In SPLC, pages 243–254, 2007 selves models created with a transformation DSL. This case then reduces to just variation over models, as described in the previous subsection.

21.4 From Feature Models to DSLs

A feature model is a compact representation of the features of the products in a product line, as well as the constraints im- posed on combinations of these features in products. Feature models are an efficient formalism for configuration, i.e. for select- dsl engineering 529

ing a valid combination of features from the feature model. The set of products that can be defined by feature selection is fixed and finite: each valid combination of selected features consti- tutes a product. This means that all valid products have to be "designed into" the feature model, encoded in the features and the constraints among them. Some typical examples of things that can be modeled with feature models are the following:

• Does the communication system support encryption? • Should the in-car entertainment system support MP3s? • Should the system be optimized for performance or memory footprint? • Should messages be queued? What is the queue size?

Because of the "select from set of options" metaphor, feature model-based configuration is simple to use – product definition is basically a decision tree. This makes product configuration efficient, and potentially accessible for stakeholders other than software developers. Also, as described by Batory15 and Czar- 15 D. S. Batory. Feature models, gram- necki16, one advantage of feature models is that a mapping mars, and propositional formulas. In SPLC, pages 7–20, 2005 to logic exists. Using SAT solvers, it is possible to check, for example, whether a feature model has valid configurations at 16 K. Czarnecki and A. Wasowski. all. The technique can also be used to automatically complete Feature diagrams and logics: There and back again. In SPLC, pages 23–34, 2007 partial configurations. In the rest of this section we will discuss the limitations of feature models, in particular, that they are not suited for open- ended construction of product variants. Instead of giving up on models completely and using low-level programming, we should use DSLs instead. This avoids losing the differentiation between problem space and solution space, while still support- ing more expressivity in the definition of a product. As an example, we use a product line of water fountains, as found in recreational parks17. Fountains can have several 17 This is an anonymized version of basins, pumps and nozzles. Software is used to program the an actual project the author has been working on. The real domain was behavior of the pumps and valves to make the sprinkling wa- different, but the example languages ters aesthetically pleasing. The feature model in Fig. 21.10 rep- presented in this chapter have been developed and used for that other resents valid hardware combinations for a simple water foun- domain. tain product line. Each feature corresponds to the presence of a hardware component in a particular fountain installation. The real selling point of water fountains is their behavior. A fountain’s behavior determines how much water each pump should pump, at which time, with what power, or how a pump 530 dslbook.org

Figure 21.10: Feature model for the sim- ple fountains product line used as the example. Fountains have basins, with one or two nozzles, and an optional full sensor. In addition, fountains have a pump.

reacts when a certain condition is met, e.g., a basin is full. Ex- pressing the full range of such behaviors is not possible with feature models. Feature models can be used to select among a fixed number of predefined behaviors, but approximating all possible behaviors would lead to unwieldy feature models.

21.4.1 Feature Models as Grammars To understand the limitations of feature models, we consider their relation to grammars. Feature models essentially corre- spond to context-free grammars without recursion18. For ex- 18 K. Czarnecki, S. Helsen, and U. W. ample, the feature model in Fig. 21.10 is equivalent to the fol- Eisenecker. Formalizing cardinality- based feature models and their special- 19 lowing grammar : ization. SOPR, 10(1):7–29, 2005

Fountain -> Basin PUMP 19 Basin -> ISFULLSENSOR? (ONENOZZLE | TWONOZZLES) We use all caps to represent termi- nals, and camel-case identifiers as non-terminals. This grammar represents a finite number of sentences: there are exactly four possible configurations, which correspond to the finite number of products in the product line. However, this formalism does not make sense for modeling behavior, for which there is typically an infinite range of variability. To ac- commodate for unbounded variability, the formalism needs to be extended. Allowing recursive grammar productions is suf- ficient to model unbounded configuration spaces, but for con- venience, we consider also attributes and references. Attributes express properties of features. For example, the PUMP could have an integer attribute rpm, representing the power setting of the pump20. 20 Some feature modeling tools support attributes on features. An example is Fountain -> Basin PUMP(rpm:int) pure::variants. Basin -> ISFULLSENSOR? (ONENOZZLE | TWONOZZLES)

Recursive grammars can be used to model repetition21 and nest- 21 Repetition is also supported by ing. Nesting is necessary to model tree structures such as those cardinality-based feature models, as described in . occurring in expressions. The following grammar extends the K. Czarnecki, S. Helsen, and U. W. fountain feature model with a Behavior, which consists of a Eisenecker. Formalizing cardinality- based feature models and their special- Rules Basin number of . The can now have any number of ization. SOPR, 10(1):7–29, 2005 Nozzles. dsl engineering 531

Figure 21.11: An extended feature mod- eling formalism is used to represent the example feature model with attributes, recursion and references (the dotted boxes).

Fountain -> Basin PUMP(rpm:int) Behavior Basin -> ISFULLSENSOR? NOZZLE* Behavior -> Rule* Rule -> CONDITION CONSEQUENCE

References allow the creation of context-sensitive relations be- tween parts of programs described by the grammar. For ex- ample, by further extending our fountain grammar we can de- scribe a rule whose condition refers to the full attribute of the ISFULLSENSOR and whose consequence sets a PUMP’s rpm to zero.

Fountain -> Basin id:PUMP(rpm:int)? Behavior Basin -> id:ISFULLSENSOR(full:boolean)? id:NOZZLE* Behavior -> Rule*

Rule -> Condition Consequence Condition -> Expression Expression -> ATTRREFEXPRESSION | AndExpression | GreaterThanExpression | INTLITERAL;

AndExpression -> Expression Expression GreaterThanExpression -> Expression Expression

Consequence -> ATTRREFEXPRESSION Expression

Fig. 21.11 shows a possible rendering of the grammar with an enhanced feature modeling notation. We use cardinalities, as well as references to existing features, the latter are shown as dotted boxes. A valid configuration could be the one shown in Fig. 21.12. It shows a fountain with one basin, two nozzles named n1 and n2, one sensor s and a pump p. It contains a 532 dslbook.org

rule that expresses the condition that if the full attribute of s is set, and the rpm of pump p is greater than zero, then the rpm should be set to zero.

Figure 21.12: Example configuration using a tree notation. Referenceable identities are rendered as labels to the left of each box. The dotted lines represent references to variables.

21.4.2 Domain-Specific Languages While the extended grammar formalism discussed above en- ables us to cover the full range of behavior variability, the use of a graphical tree notation to instantiate these grammars is not practical. Another interpretation of these grammars is as the definition of a DSL – the tree in Fig. 21.12 looks like an abstract tree (AST). To make the language readable we need to add concrete syntax definitions (keywords), as in the following extension of the fountain grammar:

Fountain -> "fountain" Basin Pump Behavior Basin -> "basin" IsFullSensor Nozzle* Behavior -> Rule*

Rule -> "if" Condition "then" Consequence Condition -> Expression Expression -> AttrRefExpression | AndExpression | GreaterThanExpression | IntLiteral;

AndExpression -> Expression "&&" Expression GreaterThanExpression -> Expression ">" Expression AttrRefExpression -> IntLiteral -> (0..9)*

Consequence -> AttrRefExpression "=" Expression

IsFullSensor -> "sensor" ID (full:boolean)? Nozzle -> "nozzle" ID Pump -> "pump" ID (rpm:int)? dsl engineering 533

We can now write a program that uses a convenient textual 22 As we have discussed at length in this notation, which is especially useful for the expressions in the book, a complete language definition would also include typing rules and rules. We have created a DSL for configuring the composition other constraints. However, to under- and behavior of fountains22. stand the difference in expressibility between DSLs and feature models, a fountain grammar is sufficient. basin sensor s nozzle n1 nozzle n2 pump p if s.full && p.rpm > 0 then p.rpm = 0

DSLs fill the gap between feature models and programming languages. They can be more expressive than feature mod- els, but they are not as unrestricted and low-level as program- ming languages. Like programming languages, DSLs support construction, allowing the composition of an unlimited num- ber of programs. Construction happens by instantiating lan- guage concepts, establishing relationships, and defining values for attributes. We do not a-priori know all possible valid pro- grams23. In contrast to programming languages, DSLs keep 23 This is in contrast to configuration, the distinction between problem space and solution space in- where users select from a limited set of options. Feature models support tact, since they consist of concepts and notations relevant to configuration. the problem domain. Non-programmers can continue to con- tribute directly to the product development process, without being exposed to implementation details.

21.4.3 Making Feature Models More Expressive We described the limitations of the feature modeling approach above, and proposed DSLs as an alternative. However, the fea- ture modeling community is working on alleviating some of these limitations. For example, cardinality based feature models24 support the 24 K. Czarnecki, S. Helsen, and U. W. multiple instantiation of feature subtrees. References between Eisenecker. Formalizing cardinality- based feature models and their special- features could be established by using feature attributes typed ization. SOPR, 10(1):7–29, 2005 with another feature – the value range would be the set of in- stances of this feature. Name references are an approximation of this approach. Clafer25 combines meta modeling and feature modeling. In 25 K. Bak, K. Czarnecki, and A. Wa- addition to providing a unified syntax and a semantics based sowski. Feature and meta-models in clafer: Mixed, specialized, and cou- on sets, Clafer also provides a mapping to SAT solvers to sup- pled. In 3rd International Conference on port validation of models. The following is an example Clafer26: Software Language Engineering, 10/2010 2010 abstract Person name : String 26 adapted from Michal Antkiewicz’ firstname : String Concept Modeling Using Clafer or Gender tutorial at Male gsd.uwaterloo.ca/node/310 Female 534 dslbook.org

xor MaritalStatus Single Married Divorced Address Street : String City : String Country : String PostalCode : String State : String ?

abstract WaitingLine participants -> Person *

The code describes a concept Person with the following char- acteristics:

• A name and a first name of type String (similar to at- tributes).

• A gender, which is Male or Female, or both27 (similar to 27 Don’t ask me why it’s an or and not or-groups in feature models). a xor constraint, so you can be two genders at once. It is in the original • A marital status, which is either single, married or divorced example :-) (similar to xor-groups in feature models). • An Address (similar composition is language definitions) . • An optional State attribute on the address (similar to op- tional features in feature modeling).

The code also shows a reference: a WaitingLine refers to any number of Persons. Note however that an important ingredient for making DSLs work in practice is the domain-specific concrete syntax. None of the approaches mentioned in this section provide customiz- able syntax. However, approaches like Clafer are a very inter- esting backend for DSLs, to support analysis, validation and automatic creation of valid programs from partial configura- tions.

21.5 Conceptual Mapping from PLE to DSLs

This section looks at the bigger picture of the relationship be- tween PLE and DSLs. It contains a systematic mapping from the core concepts of PLE to the technical space of DSLs. First we briefly recap the core PLE concepts.

Core Assets designate reusable artifacts that are used in more than one product. As a consequence of their strategic rel- evance, they are usually high quality and maintained over time. Some of the core assets might have variation points. dsl engineering 535

A Variation Point is a well-defined location in a core asset where products differ from one another.

Kind of Variability classifies the degrees of freedom one has when binding the variation point. This ranges from setting a sim- ple Boolean flag, through specifying a database URL or a DSL program, to a Java class hooked into a platform frame- work.

Binding Time denotes the point in time when the decision is made as to which alternative should be used for a variation point. Typical binding times include source time (changes to the source code are required), load time (bound when the system starts up) and runtime (the decision is made while the program is running).

The Platform is those core assets that actually form a part of the running system. Examples include libraries, frameworks or middleware.

Production Tools are core assets that are not part of the plat- form, but which are used during the (possibly) automated development of products.

Domain Engineering refers to activities in which the core assets are created. An important part of domain engineering is do- main analysis, during which a fundamental understanding of the domain, its commonalities and variability is estab- lished.

Application Engineering is the phase in which the domain engi- neering artifacts are used to create products. Unless varia- tion points use runtime binding, they are bound during this phase.

The Problem Space refers to the application domain in which the product line resides. The concepts found in the prob- lem space are typically meaningful to non-programmers as well.

The Solution Space refers to the technical space that is used to implement the products. In the case of software product line engineering28, this space is software development. The plat- 28 Product Lines often also contain form lives in the solution space. The production tools create non-software artifacts, often electronic hardware. or adapt artifacts in the solution space based on a specifica- tion of a product in the problem space. 536 dslbook.org

In the following sections we now elaborate on how these con- cepts are realized when DSLs are used.

21.5.1 Variation Points and Kinds of Variability

This represents the core of the chapter and has been discussed extensively above: DSLs provide more expressivity than fea- ture models, while not being completely unrestricted as pro- gramming languages.

21.5.2 Domain Engineering and Application Engineering

As we develop an understanding of the domain, we classify the variability. If the variability at a particular variation point is suitable for DSLs (i.e. it cannot be expressed sensibly by pure configuration), we develop the actual languages together with the IDEs during domain engineering. The abstract syntax of the DSL constitutes a formal model of the variability found at the particular variation point29. The combination of several 29 This is similar to analysis models, DSLs is often necessary. Different variation points may have with the advantage that DSLs are executable. Users can immediately different DSLs that must be used together to describe a com- express example domain structures or plete product30. behavior and thereby validate the DSL. This should be exploited: language Application engineering involves using the DSLs to bind the definition should proceed incrementally respective variation points. The language definition, the con- and iteratively, with user validation straints and the IDE guide the user along the degrees of free- after each iteration. The example models created in this way should be dom supported by the DSL. kept around; they constitute unit tests for the language.

21.5.3 Problem Space and Solution Space 30 We discussed language composition in Section 4.6 and Chapter 16. DSLs can represent any domain. They can be technical, in- spired by a library, framework or middleware, expected to be used by programmers and architects. DSLs can also cover ap- plication domains, inspired by the application logic for which the application is built. In this case they are expected to be used by application domain experts. In the case of application DSLs, the DSL resides in the problem space. For execution they are mapped to the solution space by the production tools. Techni- cal DSLs can, however, also be part of the solution space. In this case, DSL programs may be created by the mapping of an ap- plication domain DSL to the solution space. It is also possible for technical DSLs to be used by developers as an annotation for the application domain DSLs, controlling the mapping to the solution space, or configuring some technical aspect of the solution directly. dsl engineering 537

21.5.4 Binding Time DSL programs can either be transformed to executable code or interpreted. This maps to the binding times introduced above in the following way:

• If we generate source code that has to be compiled, pack- aged and deployed, the binding time is source. We speak of static variability, or static binding.

• If the DSL programs are interpreted, and the DSL programs can be changed as the system runs, this constitutes runtime binding, and we speak of dynamic variability.

• If we transform the DSL program into another formalism that is then interpreted by the running system, we are in the middle ground. Whether the variability is load-time or runtime depends on the details of how and when the result of the transformation is (re-)loaded into the running system.

21.5.5 Core Assets, Platform and the Production Tools DSLs constitute core assets; they are used for many, and often all, of the products in the product line. It is however not easy to answer the question of whether they are part of the platform or the production tools:

• If the DSL programs are transformed, the transformation code is a production tool; it is used in the production of the products. The DSL or the models are not part of the running system.

• In the case of interpretation, the interpreter is part of the platform. Since it directly works with the DSL program, the language definition becomes a part of the platform as well.

• If we can change the DSL programs as the system runs, even the IDE for the DSL is part of the platform.

• If the DSL programs are transformed into another formalism that is in turn interpreted by the platform, then the transfor- mations constitute production tools, and the interpreter of the target formalism is a part of the platform.

22 DSLs for Business Users

This chapter has been written by Intentional’s Mats Helander. You can reach him via [email protected].

In this chapter we will examine using DSLs for business professionals. The example is a system in the healthcare do- main – essentially a system for defining questionnaires and the business rules to process them. A secondary purpose of this chapter is to provide an impression of Intentional Soft- ware’s technology for defining DSL: the example system is built with the Intentional Domain Workbench. 1 Intentional Software was started by Charles Simonyi after he left Microsoft. There he had been one of the early employees and served as the chief ar- chitect for Excel and Word, introducing the principle of WYSIWYG. During his later years he ran the Intentional Programming research project in Mi- crosoft Research (which is described 22.1 Intentional Software very well in Czarnecki and Eisenecker’s Generative Programming book ). His company, Intentional Software, contin- Intentional Software was one of the first companies to create ues to work on the ideas pioneered by a language workbench1, and their focus has been on business Intentional Programming. 2 K. Czarnecki and U. Eisenecker. professionals and less on programmers as users for the DSLs . Generative Programming: Methods, Business professionals are often the source of domain knowl- Techniques and Applications. Addison- edge. Today this knowledge has to be captured and explained Wesley, 1999 to software engineers for it to be actionable. Agile principles 2 This has always been a focus of DSLs. help bridge this gap, but this communication gap remains the However, as we will see in this chapter, biggest obstacle in software development today. DSLs for busi- focussing on non-programmers leads to different tradeoffs in the design of the ness professionals have the potential to bridge this gap. languages and the tools. 540 dslbook.org

22.2 The Project Challenge

This case study describes an application in which domain knowl- edge is captured and maintained directly by the domain ex- perts using DSLs, validated at the domain level, and used for code generation to create an executable application3. The do- 3 The DSL is complete: no manual coding main is tele-health, where patients with chronic conditions or of "business logic" is required. If that were necessary, the premise of a DSL diseases like diabetes, hypertension or obesity stay at home, for business professionals would be and are provided with daily recommendations based on ob- infeasible. served values of various daily measurements of the patient. A medical professional has defined which values to observe for each particular patient, and the rules for the daily individual recommendations based on those values4. The input from the 4 This is not an expert system. All de- patient at home is provided through sensors, medical devices cisions are made originally by medical doctors. and patient interactions with the system through mobile de- vices, set-top boxes or web interfaces. The system needs to be flexible enough to address the requirements of multiple health care providers that will have different sets of criteria for differ- ent patients. The system described in this chapter replaces a legacy sys- tem developed using a traditional approach in which domain knowledge was captured in big Excel documents that encoded the physician’s rules. A typical rule looked like this: if WHtR < 46 and (LDL < 100 and No LDL Meds) and (SBP < 125 and No BP Meds) and (HgbA1c >= 6.5 and No Glucose Meds)

This Excel text should be interpreted as: if the patient has a Weight Height ratio of less than 46 and a cholesterol LDL level below 100 and does not take LDL medications and the systolic blood pressure level is less than 125 and does not take blood pressure medication and the hemoglobin A1c test is equal or greater than 6.5 and does not take glucose medication then .

The Excel spreadsheet had hundreds of rules like this. The repetition resulting from lack of abstractions available to the rules programmer meant that for each new observable attribute the number of rules doubled5. Each rule was then transformed 5 The lack type checks, testing, refactor- by a programmer into rules for a Drools rules engine. The ings, and all the other amenities we are used from an real IDE also hampered patient data had a similar workflow, in which information for productivity and maintainability. the patient-recorded data was captured also in Excel sheets. Once this information was confirmed with the doctor, XML documents were created for this data to feed a custom web dsl engineering 541

application application to be used by the patient to fill in the data. The medical professional was overwhelmed with the com- plexity. It was clear that the doctors knew exactly what in- tentions they wanted to express, but the complexity to express them became a big bottleneck. Furthermore, when the doctor wanted to add or make any changes to the application, it had to go through a convoluted process, with limited traceability, to update XML documents, Drools rules, database schemas and other application-dependent logic.

22.3 The DSL-Based Solution 22.3.1 Intentional Domain Workbench Intentional Software provides a knowledge processing plat- form to allow business professionals to turn their specialized expertise into software. The development environment, the In- tentional Domain Workbench (IDW), is a language workbench for building DSL-oriented applications for business users. These applications can be run stand-alone, and can optionally also generate applications using various languages and runtimes (such as XML and Drools in this example). The Intentional platform provides a number of key technolo- gies that make the DSLs especially suited for business users. In particular, this includes a projectional editor that allows lan- guages to be edited in multiple syntactical forms, and with multiple semantic interpretations. It can use and mix textual, tabular and graphical notations to approximate the needs of a business domain as closely as possible6. The projections of a 6 As we will see, it also supports Word language can potentially be ambiguous, but that does not cause document-like headings, and more generally, looks a bit like Microsoft a problem, because they are just projections of an underlying Office. This also helps acceptance with consistent representation, and a user can always switch to an- business users. other projection to resolve any ambiguity. The platform also allows for combination and interaction across languages. A single projection can integrate knowledge represented in mul- tiple disparate languages.

22.3.2 Overview of the Solution The purpose of the custom language workbench application examined in this case study is to let business experts edit ques- tionnaire definitions that are used as input to a web application that in turn allows end users to fill out their answers. Fig. 22.1 shows an example definition of a questionnaire. 542 dslbook.org

In addition to defining the questions, the medical professional Figure 22.1: An example questionnaire, as seen and edited by the medical pro- can also define business rules that should be applied to the fessional. Questionnaires are essentially questionnaires, as well as tests to ensure that the business rules trees of questions with the possible an- are working correctly. Fig. 22.2 shows an example of such rules; swers, as well as dependencies between questions (when answer is...). we will get back to testing later.

7 A schema is the structure definition 22.3.3 Implementation of a domain. It roughly corresponds to the abstract syntax or meta model, even though the meta meta model of To implement this, we have used IDW to define a set of domain the IDW is substantially different from schemas7 along with logic for validation, transformations, eval- EMF or MPS’ structure definition. uation, code generation and projectional editors. All of these concerns are implemented with a custom language supported by IDW that extends C# with additional operators and key- words that are useful for working with tree structures8. The 8 This approach is similar to MPS: language also contains several dedicated DSLs for defining do- MPS’ BaseLangauage extends Java with constructs useful for working on the 9 main schemas, validators or projections . The result of com- meta level of languages. piling the language definition is a custom workbench: a stan- 9 dalone Windows application that lets the business experts edit This is similar to MPS and Spoofax, which also come with a set of DSLs the defined domains in a projectional editor where all the rules specific to various aspects of language for validation, projection layout and such are applied. Fig. 22.2 definition. shows the editor for business rules with definition expressions, assessment tables, choice lists and results. As its output the workbench in this case study generates files that are fed into a web application that executes the ques- tionnaires and applies the business rules10. The web applica- 10 The patient interacts with this web tion itself is developed separately and consists of web pages application in addition to the sensors mentioned earlier. with JavaScript that consumes the XML files generated by the workbench. The JavaScript then uses these XML files to pro- duce a dynamic user interface11. The workbench also generates 11 The web application acts as an in- business rule files in a format that the Drools business rule en- terpreter for an intermediate language whose serialization format is XML gine can consume, and the web application can in turn call the Drools engine to access the running rules. dsl engineering 543

Figure 22.2: This screenshot shows how the medical professional sees and edits business rules for questionnaire questions. In this example, the body mass index is calculated.

 Domain Structure The IDW is very suitable for modulariz- ing, reusing and composing languages ("domains" in the termi- nology of Intentional Software). Consequently, the application consists of several domains, some of them specific to the appli- cation discussed here, others more general12. 12 Some are generalized because future We use two domains that are motivated by the underly- reuse is anticipated, others are existing languages reused in this application. ing technology: to generate the XML, we employ a reusable XHTML domain that comes with IDW. To generate the Drools rules, we have created a Drools domain (which may be reused for other applications in the future). Similarly, the domains that are closer to the business do- main are also modularized. The medical professionals in this case study have a particular subject they want to create ques- tionnaires about, but the questionnaire domain itself is general 544 dslbook.org

and has high potential for reuse. The business rules are also general enough to be reused on their own, independent of the questionnaires. This results in two main domains: the ques- tionnaire domain and the business rule domain. These are in turn divided into subdomains to allow selection of features to reuse. We then complement this with an adapter domain that includes the reusable questionnaire and business rule domains, and define how they should work together. Finally, we have an overarching domain for the application that we call Intentional Health Workbench (IHW), which adapts the combined ques- tionnaire and business rule domains to the particular customer requirements. In total we end up with ten domains (Fig. 22.3 shows an overview of the relationships between them):

FitBase: The generic questionnaire domain13. Contains abstrac- 13 The prefix "Fit" stands for Forms, tions such as interviews, questions and answers. Interview, Tables. FitRunner: In-workbench execution of the generic questionnaire domain FitBase, allowing the business expert editing the questionnaires to experiment with filling out answers inside the workbench. FitSimple: A simplification of the generic questionnaire domain FitBase to a subset suitable for combination with the busi- ness rules domain and intuitive editing. RulesEngine: The generic business rule domain, with table-style editing and in-workbench evaluation of business rules. RulesChecking: Consistency validation of the rules in the generic business rule domain RulesEngine. RulesCompiler: Generates the business rules from RulesEngine to files that the Drools business rule engine can use. FitSimpleWithRules: Combines the simplified subset of the ques- tionnaire domain FitSimple with the generic business rule domain RulesEngine. Drools: Provides abstractions from the Drools business rules engine domain. Supports generation to the Drools file for- mat. XHTML: Provides abstractions from the XML and HTML do- mains. Supports generation of XHTML and XML files. IHW: The workbench that ties all the other domains together. When compiled, this results in the workbench application that lets business users edit questionnaires and business rules, dsl engineering 545

test them and generate output for the web application and the Drools business rule engine.

Figure 22.3: Dependencies between the ten domains that make up the system described in this chapter. The arrows represent an includes relationship. Like the extends relationship in MPS, includes is generic in the sense that it may be an actual include in terms of language concepts, or it represents a generic dependency. An example is the RulesCompiler. Its relationship with the Drools domain captures the fact that it generates Drools rules.

 Defining a Domain The schema for each language is de- fined using a DSL for schema definition. Because no parser is involved, we only have to define the data structure of the 14 This is a very useful feature, because tree that the user will edit. IDW provides a default projection it allows the incremental definition of concrete syntax. Also, if a concrete for all domains until you create custom projections, so you can syntax definition is broken (in the sense start editing and experimenting with your structures inside the that it has a bug) or is ambiguous, the 14 program tree can always be unambigu- editor as soon as you have defined them . ously rendered in this tree view like Defining a schema for a domain is all about deciding what default notation (even if that notation is types of nodes there may be in the tree structure and what not necessarily as elegant as a custom projection). types of child nodes to expect under them. To define the tree structure schema for a domain, we use the keywords domaindef, def and fielddef.A domaindef is used for defining a new do- main, def defines a new type of node that can be used in the domain15 and fielddef defines a field under a def where new 15 It is essentially what’s called a lan- child nodes can be added. guage concept in this book. While defs and fielddefs are similar to EClasses and EFea- tures in EMF (and consequently also quite similar to MPS’ structure definition), there are a few differences. For exam- ple, a fielddef can be assigned more than one type. In EMF, accepting a range of types in a field would require the cre- ation of a supertype that the field would use as its type. A 16 In all the other tools described in fielddef will take a list of types that are all considered ac- this book, fields are always owned by ceptable. If the same list of types is used in several places, we a language concepts. In IDW they can can package them in a reusable way using the typedef key- stand on their own and can be included in defs. This provides an additional word. We can also reuse field definitions in multiple defs level of modularization and reuse. with the includefield keyword, potentially overriding (lim- This design also allows associating 16 additional specifications with a field, iting, extending) their type . such as constraints or projections. These As we are working with tree structures, the default relation- can be reused along with the field. ship between a node and its child node under a field is contain- 546 dslbook.org

Figure 22.4:A domaindef defines a ment. The Question def, for example, has an answer fielddef new domain/language. It can include with the Answer def as its type. Just using a def directly im- other domains, making their contents available to the domain. A def defines plies containment. Let us now look at references. a new language concept. defs contain The Category def reuses the main field (a commonly reused fields. New fields are defined with fielddef that comes with IDW) and overrides its type; it expects fielddef, existing fields are included via includefield. Each field defines those types listed in the QuestionHierarchy typedef. When a shape (list, set, single element) and a we look for the definitions of the types in that list we discover type (one or more other defs). When including a field, these can be overrid- virtualdef that two of them are not defs, but use the key- den. The main field is reused by many word. A virtualdef can use an arbitrary match pattern (not defs and has special editing support. just ref). This allows a new virtual def to be assigned to any virtualdefs use a match pattern to select existing nodes. Based on this node that matches the match clause. You can then define pro- virtualdef, projection rules or other jections or constraints for this new virtualdef. They will ap- aspects can be defined. dsl engineering 547

ply to all nodes that match the match clause in the virtualdef. In this case DvRefQuestion defines a type for references to Question nodes, and DvRefCategory defines a type for ref- erences to Category nodes, allowing questions and categories to be reused in multiple places.

 Constraints and Behavior Defining the basic shape of the tree that users should edit will usually be done in a declara- tive fashion using the schema DSL. However, in many cases, additional constraints are required. These are typically imple- mented in validators. Validators can enforce scoping rules (that the referenced variable is available in the scope), scan for ille- Figure 22.5: This constraint checks for gal names or naming collisions, and ensure that any number the maximum length of all kinds of of domain-specific business rules for the DSL are adhered to names. Notice how the error message itself is represented as a def: it is of by users when they edit their DSL code. Fig. 22.5 shows a sim- category Message. cat represents ple validator that ensures that the length of a name does not structural subtyping, in the sense that all the fields of Message are also added exceed a specified maximum. to Category

Here are some more examples of constraints: "categories should not be nested more than five levels deep", "questions may not 17 be modified once they are published" or "negative answer op- Validators run over the tree structure as it is edited and assert conditions that tions should be displayed in red". go beyond what is reasonable to define The first constraint, about level nesting, could be imple- in the declarative schema language, such as a recursive check to determine mented using the DSL for writing validators that comes with nesting levels. When a validator fails, IDW17. A code snippet showing how such a validator could be an error message is produced, which implemented is shown below. is shown together with any error messages that the system generates if validator implfor: FitSimpleWithRulesets { the user breaks the schema constraints def Category too deeply nested cat: Message { } defined in the schema DSL. 548 dslbook.org

validatenode : sequence(Book) procedure Vnode(var Dmx dmx, var Node node) { code: var level = 0; var parent = node.Parent; while (parent != null){ if (parent.Isa == ‘(Category)) { level++; } parent = parent.Parent; } assert level < 5 else Category too deeply nested; } }

The second constraint, about preventing modification to pub- lished questions, could be implemented using the DSL for be- haviors. Behaviors are a bit like database triggers, in that they contain code that is triggered to run on events that signal chan- ges to the tree structure, such as when nodes are inserted, mod- ified or deleted. In this case we could use behaviors to asso- 18 ciate code that should be run on modify and delete events for overproc overrides an existing, predefined procedure. Execres is a Question instances. The code would check if the question has type that signifies whether something is been published and if so, prevent the modification or delete successful or not. Error and Success are subtypes. rh is essentially the this operation from executing. The following code snippet shows a pointer. possible implementation18. def behavior implfor: Question { overproc: Execres procedure CanEdit() { if (rh->published) { return Error("May not modify published question!"); } return Success(); } }

The third constraint, about showing negative answer options in red, could be implemented in the presentation for the Answer- Option nodes using IDW’s DSL for defining. The code respon- sible for showing the AnswerOption node on screen would sim- ply use the C# if statement to check whether the option is neg- ative (such as the No answer option to a Yes/No question) and if so, present the node on screen using a red color. We will see examples of code using conditionals in a projection later on. While we would use three different DSLs to define the three constraints described above, we would also mix that DSL code with standard C# code. The DSLs that come with IDW extend C#, so in addition to the DSL keywords for defining schemas, validators, behaviors and projections, it is also possible to write standard C# classes, and even to mix C# code into declarative DSLs, such as the projection DSL. Some DSLs, such as the val- idator and behavior DSLs, expect to be implemented using C# dsl engineering 549

and have no declarative way to be implemented. The projec- tion for C# code uses a textual notation that looks basically like standard C#, but because it is a tree projection, albeit one that looks like text, there are a few differences from what the same code would look like in a text editor. Consider the following example code: program RulesChecking using: [RulesEngine, FitBase, FitSimple, Validation, Gen, DL, Core, mscorlib, System] { region RulesTesting {

[SerializableAttribute()] public class DcsUnique : Dcs { static var int counter = 0; var int index = 0;

public constructor DcsUnique() { this.index = counter++; }

public override bool procedure Equals( var Object obj ) { var DcsUnique that = obj as DcsUnique; return that != null && Equals(this.index, that.index); }

public override int procedure GetHashCode() { return this.index; }

public Kselection property kselection { get { return Crown; } } public Kend property kend { get { return Nil; } } } } }

In contrast to C#, there is the procedure keyword. This is shown in the projection simply to give the user something to click on if they want to select the whole procedure (or method as they are more commonly referred to in C#)19. Clicking on the 19 In IDW, the procedure keyword public keyword lets the user change that keyword to for exam- would be known as the crown of a subtree. ple private, and lets the user enter additional modifiers such as static. Clicking on the name lets the user change the name. But if the user wants to delete the whole method, they just click on the procedure keyword to select the whole method and hit the Delete key. In the tree structure, the name, the modifiers and the whole method body are child nodes contained by the procedure node, so deleting that node will delete all the con- tained child nodes as well. The constructor keyword is there for the same reason – something to click on to select the whole thing – as is the var keyword in the field definitions. When 550 dslbook.org

generated to C# source code for compilation, these additional keywords are not included in the output. Another use case for validators is to verify the types in ex- pressions edited by users. Depending on the DSL, the expres- sion 1 + True may or may not be illegal, but many languages would prevent the addition of a Boolean value to an integer. IDW includes a DSL for defining the rules for the type calculus in a mix of declarative and C# code, and uses recursive evalu- ation to determine the resulting type from an expression. The validator will then call the recursive IDW type calculator, and if a problem is discovered an appropriate error shows up in the error pane. In this customer case the workbench has a lot of expressions in the business rules and they are all validated for type consistency.

 Projection The ability to write C# is not only useful when writing utility classes; several of the DSLs included with IDW support the ability to mix C# code into the DSL code. The projections are one example, where some projections are writ- ten in an entirely declarative manner using just the keywords from the projection DSL, while others make use of mixed in C# to produce dynamic behaviors. Before looking at examples of such mixed code we will examine a couple of purely declara- tive projections first. Each def20 (Category, Question, or Answer) comes with its 20 Projections can also be defined own projection rules. The projection of the overall tree is then for virtualdefs. This allows nodes in a specific context to be projected a composition of the projections of all involved nodes. The differently. projection for each type is defined in a declarative fashion, where a template is specified that defines how nodes of that type should be presented to the user (Fig. 22.6). The parts with gray background in Fig. 22.6 constitute the template, whereas the parts with white background are either references to fields that should be projected in place or full blocks of imperative code. Projection works by mapping domain defs to concepts from the Abstract Projection Language (whose concepts all have names beginning with A to make them easily identifiable). These con- cepts are then transformed further, until, at least conceptually, we arrive at the level of pixels on the screen21. Some of the A 21 As a language developer, you don’t have to care about anything below the constructs are quite primitive, such as AVert, which only spec- A level. You just map your domain con- ifies that its contents should be displayed as a vertical list, or cepts to concepts from the A language. ASeq, which specifies that the contents should be presented in dsl engineering 551

Figure 22.6: The projection rules for Category and Question. The one for Category creates an AChapter (which is rendered with large, bold font text) whose text is made up of the name of the category (cf. the names field under the name), and then a colon, with no space separating the two. In the chapter’s own main field we put the main field of the Category (which contains all the questions in that category).

a sequence – horizontal or vertical is up to the presentation en- gine and depends on available screen estate. Others are more high-level, such as AChapter, which presents its contents in the form of a word processor-style chapter (thick text, optional chapter numbering and indentation, etc). To project something as a graph, we just have to use the AGraph, AGraphNode and AGraphEdge constructs. To project something as a table, we use ATable, ARow and ACell. AImage displays a bitmap im- age. AButton and AHyperLink make it possible to add buttons and links to the projections that execute C# code or move focus to a different place in the projection when clicked, providing 22 Of course developers can define new, an alternative to having the user type everything in with the higher level projection concepts that are keyboard22. in turn transformed to A concepts. It Each A Language construct has a number of fields where is also possible to define new concepts on the abstraction level of A and then values can be entered in the template. Sometimes this will project them manually to the next finer be literal information that should be displayed, such as the level. 552 dslbook.org

string literals "Question:" and "answer:" in the projection for the Question def. Other literals control the behavior of the projection, such as the True value under the Indent_Chapters field in the AChapter projection for the Category def. To make the child nodes of the projected type show up in the projection, we just put a reference to the relevant fielddefs in the appro- priate places in the projection definition23. 23 These references get a white back- Templates are a good fit for a declarative DSL, because pro- ground in the definition where the rest is gray. jections can often be defined in an entirely declarative way. When there is demand for dynamic behavior, the declarative context of the template can be broken out from using the Back- Quote() function: standard C# can be entered inside it. The C# code should end by returning a new "piece of tree" that is in- serted into the hosting template in the place of the BackQuote. A new piece of tree can be created with the BookQuote()24, 24 A Book is a basically a subtree literal inside of which declarative structures can be created. that can be inserted into another tree. There are many cases in which dynamic behaviors in projec- tions are useful. Common examples include changing the color depending on the displayed values, showing or hiding some of the values depending on editing modes or access rights, and even displaying the results from dynamic evaluation of expres- sions and values that the end users type in25. 25 This can be used nicely for hooking interpreters that execute tests based on the data entered by the user.  Dynamic Schemas Another case is when the DSLs that end users edit influence each other dynamically, such as when one DSL is the schema language for a second DSL. Consider for example an Entities DSL in which users can define entity types with attributes. A second DSL allows users to define instances of the entities, specifying values for the attributes. The schema language allows this by letting us hook in C# code to dynami- cally determine the fields that the schema should consider un- der a def or a virtualdef. Let us look at an example. When creating a program ex- pressed in the second DSL, a user may want to create an in- stance of the Person entity defined with the first DSL. The Person entity in turn contains firstName and lastName at- tributes. The editor should then read the definition for the Person entity and go on to present two fields under the new instance, label them firstName and lastName, and let the user 26 enter the names for their new Person instance. This works by Here we assume that each Instance in the instances DSL references the hooking in code into the Instances DSL that returns fielddefs Entity it instantiates in a field called for each attribute under the entity referenced in the type field26 type. dsl engineering 553

of the instance, and potentially from any supertypes of that en- tity. The IDW default projection would detect this and present firstName and lastName fields ready to be edited under a Person entity. In a custom projection dynamic code would be used to iterate over the appropriate fields and create projec- tions for them. In the case of the workbench in this case study we have a Rule def, which has one fielddef called outcome that is declar- atively defined in the standard schema DSL; the rest of its fields are determined dynamically, as described above. In the projec- tion we want to display each rule as a row in a table and each dynamic field under a rule as its own cell. The outcome field should also get its own cell, which is defined in the declara- tive way in the template, but for the dynamic fields we have to break out from the declarative template context and write some C# code. Fig. 22.7 shows the respective code.

Figure 22.7: The projection rule for a Rule. The gray parts are declarative templates that are inserted "as is" into the projected screen representation of a Rule. The white parts, and in particular, the BackQuotes, use procedural C# code to dynamically project a part of the tree. Inside this procedural code, BookQuotes are used to escape to literal mode, where, again in gray, we construct little "pieces of tree" (called Books) in IDW) which are in turn inserted into the resulting projection.

The schema of the projected node can be accessed with the ex- pression ehi.GetDmx().Rgdf(), where ehi is the input node to the projection, GetDmx() retrieves domain context informa- tion about it and Rgdf() returns the fields that are expected under the node. Normally Rgdf() will only return the fields we have declared in the schema DSL, but in this case it has been overridden for the Rule def to return a set of fields that 554 dslbook.org

are determined dynamically by other input to the Rule DSL. The C# code in the projection definition for the Rule (shown in Fig. 22.7) iterates over the fields that should go under the Rule def according to the schema and our overriding code, then uses the BookQuote() function to create a piece of tree with an 27 But taking care to avoid doing so for 27 ACell in each one . A simple C# expression (ehi.Index + 1) the outcome fielddef, which is already is also used to display the row index in a leading cell for each projected in the declarative part of the projection rule. row. The ability to mix C# into projections opens up the possi- 28 In fact, projections, transformations bility of creating very powerful dynamic projections including and code generation work essentially the same way in IDW. As we will see DSL evaluation, and even running of test cases for a DSL di- later, code generators in IDW are im- rectly in the editor for that DSL. Projections can also be com- plemented as transformations between a source domain and a target domain. bined with transformations, such that the tree structure edited Projections are in turn just transforma- by the user undergoes a series of transformations before being tions that have the Abstract Projection projected onto the screen28. These transformations are two- Language as their target domain. There are two differences. First, projections way, so the projections built including such a transformation are evaluated lazily, such that only the continue to be fully editable. They work in a similar way to parts of the projection required for showing what is on screen will actu- projections, in that they let the developer create templates in ally be executed, whereas for a code their target language (rather than the A language) declaratively, generator the whole transformation but with the option of breaking out into C# code. By moving will always be executed at once. The second difference is that projections are calls to things like test evaluation into a transformation that automatically triggered as the program precedes the projection, code with different types of respon- tree changes, whereas code generators are executed only on demand (e.g., sibilities is separated by concern and kept simple and to the by the user pressing a button in the point. Workbench.) A testing framework was created for the case study dis- cussed in this chapter, so that business rules can be evaluated with test data and the results verified against expected values, all directly in the workbench. The tests are run continuously, so that whenever the user modifies the business rules, the tests go red or green as the rules break or correspond to expectations. Fig. 22.8 shows an example of such a test case. The evaluation of the business rules is implemented as an in- terpreter that works by evaluating each node in the tree struc- ture according to its type in a recursive fashion. The code for this is packaged in a helper class called by a transformation that passes the test inputs to the evaluation method and dec- orates the transformation’s output tree with the results. The projection then takes the decorated tree, presents the editable input values, expected values and calculated test results (not editable), and compares the test results with the expected val- ues to show green, red or orange bullets as appropriate. In this case study we see another interesting example of pro- dsl engineering 555

Figure 22.8: The testing aspect of the Health workbench lets users create test data and evaluate the business rules directly in the IDE. As mentioned earlier in this book, in-IDE-testing is an important ingredient to making a DSL accessible to non-programmers: in some sense, the tests make the DSL "come alive" and help build understanding about the semantics of the language.

jecting dynamically derived information about the user input. The projection for a questionnaire calls out to C# code that per- forms consistency analysis on the combined business rules and questionnaire domains. The analysis ensures that when busi- ness rules are applied to a particular questionnaire, the rules do not refer to questions absent from that questionnaire. The tests and consistency analysis are implemented and pre- sented in a way specific to the application, but it is also possi- ble to use the IDW validation framework to ensure validity of user inputs. The developer then writes validators that run in the background against the tree structure as it is being edited by the user. When a rule in a validator is broken, it yields an error message, which is shown in the IDW error pane, a central place for collecting custom error messages from validators and system error messages from built-in generic validators alike. 29 For this application this means that the structure is correct, all validators  Transformation and Generation Once the input is known to are happy, all tests are green and be consistent29, the time has come to do something with the consistency analysis is satisfied. 556 dslbook.org

information the medical professional has provided. In this Figure 22.9: The consistency analysis ensures that when a set of business case study this means invoking code generation to produce the rules is used together with a particular XML files that the JavaScript in the web application will con- questionnaire, all questions evaluated by the rules are actually present in the sume, and the files with business rule definitions for the Drools selected questionnaire. engine. IDW includes a DSL for defining how to create output fold- ers and files, and, together with the DSL for transformations, it constitutes how code generators are defined. While it is pos- sible to take the tree that the user has edited and generate raw text files in the target format directly, it is often a better ap- 30 As we discussed in Part II, this proach to use a transformation to create a tree structure in the allows us to reuse parts of the overall transformation. For example, the Drools domain of the target format from the tree structure that the and XML domains are likely to be user edited30. Such transformations that result in information reusable in other applications. being generated to files rather than being presented to the user on screen do not have to be two-way, as there is no requirement for the information to stay editable. The workbench in this case study uses a transformation that takes the questionnaire domain as input and outputs a tree structure in the XHTML domain that is included with IDW. The resulting XHTML tree is then passed on to a second trans- formation that knows how to transform such trees to text. The result of this transformation is finally passed to a file gener- ator defined with the DSL for creating files and folders, with the result that the text is saved to files on disk. To generate the Drools files a similar transformation chain is executed, but with the difference that both the Drools domain and its trans- formation to text had to be developed for the project. Fig. 22.10 shows the transformation to the Drools domain. Again, parts with a gray background are declarative, whereas dsl engineering 557

the white background signifies imperative code. We can see the use of the FIn() function which determines if a given item Figure 22.10: This is part of the trans- is in a list of items. We also see the use of BookNew(), which formation from the business rules domain into the Drools domain. The creates a single node that can be inserted into a larger tree. template-based approach is similar to projections. 558 dslbook.org

22.4 Wrapping Up

With the code generators in place the whole workbench is ready for use by the medical professionals. They can now define in- terviews and rules that they can run and validate against tests directly in the workbench. When they are satisfied with their work, they hit the button to generate the XML and Drools rules for the web application, which can then be used immediately by end users. All the time the workbench guides them in their work with helpful validation messages, auto-completion and running tests, allowing for consistently high quality in the gen- erated applications. To implement the workbench we used several DSLs for sche- ma definitions, projections, transformations and more in con- cert. The final product also combines several domains, with the two most prominent domains for interviews and for busi- ness rules split up into individually reusable subdomains. The projectional approach is well suited for such complex language composition and provides flexible notation, which makes it a powerful technology in scenarios that target business profes- sionals. The ability to mix DSLs with GPLs such as C# ensures that each DSL can remain short and to the point without taking on all the burdens of a GPL as requirements grow in complex- ity.