An Identifier-Based Intermediate Representation To

Information and Media Technologies 4(2): 282-303 (2009) reprinted from: Computer Software 25(3): 113-134 (2008) © Japan Society for Software Science and Technology 推薦論文● PPL2007 IIR: an Identifier-based Intermediate Representation To Support Declarative Compiler Specifications Eiichiro Chishiro For compiler developers, one big issue is how to describe a specification of its intermediate representation (IR), which consists of various entities like symbol tables, syntax trees, analysis information and so on. As IR is a central data structure of a compiler, its precise specification is always strongly desired. However, the formalization of an actual IR is not an easy task since it tends to be large, has complex interdependency between its entities, and depends on a specific implementation language. In this paper, as a first step to solve this problem, we propose a new data model for IR, called IIR. The goal of IIR is to describe a specification of IR declaratively without depending on its concrete implementation detail. The main idea is to model all entities of IR as relations with explicit identifiers. By this, we can develop an IR model transliterally from an actual IR, and describe its specification by using the full expressiveness of conventional logic languages. The specification is inherently executable and can be used to check the validity of IR in compile time. As a practical case study, we formalized an IR of our production compiler in IIR, and developed a type system for it in Prolog. Experimental results about size and performance are shown. a module for dataflow analysis analyzes the IR and 1 Introduction adds dependence information to it. A module for Developing a production compiler is a hard work. dead code elimination traverses the IR and removes The target architecture continues to evolve. New unnecessary statements from it. Thus the IR works language features continues to be added. To keep as a common interface between compiler modules, up with these changes and generate efficient com- and its specification judges the correctness of each piled codes, we should continue to implement new module. features for analyses, optimizations, code genera- Unfortunately, the complete specification of the tions and more. As a result, an actual implemen- IR rarely exists in practice. The IR continually tation of a production compiler tends to be quite changes in the development process, and it is often large — sometimes over a million of lines. the case that given some IR, developers disagree A compiler consists of a series of modules to re- with its precise meaning, or worse, whether it is alize its features. Most modules work on the inter- legal (well-formed). mediate representation (IR), which is a source pro- This severely undermines the quality of a com- gram representation in the compiler. For instance, piler. You are forced to develop your module based on an obscure specification of the IR. For instance, if you want to know all the patterns coming into 宣言的なコンパイラ仕様記述を支援するための識別子にもとづく中間表現. your module, you need to dive into massive source 千代英一郎, 日立製作所システム開発研究所,Hitachi codes and examine all possibilities. This makes the Systems Development Laboratory. development process slower and less reliable. 本論文は，第 9 回プログラミングおよびプログラミング In research on the semantics of programming lan- 言語ワークショップ (PPL2007) の発表論文をもとに発 guages, there exist lots of systematic ways to define 展させたものである． the specification of a programming language [41]. コンピュータソフトウェア, Vol.25, No.3 (2008), pp.113–134. Generally, we first define the syntax of program- [研究論文]2007年 4 月 30 日受付. 282 Information and Media Technologies 4(2): 282-303 (2009) reprinted from: Computer Software 25(3): 113-134 (2008) © Japan Society for Software Science and Technology ming languages in BNF, and then static/dynamic Goal Our goal is to provide a way to fill these semantics based on the syntactic structure of pro- gaps between the existing IR in practice and declar- grams. If we could regard the IR as a programming ative methods in research. What we want is an IR language, we would follow the same approach. model which This approach fits very well with methods based 1. can faithfully represent the structures of an on declarative programming languages (here we actual IR, and call these ‘declarative methods’). The specification 2. can also be a basis on which declarative meth- written in a declarative language can serve as both ods are developed directly. human-readable document and machine-executable Such models enable us to write the specification of checker. When writing the specification of the IR, IR in declarative languages, which has many bene- the latter property is very useful for relating the fits. The most notable one is executability as said specification to its implementation. above, which is invaluable and indispensable when Besides specifications, there exist many declara- developing a large formal specification. We can be tive methods for compiler features such as program confident in the correctness of a specification by analyses [11] [18] [38], optimizations [4] [12] [26] [39] executing it on IR models of test programs. If a and code generations [19]. These declarative meth- model is a faithful representation of IR, the distance ods are promising approaches to reducing the bur- between an actual implementation and its model dens of compiler development. is sufficiently small, and the results obtained from Problem However, these methods cannot be amodelaresuretoalsoholdonanimplementa- applied to the existing IR in a straightforward way. tion. In addition, existing declarative methods can The main difficulty lies in the difference between be built directly on an IR model, which releases the IR of an actual compiler and the program rep- us from the tiresome work of translation for each resentations used in these methods. method. The IR has several aspects which are difficult to In a sense, our motivation is similar to that of the be treated in declarative methods. First, the IR researchers and developers in the area of databases. forms a graph-structured data where entities refer They also aim to create a universal data model to each other (e.g., control flow graphs), or share upon which any application software can be devel- a common entity (e.g., symbol tables). Next, IR oped. One of the most successful achievements has entities of a production compiler tend to have a been the relational data model (RDM) [9]. While large number (sometimes more than a hundred) using RDM as an IR model seems an attractive ap- of attributes to carry miscellaneous information proach, it has some shortcomings in our situation, such as source line numbers, analysis results, com- particularly in the interaction with declarative lan- piler directives and architecture-specific extensions. guages. We will discuss this issue in Section 5. Finally, the situation becomes more complicated Proposal To achieve our goal, we have devel- when the implementation types of IR entities are oped a new IR model, which we call an identifier- optimized intricately to get better performance of based intermediate representation (IIR). Our key compilation. insight in developing an IR model is that major dif- In contrast, in most declarative methods, pro- ficulties described above stem from the anonymity grams are modeled as tree-structured terms suit- of IR entities. While each entity has its identity in able for treatment in declarative languages. Many an actual implementation, there is no counterpart attributes of programs are abstracted to focus on in conventional program models used in declara- the essence. Thus, to apply such method to the tive methods. This discrepancy is an impediment IR, we first need to extract a model of the IR by to faithfully modeling an actual IR structure. translation. As program models usually differ be- The essence of IIR is as follows: tween methods, we have to do this for each method. 1. all entities have an explicit identifier. Developing such a transformation is tedious and 2. all entities are represented as relations. error-prone work, and sometimes impossible due to The first condition means that we give each entity a the substantial difference between the IR and the unique identifier and always use this when referring assumed program representation of the method. 283 Information and Media Technologies 4(2): 282-303 (2009) reprinted from: Computer Software 25(3): 113-134 (2008) © Japan Society for Software Science and Technology † to the entity 1. complete. The statistical results of the specification Note that the meaning of the term identifier here clearly showed that the scale of the full specification is not limited to names of variables or functions in is quite large, and without such testing it would be programming languages. For instance, in an ex- virtually impossible to write a correct specification pression x+x, each operand of + has same vari- conforming to an actual implementation. able identifier x, but they are not the same entity Next, we used the specification as an IR checker instance and each of them has its own identity. The (lint). We ran it on some IRs that were incorrect identifiers in IIR provide an explicit and abstract because of bugs in an older version. We found that way of representing the identity of each entity. This twobugsreportedonourwebsite [28] [29] could solves the sharing problem discussed above as we be detected as type errors, although the causes of can always distinguished the entity itself form ref- these bugs had nothing to do with types. While erences to it by identifiers. An example for this is the effectiveness of an IR checker is well known in shown in Section 2. research [21], our result can be used as a practical The second condition means that all entities in example to support this claim in industry. IIR are relations in the mathematical sense. As re- As another case study, we developed a reaching lations can be easily treated as predicates in logic, definition analysis in a declarative language on an we can write a strict well-formedness condition for IIR model of MIR (Section 4).

An Identifier-Based Intermediate Representation To

Programming the Capabilities of the PC Have Changed Greatly Since the Introduction of Electronic Computers

Certification of a Tool Chain for Deductive Program Verification Paolo Herms

Presentation on Ocaml Internals

Comparative Studies of Programming Languages; Course Lecture Notes

“Scrap Your Boilerplate” Reloaded

Csci 555: Functional Programming Functional Programming in Scala Functional Data Structures

Open Pattern Matching for C++

Codata in Action

Generalized Abstract GHC.Generics

Refining Algebraic Data Types

Simple Algebraic Data Types for C

Wobbly Types: Type Inference for Generalised Algebraic Data Types