Programs that use are central to our information infrastructure. Such systems are increasingly being developed using procedural object-oriented languages and relational databases. But procedural languages and query languages are based on different semantic founda- tions and optimization strategies. No approach has yet been developed to bridge this semantic gap and enable integration of databases and procedural language programs into efficient total systems. As a result, applications that access databases are awkward to design and develop. Careful opti- mizations are often needed to attain good performance, resulting in programs that are difficult to maintain and evolve. In addition, databases and programming languages are taught separately so that developers trained in one domain may not be expert in the other domain. Finally, there has not been sufficient evaluation and comparison of the performance of different integration approaches. There is ac- cordingly a very high leverage to improving the development of these programs, both with respect to programmer productivity and application performance. The goals of this project are to create methods, languages and tools which enable application programmers to readily develop efficient and well-structured programs for applications which in- tegrate database processing into procedural programs, to develop a framework for evaluation of solutions for this problem and to develop and offer courses which integrate database and procedu- ral language program development. The intellectual merit of this project is based on the development of fundamental solutions to the problem of integrating programming languages and databases. The vision is to extend pro- gramming languages to specify modular and statically-typed database operations, while still lever- aging the full power of database query optimization and concurrency control. Two approaches will be explored: Safe query objects use statically-typed classes to define compositional and modular queries. This simple and novel approach may have immediate impact by enabling static typing of queries in widely-used technologies like EJB. Query extraction involves extracting database queries from traditional procedural code. This project will go beyond previous work by extracting large queries from nested sub-expressions and exploring interprocedural analysis to extract queries from modular programs. These new approaches will be evaluated relative to existing approaches for performance and us- ability. In additional to traditional performance benchmarking, large open-source applications will be used to test the applicability and usability of proposed solutions and measure their performance in more realistic settings. This project will have broad impact on both industry and academia. A fundamental improve- ment in the interface between programming languages and databases could significantly reduce cost and improve quality of software, from e-commerce servers to desktop and mobile applica- tions. The project includes a plan to publish in both programming language and database confer- ences to facilitate and encourage coordinated research. These results are also a concrete first step in a larger research effort to better support the development of distributed systems from large-scale components, using technologies like web services. The educational innovation in this project is based on developing an integrated approach to teaching programming, languages and databases. Although incremental improvements to existing courses will be implemented, the project will also develop undergraduate and graduate courses that take an integrated view of programming with persistent data as their central theme. Project Description 1 Introduction Programs that use databases are a critical part of our information infrastructure. These systems generally use programming languages for general-purpose computation and databases to control concurrent access to data, search large amounts of data, and/or update data reliably and securely. Such systems are increasingly being developed using procedural object-oriented languages and relational databases. For scalability and reliability, multiple application servers typically commu- nicate with a shared, replicated database server. Procedural languages and database query languages are based on different semantic founda- tions and optimization strategies. These differences are known informally as “impedance mis- match” [44]: imperative programs versus declarative queries, compiler optimization versus query optimization, algorithms and data structures versus relations and automatic indexes, null point- ers versus nulls for missing data, and different approaches to modularity and information hiding. Because databases and programming languages can perform many of the same tasks, develop- ers must make difficult architectural decisions about how to organize and partition system func- tionality. Distributed execution also requires efficient structuring and management of specialized communication patterns. As a result, applications that access databases are awkward to design and develop. Program- ming languages do not facilitate effective use of databases, and attaining good performance usually requires careful optimization based on expert knowledge, which can make programs difficult to maintain and evolve. In addition, databases and programming languages are taught separately so that developers trained in one domain may not be expert in the other domain. Finally, there has not been suffi- cient evaluation of the end to end performance of design alternatives. There is accordingly a very high leverage in methods, languages and tools which enable more effective development of these programs, both with respect to programmer productivity and application performance. 1.1 Research Overview The goals of this project are to create methods, languages and tools which enable application programmers to readily develop efficient and well-structured programs for applications which in- tegrate database processing into procedural programs, to develop a framework for evaluation of solutions for this problem and to develop and offer courses which integrate database and procedu- ral language program development. The vision is to extend programming languages to specify modular and statically-typed database operations, while still leveraging the full power of database query optimization and concurrency control. Two approaches will be explored: Safe query objects use statically-typed classes to define compositional and modular queries. This approach is effective because is leverages existing lan- guage constructs to enabling static typing of queries in widely-used technologies like EJB. Query extraction involves extracting database queries from traditional procedural code. This approach is a natural combination of compiler and query optimization within a unified language framework. This project will go beyond previous work by extracting large queries from nested sub-expressions and exploring interprocedural analysis to extract queries from modular programs. These new approaches will be evaluated relative to existing approaches for performance and us- ability. In additional to traditional performance benchmarking, large open-source applications will

1 class Employee { class Department { String name; String name; float salary; Collection employees; Department department; Employee manager; } }

Figure 1: Example database schema defined via classes be used to test the applicability and usability of proposed solutions and measure their performance in more realistic settings. 1.2 Expected Results The results from this project will include fundamental theories on how to interface programming languages to databases to achieve both programmer productivity and high performance, practical implementations of this theory in a language design and implementation, and measurement of this solution relative to other approaches. The project will produce several PhD theses, one of which is now in progress, and several masters thesis. Theoretical results will be published in both programming language and database conferences. The practical applications of the project make it possible to communicate the results in professional and trade publications. All software developed in the project will be freely available on the web as a basis for training and experimentation. The project will explore an integrated approach to teaching programming language and database concepts. New courses will be developed at the undergraduate and graduate level that take integra- tion as a central theme. The project will also make incremental improvements to existing courses to address integration issues. The results from this project will help to improve the performance and reliability of software, while reducing development costs. The solutions developed in this project will be defined at a fundamental, theoretical level, so that they will apply to a wide range of languages and applications. The empirical validation of the results will provide evidence that the techniques are worth adopting. These results are also a concrete first step in a larger research effort to better support the devel- opment of distributed systems from large-scale components, using technologies like web services. Databases are an important concrete example of a large-scale component with complex integration needs. The solutions developed in this project should contribute to a longer-term effort to address integration problems in general. 2 TheProblem One of the contributions of this project will be a detailed analysis of the problems involved in integrating programming languages and databases. This section presents an overview of the key issues as I understand them now. During the course of the project this understanding of the problem will certainly evolve. As a starting point for discussion of the problem of interfacing a programming language with a database, consider the simple employee/department object model defined in Figure 1 as a set of # classes. Using orthogonal persistence [55, 4, 43, 46] or object-relational mapping [35, 65, 60], these classes can represent persistent data. The example in Figure 2 finds employees whose last named begins with a prefix and prints their name, salary, and the department they work in.

2 void printInfo(String prefix) { for (Employee e in db.allEmployees() ) if ( e.name.startsWith(prefix) ) { print( e.name ); print( e.salary ); print( e.department.name ); } }

Figure 2: Printing employee information

Optimization. The first problem concerns the ability to leverage query optimization techniques on this code. A query optimizer can produce more efficient code to evaluate this SQL query, which returns the values that must be printed:

SELECT e.name, e.salary, d.name FROM Employee e INNER JOIN Department d ON d.ID = e.Department WHERE e.name LIKE ?

The “?” represents a parameter which must be bound when the query is executed. The database manages indexes automatically and performs query optimization based on detailed knowledge of the structure, content, and location of the data [28]. A particular plan for executing the query can be described in a form similar to the original code:

void printInfo(String prefix) { for (IndexItem l in employeeNameIndex.match("S")) { Employee e = employeeID_Index.lookup(l.ID); print( e.name ); print( e.salary ); Department d = departmentID_Index.lookup(e.deptID); print( d.name ); } }

The prefix test is implemented by searching an index: the match method returns an iterator over the index matches. An efficient index from record IDs to record values is then used to find the employee data. This code will be more efficient than the linear search as long as most names do not start with the prefix. Programmer productivity is significantly reduced if such optimizations must be coded by hand, because even a slight change to the original unoptimized code, e.g. adding another condition to the if statement, may require a complete rewrite of the optimized form. If any data can be persistent, programmers cannot identify where indexes or specialized algorithms are needed. Languages that lack a clear performance model are frequently unsuccessful [29]. Given that database management systems already perform these optimizations, a pragmatic solution is to use a call level interface [63, 54] like JDBC [33] to execute the SQL statement, as shown below: string empQuery = "SELECT e.name, e.salary, d.name as deptName"

3 + "FROM Employee e INNER JOIN Department d ON d.ID = e.Department" + "WHERE e.name LIKE ?"

Connection conn = DriverManager.getConnection(...); PreparedStatement stmt = con.prepareStatement(empQuery); stmt.setString(1, prefix); ResultSet rs = stmt.executeQuery(empQuery); while ( rs.next() ) { print( rs.getString("name") ); print( rs.getDecimal("salary") ); print( rs.getString("deptName") ); }

There are many problems with this approach. The syntax and types of database programs are not checked statically, so any errors are not detected until runtime. Query parameters are awkward to specify and invoke via an API, and query results are represented as dynamically typed objects that are accessed by string names. Constructing and reusing queries at runtime requires complex and error-prone string manipulation. Despite these problems, many commercial software development projects use call level interfaces to leverage database query optimizers and reduce communication latency. In addition to algorithmic optimizations, optimizing latency is also an important issue. The original code in Figure 2 traverses the department relationship to print the name of each per- son’s department. In a persistent object system this traversal will load the appropriate Department object, if it is not already loaded. When persistence is connected to a relational database, the result will be multiple queries to the database, causing high latency and reduced performance [16]. The basic problem is the forced choice between a clean programming model that does not support important optimizations and an extremely poor programming model that can leverage the full extent of database optimization. Data Manipulation. Although data manipulation operations are frequently applied to individual objects, leaving little room for the kinds of operations described in the previous section, there are typically a few cases in any application where bulk data manipulation is required. Data manipula- tion operations include inserts, deletes, and updates to data. A simplified example is the following: for (Employee e in db.allEmployees() ) if ( e.department.name = ("Sales") ) e.salary = e.salary * 1.2;

The optimizations described in the previous section also apply in this case. No existing data access library allows data manipulation to be expressed in a natural form as shown above, yet executed as a bulk update operation in SQL as shown below. Modularity. Modularity in programming languages can interfere with the kinds of optimizations described above. The problem is that some behaviors necessary to build the query may be inside multiple procedures or methods. Unless interprocedural analysis can reconstruct the full query, there will be a significant performance penalty for writing modular, reusable programs like this one: void printInfo(String prefix) {

4 for (Employee e in db.allEmployees() ) if ( e.name.startsWith(prefix) ) printEmployee(e); } void printEmployee(Employee e) { print( e.name ); print( e.salary ); printDepartment( e.department ); } void printDepartment(Department d) { print( d.name ); }

3 Prior Research Accomplishments This work builds upon my solid research background on formal semantics, type theory, compi- lation, and language design. It is also motivated and informed by my ten years of experience in commercial software development, where I designed and implemented programming languages, and database-driven enterprise software. Results from Prior NSF Support. I have no prior NSF support. Semantics and type theory. My early work included a typed semantics of inheritance [18, 22] and formal study of mixins [8]. One result of this work was a demonstration of the inconsistency of Eiffel’s type system [19]. My work on object interfaces [11, 21] and F-Bounded polymorphism [10] has significantly influenced the design of Java. Distributed languages and compilation. At Apple I led the design of AppleScript [37], a lan- guage for distributed application integration on the Macintosh platform. To achieve efficiency in the face of high communication latency, I developed an alternative to traditional remote procedure calls: AppleScript messages contain declarative specifications of transactions to be performed on the remote server. I also participated in the development of compilation techniques to generate and execute the remote messages. This design experience is relevant to this proposal, since AppleScript messages are a generalization of database queries. Educational software development. I was software architect for the Writer’s Solution, Prentice Hall’s interactive curriculum for teaching writing skills to middle school students. The system allowed curriculum authors to populate a database of interaction elements, which were then com- piled into highly interactive multimedia CDs, each with nearly a thousand pages. Database-driven software development. At Allegis I led the development of an enterprise ap- plication platform and Partner Relationship Management (PRM) product. The platform supports extensive configuration and customization of applications while enabling rapid upgrades, high per- formance, and scalability. The key research problems and solutions identified in this work are the basis for this proposal. 4 Background and Related Work In 1987, Atkinson & Buneman [3] reviewed early work on integrating programming languages and databases; their focus on creating a clean, uniform programming model for persistent data provided a framework for later research. They did not focus attention on concurrency or performance is- sues. In another early paper, David Maier [44] stated a key requirement for solving impedance

5 mismatch: “Whatever the database programming model, it must allow complex, data-intensive op- erations to be picked out of programs for execution by the storage manager, rather than forcing a record-at-a-time interface.” The approach taken in this project follows this recommendation. Ten years later, Carey & DeWitt predicted the demise of persistent programming languages, object- oriented databases, and the ultimate success of object-relational databases. They also identify the integration of databases and programming languages, which they call client integration, as one of five key research challenges. 4.1 Data Access Libraries A call level interface (CLI) allows a programming language to access a database engine through a standardized API [63, 39, 33, 54]. The key characteristic of a CLI is the ability to execute database queries and commands, which are passed as strings or other runtime data structures [17, 9, 64]. The problems with this approach were mentioned in Section 2. Gould, Su and Devanbu [32] apply static analysis to check programs that use call level inter- faces. Their analysis does not currently cover query parameters or result types and can produce incorrect results when components are compiled separately. It does not reduce the complexity of using call level interfaces. Query languages may also be embedded into a programming languages [38, 2]. Dynamic construction of modular queries is not supported. The result in all these approaches is an explicit hybrid of a host language and SQL, rather than a unified programming model. HaskellDB [41] is a SQL library for Haskell. Queries can be built from reusable components involving selections, joins and filtering, but not existential quantification or sorting. HaskellDB relies heavily on Haskell’s sophisticated type system and monadic operators. 4.2 Orthogonal Persistence Persistent programming languages [55, 4, 43, 46] are based on the principle of orthogonal persis- tence [3]: any data can be persistent and programs operate the same on persistent and non-persistent data. Persistent programming languages do not fully overcome impedance mismatch. With shar- ing and concurrency, persistent data behaves differently from non-persistent data [7]. Persistence programming languages typically do not support query languages or query optimization, since ob- jects are located by traversing pointers. Techniques for improving performance using prefetching have been explored [6]. Object-oriented databases (OODB) are designed to manage concurrent access and updates to complex object structures [24, 45]. They typically provide bindings to languages to implement orthogonal persistence [15, 5]. Some object-oriented databases include call level interfaces to execute OQL queries [15]. 4.3 Generative and Reflective Approaches Linguistic reflection, or generating code at runtime, has been used to generate data access code. It can be used to generate code for relational joins [40], which cannot be typed in typical polymorphic type systems, although the result is more an illustration of the power of reflection than a usable programming model. Linguistic reflection can also be used to compile query strings passed to a call level interface into procedural code [1]. This is the opposite of the approach proposed here, and provides neither static type checking nor the ability to leverage external databases.

6 Object-relational mapping generates code to connect an object-oriented programming language and a relational database [35, 65, 60]. They are effective at generating load/store operations to support navigational access, but use call level interfaces to execute queries [27, 17, 56, 47]. 4.4 Integrated Approaches The DBPL and Tycoon projects have explored many aspects of integration of databases and pro- gramming languages. The DBPL language is an extension of Modula-2 to support orthogonal persistence [59, 58]. Modula-2 was also extended to include iterators with associated conditions; access expressions for selection, existential quantification, and projection; and operators for in- sertion, deletion, and update. DBPL has a built-in database, but also has gateways to relational databases [49]. No published results indicate if queries from different iteration and access con- structs were combined, or measure the performance of this approach. The Tycoon language [52, 50] is a follow-on to DBPL with an higher-order polymorphic type system. Persistent data is based on programming language type models, which need complex user-defined bulk types [51], rather than simpler object-oriented or entity-relational data models. Tycoon proposed integrating compiler optimization and database query optimization, but no final results were published [31, 30]. Queries that cross modular boundaries were optimized at runtime by dynamic compilation [57], and query functions could be passed to remote Tycoon servers as code objects [48]. No performance evaluations have been published for DBPL or Tycoon; the only published metrics cover lines of code in the implementation. Unlike Tycoon, his project will investigate static extraction and compilation of queries, with metrics to measure the completeness of the resulting queries. The Xen language is a more recent approach to integrating relations, XML, and streams. It is similar to the HaskellDB but based on an object-oriented language, C# [53]. Xen is still under development and its capabilities for accessing relational databases have not been published yet or evaluated. 4.5 Benchmarks Many benchmarks have been created to evaluate performance of systems that use databases [14]. The OO7 benchmark [13, 12] was designed to simulate the typical workload on an object-oriented application. Dynamic frameworks [34] and generic benchmarking tools [25] have also been cre- ated to help overcome some of the limitations of OO7 [62]. Synthetic benchmarks can differ significantly from operations taken from real applications [36]. Existing studies focus on comparing different implementations of a particular approach, like persistent objects, rather than comparing different approaches like persistent objects and object- relational mapping. For example, the OO7 benchmark was designed to test object-oriented databases and hasn’t been generally applied to testing other database styles. While OO7 was designed to simulate complex engineering applications like CAD, for which object-oriented databases were designed, there is no reason why other approaches cannot be measured against this type of work- load. Even e-commerce applications are becoming more complex, so the distinction between engineering applications and business applications is less significant. 5 Initial Results Safe query objects are a statically-typed, compositional approach to definition of queries and trans- actions. This section presents initial results on the development of safe query objects [23]. The

7 current design of safe query objects is tied closely to JDO [56], an object-relational mapping li- brary. JDO was selected because it provides support for orthogonal persistence and has a high-level query language, JDOQL, a variant of OQL [15]. As in other call level interfaces, JDOQL queries are manipulated as runtime strings whose syntax and types are not checked until runtime. The JDO Query interface given below makes extensive use of meta-programming, or passing programs as data:

interface javax.jdo.Query { void setFilter(String filter); // filter expression void setOrdering(String ordering); // sort order expression void declareImports(String imports); // package imports void declareParameters(String params); // parameter declarations void declareVariables(String vars); // existential variables Object execute(); // execute query Object executeWithMap(Map map); } // execute with parameters

The prototype of safe query objects overcomes all of the insecurities in the JDO interface, by providing a statically-typed representation of JDO query behavior while still leveraging the under- lying object-relational support in JDO. The elements of safe queries correspond to the methods in the JDO Query interface: filter and sort expressions are defined as normal Java methods; import, parameter and variable declarations use normal Java declarations. Below is a simple example of a safe query:

class SalaryLimit extends BaseQuery { boolean filter(Employee e) { return e.salary > 50000; } Sort order(Employee e) { return new Sort(e.name, Sort.ASCENDING); } }

The generic type parameter Employee is the candidate type of the query, which specifies the type of objects that the query returns. The filter method takes an object of the candidate type and returns a boolean indicating if that object should be included in the query results. The order method returns a linked list of objects containing a comparison value and a sort direction: the first value in the list defines the primary order and direction; when the first values are equal subsequent values are considered. All the methods must be free of side-effects: they must not modify any state or call any methods that modify state. The key to safe query objects is the mechanism for translating them into database commands for efficient execution. To do so, the behavior specified in the filter method is translated into an equivalent relational query. The prototype uses OpenJava [61], an extension of Java with compile- time meta-programming, to translate safe queries for execution via JDO. Figure 3(a) is a more complex example that illustrates the capabilities of the current design of safe query objects. Figure 3(b) is the automatically generated method to perform the query using the JDO Query interface. Query parameters, like deptPrefix, are represented by member variables which are initialized when the query object is created. Existential quantification is sup- ported by an exists method in the base query class. Its arguments are a collection of objects and a subquery; the result is true if any objects in the collection satisfy the subquery filter. This method is similar to the detect:ifNone: method in the Smalltalk collection classes [20]. Dynamic queries involve filters, parameters, or ordering whose structure varies at runtime. For example, a user interface may allow specification of name or age criteria, resulting in queries of the

8 class DeptQuery instantiates RemoteQueryJDO extends BaseQuery { // member variables & constructor arguments are query parameters String deptPrefix; DeptQuery(Float min, String deptPrefix) { this.deptPrefix = deptPrefix; } // filter function determines membership in result set boolean filter(Department dept) { return (deptPrefix == null || dept.name.startsWith(deptPrefix)) && (min == null || exists(dept.employees, new SalaryLimit())); } // multiple order expressions can be used to sort results Sort order(Department dept) { return new Sort(dept.name, Sort.ASCENDING); } }

(a) Safe query with dynamic filter, existential quantification, and subquery

ResultSet execute(Connection conn) Collection execute(String namePrefix, Double minSalary) { String filter = null; String paramDecl = ""; HashMap paramMap = new HashMap();

if (namePrefix != null) { q.declareParameters("String namePrefix"); paramMap.put("namePrefix", namePrefix); filter = and(filter, "(name.startsWith(namePrefix))"); }

if (minSalary != null) { q.declareVariables("Employee e;"); where = and(where, "employees.contains(e) && exists(e.salary > 5000)"); }

javax.jdo.Query q = makeQuery(Department.class); q.setFilter(filter); q.setOrdering("name Ascending"); return q.executeWithMap(paramMap); }

(b) Automatically generated query code to invoke query via JDO

Figure 3: A complex safe query and its translation to JDO

9 form (name = "Alice") in one case, and (age > 30) in another, or a combination of the two. Dynamic queries are different from parameterized queries, which support variation in values but not structure. Safe query objects implement dynamic filters through the use of conditionals or short-circuit evaluation: parts of the overall filter are only evaluated if certain conditions are met. This query expression cannot be encoded directly in JDOQL or SQL, because query languages do support null parameters or require short-circuit evaluation of boolean connectives. Instead, the safe query object translator uses abstract partial evaluation to generate the different cases for structure of the final query. Currently the analysis only supports dynamic queries based on tests for null. 6 Research Plan In this section I present specific research plans to meet the challenges identified above. The solu- tions are centered around two approaches to the problem, safe query objects and query extraction, which will be developed and evaluated. Safe query objects are a significant improvement to exist- ing database libraries, while query extraction is a more ambitious and comprehensive solution to the problem. The next two sections describe the two research approaches; the following sections cover general challenges that will be addressed in each approach. 6.1 Safe Query Objects Safe query objects have potential for significant near-term impact on database interfaces. I hypoth- esize that safe query objects are a general mechanism that can be applied to a variety of languages and database libraries to provide static typing and high performance. Both the generality and limitations of the safe query objects will be investigated. My initial research on safe query objects for JDO, and its prototype implementation based on OpenJava, will be completed and evaluated for performance and usability. The effectiveness of the solution will be measured by converting existing sample code and a JDO-based application to use safe queries. One specific issue to be evaluated is whether the kinds of dynamic queries used in applications fall within the scope of queries supported in safe query objects. Although safe query objects provide a type-safe interface to all of the functionality supported by JDO, the initial work also revealed fundamental limitations in the JDO model itself. For ex- ample, JDO cannot specify query results from multiple tables at the same time. Other limitations include the inability to specify aggregation operations, and the inability to specify data manipula- tion operations. This project will resolve these issues by developing a model of safe query objects that is independent of any particular database library. This model will then be related to OQL, the standard object-oriented query language, which is more expressive than JDOQL. This will enable a broader investigation of safe query objects, and avoid the limitations of JDO. A version will also be developed for EJB to validate generality of the technique and enable evaluation on a wider range of sample programs. I will also investigate the hypothesis that safe query objects can be implemented using a variety of meta-programming mechanisms. Meta-programs are programs that manipulate programs–and in the case of safe queries the input is a type-checked class definition and the output is a database query. The prototype uses OpenJava, an extension to Java for compile-time meta-programming. I will investigate implementing safe queries using run-time reflection and code generation as sup- ported by standard Java. I will evaluate the performance of run-time versus compile-time meta- programming. Finally, the generality of the technique will be tested by developing a data access library for version for OCaML [42], a dialect of ML with support for objects.

10 6.2 Extracting Queries Building on Safe query objects, I propose to investigate more tight integration of programming languages and databases. Safe query objects force a separation between queries and the code the processes the results returned by the query. It is often more natural to intermix the two, as in Figure 2. The query part of this computation finds employees whose name begins with a prefix; the processing part prints employee attributes and related information. These parts can then be executed in phases: first the query, then the processing. This phase distinction will be investigated using the mechanism of staged computation [26]. Some of the specific challenges in this work include developing appropriate conservative anal- ysis to identify the data that may be needed in the processing stage. In addition, the binding time of variables used in the query stage and the processing stage must be controlled: a query cannot depend upon a variable bound in the later processing stage. Sorting and aggregation are also very different in programming languages and databases [23], so that language extensions may be needed to handle these cases. The following sections discuss some specific research areas that apply to both safe query ob- jects and query extraction. 6.3 Query Result Type Inference One of the problems discussed in Section 2 was identifying the exact subset of the database needed to perform an operation. I propose to use type inference to identify precise types for data needed to be loaded to complete an operation. As an example, consider the program in Figure 2. The operation does not need all the attributes and relationships of the types defined in Figure 1. The types that describe the subset of attributes needed by the operation are supertypes of the actual types in the database definition, as shown below: interface EmployeeP { interface DepartmentP { String name; String name; float salary; } DepartmentP department; }

Only fields that are printed are included. The type EmployeeP is a supertype of Employee, and DepartmentP is a supertype of Department. Safe query object can be extended to create complex result types by adding a method to construct the result explicitly. However, this adds an additional burden on the programming, and also introduces the complexity of many synthetic types. It may be possible simplify this process using meta-programming, as in OpenJava, but it is more likely that a modified type system will be required. The complexity of adding result types to safe query objects is one of the primary motivations for query extraction, which should provide a better foundation for manipulation of query result types. 6.4 Modularity I will investigate solutions to the problem of modularity described in Section 2. In the case of safe query objects, optimization over modular queries is achieved by method interfaces through which modular query objects coordinate to build a single database query, which is then passed to the database query optimizer.

11 Query extraction is significantly more difficult in the presence of modularity. My current plan for investigation is to automatically generate attributes or runtime interfaces for the construction of queries, similar to those used by safe query objects. Attributes can describe the query behavior of a procedure. For example, the declared type of the argument to printEmployee function in Section 2 is Employee, but the type based on its data access behavior is EmployeeP defined in the previous section. If necessary each procedure declaration could generate two entry points: one for constructing a query that describes the data requirements for the procedures, and a second which actually performs the procedure. I experimented with a variant of this technique in the framework for the Allegis products, and it was quite effective. I propose to do a more rigorous design and evaluation of this technique. 6.5 Data Manipulation This project will investigate optimization of data manipulation. Safe query objects can be extended to data manipulation by adding additional methods for update and insert:

Employee update(Employee e) { e.salary = e.salary * 1.2; return e; }

The return type for an insert method can be different from the argument type. Bulk deletion only requires a query to specify which objects to delete. I will also investigate extracting bulk update operations as SQL commands from procedural code. Some syntactic support to delineate the boundary of a transaction will be needed in this case. 6.6 Improving the PL/DB interface The previous sections have outlined ways that programming languages can better access databases. This project will also investigate opportunities to improve the interface that databases present to programming languages. The opportunity here is analogous to the one exploited by RISC archi- tectures, which fundamentally restructured the interface between compilers and computer archi- tecture. Like pre-RISC assembly language, SQL was designed to be read and written by humans, not software. I propose to analyze the kinds of queries generated by programs and identify opportunities for improving database operations based on these results. For example, initial results indicate that queries would be easier to define if they returned a small database (a set of related tables) rather than a single table. The key point is that SQL is essentially the target language for translation of data-oriented operations, just as assembly language is the target for general-purpose operations. 6.7 Evaluation This project will evaluate existing and new techniques for interfacing programming languages and databases according to two dimensions: 1) system performance 2) ease of use of the persis- tence model from within the programming language. For system performance, the key metric is end-to-end time to perform a single transaction (latency), not number of transactions per second (throughput). The focus is to evaluate the quality interface between the programming language and database, not the performance of database systems themselves. It does not matter how fast a

12 database is unless the programming language can effectively utilize the performance. The follow- ing is an initial list of experiments to be performed: Cross-approach evaluation. There are several very different families of solutions to the problem: comparisons are frequently made within a given family, but rarely between families to evaluate their overall effectiveness. This project will design and conduct experiments to compare different approaches to interfacing programming languages and databases, rather than comparing alterna- tives within an approach as in previous studies. Approaches that support static typing and modular composition of programs will be compared and evaluated against a hand-optimized solution, with no requirements for typing or modularity, to set a baseline for performance. Target approaches in- clude 1) persistent object systems, 2) object-relational mapping tools, 3) safe query objects, and 4) queries extracted from navigational code. Cross-approach evaluation is important because software developers must choose between different approaches, not simply select a particular implementa- tion of a given approach. Latency, caching and query optimization. There are likely to be many subtle relationships be- tween the latency and performance of distributed caches when accessed via navigation versus queries. One important hypothesis is that latency will distinguish query-based from navigational approaches. Application developers are primarily concerned with overall latency, which is espe- cially an issue when cache hit rate is low. A second hypothesis is that sufficiently precise queries, which specify all the data needed to perform an operation, may eliminate the need for application- level caches. The cost of query extraction and optimization may also be a significant factor. There may be a threshold on query complexity or data size below which explicit queries are not effective. Expressiveness of extracted queries. I will evaluate the range of queries that can be extracted from navigational code using the techniques described above. The language of queries that can be extracted may be less expressive than a full, general-purpose query language. I hypothesize that the degree of content-dependence in navigation affects query extraction and the resulting performance. In designing a test with random navigation, the performance characteristics are very different if the random navigation path is known before navigation starts, or if it depends upon the data being returned. This makes a difference because it affects the ability to plan a query in advance. Benchmark design. This project will explore the use of open source applications to generate realistic benchmark workloads. Open source versions comparable to applications business man- agement applications from Oracle and SAP are an ideal testbed for the usability and performance of data access techniques. One significant challenge to this approach is creating realistic data sets. Database server benchmarks typically specify requirements on redundancy, concurrency, re- covery, and transactions. However, the effect of these factors have not been systematically evalu- ated on the interface between languages and databases. For example, the original OO7 benchmark does not model concurrency, and the multi-user version of OO7 has not been widely evaluated. In order to compare the effect of distributed caching versus a cacheless system with query optimiza- tion, it is essential to include evaluating the effect of redundancy. 7 Educational Innovation I believe that the traditional separation of database, programming, and language instruction is no longer effective. At the undergraduate level, database management courses typically introduce database concepts but do not systematically address how they are used within a larger system de-

13 sign. Presenting a language like SQL in isolation does not enable students to appreciate its value or understand how it relates to the broader field. Although advanced database implementation courses are typically more focused on the internals of databases, there are also significant im- plementation issues related to database integration as well. On the other hand, persistent data is rarely addressed in either the sequence of programming courses or the undergraduate program- ming languages course. At the graduate level the same pattern applies. Graduate students can also benefit from a systematic review of the theory and research opportunities related to integration. The compartmentalization of instruction perpetuates the segregation of database and programming language expertise in academia and industry. This project will develop and evaluate an integrated approach to teaching programming lan- guage and database concepts. I will approach the problem at two levels: incremental changes to existing courses, and creation of new courses focused on integration. Changes to existing courses can have a significant impact on understanding of the interfaces between topics. I will modify the database course to explain how databases can be integrated into larger systems, and identify the issues that affect design choices for integration. I will also modify the undergraduate programming language course to discuss techniques for access to persistent data. I will teach both of these courses during this project, so I will be well-placed to achieve these goals. The goal is to give undergraduates a foundation for designing and building systems that integrate languages and databases. At the graduate level I will also integrate the curriculum covering programming languages and databases. The graduate Database Management course currently includes projects that require the use of data access libraries, but there is no systematic discussion of different data access approaches or of underlying principles. The graduate programming language course, which is focused on semantics and type theory, has no mention of persistence or the connection between the theory of programming languages and database theory. I will develop material to cover these topics. Rather than develop a new undergraduate course specifically on database integration, I will design a new course that addresses integration issues in general. Database integration will be a key element of the course. However, modern systems are frequently built by combining multiple dif- ferent large-scale components, including databases, workflow engines, and presentation systems. The goal of the course will be to teach students how large-scale systems can be built from existing components, and also to experience first-hand the difficulty of effectively integrating these parts into a complete system. This understanding is likely to be increasingly important as more and more software is built by integrating large components. 7.1 Educational Achievements One of my goals in joining academia after ten years of commercial software development was to bring my practical experience into the classroom and to help bridge the gap between academia and industry. Teaching. In the fall of 2003 I taught a reading course on Integrating Programming Languages and Databases. Ten graduate students participated and several of the project reports have been sub- mitted for publication. The broad exposure to current research on object-oriented databases, query optimization, type theory, and language semantics was challenging to the students, but they ap- preciated the integration of theoretical and pragmatic material. I summarized the results from this

14 course into a lecture that I then delivered to the undergraduate Database Management course, and the required graduate course on Database System Implementation at UT Austin. I taught the undergraduate Programming Language course in the spring of 2004 and will teach the graduate Programming Language course in the fall of 2004. Even as a practicing software engineer I have sought out opportunities to teach. While at Apple I taught a graduate Programming Language course at Santa Clara University. While managing large groups of software engineers I prepared and delivered lectures on application development, interviewing skills and data mining. Advising. In my first year as an assistant professor I initiated a research program on languages and databases. I am advising an advanced PhD student in preparing a thesis proposal on integrating programming languages and databases. During the course of this project I will look for a second PhD student to complete the work. I am on the doctoral committee of a student who is developing a formal semantics for an approach to poly-lingual programming. Two undergraduates completed research projects with me in the last year, on extensibility of data types and design of a database-oriented cell-phone application. I enjoy working with students to help them identify important problems and sharpen their skills to create, evaluate, and publish solutions to these problems. My combination of theoretical and industrial experience is attractive to students: practical experience is particularly useful in identifying what problems are worth solving, while theoretical background provides the basis for finding fundamental solutions. I hope to set a good example in connecting pure research to broader impact. 8 Broader Impact The results from this project will help to improve the performance and reliability of software, while reducing development costs. Performance will be increased by enabling programming languages to better leverage the power of databases for query optimization. Reliability will be increased by supporting static typing of queries. Cost of software development can be reduced by modular def- inition and reuse of queries. Currently many applications that could benefit from using a database do not because the task of integration is too difficult. If programming languages can more easily integrate database functionality, more systems will be designed to take advantage of the reliability and scalability provided by databases. The research plan will result in two integration approaches. The first, safe query objects, is designed to be integrated into existing data access technologies like EJB and JDO. Since these technologies are widely used, there is high leverage to these improvements. However, safe query objects are also subject to the limitations of these technologies. The second solutions, query ex- traction, is a more fundamental solution to the problem, but will require much longer to be ready for broad adoption. The solutions developed in this project will be defined at a fundamental, theoretical level, so that they will apply to a wide range of languages and applications. The empirical validation of the results will provide evidence that the techniques are worth adopting. The practical applications of the project make it possible to communicate the results through tutorials and professional/trade publications. All software developed in the project will be freely available on the web as a basis for training and experimentation.

15 References [1] S. Alagic and J. Solorzano. Java and OQL: A reflective solution for the impedance mismatch. L’OBJET, 6(2), 2000. also in Akmal B. Chaudhri (Ed.) Java and Databases, Hermes Penton Science, ISBN 1903996155, 136 Pages, March 2002.

[2] ANSI/INCITS. Database languages - SQLJ - part 1: SQL routines using the Java program- ming language. Technical Report 331.1-1999, ANSI/INCITS, 1999.

[3] M. P. Atkinson and O. P. Buneman. Types and persistence in database programming lan- guages. ACM Comput. Surv., 19(2):105–170, 1987.

[4] M. P. Atkinson, L. Daynes, M. J. Jordan, T. Printezis, and S. Spence. An orthogonally persistent Java. SIGMOD Record, 25(4):68–75, 1996.

[5] D. Barry and T. Stanienda. Solving the Java object storage problem. Computer, 31(11):33– 40, 1998.

[6] P. A. Bernstein, S. Pal, and D. Shutt. Context-based prefetch for implementing objects on relations. In The VLDB Journal, pages 327–338, 1999.

[7] S. Blackburn and J. N. Zigman. Concurrency — the fly in the ointment? In POS/PJW, pages 250–258, 1998.

[8] G. Bracha and W. Cook. Mixin-based inheritance. In Proc. of ACM Conf. on Object-Oriented Programming, Systems, Languages and Applications, pages 303–311, 1990.

[9] J. Brant and J. W. Yoder. Creating reports with query objects. In B. F. Neil Harrison and H. Rohnert, editors, Pattern Languages of Programs Design 4. Addison Wesley, 2000.

[10] P. Canning, W. Cook, W. Hill, J. Mitchell, and W. Olthoff. F-bounded polymorphism for object-oriented programming. In Proc. of Conf. on Functional Programming Languages and Computer Architecture, pages 273–280, 1989.

[11] P. Canning, W. Cook, W. Hill, and W. Olthoff. Interfaces for strongly-typed object-oriented programming. In Proc. of ACM Conf. on Object-Oriented Programming, Systems, Languages and Applications, pages 457–467, 1989.

[12] M. J. Carey, D. J. DeWitt, C. Kant, and J. F. Naughton. A status report on the OO7 OODBMS benchmarking effort. In Proc. of ACM Conf. on Object-Oriented Programming, Systems, Languages and Applications, pages 414–426. ACM Press, 1994.

[13] M. J. Carey, D. J. DeWitt, and J. F. Naughton. The OO7 benchmark. In P. Buneman and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 12–21. ACM Press, 1993.

[14] M. J. Carey, D. J. DeWitt, J. F. Naughton, M. Asgarian, P. Brown, J. E. Gehrke, and D. N. Shah. The BUCKY object-relational benchmark. In Proceedings of the 1997 ACM SIGMOD international conference on Management of data, pages 135–146. ACM Press, 1997.

1 [15] R. G. G. Cattell, D. K. Barry, M. Berler, J. Eastman, D. Jordan, C. Russell, O. Schadow, T. Stanienda, and F. Velez, editors. The Object Data Standard ODMG 3.0. Morgan Kauf- mann, January 2000.

[16] E. Cecchet, J. Marguerite, and W. Zwaenepoel. Performance and scalability of EJB applica- tions. In Proceedings of the 17th ACM SIGPLAN conference on Object-oriented program- ming, systems, languages, and applications, pages 246–261. ACM Press, 2002.

[17] D. Cengija. Hibernate your data. onJava.com, 2004.

[18] W. Cook. A Denotational Semantics of Inheritance. PhD thesis, Brown University, 1989.

[19] W. Cook. A proposal for making Eiffel type-safe. In Proc. European Conf. on Object- Oriented Programming, pages 57–70. British Computing Society Workshop Series, 1989. Also in The Computer Journal, 32(4):305–311, 1989.

[20] W. Cook. Interfaces and specifications for the Smalltalk collection classes. In OOPSLA, 1992.

[21] W. Cook, W. Hill, and P. Canning. Inheritance is not subtyping. In Proc. of the ACM Symp. on Principles of Programming Languages, pages 125–135, 1990.

[22] W. Cook and J. Palsberg. A denotational semantics of inheritance and its correctness. In Proc. of ACM Conf. on Object-Oriented Programming, Systems, Languages and Applica- tions, pages 433–444, 1989.

[23] W. Cook and S. Rai. Safe query objects: Statically-typed objects as remotely-executable queries. Technical Report TR-04-17, UT Austin Computer Science, May 2004.

[24] G. Copeland and D. Maier. Making smalltalk a database system. In Proceedings of the 1984 ACM SIGMOD international conference on Management of data, pages 316–325. ACM Press, 1984.

[25] J. Darmont, B. Petit, and M. Schneider. OCB: A generic benchmark to evaluate the perfor- mances of object-oriented database systems. In 6th International Conference on Extending Database Technology (EDBT 98), Valencia, Spain, volume 1377 of LNCS, pages 326–340, Heidelberg, Germany, March 1998. Springer Verlag.

[26] R. Davies and F. Pfenning. A modal analysis of staged computation. In Conf. Record 23rd ACM SIGPLAN/SIGACT Symp. on Principles of Programming Languages, POPL’96, St. Pe- tersburg Beach, FL, USA, 21–24 Jan 1996, pages 258–270. ACM Press, New York, 1996.

[27] J.-A. Dub, R. Sapir, and P. Purich. Oracle Application Server TopLink application developers guide, 10g (9.0.4). Oracle Corporation, 2003.

[28] M. J. Franklin, B. T. Jonsson,´ and D. Kossmann. Performance tradeoffs for client-server query processing. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 149–160. ACM Press, 1996.

2 [29] R. P. Gabriel. Patterns of software: tales from the software community. Oxford University Press, Inc., 1996.

[30] A. Gawecki and F. Matthes. The Tycoon machine language TML: An optimizable persistent program representation. Technical Report FIDE/94/100, FIDE Project Coordinator, Depart- ment of Computing Sciences, University of Glasgow, Glasgow G128QQ, Aug. 1994.

[31] A. Gawecki and F. Matthes. Integrating query and program optimization using persistent CPS representations. In M. P. Atkinson and R. Welland, editors, Fully Integrated Data Envi- ronments, ESPRIT Basic Research Series, pages 496–501. Springer Verlag, 2000.

[32] C. Gould, Z. Su, and P. Devanbu. Static checking of dynamically generated queries in database applications. In Proceedings, 26th International Conference on Software Engineer- ing (ICSE). IEEE Press, 2004.

[33] G. Hamilton and R. Cattell. JDBCTM: A Java SQL API. Sun Microsystems, 1997.

[34] Z. He and J. Darmont. Evaluating the dynamic behavior of database applications. Journal of Database Management, 2004.

[35] U. Hohenstein. Bridging the Gap Between C++ and Relational Databases. In European Conference on Object-Oriented Programming, 1996.

[36] U. Hohenstein, V. Plesser, and R. Heller. Evaluating the performance of object-oriented database systems by means of a concrete application. Theor. Pract. Object Syst., 5(4):249– 261, 1999.

[37] A. C. Inc. AppleScript Language Guide. Addison-Wesley, 1993.

[38] INCITS/ISO/IEC. Information technology - database languages - SQL - part 5: Host lan- guage bindings (SQL/Bindings). Technical Report 9075-5-1999, INCITS/ISO/IEC, 1999.

[39] ISO/IEC. Information technology - database languages - SQL - part 3: Call-level interface (SQL/CLI). Technical Report 9075-3:2003, ISO/IEC, 2003.

[40] G. Kirby, R. Morrison, and D. Stemple. Linguistic reflection in Java. Software–Practice and Experience, 28(10):1045–1077, 1998.

[41] D. Leijen and E. Meijer. Domain specific embedded compilers. In Proceedings of the 2nd conference on Domain-specific languages, pages 109–122. ACM Press, 1999.

[42] X. Leroy. The Objective Caml System. INRIA, Rocquencourt, France, 1997.

[43] B. Liskov, A. Adya, M. Castro, S. Ghemawat, R. Gruber, U. Maheshwari, A. C. Myers, M. Day, and L. Shrira. Safe and efficient sharing of persistent objects in Thor. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pages 318–329. ACM Press, 1996.

3 [44] D. Maier. Representing database programs as objects. In F. Bancilhon and P. Buneman, editors, Advances in Database Programming Languages, Papers from DBPL-1, pages 377– 386. ACM Press / Addison-Wesley, 1987.

[45] D. Maier, J. Stein, A. Otis, and A. Purdy. Developments of and object-oriented DBMS. In Proc. of ACM Conf. on Object-Oriented Programming, Systems, Languages and Applica- tions, pages 472–482, 1986.

[46] A. Marquez, S. Blackburn, G. Mercer, and J. N. Zigman. Implementing orthogonally persis- tent Java. In Proceedings of the Workshop on Persistent Object Systems (POS), 2000.

[47] V. Matena and M. Hapner. Enterprise Java Beans Specification 1.0. Sun Microsystems, 1998.

[48] B. Mathiske, F. Matthes, and J. Schmidt. Scaling database languages to higher-order dis- tributed programming. In Proceedings of the Fifth International Workshop on Database Pro- gramming Languages, Gubbio, Italy, Sept. 1995.

[49] F. Matthes, A. Rudloff, J. Schmidt, and K. Subieta. A gateway from DBPL to Ingres. In W. Litwin and T. Risch, editors, Applications of Databases, First International Conference, ADB-94, volume 819, pages 365–380, Vadstena, Sweden, 1994. Springer-Verlag.

[50] F. Matthes and J. Schmidt. System construction in the Tycoon environment: Architectures, interfaces and gateways. In P. P. Spies, editor, Proceedings Euro-ARCH, pages 301–317. Springer-Verlag, 1993.

[51] F. Matthes and J. W. Schmidt. Bulk types: built-in or add-on? In Proceedings of the third international workshop on Database programming languages : bulk types & persistent data, pages 33–54. Morgan Kaufmann Publishers Inc., 1992.

[52] F. Matthes, G. Schroder, and J. Schmidt. Tycoon: A scalable and interoperable persistent system environment. In M. Atkinson, editor, Fully Integrated Data Environments. Springer- Verlag, 1995.

[53] E. Meijer and W. Schulte. Programming with rectangles, triangles, and circles. In Proceed- ings of Conference on XML, 2003.

[54] Microsoft Corporation. OLE DB/ADO: Making universal data access a reality, 1998.

[55] R. Morrison, R. Connor, G. Kirby, D. Munro, M. Atkinson, Q. Cutts, A. Brown, and A. Dearle. The Napier88 persistent programming language and environment. In Fully Inte- grated Data Environments, pages 98–154. Springer, 1999.

[56] C. Russell. Java Data Objects (JDO) Specification JSR-12. Sun Microsystems, 2003.

[57] J. Schmidt, F. Matthes, and P. Valduriez. Building persistent application systems in fully in- tegrated data environments: Modularization, abstraction and interoperability. In Proceedings of Euro-Arch’93 Congress. Springer Verlag, Oct. 1993.

4 [58] J. W. Schmidt and F. Matthes. The rationale behind DBPL. In Proc. of the 3rd Symposium on Mathematical Fundamentals of Database and Knowledge Base Systems (MFDBS-91), pages 389–395, Rostock, Germany, 1991.

[59] J. W. Schmidt and F. Matthes. The DBPL project: advances in modular database program- ming. Inf. Syst., 19(2):121–140, 1994.

[60] Siemens. Proceedings of the 1997 European Pattern Languages of Programming Conference, number 120/SW1/FB in Siemens Technical Report, Irrsee, Germany, 1997. Siemens.

[61] M. Tatsubori, S. Chiba, K. Itano, and M.-O. Killijian. OpenJava: A class-based macro system for Java. In OORaSE’99 Workshop on Object-Oriented Reflection and Software Engineering, pages 117–133, Denver, Colorado, USA, Nov. 1999. ACM.

[62] A. Tiwary, V. R. Narasayya, and H. M. Levy. Evaluation of OO7 as a system and an appli- cation benchmark. In OOPSLA Workshop on Object Database Behavior, Benchmarks and Performance, Oct. 1995.

[63] M. Venkatrao and M. Pizzo. SQL/CLI – a new binding style for SQL. SIGMOD Record, 24(4):72–77, 1995.

[64] N. Welsh, F. Solsona, and I. Glover. SchemeUnit and SchemeQL: Two little languages. In Third Workshop on Scheme and Functional Programming, 2002.

[65] J. Yoder, R. Johnson, and Q. Wilson. Connecting business objects to relational databases, 1998.

5