SPENDVIEW: A platform for democratizing access to government budget and expenditure data Manuel Aristaran MASSACUET ILNSTITUTE OF TECHNOLOGY Submitted to the Program in Media Arts and Sciences, JUL 14 2016 School of Architecture and Planning, LIBRARIES in partial fulfillment of the requirements for the degree of ARCHIVES Master of Science in Media Arts and Sciences at the Massachusetts Institute of Technology

June 2016 Massachusetts Institute of Technology, 2016. All rights reserved. Signature redacted Author Manuel istarhn Program in Media Arts and Sciences May 5, 2016

Certified By Signature redacted s-..-ar A. Hidalgo ciate Professor of Media Arts and Sciences Program in Media Arts and Sciences

Accepted By Signature redacted

Vattie Maes Academic Head Program in Media Arts and Sciences

w 2 SPENDVIEW: A platform for democratizing access to government budget and expenditure data Manuel Aristaran

Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences at the Massachusetts Institute of Technology June 2016

Abstract

Budgeting and expenditure data is the clearest expression of a government's priorities. Despite its importance, making it available to the public imposes hard challenges that not every administration is ready to undertake. The lack of IT capabilities and constrained resources of government agencies -regardless of size or budget- make it difficult for them to respond to the demands of information from their constituencies, transparency advocates, the press and central governments. Moreover, these administrations don't usually have access to analysis tools that help them view how the resources are being used and to detect potential risks of misspending -a critical need of elected officials, who are under everyday scrutiny. Responding to these needs, I propose the design, implementation, impact and usability studies of a software platform for analysis and visualization of government budget and expenditure data.

Thesis Supervisor Dr. C6sar A. Hidalgo Associate Professor of Media Arts and Sciences Program in Media Arts and Sciences

3 4 SPENDVIEW: A platform for democratizing access to government budget and expenditure data

Manuel Aristaran

Signature redacted Thesis Supervisor Dr. Cesar A. Hidalgo Associate Professor of Media Arts and Sciences Massachusetts Institute of Technology

SPENDVIEW: A platform for democratizing access to government budget and expenditure data

Manuel Aristaran

Thesis Reader Signature redacted

/ Ethan Zuckerman Principal Research Scientist Center for Civic Media Massachusetts Institute of Technology

T

SPENDVIEW: A platform for democratizing access to government budget and expenditure data

Manuel Aristaran

Thesis Reader Signature redacted Dr. Iyad Rahwan Associate Professor o Media Arts and Sciences Massachusetts Institute of Technology

Acknowledgments

This thesis is dedicated to my wife Luisina Pozzo Ardizzi, partner in life and accomplice in countless crazy adventures. None of this would have been possible without your love and support.

When I first got connected to the internet, the coolest and craziest things were hosted in the mit. edu domain. It was through listserv archives coming from the mythical

18. 0. 0. 0/8 class-A IP network that I was first exposed to the hacker culture. Sneakily typing commands on a VT-100 terminal in the computer lab of the university of my hometown in southern Argentina, my 14-year old self fantasized about this mysterious and faraway university in the USA.

Twenty years later, I met Cesar Hidalgo during a fortuitous visit to the MIT Media

Laboratory, and I asked him what I needed to do to study at his group.

--- "Application period opens in September, huevorn.", he answered.

I didn't think that I had a chance to study here: it was MIT, home to Minsky, Papert, Shannon and Chomsky. The word hacker was coined here! Besides, I was 33. But my wife talked me into trying (she can talk me into doing anything). I took the English sufficiency test, painstakingly crafted the admission essay, put the paperwork together and submitted my application. Two years after, I'm about to graduate from this magical place, but still find it hard to believe that Cesar trusted me and accepted me as a student of Macro

Connections. I will always be grateful for his confidence in me, and for always encouraging all his students to think big, work hard and go farther. Macro Connections gang, you will always be in my heart. Cristian Jara Figueroa, Kevin

Hu, Dominik Hartmann, Flavio Pinheiro: thank you for the laughs, your friendship, your support, your always insightful feedback and kind criticism, and the infinite number of ping-pong matches. I'm grateful for the friendship of the rest of my academic siblings:

Mia Petkova, Amy Yu, Mary Kaltenberg, Elisa Castaher, Ambika Krishnamachar, Miguel

Guevara and Jermain Kaminski.

Thanks to my fellow Media Lab students, faculty and staff. You have truly made me feel at home away from home. Special acknowledgments to my thesis readers, Ethan

Zuckerman and Dr. Iyad Rahwan for their insightful feedback and encouragement during the process of writing this thesis.

I would also like to acknowledge the government of Bahia Blanca, my hometown, that through its Innovation and Open Government Agency allowed me to test the ideas, concepts and tools outlined in this thesis, using the city's data. Uruguay's Budget and

Planning Office and the Latin American Initiative for Open Data for hosting me and facilitating my research in the beautiful Montevideo.

Finally, I would like to thank my family: my father Roberto, my mother Maria Ines, my

brother Agustin and my wife Luisina. Gracias for always reminding me that love is the only thing that matters. Table of Contents

Introduction ...... 15 Motivation...... 16 The nature of fiscal information ...... 18 Contribution ...... 19

Background...... 21 Prior W ork ...... 22 The Relevance of ...... 27 M ultidimensional Modeling of Fiscal Information ...... 30

Implementation ...... 35 Design Rationale...... 36 D e s ig n p rin c ip le s ...... 3 6 N a rra t iv e s ...... 3 6 H y p e rm e d ia ...... 3 6 In te rn a tio n a liz a tio n...... 3 7 O p e n n ess a nd re u sa b ility ...... 3 7 Mondrian-REST: a W eb-friendly API for OLAP ...... 37 Multidimensional eXpressions (MDX) ...... 37 A REST endpoint for OLAP ...... 39 Im p lem e ntatio n D eta ils ...... 44 User experience and user interface ...... 45 O verv iew of a B ud get ...... 4 5 Drilling down on a budget classification ...... 48 Exploring the different perspectives of the budget ...... 49 B udget d im ensio ns as q uestio ns ...... 49 Data verbalization as a complement of visualizations...... 51 M u ltip le g rap hica l representatio ns...... 5 1 E x p lo rin g t h e d a ta ...... 5 2 S h a rin g ...... 5 4 Implementation of the user interface...... 55 Isom orp h ic W eb app licatio ns ...... 56 V isibility to no n-rich internet clients ...... 57

Impact ...... 59 Bahia Blanca: from confrontation to collaboration...... 60 W orking with Uruguay's budget office...... 63 W o rk s h o p ...... 6 4 User evaluation of SpendView...... 65 Public Deployment ...... 66 U sa ge d a ta ...... 6 8

Conclusion ...... 71

13 Future W ork ...... 72 Concluding Remarks ...... 72 List of Figures ...... 73 Index of Tables ...... 75 Bibliography ...... 76

14 Introduction

15 Motivation

Government, and the public sphere in general, are active participants of the data explosion that has been happening during the last decade. In the last 10 years, numerous public entities around the world have been active participants of the Open Data movement, that seeks to publish government generated, machine readable data under licenses that allow for unrestricted use, redistribution, modification and propagation (Open Knowledge

Foundation, 2015).

However, the increased availability of information resources hasn't always been accompanied by interactive tools that present this data in actionable formats, that allow end users to make informed decisions. The traditional design paradigm for the existing tools is the , that just like those in vehicles- allow an analyst or operator to obtain information at a glance from graphical representations of the information generated by the system (Figure 1).

Seeseted Perod 6

Um- ,' fwvm~a PAew TowAd A & I&wPO*WD A U It UI P..j ""9~%l .. ~tt

4J 4

+ I 1'~ ~~ i5*, *S'51 -

+ .4.(4 v5.,* W.> A-6IA,~:~ M.

31s3 W-3*fs; .. lj. 1 : 5. ftl

34 "--4 kM1,.,,2 V45 1 K S 1111 . *2-151 .33*543 41IW1 A 145_C_ _1-.-- II - -

Figure 1: A traditionaldashboard for government finances

16 t A.

Dashboards predate the Web, so they don't usually incorporate some of the interaction patterns and conventions that became predominant in the last 20 years. As they allow users to filter the information being displayed, it is usually cumbersome or impossible to share a particular state (view) of the charts and data tables. Also, as dashboards are designed to provide information at a glance, they tend to underutilize the scroll interaction, now prevalent in mobile interfaces. Lastly, besides chart captions, typical dashboards don't incorporate text as a way of describing data to their users.

Lately, there has been an emergence of visualization engines that improve on the classic dashboarddesign logic, by adopting elements of linear storytelling and now conventional patterns of user interaction. The profile page, for instance, is one of such conventions. In the context of a Web page, it contains information about an entity of the system (a user in a social network, an article of an online encyclopedia) and presents its information hierarchically.

Notable examples of these ideas include the latest version of the Web site The Observatory of Economic Complexity (Simoes Gaspar, 2012), which main entities are profile pages for countries and products, and Data USA (Datawheel, Macro Connections / MIT Media Lab, & Deloitte, 2016) that uses the same technique for offering information about geographical entities, industries and academic majors.

Fiscal information, defined as detailed information records about budget, spending and revenue, are often recorded in multidimensional datasets, a structure that has been traditionally visualized with dashboards. This work will attempt to show that incorporating design elements such as profile pages for the entities defined in a budget, linear narratives instead of configurable visualizations, automatically generated descriptive text along with charts, indexable and shareable pages, can improve on the comprehension, accessibility and diffusion of fiscal information.

17 The nature of fiscal information The data generated during the process of formulation and execution of a government's budget is one of the information resources most relevant to the public interest. An administration's policies, encoded in its fiscal measures, influence the lives of its constituents, affecting services such as health, education and utilities. Budget information can be interpreted as what is known in the economic literature as the revealed preferences

(Samuelson, 1948) of a consumer, in this case a government administration. It can be argued, then, that public expenditure and budgetary records, when analyzed and visualized with effective techniques and devices, can inform the people, researchers, and public servants about the priorities of their government.

However, fiscal information is inherently complex. The budget design process is a multi- stakeholder procedure, in which disparate actors have different and often competing interests. The politician sees it as an event conducted in the political arena for political advantage, the economist as a question of resource allocation, the accountant focuses on the auditing perspective, while the public servant sees the budget as a blueprint for the implementation of policies (Lynch & others, 1979). This multiplicity of perspectives is directly translated into the structure of the information generated by the budgeting and expenditure processes: the data is recorded in mutually independent budget classifications

(Jacobs, H6lis, & Bouley, 2009), each of those serving the needs of different users of the information.

The complexity of fiscal information lies on its multi-dimensional nature. This presents an opportunity to the designer of a tool for presenting it effectively; it is through the combination of these different perspectives (i.e., dimensions) that useful knowledge emerges.

18 Contribution In this thesis I design and implement Spendview, a general tool to visualize and analyze government spending data.

To create Spendview, I developed Mondrian-REST, which is a server side software component that allows creation of HTTP APIs for accessing a by specifying the logical structure of the information, and a Web based platform consisting of a system that allows exploration of a public budget through interactive visualizations.

Finally, I examine public deployments of the system within the government of Uruguay, and the city of Bahia Blanca (Argentina).

19 20 Background

21 Prior Work

To make decisions, governments often need to gather and analyze information. However, public availability government information is a relatively recent phenomenon: the first public records legislation was introduced by Sweden in the late 18th century (Mustonen, 2006). The introduction and subsequent democratization of digital information processing techniques have naturally prompted governments, and the civil society, to extend the principles of access to public information to the "raw" or unprocessed data records generated by the state, instead of aggregate statistics and prepared reports.

In 2009 the United States Government also begun pushing open data when President

Barack Obama introduced the Open Government Directive (Orszag, 2009), which mandated "executive departments and agencies to take specific actions to implement the principles of transparency, participation, and collaboration". This mandate institutionalized this demand for digitized public information, and nurtured the emergence of several movements which can be collectively described as civic technology.

In this context of readily available machine-readable datasets generated by governments, and a community of eager journalists, civic hackers and companies, the last few years saw an explosion of platforms that attempt to make often complex information more accessible to non-specialist audiences. Fiscal (budget, expenditure and revenue) information is a natural candidate for this kind of tools, since management of public money has been historically associated with transparency and oversight of the activities of the government.

Although there are precedents of governments publishing contracting and expenditure information online as early as the year 2000 1, those initial attempts were simple "data dumps" from the administrations' financial management information systems (FMIS),

The city of Bahia Blanca (Argentina) implemented the Active Control system in September, 2000: http://www.bahiablanca.gov.ar/digesto/Ordenanzal.html?ord= 11162

22 which often failed at making the fiscal information accessible to the general public and public servants. One of the earliest and most referenced examples of a platform that made use of interactive data visualizations is Where does my money go? (Gray & Chambers,

2012), originally launched in 20092 by the UK non-profit Open Knowledge Foundation.

This project evolved into Open Spending, an openly accessible data visualization platform with features tailored to datasets pertaining to fiscal information.

0 * weedoesnWrnmoreo.ofgI WHERE DOES MY MONEY GO?

1-e Day Br Country. Regional Analysis :ar:'rfraI Speio> g Abc it

Expendittre an

0, a

Figure 2: Where does my money go? One of the first examples of data visualization applied to fiscal data

Commercial vendors also offer tools for visualization of budget and spending data,

prompted by the Open Government and Open Data movements originated in 2007 and

2 "Web to watch government spending" http://news.bbc.co.uk/2/hi/technology/8407779.stm - Accessed Mar 28, 2016.

23 institutionalized by the aforementioned presidential memorandum. The Seattle-based company Socrata sells Open Budget, a platform adopted by several districts in the US.

Another commercial offering is OpenGov, which also provides governments with a dashboardfor the display of fiscal information.

0O budge t dalasanbridgenia go,,

Open Budget $164 Million Lducation 2016 Operating Expenses $546 milion 1000%ofia HOus Operating Expenses Where's it going? It Tunded

* Education htoken down o' Expense Category -

j ar e pa I IVa ' AL Snapshot

stow As

Figure 3: Socrata Open Budget showing City of Cambridge's education expenditure

Besides the tools originated from the non-profit and commercial sectors, some governments have chosen to develop these tools in-house. A common issue with government-built platforms is that they do little to make fiscal information more accessible.

Frequently, the tools that governments make available to the public are trimmed-down versions of the tools that are used internally. A comprehensive survey

24

1, of government fiscal information portals can be found in (Dener & Min, 2013).

News organizations also developed tools to assist in the interpretation of budgetary data.

The New York Times' Four Ways to Slice Obama's 2013 Budget Proposal, highlights multiple perspectives of the multi-dimensional nature of budget datasets.

* qr nmyns com r i

Sawcr All NYTwrie m nexeoerkac Politics

TVATTEP JlrD'ir 3] CARC Four Ways to Slice Obama's 2013 Budget Proposal Explore every nook and cranny of President Obama's federal budget proposal-

AN Spending '- l:i

How $3.7 Trillion Is Spent Mr Obarna's Dudget rpop;il diufs $37 trnIornmspendmg in 2]3 r1 micasts a $901 txilion defct.

i eCs are sized acc)rtng to the proposed soerxing.

The proposal forecasts a $901 toltko 0arut

Color stows amount of cut or nrease from 2012

Figure 4: New York Times visualization of the 2013 US Budget proposal

My own past experiences with information tools for fiscal information also influenced the design and implementation of SpendView. Gasto Puiblico Bahiense (Bahia Blanca Public

Spending) (Aristaran, 2010) is a site that displays procurement information from the government of Bahia Blanca, my hometown. Presupucsto Abierto (Aristaran, 2015), a project made as part of the initial explorations for the research presented on this thesis, is a system for graphical display of fiscal information, developed in collaboration with the

25 government of the City of Bahia Blanca (Argentina).

* 0 0 Department s

-- U -.-.- - 0 Jusdicciones

a0

...- m".=

Figure 5: PresupuestoA bierto. org displaying fiscal infortation for the cities of Boston (USA) and Bahia, Blanca (Argentina)

These examples show that dashboard systems are a frequent choice of paradigni for visualizing fiscal information. We define dashboard as a set of graphical representations of different perspectives of the data and interface elements that, similarly to a car or the cockpit of an airplane, allow the user to obtain knowledge at a glance about the state of the system, described by the measurements contained in the data.

I believe that the dashboard paradigmn is not always well suited to the interaction conventions of the World Wide Web. Namely, they don't adapt well to low resolution

26 displays (e.g., mobile devices), they don't provide hyperlinks to individual configurations of filters, don't take advantage of design elements of linear narrative, and are usually

"invisible" to search engines.

The Relevance of Dimensional Modeling

The dimensional modeling technique (Kimball & Ross, 2002) allows users to model knowledge about a system using the concepts of orthogonal dimensions and measures.

Dimensions are the the structural attributes of the system of interest, which belong to a similar category in the user's perception of data. Measures are the quantitative observations about the system. For example, for a business that sells products in multiple markets and is interested in tracking its performance over time, the dimensions of interest will be: product, market and time, while the only in the model will be the price at which the products are sold. Additionally, dimensions can be structured hierarchically: in the previous example, products can be grouped in categories, markets according to their geographical location, while the chronological dimension can be analyzed by year, quarter and month.

Dimensions and measures are then linked together in a data cube, which becomes the central entity of a system. The technologies that enable the implementation of these systems are known as Online Analytical Processing (OLAP). In contrast to Online Transaction Processing (OL TP), which support operational tasks such as recording individual transactions in an organization, OLAP systems are designed and optimized for executing complex, aggregated queries.

As an example, each record in UN COMTRADE (United Nations, 2014)-a database of international trade between countries that collects data since 1962- contains information

27 about the monetary value of imports and exports between a country of origin and a country of destination, recorded by year and by product, which are classified according to a predefined hierarchy.

The entities (time, countries, products, import/export amount) of this database can be mapped directly to concepts in the dimensional modeling paradigm as follows:

Dimensions

* Chronological

0 Year

" Origin Country (Geographical)

o Continent

o Country

e Destination Country (Geographical)

o Continent

o Country

Products

o Classification level 1

o Classification level 2

o Classification level 3

Measures

* Imports amount

* Exports amount

28 Users of this database may never need to know that in 2015 Argentina exported $13.8 millions of US dollars' worth of oranges to Greece (an individual fact in the database), but instead what was the trade balance between those two countries in the last 15 years.

With the information arranged in the aforementioned cube in an OLAP system, the user can apply operations to the data structure to obtain the desired information:

- Slice (filter) the cube along the Origin Country dimension, choosing the value

"Argentina"

* Slice (filter) the cube along the Destination Country dimensions, choosing the value

"Greece"

" Drill down to the "Year" level in the Time dimension

* Roll-up the Imports and Exports measures, applying the summarization rule:

Balance = Exports - Imports

In terms of implementation, multidimensional models are often materialized in relational databases, but their table structures differ substantially to the conventional third normal form (3NF) (Codd, 1972) found in OLTP systems. 3NF databases are usually structured following the Entity-Relationship Model paradigm (Chen, 1976), that seeks to reduce redundancy in the representation of entities and the relationships among them, with the goal of optimizing the system for a high volume of individual transactions, as each operation will involve the minimum possible number of modifications. The structure of a database schema suitable for an OLAP system allows for redundancy, as it does not need to support a high volume of individual modifications. In contrast to 3NF structures, they're known as 'star schemas', examples of which are shown in Figure 6 and Figure 7.

29

I Multidimensional Modeling of Fiscal Information Fiscal information, revenue and expenditure, is recorded under different classifications, that are independent to each other. Each line of a public budget is usually imputed to at least 4 different hierarchical classifications, namely administrative, economic, functional, and chronological (Jacobs et al., 2009). The administrative classification identifies the area of government that is responsible for administering the funds, while the economic one records the type of expenditure incurred, for example, salaries, goods and services, interests or capital spending. The functional classification codifies the purpose and objectives of the spending, such as education, public safety or community development.

The latter has been standardized by the United Nations in the Classification of the

Functions of Government (COFOG) manual, to facilitate comparisons between government administrations (Department of Economic and Social Affairs, 2000), while the economic one was codified in the Government Finance Statistics Manual by the

International Monetary Fund (International Monetary Fund, 2014).

Some government administrations use additional classification criteria. Among them, we can identify the geographic classification, that identifies the region in which the revenue was collected or the expenditure was incurred, as well as the funding source, that classifies the origin of the perceived or expensed monetary resources.

Associated to each line of the fiscal dataset, which contains references to entities the classifications in use, there are also monetary amounts, which correspond to the phases of the budget execution process. Typically, these are: proposed (budget proposal by the executive or legislative branches of government), approved (budget as approved by the legislative branch), adjusted (subsequent modifications during the fiscal year) and executed (actually perceived revenues or incurred expenses). Other administrations may choose to use additional phases such as committed or paid.

30 These mutually independent classifications facilitate multiple perspectives of analysis. An economist doing long-term planning could be interested in the yearly amounts broken down by the first level of the functional classification, while an auditor might want to obtain a detailed report of spending by week of a particular month, for a certain department.

Chronological Classification Year Month 2015 April Administrative Classification Title Section Chapter Ministry of Agriculture Secretariat of Community Health Undersecretariat of Primary Health Care Economic Classification Section Article Paragraph Ordinary Expenses Compensation of employees Wages and Salaries in Cash Functional Classification Main Function Function Secondary function Social Services Health Primary Health Care Additional Classifications Source of financing Geographic National Treasury Santa Fe Province Expenditure Phases Proposed Adopted Committed Paid $ 176,455,000 ARS $167,990,000 ARS $ 101,524,100 ARS $ 91,723,100 ARS

Table 1: Example classification and phases for a line of an expenditure budget

Table 1 illustrates how each classification has features that differ from and are

independent of the others. In well-designed budget structures, each transaction is

attributed unambiguously to each of the defined classifications.

This normative framework for public finances, now adopted by most governments,

translates almost directly to the principles of dimensional modeling. As an example, the

budget structure to which the line depicted in Table 1 belongs, would be represented in

31 an OLAP system as follows, in terms of dimensions, hierarchies, levels and measures:

Dimensions

Chronological

o Year

o Month

e Administrative

o Title

o Section

o Chapter

* Economic

o Section

" Article

o Paragraph

* Functional

o Main Function

o Function

o Secondary Function

* Source of Financing

o Source of Financing

* Geographic

32 0 Geography

Measures

- Proposed

- Adopted

- Committed

Paid

A database structure suitable for storage of a typical budget dataset is depicted in Figure

6.

dim time dim functional Id - Id year id-main function month main function idfunction function id sub function facts-budget sub-function id dim administrative idtime -1 dim-economic Id idadministrative Id id_title id_functional idsecion [title id-economic section lidsection id geographic id article section id-fi nancingsource --- I article _tchapter proposed idparagraph chapter adopted paragraph committed paid

dimrgeographic dim fi nancing source 4 id region financing-source

Figure 6: A "" for a budget

33 I There is an increasing number of governments publishing detailed budget information that preserve the multi-dimensionality needed for proper analysis. However, the lack of a data standard that includes information () about the classifications, taxonomies and expenditure phases in use in a particular administration, is a hurdle for the adoption of effective visualization and analysis tools for public finances. The Fiscal Data Package

(Bj6rgvinsson, Pollock, & Walsh, 2016), a recent standardization effort led by the Open

Knowledge Foundation, strives to provide users with such a standard. This format also realizes the usefulness of the dimensional modeling concepts for representing, storing and transmitting government budget information.

34 Implementation

35 Design Rationale This sections outlines the design principles behind SpendView, provides an overview of implementation, and describes current known limitations of the system.

Design principles The design of Spend View was informed by a number of design principles, that attempt to improve on the dashboard paradigm as device for visualizing multidimensional datasets, choosing public budgets as the focus of this work.

Narratives Stories are perhaps the primary mean for transmission of knowledge among humans.

However, their usage as a device for extracting knowledge of a multidimensional dataset is not frequent. By structuring information in the form of the journalistic inverted pyramid, translating particular slices of the data cube into questions and their answers, and combining automatically generated graphical and linguistic summaries, SpendView aims to ease the comprehension of complex information.

Hypermedia Although most current data exploration and visualization systems are distributed as Web applications because it is technically convenient, they don't use the language of the Web.

These dashboard-type systems originated in workstations, and they retain pre-Web user interface conventions. SpendView aims to be not just an application that runs in a Web browser, but a platform that embraces the idioms of the Web, encouraging linking, embedding, indexing, archival and sharing of its content.

36 Internationalization SpendView is conceived globally. Having automatically generated text as one of its main interaction devices, its user interface has been designed from the start with internationalization in mind. At its present state, it supports English and Spanish; supporting additional languages only requires creating a substitution file that contain translated strings.

Openness and reusability SpendView and Mondrian-REST, the two components that comprise the system, are licensed under the MIT license, and are freely available on the code-sharing platform

GitHub.

Mondrian-REST: a Web-friendly API for OLAP

This section briefly describes the Multidimensional eXpressions (MDX) language, and introduces a novel querying interface OLAP databases that support MDX.

Projects that implement tools for visualization of multidimensional data often build ad- hoc interfaces for querying it. These application programming interfaces (APIs) contain knowledge about the particular dataset that they allow to query, frequently encoded in imperative programming languages, with the corresponding loss of generality that prevents reuse of those software components. OLAP databases offer a powerful declarative mechanism for specifying the structure of the data, and the query language they provide is expressive enough to encode every possible aggregation of the data.

Multidimensional eXpressions (MDX) The MDX language was introduced by Microsoft in 1997, as part of the OLE DB for

OLAP specification. It provides a specialized syntax for defining and querying

37 multidimensional objects and data, such as the data stored in OLAP cubes. While similar to SQL, it is not limited to bi-dimensional tables (rows and columns); MDX allows the user to query structures of up to 127 dimensions, which are referred as axes.

Taking the UN COMTRADE cube described in a previous section, a user would issue the following query get the yearly total amount of exports from Argentina to each of the

African countries:

SELECT {[Measures].[Exports]} ON Axis(@), [Time].[Year].Members ON Axis(1), [Destination Country].[Africa].Children on Axis(2) FROM [Trade Flow] WHERE ([Origin Country].[South America].[Argentina])

This query will select the Exports value -which is part of the special Measures dimension- in the first axis, which by default is aggregated using the sum operator. The second access will be populated with the contents of the Time dimension, at the Year level. The third and last axis will contain the immediate descendants (Child ren) of the

Africa member of the Destination Country dimension. The data that will take part on the aggregation will be those records (tuples) for which the Origin Country dimension is equal to Argentina, as specified by the WHERE clause. Table 2 shows an excerpt of the result of the query.

Year Country Code Country Name Exports

1995 gab Gabon 2326825

1995 gha Ghana 2556622

1995 gin Guinea 436794.32

38 1995 gmb Gambia 11249

1995 gnq Equatorial Guinea 144528

1995 ken Kenya 51807120

1995 lbr Liberia 2904068

1995 lby Libya 6113561

1995 mar Morocco 37701284.51

Table 2: Result from MDX query (excerpt)

A REST endpoint for OLAP databases

Despite the flexibility and expressive power of MDX and multidimensional databases, there hasn't been significant adoption of these technologies outside the enterprise.

Analytical databases that support MDX integrate with other systems through protocols like the Microsoft Windows-only OLE DB for OLAP or XML for Analysis (Microsoft, 2001). However, the lack of a protocol that conforms to current conventions and patterns for Web-applications might have contributed to the low levels of adoption of multidimensional databases. This section introduces Mondrian-REST, an implementation of an URL-encoded query language that is able to express most common types of MDX queries.

The design of the proposed query language is centered around 3 operations:

* Drilldown: allows to move from one level of detail to the next. Starting from the

root of the UN COMTRADE data cube, drilling down on the Origin Country

dimension means getting information for every Continent. Multiple drilldowns can

39 be specified in a single query.

* Cut: allows the user to specify a filter, which effectively restricts the cube in which

the aggregations are performed. Cutting the cube along the [South

America] .[Argentina] member in the Origin Country dimension, will only

consider cells for which Argentina is the origin country in the trade data. Multiple

cuts can be specified in a single query.

* Measures: allows to select the measures (scalar variables associated with a

particular fact in the data cube). Multiple measures can be selected in a single

query.

These three operations are sufficient to express the most common types of queries needed by sites like SpendView.

Additionally, Mondrian-REST provides access to the structure of the cube, allowing the user to obtain metadata such as the list of members of a particular dimension, or detailed information about a member.

The following examples illustrate how queries are constructed, by restricting the search space with the drilldown, cut, and measures parameters.

The single entry point for aggregation operations in Mondrian-REST is /aggregate, which accepts the described parameters. In order to aggregate (sum) the Imports measure over the entire cube (every country), a user would issue the query:

/aggregate?measures [ ]=Imports

To obtain the time series of the total imports of every country, a drilldown on the Time dimension must be performed:

/aggregate?drilldown[]=Year

40 &measures[]=Import

To add an addition dimension, it is possible to specify an additional drilldown. When combined with the cut parameter, this query adds every African country to the dataset, resulting in a yearly time series of total imports by each African country:

/aggregate?drilldown[]=Year &drilldown[]=Destination Country &cut[]=Destination Country.Africa &measures[ ]=Import

Finally, this query will only consider imports of live horses by African countries, originated in South American countries.

/aggregate?drilldown[]=Year.Year &drilldown[]=Destination Country &cut[]=Origin Country.South America &cut[]=Destination Country.Africa &cut[]=Product.Live Animals &measures[]=Imports

The default representation of the query result is a JSON-encoded dictionary, that contains information about the members of each dimension along with the associated measures.

Additionally, following the principles of Representational State Transfer (REST) (Fielding

& Taylor, 2000), now adopted as the standard in modern Web application development, the user can also request that the response is encoded as comma-separated values or as an Excel spreadsheet. These tabular (two dimensional) representations of a multidimensional structure follow the "tidy data" recommendations (Wickham, 2014) that allow for easy processing and manipulation with standard data analysis tools.

41 Within Mondrian-REST, the mapping between logical entities (dimensions and measures) and the underlying "physical" representation of the information in the database is achieved through a configuration file in XML format (manually created by the developer), that encodes the principles of dimensional modeling. A fragment of such a file for the UN

COMTRADE database schema (Figure 7) dataset is shown in Figure 8.

dim-products Id idcategoryjlevel 1 facts-trade categoryjlevel-1 dimtime Id idcategoryjlevel_2 Id idtime category-level_2 year id-origin-country id_category-level 3 iddestinationcountry category-level_3 id-product valueimports valueexports

dim-country Id idcontinent continent id-country country

Figure 7: Database "starschema" for UN COMTRADE

42 KS~chm name "atlas" nsi name "Year" type "TimeDimension" Hier irchy hasAli "true" primaryKey "id"

KDims'i- n name "Countries"

- Hierarchy

ube name "Trade Flow" rali name "facts trade" Dim n i cnsag(- name "Year" source "Year" foreignKey "id time" LDimn1ionfla- name "Origin Country" source "Countries" toreignKey "id origin country" *Vism-n oII' ' name "Destination Country" source "Countries" foreignKey "id destination country" F i mmmn' i rolv_1e name "Product" source "lS" foreignKey "id product"

M ur winame "Exports" column "value exports" aggregator "sum" Memmii e name "Imports" column "value imports" aggregator "sum"

Figure 8: Cube definition for the UN COMTJRADE database

The preceding examples use the UN COMTRADE dataset to demonstrate the expressiveness of the multidimensional modeling paradigm and the proposed query language. These techniques were applied to fiscal information during the implementation of the systems presented in this thesis. Using the model for a budget dataset listed in page

30 and the budget line of Table 1 as examples, the following query returns a yearly time series of all the expenditure phases, broken down by the first level of the functional classification, for the Ministry of Agriculture:

43 /aggregate?drilldown[]=Year &drilldown[]=Functional &cut[]=Administrative.Ministry of Agriculture &measures[]=Proposed &measures [ ]=Adopted &measures[]=Committed &measures[ ]=Paid

Implementation Details Mondrian-REST is an HTTP service that translates queries encoded in the previously described language into MDX, which is in turn interpreted by the Mondrian Relational

OLAP (ROLAP) engine3 . This software component translates MDX into SQL queries, which are executed by a relational database. Mondrian ROLAP supports most relational databases: the system built for this thesis used the lightweight relational database system

SQLite4 during the development process, and the column-oriented database MonetDB5 for live deployment with no change in the underlying code. The system has also been successfully deployed using the PostgreSQL6 and MySQL 7 database systems.

Mondrian-REST is implemented in JRuby , a version of the Ruby language that runs in the Java virtual machine.

3 http://community.pentaho.com/projects/mondrian/ 4 https://sqlite.org/ 5 https://monetdb.org/ 6 http://postgresql.org/ 7 https://mysql.com/ 8 http://jruby.org/

44 User experience and user interface Spend View structures the overview of a fiscal information dataset linearly, inspired by the well-known design concept of a profile page and the inverted pyramid writing technique of journalistic articles. This writing style is characterized by "the most important information summarized in the so-called 'lead sentence' that, according to standard practice, has to answer four or five 'w-questions' (who? when? where? what? and perhaps why?). After the lead sentence comes the rest of the story [...f' (P6ttker, 2003).

In addition to interactive visualizations, SpendView relies in automatically generated textual narrative that describes and highlights basic statistics of the datasets.

This section provides an overview of the UI and describes in detail how the user interacts with the system to explore a budget.

Overview of a Budget

The initial view of a district's budget is shown in Figure 11. Following a modular structure, the page contains two types of elements.

The header (Figure 9) contains the name of the district and its logotype or emblem. It also contains a search input field, that allows the user to search for budget entries. It also contains a brief textual summary of some aspects of the budget, that include the total amount for the latest available year, its compound annual growth rate and the main sources of expenditure for all the available budget classifications. Following the header, the page contains one aggregation view module (Figure 10) for each classification of the budget. By default, it displays the first two levels of a dimension in a treemap, a visualization technique suitable for hierarchical data structures (Johnson & Shneiderman,

1991).

45 ~Th KI

i 2016. City Cambfidget, of bidget was of According to the economic classifwlation the Ma i P j Sai I-,(120 million, 22 i and >rrusy 546 millioi i has inc reased at an annuaibzed rate of areas of spending were alaies ard Wages Mainten ance an [ What is it spent oil lopmen (114 million. 21'1, 3.5'i since 2011 (365 million. 6 '..) ither Ordinary Main teriaie What is the purpose? (117 million. 21".) and Extiauidinary Experidiluicc According to the funding source classificaion, the According to the administrative classific ation, the What is the funding source') (59 7 million. 11 4 main areas of spending were Geiieral Fund main areas of spending were Edi. uatinn (164 millon. 1519 million 95' W.V t: I uid (14 million. 3%) and 30%), Debt 7 million. 10k) Serviv (54 and Poi e According to the functional classification, the main Parkrrq ifrd (11 5 million, 2%) (50 6 million, 95) areas of spending were Eduation (164 million, 30).

Fig u're 9: Detail view of the header, showing the generated textaal description of the data

ii ,ri, r2 I,,

In 2016, Eucation

(164 million, 305.) and Dri l,5vice 54 7 million 105) were the t i rvin areas of spending in Ine administrative classIftcatior Headquarters FireStations

8% Bud Iigs I N

EEEE EZ=EMLM MU ME E:'! Ii MMEEE

Fliur 10: An "agqregation 'view" showing the first 2 levels of the administralve classifI a tion

46 I Ilk Imm

* U r (icy

tc K $id.a 16 4 wy: 1 In "A I? - e 4 11-r

--

I Hmm SErm in11 *v t a5m. ME *lWcNmm mEE*E *EREE E I

Permanent Salaries/Wages Health Pensions Insurance

I 6%

46%

OEM I

...f . s ..- '- ;' .... or 2 -

bEL I7@# o4'Ila 7 4prtf4

14 tis lei sfc te I

General Fund

95% I

Fifuire 11: SpcndVieu} shoing the i njitiial ovCroicU) of thc budget for /Ic City of Canbridgc

47 I Drilling down on a budget classification

The drill dow'n operation allows the user to move from one level of (letail to the next.

Spend View maintains the design of the main page for displaying information about a

member of a particular budget classification. Figure 12 shows the profile for the "personal

pernianente" (permanent staff) classification -- a second-level classification in the

administrative dimension, within the "gastos en personal" (salaries and wages) item in

a municipal budget. In addition to the elements of the main budget view described in

page 45, the heading contains a trail that indicates the depth of this member within the

overall budget, along with the names of the dimension and hierarchy level in which the

member is located.

spendview

Ciudad Autonoma de Buenos Aires P Gastos en personal Personal permanente

Parlida Principal - Enlidad de 2' nivel en la clasifica iioi economica

Fn 015, el presupeslotritaI de Personal permanente Hii:A( Ei>N (8,11 miles de millones, ?S)y Complenenos (?,7 miles 5Quien gasla2 de millones, R,) ue de 32 miles de milones que reprpseni.a el 37.03', OBLIGACIONES A CARGO DEL IESOR(i, eEn que se gasta? del presupuesro total del ado Ha crecido a una tasa (3,48 miles de millones 11!) De acuerdo a la clasificacion geogrifica, las areas de eDoride se gasla? anualizada del 42,3% desde et ado 2013 mayor gasto fueron Cornniii I De acuerdo a la clasificacion economica, las areas de (13 2 fniles, de millorres, 41 irs s4 De acuerdo ala clasificacion administrativa, las areas mayor gasto fueron Relibucin dMi aigo (596 miles de millones. 18%) y Coimuna 7 de mayor gasto tueron.MINSTE Of RIE 211 (23,4 miles de millones, 72%), Contribuciones (1 56 miles de millones, 5%) (9,1 miles de millones. 28%). MINIST [R10 L patronales (418 miles de millones 13%) y

En el ano 20 15, Reritabucion del

(20.4 miles de rillones., 64 ,) y

Coriirbuciouc pir Onali, (4,75 mies de millones. 15 l fueroni las areas de mayor gasto para la clasificacion econ6mica dentiro de a on wnir

6%

1,!p- ejc c - g-nte iros1 ED ;

Figqare 12: Partial uiew of a member of a budget

48 .1 Exploring the different perspectives of the budget

The container for most interactive elements in SpendView is the "aggregation view", shown in Figure 10. Both the budget view (Figure 11) and drilldown pages (Figure 12) include one of these modules for each available dimension of a budget.

Budget dimensions as questions

As mentioned in page 18, budgets are classified according to multiple, independent dimensions that offer different perspectives of analysis. The usual nomenclature to these classifications can be translated to more accessible language, to facilitate conprehension of the budgetary information (

Question Classification Name Example members W1ho spends? Administrative Police departnent, Ministry of

Education

What is spent on? Expense Type Salaries and Wages, Goods,

Services

What is the funding source? Financing source General Funid, Revenue Tax,

External Financing

What is the purpose? Functional Coniunnity Development

Where it is spent? Geographic Jamaica Plains, Buenos Aires

Province

Table 3).

Question Classification Name Example members Who spends? Administrative Police department, Ministry of

Education

49 I What is spent on? Expense Type Salaries and Wages, Goods, Services

What is the funding source? Financing source General Fund, Revenue Tax, External Financing

What is the purpose? Functional Community Development

Where it is spent? Geographic Jamaica Plains, Buenos Aires

Province

Table 3: Budget classifications as translated to questions

50 Data verbalization as a complement of visualizations SpendView makes use of automatically generated linguistic summaries of the data that is used to construct budget and dimension member profiles. Each aggregation view contains a portion of text that describes the members of the displayed dimension that incurred in the larger portions of expenditure, according to the displayed classification.

For example, given the budget of the City of Cambridge for the fiscal year 2016, the system would generate the following summary for the functional ("What is the purpose?") classification:

In 2016, Education (164 million USD, 30%) and Public Safety

(120 million USD, 22%) were the two main areas of spending in

the functional classification.

On a dimension member profile, the text is expanded to indicate that the summary is limited to that particular cell. The Education profile for the budget of Cambridge contains this text in its economic ("What is it spent on?") classification section:

In 2016, Salaries and Wages (137 million USD, 83%) and Other Ordinary

Maintenance (25.4 million USD, 16%) were the two main areas of spending in

the economic classification within Education.

Additionally, the names of the dimension members that are included in the linguistic summary are hyperlinked to their respective profiles.

These linguistic summaries are crucial for capturing traffic originated in search engines.

Charts present data in graphical formats, but they are effectively "invisible" to search engines. Automatically generated text, on the other hand, is quickly detected and indexed by search engines (Figure 18).

Multiple graphical representations The default graphical representation of the data contained in an aggregation view module

51 is the treemap. According to the dimension being displayed, the treemap will represent up to two levels in the hierarchy; the first level is encoded by a color in a categorical scale, and the second is encoded by the position of the rectangle that represents a member, as calculated by the Squarified algorithm (Bruls, Huizing, & van Wijk, 2000). Treemaps are effective for representing the relative share or importance of items in a total amount. In

SpendView, the treemap items (rectangles) are budget categories in a classification, and their size is relative to their contribution to the overall spending in a particular year. The user can select a year of interest with a "slider" control situated below the treemap.

However, this is not effective for visualizing temporal trends, a frequent use case in budget analysis. To accommodate for this, SpendView offers line and stacked bar charts, that represent the temporal dimension on their horizontal axis (Figure 13).

Exploring the data Every chart type in Spend View is interactive, allowing the user to obtain more information about the displayed budget categories. When the mouse is over either a treemap rectangle, a line in line chart, or a bar in a barchart, it presents an indicative tooltip that contains detailed information about the budget classification. Figure 14 shows a tooltip displaying the amounts of the execution phases of a budget, in the selected year. Chart legends also react to the mouse pointer, adding to the tooltip's content, controls for hiding or isolating that particular series from the chart.

52 Adjusted 2C14 .39 b4ilion

Settlemnents with the Education portion genetal budget of thne of the general European Union for own funds subsidy for local government units

Monetary benefits from Social pension insurance Insurance Fund I':II

M E Uat-t 4MEEUMEEU MMEME,4 flEE

MENCNIuNIMMEMM MWENEMMM OMMNa M MON *UUEUUUUEEa1O *MtEEEE nOowEDS

Figqure 13: Treernap, line and stacked bar chart applied to the same dataset.

tufdqo 1

Figqare 15: Tooltip over a chart legend

Figre 14: Tooltip over a chart

In addition to the "mouse over" interaction gesture, charts anid legends also respond to clicks. This interaction is used to obtain more inforiiiatioin about a buidget item. When

53 the user clicks either on a chart element or on an item of a chart's legend, SpendView

displays a modal window constrained to the aggregation view module that contains the

chart. This modal window introduces an additional visualization method, the grouped bar

chart, that facilitates comparisons of multiple expenditure phases (e.g., approved, adjusted,

executed). It also utilizes a tooltip, consistently with the main aggregation view, to provide

further details.

Republic of Poland i Transport and communication Local public transportation Chapter - 2"1 level entity in the functional classification 110 -

9.009

7000

4 ooe

Sloge -

00- 2010 2011 2012 2013 2014 Approved U Adjusted Executed I

Figure 16: Modal window showing spending time series for a budget item

Sharing Ease of sharing is one of SpendView's design principles. In addition to every budget and

item page having an unique URL ("permalink"), each aggregation view module offers multiple ways to share and download the displayed information in several formats. An

"options" button displays a modal window (Figure 17), similar to the one described in the preceding section, that contains the following options: I 54 I - Share: buttons to share the selected view on social media platforms.

- Download: buttons to download the data corresponding to the current chart in

comma-separated (CSV) and Microsoft Excel (XLS) formats. The user can also

download a rendering of the chart in Portable Network Graphics (PNG) and

Scalable Vector Graphics (SVG).

- Data: Tabular representation of the data.

- API: Displays the address of the application programming interface (API) for

consuming the information through the HTTP protocol.

- Embed: HTML code suitable for embedding the selected visualization into a Web

page.

Share Download Data API Embed Categry Fecha Amount Administration &Gen Supplies 2011 122,050 Administration &Gen Supplies 2012 122050 Administration &Gen Supplies 2013 122.050 Administration &Gen Suppes 2014 122950 AdmimistrtiOnl &Gen Supplies 2015 137.950 Administration &Gen Supplies 2016 137.950 Communication 2011 11,355 Communication 2012 11,355 .ct>o0t-nfC Communication 2013 11,355 Communication 2014 10,455 Communication 2015 10,455 Communication 2016 10,455 Educational Supplies 2011 0 Educational Supplies 2012 0 Educational Supplies 2013 803,535 Educational Supplies 2014 830,285 Educational Supplies 2015 855,750 Educational Supplies 2016 900,265 Energy 2011 498,800 Energy 2012 428,090 Energy 2013 428,090 Energy 2014 428.090 Energy 2015 428,090 Energy 2016 428,090

Figure 17: The Share modal window displaying the "Data"section

Implementation of the user interface SpendView is currently running at spendview.media.rnit.edu as a public Website with access to the open Web. The following sections discuss how the the interface was

55 implemented.

Isomorphic Web applications

Web applications, being client-server systems, are comprised of two parts: the server application, that serves content (markup, media, executable code) to a Web browser through the network to a client; and the client application that runs on the user's device.

Before the advent of rich interfaces, these applications were simple in scope. The HTTP served HTML files and media content to a Web browser, that rendered them and only provided basic interactions such as navigating through hyperlinks, or submitting the contents of a form to a server.

In 1995, Netscape introduced the interpreted language Javascriptto its Navigator product, which enabled the creation of richer interactions in the Web browser by manipulating the contents of the HTML that was downloaded from the server, in response to the actions of the user. These new possibilities lead to an increase in the complexity of Web applications, which now consist of two separate codebases -server and client-, usually implemented in different programming languages. Furthermore, less-capable HTTP clients -such as the crawlers employed by search engines to gather and index the contents of Web pages- are unable to execute Javascript code. In consequence, information that is displayed on a Web browser screen by manipulating HTML with Javascript, is in effect invisible to search engines.

In response to this, and with the recent feasibility of Javascript as a server side language

(Tilkov & Vinoski, 2010), the Web development landscape saw the emergence of a new development practice, called Isomorphic or Universal Web applications (Robbins, 2011).

This technique allows the developer to employ a single, coherent technology stack for the server and client applications. SpendView is developed with such a paradigm: the entire

56 M _

codebase for the Web application is developed with Javascript, using the React 9 application framework, released by Facebook in 2014.

Visibility to non-rich internet clients One of the main benefits of rendering HTML content on the server, as mentioned above, is the increased "visibility" to search engines (Figure 18) and to social media platforis.

This is particularly relevant in a context where social sharing and organic search results amount to a combined 80% percent share of the sources of traffic acquisition for websites.

Social media platforms have created inechanisis to increase visibility of shared content, which require the inclusion of metadata tags that are interpreted by the platform's crawlers. SpendView uses these devices to pirovide a textual siunmary of the information contained in the link that was shared.

Go gi spendview argentina

All Maps News Images Videos More - Search tools

About 2,290 results (0.56 seconds)

Republica Argentina - SpendView https://spendview.media.mit.edu/cube/argentina In 2016, R6publica Argentina's budget was of 1.56 trillion ARS. It has increased at an annualized rate of 28.89% since 2008. According to the administrative

SpendView Home - SpendView https://spendview.media.mit.edu/ - ... spending visualization engine. Questions? Reach us! [email protected] ... City of Cambridge - Ciudad Aut6noma de Buenos Aires - R6publica Argentina.

Figure 18: Orqaric search results in Google for the query "spendiew argentina"

lhttps: facebook.github.io/ireact,

57 Manuel Aristardn

El presupuesto argentino 2008-2016 en SpendView, la herramienta quo estoy desarrollando.

L Manuel Aristarin Playing with - s open data:

SperMView: City at Boston In 2016, City of Boston'a budget was of 2,857,412,184. It has increased at an annualized rate of 4.61% since 2013,

5 NP& IL0

8:36 PM 20 Ap: SpendView: R6pubfica Argentina In 2016, Repubica Argenina's budget was of 1.564,422.349.338. It has increased at an annuakzed rate of 20.89% since 2008

4 LIke P Comment . Share

o Guadalupe Mendoza. Celeste ALita Yanovsky and 45 others

Figure 19: Sharing on Twitter Figure 20: Sharing on Facebook

58 Impact

59 Bahia Blanca: from confrontation to collaboration Bahia Blanca, a city with a population of 300,000 in the south of the Buenos Aires province in Argentina, was among the first in the world to use the Internet to publish detailed fiscal information: in the year 2000, its government implemented a Web site that let citizens consult the city's checkbook. The system, although primitive for today's standards, allowed users to obtain information on most aspects of the city's revenues, expenditure and procurement. The successive administrations trimmed down the system considerably; in mid-2010 it only remained as a simple list of purchase orders emitted by the government.

In July 2010, motivated by some corruption scandals that were reported by the local media, I built GastoPublicoBahiense.org10 (GPB), a system that obtained those procurement records through screen scraping techniques and produced a Web site that republished that information using interactive visualizations (Figure 21). This initiative gave relevance to a dataset that had been available for more than 10 years, and had immediate impact on the city's political and media landscape. It is also frequently cited as a pioneering example of civic hacking in Latin America (Jefatura de Gabinete de

Ministros, 2014; Morozov, 2014).

The then-current administration took measures against the project, such as introducing a

CAPTCHA -a challenge-response mechanism destined to block screen-scrapers- in section of the city's Website where the procurement information was published. This incident brought more notoriety 1 1 to the GPB project. It also prompted the next administration to create an office within government dedicated to transparency and

10 http://gastopublicobahiense.org 11 http://www.lanacion.com.ar/1387132-impiden-al-acceso-a-informacion-publica-en- bahia-blanca

60 technological innovation.

Casio P"Ico Bahiense

Gasto Publico Bahiense

1: liii t- , afio 20 10 1,4 \1tvuqI aIh (i KIf i l $ 28.592.243.80

Las 5 repwoaons quo ma. gastwon ono qu va dM Mo 2010 Los 10 proveadores m.s boneftcadoS on o quo va del afo 2010

, )jhA I lyA A .,L.i 11bSj1j 1.749D14,82

HL&AGEiA P A AD WJLAJ2A S A141330.36 YPF 0CEDADMAONMA 6 1.632.72AO

EYCONSA S1 A49261.10

IE SEGPAM - SEGUHILAD.PF40VIP40AL SSINISTEPAO I 24244.00

L 578-962AS BAHM GQMUNICACK)N S S.A. S 453418.74 RCA) S A S 437 22$0

CA)PERATIVA DTTRABAJj Njf4OfStE L IMIT AA S 365648M

tWOthLL 5?A S 331.93331

loquw v dad afo 2010 DEP TALtERES Y MASTRA6ZA 8349328.67 Pa -asVci on las ordan d compra en

KS2LA.LtA64 52. 8 ,AY, co r3 131 ianCa Cl0o4 xepD toS mUnIap01 sWrC1S PrvoSO 1WOMdeStro

SECtETAAA DE OBAS YSERV. PUBLIC S 2.320.38.A5 : .1rsa riuniapl dad orMane 00coMora d(aPe tsu",strO autorMtof mgoS

DARECCIO1N GENERAL DE TRANSPORTE 2.042.002,3

MAYRD A S 1437304.8

Figure 21: The homepage of GastoPublicoBahiense.orgin July 2010

In this new political context, the city's government removed the previously established obstacles for accessing public information, and we established a collaborative relationship.

In June 2015, in the context of this research, I started collaborating with Bahia Blanca's

government to develop a budget information portal for the city. The city has an integrated financial management information system (IFMIS) (United States Agency for

International Development, 2008) in operation since 2008, but didn't have a publicly-

facing system for providing budgetary information to its citizens.

The outcome of this collaboration was twofold: I reverse enginecred the IFMIS's database

and wrote a set of software components that allow the extraction of detailed budgetary

61 and procurement reports in machine-readable formats, suitable for use in data analysis tools. I also developed Presupuesto A bierto (Figure 5), an initial prototype for a budget information portal, that was publicly deployed in November 2015 and is running at http://bahia.presupuestoabierto.org.

The usage data and feedback collected from the public deployment of Presupuesto A bierto informed the development of SpendView, and drove the decision to explore alternatives to the dashboard paradigm. Additionally, the software that was developed to extract the financial records from the city's IFMIS -a software system provided by the provincial government which runs in more than 140 districts across Argentina- was made available by Bahia Blanca's government to other municipalities, which are now using it to overcome this system's limited reporting capabilities, by processing this information with more powerful tools.

In April 2016, Bahia Blanca's fiscal records from 2008 to 2016 were made available in

Spend View. In constrast with the other budget datasets that are published in the platform, Bahia Blanca's are kept updated thanks to the integration with the city's IFMIS that was developed for this research.

Given the standardization and architectural homogeneity of IFMIS (United States Agency for International Development, 2008), it is feasible to develop integrations of SpendView with other systems.

62 0 7 a tos e(

v ' 1, 1 1, Gastos on personas tudlet nas of 13; rtnil.n A F v (86 4 rhl ARS. ITA) r.-rt n -, (10 6 "MtmonARAM I. and x Cr ren t, 1w 32 67' uf the year 5 total budget li has AcCrrragc in~. at 4, d 32 12 s rce 2108 eamonmc la~shscaonr Ih r-arr a-ean Cf VMahl ahe 'u A nfpedgrwln AhC. 43.

A Cod -gtot admanrstntra.. hrifcatcaa the mam reas 114- i->A S 18mrr and 'Cnomg l n mahne ARS 28) (rCmr: W~hal ,h p u e 7 of sp dra w n 5 dy 14 i-i- rr3eeAIRS 1 %,)artdra 4P3S 28%) .%& 1A hr~a,heARS 15%------rn I 1 11 2li mhrrir,AI., ISt,) and 1 Arr-ing Ir : funding source C r-ifrtin, - rr

areas ofspenira. were r' M. a 826 mlhn ARS 9,

t20. 6 -0 1 - i I r . -- 35t m 1km3 AR ; 1. - -uiCO i3 [Retribuciones de

t~apattnjales i~ t-- - -

allne Itus nw acn d -)

@11

FigureI 22: SpendView showing the profile for the 'Salaries and Wages' budget item (Bahia Blanca)

6 Working with Uruguay's budget office

In March 2016, I spent a week working at the Uruguayan presidency's Office for Budget

and Pianninq (OPP) through a collaboration with the Latin Anerican in'ttative fr

Open Data (ILDA)'. The Uruguayan national government has one of the most active

Open Data programs in Latin America 1", and has maintained a citizen-oriented budget

information portal since 2011 (Figure 23).

1 http: www.opp.gub.uy 13 http: ilatosabiertos.org " Currently occupying the 19h place in the world in the Open Data Barometer,out of 92 countries.

63 I agesrpp gib ly

A d6nde van Cr6dito vigente nuestros impuestos? por Area PwramAtica Es una herranema wUractiva Aho 2016 pare visuadzar dassocre el 4 Presupeso Nac~ane . _ TU V C E

tlo informacian configno? LEGASLAT CS E IrA~L IENCIA. Juegos de datos

Datos finan&eas rdenad, ADMI STRACION C AC 4 r (1 5 -, Ctf 0 IEANASCIA A .TVIENDA La De'spe;Ava furcona) pnsre conocer a q4 ieas y o4etvos de gcmaemo se de-Ama el di-,rm Visualizacden mfOnma sobre que rygan smo recibe yepectit escos recursos va ore exmaon mdk. - de pes reaos a prrl Tanir rrr!.

Nota metodol6gica Descargar datos C'3IAG~ Documentos Presupuesto 1961-2010 TRANSP,)Ai 'Y' Ci

02M

9itgA~I

Figurc 23: Official budget information portal (Presidency of Uruguay)

* OPP is currently undertaking the renovation of this online resource, and has expressed interest in adopting Spend View, as it presents some advantages over their current portal.

Namely, it allows for easier visualization of temporal trends, and enables queries across all the classification criteria (diiiieisions) of the budget. OPP also valued the linguistic summaries that are presented alongside charts.

Workshop

I conducted a workshop titled "Exploring technology and open budget data". in collaboration with ineibers of ILDA and OPP. It was attended by members of' several ministries (Labor, Interior, Housing), consultants, journalists, government planners and

64 I researchers from local universities. We structured the activity as a focused conversation: the participants were separated in groups, and were tasked with collaboratively producing answers to a series of questions posed by the organizers of the workshop, to be later shared and discussed. The following are some of the questions that we posed to the participants, and the answers that were produced:

What are the challenges that you face when working with budgetary information?

For the most part, participants pointed out deficiencies in the existing information resources: lack of performance-oriented indicators that can be associated with expenditure data, and the need for transaction-level information (checkbook entries) that would allow analysts to construct their own budget classifications. They also mentioned the lack of detail in the formulation of the country's budget.

Are the fiscal information resources accessible and easy to consult?

Participants from the government, noted that it is often difficult to obtain budget information from government systems. Notably, some of them were unaware of the official budget information portal, and mentioned that it was not possible to reach it from a search engine query.

How do you use budget information? For which purpose?

Among others, participants from the governments and news organizations mentioned the use of budget data for creating reports to non-specialized audiences, performing comparative analysis of spending across government programs. Researchers and consultants referred to their use of budget records to study the effectiveness and cost of public policies.

User evaluation of SpendView I conducted an informal testing session of SpendView with the participants of the

65 workshop. The tool was well-received, and I obtained valuable feedback and observations:

* Users valued that the system "can be read and navigated like a Web page", in

contrast to the business intelligence tools that they are used to.

- Although the linguistic summaries were perceived as useful, some users noticed

the uniformity in their structure. Verbalizing other statistical properties of the

information -besides top spenders- is intended to be developed in future

versions of SpendView.

- Some users were confused by the visualization of two levels of a hierarchy on a

single treemap. Future versions of the platform will display a single level by

default, and provide a "depth" control, that allows users to control how many

levels are displayed by the treemap.

Public Deployment Spend View was deployed to the open Web on April 22, 2016, with 7 initial budget datasets

(Table 4) loaded into the platform, totaling 1.6 million individual records.

District Type Period Source

Uruguay Country 2011-2015 Uruguay Open Data Portal1 5

Bahia Blanca City 2008-2016 Bahia Blanca IFMIS

15 https://catalogodatos.gub.uy/dataset/credito-presupuestal-asignado-y-ejecutado-con- aperturas

66 Cambridge, MA City 2011-2016 City of Cambridge Open Data Portal1 6

Buenos Aires City 2013-2015 Buenos Aires Open Data Portal1 7

Argentina Country 2008-2016 Argentina's "Citizen's Portal" 8

Boston, MA City 2013-136 City of Boston Open Data Portal19

Poland Country 2010-2014 World Bank's BOOST project 20

Table 4: Budgets loaded into SpendView as of April 2016

This initial set of budgets, that originate from heterogeneous sources, served to test one of the main design principles of the system, namely, that it should be possible to map the logical organization of a budget (hierarchical classifications and expense phases) with their physical representation (i.e., the data as stored in the database).

Table 5 shows the columns present in Boston's budget dataset, as available for download in the city's open data portal, and their mapping to budget classifications.

Column Classification Level

Fiscal Year Chronological 1

Cabinet Administrative 1

16 http://budget.data.cambridgema.gov/ 17 http://data.buenosaires.gob.ar/dataset/presupuesto-ejecutado 18 http://sitiodelciudadano.mecon.gov.ar/

19 http://budget.data.cityofboston.gov/ 20 http://wbi.worldbank.org/boost/country/poland

67 Department Administrative 2

Program Administrative 3

Expense Type Economic 1

Expense Category Economic 2

Table 5: Columns and their mapping to budget dimensions

This mapping is implemented with an XML file, similar to that shown in Figure 8, that specifies the columns that correspond to each classification of a budget, and the expenditure phases. At present, the maintainer of an instance of SpendView needs to manually edit this XML mapping. Implementing a graphical interface for this task is planned for a future version of the platform. However, the structural similarity of budgetary information, the flexibility of Mondrian-REST, and the dimensional modeling technique, proved to be sufficient for the purpose of implementing a general-purpose API and visualization engine.

Usage data Since its public deployment in April 22, 2006, Spend View has had 998 unique users, 1,288 sessions and 2,244 page views, according to data collected with the Google Analytics platform. Social networks and organic search were the sources that originated the most traffic, accounting for 41.03% and 41.62% respectively. Although the traffic volume is still insufficient to make a comprehensive analysis, we recorded interesting queries by users that arrived to SpendView through a search engine:

* "ministerio de educacion argentina presupuesto" ("ministry of educacion argentina

budget"), which led the user directly to the corresponding profile page in Spend View.

68 * "boston budget", leading to the City of Boston's budget profile page.

In addition to tracking individual page views, user interface actions such as clicking on a chart element, changing the chart type, or opening a modal, were also recorded in Google

Analytics. While the sample size is small, this information highlighted potential usability issues. For instance, a considerable portion of the sessions ended after the user clicked on a chart's element that causes a modal window to be opened (Figure 16). This window is meant to provide detailed information about a budget item, and to provide a link to the profile of that item. Assuming that the users that clicked on the chart element corresponding to a budget item intended to get more information about it, the observation that they are likely to end their session at that point, suggests a deficiency in the design of that interaction.

69 70 Conclusion

71 Future Work Based on the preliminary analysis of the usage data, received feedback, and public deployments, I identify possible opportunities for future improvements: a tool that eases the process of importing a new budget into SpendView could promote wider adoption.

Specifically, the Fiscal Data Package standardization effort (Bjdrgvinsson, Pollock, &

Walsh, 2016), led by the Open Knowledge Foundation, provides a format that contains the data and metadata needed to construct a SpendView instance. Additionally, the reverse engineering of the IFMIS system running in Bahia Blanca that is mentioned in page 60, presents the opportunity to deploy SpendView to more than 140 districts in

Argentina. While the initial results and observations are encouraging, more usability studies should be conducted to improve the user experience.

Concluding Remarks

SpendView is a platform for visualizing public budget information, that makes use of linguistic summaries and interactive visualizations, that emphasizes linear narratives and sharing of content. Although the design and implementation principles outlined in this thesis were applied to fiscal information datasets, they could also be applied to other domains, taking advantage of the flexibility of the dimensional modeling paradigm.

72 List of Figures

Figure 1: A traditional dashboard for government finances ...... 16

Figure 2: Where does my money go? One of the first examples of data visualization app lied to fiscal d ata ...... 23

Figure 3: Socrata Open Budget showing City of Cambridge's education expenditure ... 24

Figure 4: New York Times visualization of the 2013 US Budget proposal ...... 25

Figure 5: PresupuestoAbierto.org displaying fiscal information for the cities of Boston

(USA ) and Bahia Blanca (Argentina) ...... 26

Figure 6: A "star schem a" for a budget...... 33

Figure 7: Database "star schema" for UN COMTRADE ...... 42

Figure 8: Cube definition for the UN COMTRADE database...... 43

Figure 9: Detail view of the header, showing the generated textual description of the d a ta ...... 4 6

Figure 10: An "aggregation view" showing the first 2 levels of the administrative classificatio n ...... 4 6

Figure 11: SpendView showing the initial overview of the budget for the City of

C a m b rid g e ...... 4 7

Figure 12: Partial view of a member of a budget ...... 48

Figure 13: Treemap, line and stacked bar chart applied to the same dataset...... 53

Figure 14: T ooltip over a chart...... 53

Figure 15: Tooltip over a chart legend ...... 53

Figure 16: Modal window showing spending time series for a budget item ...... 54

Figure 17: The Share modal window displaying the "Data" section ...... 55

Figure 18: Organic search results in Google for the query "spendview argentina"...... 57

Figure 19: Sharing on T w itter ...... 58

Figure 20: Sharing on Facebook ...... 58

73 Figure 21: The homepage of GastoPublicoBahiense.org in July 2010...... 61

Figure 22: SpendView showing the profile for the 'Salaries and Wages' budget item

(B ah ia B lan ca)...... 63

Figure 23: Official budget information portal (Presidency of Uruguay) ...... 64

74 Index of Tables

Table 1: Example classification and phases for a line of an expenditure budget ...... 31

Table 2: Result from M DX query (excerpt) ...... 39

Table 3: Budget classifications as translated to questions ...... 50

Table 4: Budgets loaded into SpendView as of April 2016...... 67

Table 5: Columns and their mapping to budget dimensions ...... 68

75 Bibliography

Aristaran, M. (2010). Gasto Pniblico Bahiense. Retrieved from http://gastopublicobahiense.org/ Aristaran, M. (2015). Presupuesto Abierto. Retrieved from http://www.presupuestoabierto.org/ Bj6rgvinsson, T., Pollock, R., & Walsh, P. (2016). Fiscal Data Package Specification 0.3.0. Retrieved from http://fiscal.dataprotocols.org/spec/ Bruls, M., Huizing, K., & van Wijk, J. J. (2000). Squarified treemaps. Proceedings of the Joint Eurographicsand IEEE TCVG Symposium on Visualization, 33-42. Chen, P. P.-S. (1976). The Entity-relationship Model - Toward a Unified View of Data. ACM Trans. Database Syst., 1(1), 9-36. http://doi.org/10.1145/320434.320440 Codd, E. F. (1972). Further normalization of the data base . Data Base Systems, 33-64. Datawheel, Macro Connections / MIT Media Lab, & Deloitte. (2016). Data USA. Dener, C., & Min, S. Y. (2013). FinancialManagement Information Systems and Open Budget Data. The World Bank. http://doi.org/10.1596/978-1-4648-0083-2 Department of Economic and Social Affairs. (2000). Statistical Papers Series M No. 84. New York: United Nations. Gray, J., & Chambers, L. (2012). Covering the Public Purse with OpenSpending.org. Sebastopol, CA: O'Reilly Media. Retrieved from http://datajournalismhandbook.org/1.0/en/casestudies_3 .html International Monetary Fund. (2014). Government Finance Statistics Manual 2014. Washington, DC: International Monetary Fund. Retrieved from http://www.imf.org/external/Pubs/FT/GFS/Manual/2014/gfsfinal.pdf Jacobs, D., Helis, J.-L., & Bouley, D. (2009). Budget classification. IMF Technical Notes and Manuals, 09(06), 1-18. Jefatura de Gabinete de Ministros. (2014). Politicas y experiencias de Goberno Abierto en Argentina. Retrieved from www.gobiernoabierto.gob.ar Johnson, B., & Shneiderman, B. (1991). Tree-maps: A space-filling approach to the visualization of hierarchical information structures. In Visualization, 1991. Visualization'91, Proceedings., IEEE Conference on (pp. 284-291). Kimball, R., & Ross, M. (2002). The data warehouse toolkit: the complete guide to dimensional modelling. Wiley. Lynch, T. D., & others. (1979). Public budgeting in America. Prentice-Hall. Microsoft. (2001). XML for Analysis Specification. Retrieved from https://msdn.microsoft.com/en-us/library/ms977626.aspx

76 Morozov, E. (2014). To save everything, click here: The folly of technological solutionism. PublicAffairs. Mustonen, J. (2006). The world's first freedom of information act: Anders Chydenius' legacy today. Anders Chydenius Foundation. Open Knowledge Foundation. (2015). Open Definition 2.1. Retrieved from http://opendefinition.org/od/2.0/en/ Orszag, P. R. (2009). Memorandum for the Heads of Executive Departments and Agencies. Retrieved from https://www.whitehouse.gov/sites/default/files/omb/assets/memoranda_2010/mI0 -06.pdf Pbttker, H. (2003). News and its communicative quality: the inverted pyramid-when and why did it appear? Journalism Studies, 4(4), 501-511. http://doi.org/10.1080/1461670032000136596 Robbins, C. (2011). Scaling isomorphic javascript code. Nodejitsu. Samuelson, P. A. (1948). Consumption theory in terms of revealed preference. Economica, 15(60), 243-253. Simoes Gaspar, A. J. (2012). The Observatory: Designing Data-Driven Decision making tools. MIT Media Lab, 1-59. Retrieved from http://macroconnections.media.mit.edu/papers/SimoesMasterThesis.pdf Tilkov, S., & Vinoski, S. (2010). Node. js: Using javascript to build high-performance network programs. IEEE Internet Computing, 14(6), 80. United Nations. (2014). UN Comtrade Database. Retrieved from http://comtrade.un.org/db/ United States Agency for International Development. (2008). Integrated Financial Management Information Systems: A Practical Guide. Retrieved from http://pdf.usaid.gov/pdfdocs/Pnadk595.pdf Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(1), 1-23.

77