Learn to Create a Sankey Diagram in R with Data from the European Commission (2018)

Learn to Create a Sankey Diagram in R With Data From the European Commission (2018)

© 2021 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods: Data 2021 SAGE Publications, Ltd. All Rights Reserved. Visualization Learn to Create a Sankey Diagram in R With Data From the European Commission (2018)

Student Guide

Introduction This guide explores creating a Sankey diagram of EU expenditures and revenues. Sankey diagrams are used to visualize hierarchical or value flow data between a series of points and are most often composed of a series of points or nodes connected by lines varying in width by inherent flow value. Sankey diagrams use this line width, along with relative location and color to encode the value.

The visualization in this tutorial uses data from the European Commission about EU expenditure and revenue. The connecting lines represent the flow of revenue into different expenditure categories (Figure 1).

All data are approximate. The expenditure under security and citizenship is used for the following purposes in the decreasing order of proportion:

Page 2 of 16 Learn to Create a Sankey Diagram in R With Data From the European Commission (2018) SAGE SAGE Research Methods: Data 2021 SAGE Publications, Ltd. All Rights Reserved. Visualization • Union action in the f… • Justice Programme • Union Civil Protectio… • Europe for Citizens • Consumer Programme • IT Systems • Pilot projects and pr… • Other actions and pro…

Text above the diagram reads, “Source: European Commission, 2018.”

Figure 1. A Sankey Diagram of Security and Citizenship Spending and Revenue

What Is a Sankey Diagram? Sankey diagrams are tree-like structures visualizing either hierarchical or value flow data between a series of points, where the nodes have no particular geographical or spatial location. Sankey diagrams also visually present subpart distributions, separating the various streams or branches into different end points. Nodes are organized into a tree-like hierarchy, though the structures can also be nonhierarchical. As in weighted network diagrams, the line thickness indicates the magnitude or value of a relationship or exchange between two points. The nodes in these diagrams have no geographical location but otherwise also resemble a flow map to a great degree (see Variations and Alternatives).

Why Use a Sankey Diagram A Sankey diagram is particularly well suited for showing the movement and fluid nature of certain phenomena between predefined points. It is especially a good alternative to a flow map when there are no particular geographical locations associated with the dataset nodes, but the flows between nodes are still composed of quantifiable values. Due to its tree-like structure, the Sankey diagram is also particularly adept at bringing out hierarchical aspects of these types of datasets.

Considerations and Cautions Like flow maps and network diagrams, Sankey diagrams are prone to becoming cluttered to the point of illegibility if the visualized network is complex with many crossing edges and often require extensive manual editing. With this in mind, the visualization should be limited in scope (or in the case of interactive displays offer filtering options for the user). The flows in Sankey diagrams are also usually restricted to one direction, often the same as the direction of written text, for example, showing how a total company budget is allocated to different projects. Individual lines going against the general flow are sometimes denoted with arrowheads to distinguish them from the norm but tend to add an element of complexity that can confuse readers and should therefore be avoided or left to a minimum at best.

Classification Since the line weights used in the Sankey diagram are based on the various amounts or magnitudes of movement, it can sometimes be helpful to establish a certain finite number of possible line weights to help the reader visually rank and evaluate the various flow sizes. Classifying, or binning, the individual values can be done in a variety of ways, for instance, classifying all data points into

Page 5 of 16 Learn to Create a Sankey Diagram in R With Data From the European Commission (2018) SAGE SAGE Research Methods: Data 2021 SAGE Publications, Ltd. All Rights Reserved. Visualization arbitrary round number categories, or into categories delineated by breaks at equal intervals, quantiles, natural breaks (Jenks) in the data, or standard deviation breaks around the mean or median value. Categorizing into quantiles, that is, dividing the dataset into classes with an equal number of items, is a good option for most datasets—though this will always depend on the particular topic and data at hand.

Labeling As visually evaluating volume and area are inherently difficult for readers, exact values are difficult to accurately read from a Sankey diagram without direct labels. Labels can be used in Sankey diagrams to identify elements such as individual nodes, amounts, or magnitudes, as well as significant branches or substreams of the flow itself. Any labels should be positioned as close as possible to the elements they refer to. Whether or not line weights are classified, it is also useful to include some form of legend for interpreting the underlying data values inherent in line sizes. As always, it is important to clearly note the unit of measure in the legend or title.

Color Choices Color scales can be either qualitative or quantitative, though most often qualitative in Sankey diagrams. On a qualitative color scale, colors usually represent objects that belong to different groups or categories. The goal with a qualitative color scale is usually to create a color palette in which different colors are easily distinguishable, that is to say, relying heavily on major differences in hue.

Quantitative color scales, on the other hand, which do not usually apply in the case of Sankey diagrams, often make use of variation in the lightness of color to show variation in value. Generally, as the value of a variable increases, so does the contrast between the color and its background. A sequential single-hue color

Page 6 of 16 Learn to Create a Sankey Diagram in R With Data From the European Commission (2018) SAGE SAGE Research Methods: Data 2021 SAGE Publications, Ltd. All Rights Reserved. Visualization scale, where value differences are marked only by differences in hue lightness (e.g., white to red), can, however, hinder differentiation of adjoining areas. A better choice in these cases is a sequential multi-hue scale, where changes on the value scale are accompanied by changes in hue (e.g., yellow to red). As a general rule of thumb, the human eye can reliably only distinguish six to seven degrees of lightness in any given hue due to a phenomenon called simultaneous contrast. Consequently, the number of classes visualized should also ideally be limited to seven or fewer, depending on your choice of the color palette. Another option for a quantitative color scale is the diverging color scale, often used to depict value scales containing both negative and positive values, where the color scale ends in disparate hues adjoined by a neutral third hue in the middle (e.g., blue, white, red). These latter two color palettes also enable the use of a few more classes than the recommended maximum, if necessary, as the effects of simultaneous contrast are diminished with increased variation in hue (Figure 2).

The color scales shown in the image are listed as follows:

• Qualitative scale: This scale has boxes of markedly different colors. • Quantitative scale • Single-hue scale: This scale has boxes of same color but of increasing brightness. • Multi-hue scale: This scale has boxes of varying hue. • Diverging scale: This scale has boxes of different colors at each end, with a series of neutral-colored boxes between them.

Figure 2. Color Palette Examples: Qualitative, and Quantitative Single-Hue, Multi-Hue, and Diverging

Establishing good color contrast is overall a good practice, keeping in mind readers with differences in color vision. Whenever possible, it is recommended to check chosen color palettes through some form of simulated preview to see what the result looks like for readers with deuteranopia, protanopia, or other differences in color vision (e.g., within Adobe image and vector editing software with different Proof Setups or with online resources such as the Coblis simulator).

Variations and Alternatives Some common alternatives to the Sankey diagram are flow maps and network diagrams.

Flow maps visualize phenomena in which the movement, transferral, or migration of variables is quantifiable on a geographical plane, usually over a specific period of time (Figure 3). The basis of a flow map is a network with edges representing connections between nodes. Each node in a network that can be visualized as a flow map should have at least a roughly defined geographical location. Each connection or edge should have a weight and be directed, thus indicating the magnitude or value of a variable moving from one node to another and the direction in which it moves. The weight can be indicated by line thickness and color, and the direction of flow is usually marked by arrowheads. The data can be either classified or unclassified; line thickness corresponding either directly to the magnitude of transference or to a certain class interval.

Text next to the map reads, “Total refugees and people in refugee-like situations, countries with over 600 persons in 2016: The category of people in refugee- like situations is descriptive in nature and includes groups of persons who are

Page 8 of 16 Learn to Create a Sankey Diagram in R With Data From the European Commission (2018) SAGE SAGE Research Methods: Data 2021 SAGE Publications, Ltd. All Rights Reserved. Visualization outside their country or territory of origin and who face protection risks similar to those of refugees, but for who refugee status has, for practical or other reasons, not been ascertained. Source: UNHCR Statistical Database; United Nations High Commissioner for Refugees 2016.” The number of refugees from Sudan to other countries are as follows. Sweden: 702. Syria: 742. Israel: 4,647. Egypt: 13,848. Lebanon: 695. Jordan: 2,222. Italy: 2,003. Netherlands: 759. France: 7,049. Germany: 693. Libya: 626. United States of America: 1,383. Chad: 312,468. Central African Republic: 2,114. South Sudan: 241,510. Ethiopia: 39,896. Kenya: 87,141. Uganda: 2,545. The number of refugees from other countries to Sudan are as follows. Eritrea: 103,176. Israel: 6,997. Central African Republic; 1,579. Chad: 8,502. Ethiopia: 3,663. The number of refugees from South Sudan to other countries are as follows. Sudan: 297,168. Ethiopia: 338,774. Kenya: 87,141. Uganda: 639,007. Democratic Republic of Congo: 66,672. Central African Republic: 4,912. Egypt: 2,532. The number of refugees from other countries to South Sudan are as follows. Democratic Republic of Congo: 14,476. Central African Republic: 1,853. Ethiopia: 4,691. Arrows point from the origin to the destination. The thickness of the arrow varies and increases with increase in the number of refugees.

Figure 3. A Flow Map

Network diagrams, on the other hand, are visualizations that display the relationships and connections between objects or concepts without geographical data (Figure 4). The network consists of point nodes or vertices connected by lines, called edges or arcs, that are indicative of their shared relationships. Usually, any connected elements are placed relatively close to each other with the help of a force-directed algorithm. When an algorithm is used for placement, the proximity of nodes also encodes some additional information about the relationship, such as shared characteristics. Network diagrams can be either weighted or unweighted, so to speak. In their unweighted form, all nodes and edges are of the same strength and uniformly sized. If the graph is weighted, the nodes or edges connecting them can be sized based on some additional variable, such as the number or strength of relationships. Furthermore, network diagrams can be directed, in that edges have an associated direction. In these cases, the

Page 10 of 16 Learn to Create a Sankey Diagram in R With Data From the European Commission (2018) SAGE SAGE Research Methods: Data 2021 SAGE Publications, Ltd. All Rights Reserved. Visualization nodes are connected by arrows or directed edges.

Similararcs with arrows but with lesser thickness point from South Sudan to Sudan and vice versa; from Sudan to Chad; and from South Sudan to Ethiopia. Similar arcs with arrows but with further lesser thickness point from South Sudan to Democratic Republic of Congo; from Sudan to Eritrea; and South Sudan to Kenya. Further thin arcs, some ending in arrows and some not, are found diverging in different directions from South Sudan and Sudan to following countries: NLD, ISR, JOR, SYR, GBR, EGY, USA, LBN, DEU, FRA, Kenya, Chad, SWE, CAF, Democratic Republic of Congo, ITA, and LBY.

Figure 4. A Network Diagram

Illustrative Example: EU Security and Citizenship Expenditure and Revenue 2018 Figure 5 shows a Sankey diagram of EU expenditures and revenues in 2018, using data from the European Commission’s EU expenditures and revenue dataset. Lines of expenditure are sorted based on quantity. The Sankey diagram gives an easy at-a-glance impression of general budgetary outlays in the year 2018. A default qualitative color scheme was used to distinguish the various flow streams. The headline clarifies the main visual content.

Page 12 of 16 Learn to Create a Sankey Diagram in R With Data From the European Commission (2018) SAGE SAGE Research Methods: Data 2021 SAGE Publications, Ltd. All Rights Reserved. Visualization All data are approximate. The expenditure under security and citizenship is used for the following purposes in the decreasing order of proportion:

• Decentralised agencies • Asylum, Migration and… • Internal Security Fund • Food and feed • Instrument for Emerge… • Creative Europe Progr… • Actions financed unde… • Rights, Equality and… • Union action in the f… • Justice Programme • Union Civil Protectio… • Europe for Citizens • Consumer Programme • IT Systems • Pilot projects and pr… • Other actions and pro…

Text above the diagram reads, “Source: European Commission, 2018.”

Figure 5. A Sankey Diagram of Security and Citizenship Spending and Revenue

The Data The dataset used in this demonstration is a subset of the EU expenditure and revenue 2014–2020 dataset from the European Commission. The full dataset includes expenditures and revenues divided by program, year, and country, of which the scope has been limited to the year 2018 and total European expenditures and revenues. The numbers reported in the data table are in millions of euros.

Interpreting the Chart The Sankey diagram created through this demonstration shows a brief overview

Page 14 of 16 Learn to Create a Sankey Diagram in R With Data From the European Commission (2018) SAGE SAGE Research Methods: Data 2021 SAGE Publications, Ltd. All Rights Reserved. Visualization of some EU expenditures and revenues in the year 2018, highlighting in particular security and citizenship-related expenses. The data generally showed that the security and citizenship expenditure is only a minuscule portion of total expenses and that this small portion is further divided up into a multitude of programs ranging from migration to IT security to culture and creativity. The largest three of these smaller programs are Asylum and Migration, Internal Security Fund, and Decentralised Agencies— which are all easy to distinguish from the diagram due to the simple encoding of value in line length and width. Likewise, identifying the smallest outlays and understanding the scheme of overall value distribution are relatively simple visual tasks with this type of diagram.

The Sankey diagram is generally a good choice for data like this, where we want to succinctly visualize the breakdown of a whole into a subset of varying parts and examine how these values flow between different points. This particular chart is mostly useful for visualizing the quantity and magnitude of flow streams within the budget and particularly for putting these streams into a larger context within the entire scheme of revenues and expenses.

Review This dataset example has demonstrated the Sankey diagram, how it can be used, and how it compares to other visualization types for use with similar data. A subset of the European Commission’s EU expenditure and revenue dataset was visualized.

You should know:

• What are Sankey diagrams? • What kind of data can a Sankey diagram encode? • When is a Sankey diagram an appropriate visualization choice? • What are the best practices for composing Sankey diagrams?

Page 15 of 16 Learn to Create a Sankey Diagram in R With Data From the European Commission (2018) SAGE SAGE Research Methods: Data 2021 SAGE Publications, Ltd. All Rights Reserved. Visualization • What are the main weaknesses and limitations of this visualization method?

Your Turn You may now proceed to download the sample dataset and walkthrough guide on how to carry out the visualization with the R statistical software. The sample dataset includes a number of interesting variables that can be used to augment the example pictured above. You may, for example, experiment with visualizing a different subcategory of expenditure or revenue or applying different stylistic features to better bring out different facets of the dataset.

Page 16 of 16 Learn to Create a Sankey Diagram in R With Data From the European Commission (2018)