GODAN ACTION LEARNING PAPER

Land and nutrition data in the Land Portal and the Global Nutrition Report: a gap exploration report

Valeria Pesce

Global Forum on Agricultural Research (GFAR) Lisette Mey

Land Portal 30 June 2018 Pauline L’Hénaff

Open Data Institute (ODI) Carlos Tejo-Alonso

Land Portal GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

Executive summary

GODAN Action supports data users, producers and The main conclusions drawn from the report are that intermediaries to effectively engage with open data and maximise its potential for impact in the agriculture and • The two use cases present many similarities. They nutrition sectors. In particular, we work to strengthen both aggregate data from secondary sources, capacity, to promote common standards and best practice, already partly normalised by global agencies; they and to improve how we measure impact. both aggregate data around specific indicators; they aggregate from datasets with a similar structure This gap analysis report is the third in a series which (indicator, country, year, value). has examined gaps in data standards. The first version of the report examined gaps in agriculture and food data • The identified gaps in data standardisation are (Pesce, Kayumbi, Tennison, Mey, and Zervas: 2016) very similar. The names of countries and regions in data sources are not standardised or they are A second version (Pesce, Tennison, Dodds and Zervas: standardised according to different conventions; the 2017), in line with the 2017 project focus on weather data names of the variables do not follow any convention; and related use cases, examined the situation in the area indicators are represented by strings and may of data standards for weather data (and closely related change over the years (both their names and the geospatial data), and particularly focused on weather measurement methods). data for use in farm management services. With reference to our data standard assessment criteria, This third version focuses on data standardisation gaps the few standards used by the data sources (country in specific use cases of aggregation of land data and naming conventions, value ranges, units of measurement) nutrition data around indicators: the Land Portal and the in both cases are not open and not very usable. The Global Nutrition Report. situation is different when it comes to the way the two projects re-publish the data: the GNR normalises values The report starts with a review of the relevant types of around some conventions, while the LP re-publishes data for these use cases, then illustrates similarities everything according to principles, and uses between the two projects and similar standardisation published vocabularies. gaps, and then moves to more specific challenges for the two individual projects.

The Land Portal (LP) gathers information from a broad range of land-related data and information providers. It is organised and visualised in ways that are intuitive and usable for researchers, private sector actors and policy makers at global and local levels. The information provided can strengthen research, advocacy, and policy- making efforts by enabling a better understanding of land governance issues affecting various countries and regions.

The Global Nutrition Report (GNR) is a comprehensive narrative on global and country-level nutrition. GNR produces the Report annually and aggregates a wealth of nutrition and nutrition-related data from a wide range of sources. This data underpins the report itself as well as being used to produce a range of supplementary materials, including country, regional, and sub-regional profiles and data visualisation tools.

2 GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

Contents

1 Introduction 04

1.1 Relevant types of data 04

2.1​ Similarities between the two use cases: types of data sources 05

2.1 Common standardisation issues 06

3 Land data use case: the Land Portal 12

3.1 Land Portal specific standardisation gaps 13

3.1.1 Data sources 13

3.1.2 Re-published data 17

4 Nutrition data use case: the Global Nutrition Report (GNR) 19

4.1 GNR specific standardisation gaps 20

4.1.1 Data sources 20

4.1.2 Re-published data 25

5 Experts interviewed 26

6​ Conclusions 26

References 27

3 GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

1 ​Introduction

This third version of the data standards gap analysis ​1.1​ Definitions focuses on specificuse cases of aggregation of land data and nutrition data: the Land Portal and the Global In this document, we use the same terminology as in our Nutrition Report. previous reports, which is explained in more detail in ‘A The Land Portal1 gathers information from a broad Map of Agri-food Data Standards‘ (Pesce, Tennison, Mey, range of land-related data and information providers Jonquet, Toulet, Aubin, and Zervas: 2016). and is organised and visualised in ways that are intuitive We often use the terms “data standards” and “vocabularies” and usable for researchers, private sector actors and interchangeably, to indicate any specification (from models policy makers at global and local levels. The information to templates/schemas to data dictionaries to code lists to provided can strengthen research, advocacy, and policy- thesauri) that normalises the way an entity is described or making efforts by enabling a better understanding of categorised. This corresponds to the general definitions land governance issues affecting various countries and of the W3C4 for vocabularies: “vocabularies define the regions. concepts and relationships used to describe and represent 3 The Global Nutrition Report2 is a comprehensive an area of concern”. narrative on global and country-level nutrition. GNR In other cases, when it is important to clarify the type produces the Report annually and aggregates a wealth of a specific data standard/vocabulary or which type of nutrition and nutrition-related data from a wide range of standard would be appropriate to improve the of sources. This data underpins the report itself as well interoperability of some datasets, we indicate the more as being used to produce a range of supplementary specific type of standard, either referring to a broader materials, including country, regional, and sub-regional group of standards if any type in that group applies, or profiles and data visualisation tools. referring to the specific type of standard if we want to be Before writing this report, we conducted a survey of data more specific. In these cases we refer to the groupings standards for land data and nutrition data. This report defined by the W3C and to the specific types defined by 5 will show that that very few of the standards surveyed the list of KOS: are relevant to the Land Portal or the Global Nutrition • Metadata element sets or element sets (or Report. This is because our survey considered primary “description vocabularies”) “define classes and land data and nutrition data, which normally comes in attributes used to describe entities of interest”. typical statistical formats (from tabular to SDMX) and Specific types of description vocabularies are: uses the data dictionaries, code lists and the classification metadata schemas (more specifically, XML schemas, schemes agreed upon by authoritative agencies in that JSON schemas, RDF schemas), models (including field. All data standards relevant for this type of data are UML models), templates, ontologies and more analysed in our survey report (forthcoming). • Value vocabularies “define resources (such as In this report on the other hand, we are focusing on the two instances of topics, art styles, or authors) that are use cases, which are data aggregators aiming to visualise used as values for elements in metadata records. [...] and narrate the current status of relevant socio-economic A value vocabulary thus represents a controlled list indicators in the world. They therefore collect data from of allowed values for an element. Examples include: secondary sources built by global agencies responsible thesauri, code lists, term lists, classification schemes, for those indicators, where the primary national/regional subject heading lists, taxonomies, authority files, data have already been normalised and consolidated digital gazetteers, concept schemes, and other types around specific indicators. of knowledge organisation systems”.

This report illustrates the similarities between the two The second type of vocabulary is very relevant for the use cases in terms of structure of the data sources and use cases analysed in this document. Therefore, the term related data standardisation gaps. value vocabulary or more specifically terms like code lists or classifications are often used in the document, according to the definitions above and to the more detailed definitions in ‘A Map of Agri-food Data Standards’ and in the published Map of agri-food data standards.6 4

1 http://landportal.org 5 http://wiki.dublincore.org/index.php/ 2 http://globalnutritionreport.org/ NKOS_Vocabularies#KOS_Types_Vocabulary 3 https://www.w3.org/standards/semanticweb/ontology 6 See definitions athttp://vest.agrisemantics.org/about/structure 4 https://www.w3.org/2005/Incubator/lld/ XGR-lld-vocabdataset-20111025/#Introduction:_Scope_and_Definitions GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

2 Similarities between the two use cases: types of data sources

The thematic topics on which the project focuses in the global data sources (FAO, World Bank, UNDP, UNEP, second year are land data and (mal)nutrition data. OECD, IFPRI and others) and a couple of national ones (India, Laos) that provide already aggregated data from Land tenure, land use and malnutrition are socio-economic primary sources around such indicators. The Land dimensions, mainly measured through statistics and Portal monitors 623 indicators that provide aggregated surveys and elaborated through projections. information about land tenure on a country-level, as Therefore, as we explain in more detail in our survey well as contextual information about land governance report, the data formats adopted are normally the typical on country-level or with regards to a specific issue. The statistical formats (from tabular to SDMX). On the other scope of the analysis is limited to assessment of these hand, the highly topic-specific data standards are the data specific indicators on the Land Portal dictionaries, code lists and the classification schemes The Global Nutrition Report used. These are, of course, specific to the type of data and agreed upon by authoritative agencies in that field. The Global Nutrition Report (GNR) is a comprehensive In addition, all these types of datasets have some narrative on global and country-level nutrition. GNR geospatial dimension, although in most cases mainly produces the Report annually and aggregates a wealth geopolitical and limited to area codes or country codes. of nutrition and nutrition-related data from a wide range So (except for land cover data, which is not the core of of sources. This data underpins the report itself as well the Land Portal use case), pure geospatial standards are as being used to produce a range of supplementary not very relevant, while conventional area and country materials, including country, regional, and sub-regional code lists are very much used. profiles and data visualisations.

This is the situation we described in our suvey report The data are compiled from secondary sources including regarding land data and nutrition data in general. Children’s Fund (UNICEF), World Health However, neither of the two use cases on which the project Organisation (WHO), and the World Bank (WB) among is focusing, the Land Portal and the Global Nutrition many others. The dataset broadly contains information on Report, uses raw statistical data directly. They are both adult and child nutrition, economic demography, nutrition aggregators of secondary data sources. intervention coverage, and policy legislation in the nutrition sector. Overall, the finally aggregated data are organised To understand these similarities, let’s first summarise the around over 70 indicators. aims of the two projects that constitute the use cases for the topics land data and nutrition data. So, in both cases the data sources are secondary sources, data already aggregated from original statistics The Land Portal and surveys into simple and harmonised tabular datasets that provide the values of the indicators for each country The Land Portal (LP) gathers land-related data and (or in some cases for geopolitical areas). information from a broad range of information providers and organises and visualises the data in ways that This means that the structure of the data sources is are intuitive and usable for researchers, private sector quite similar in both cases: the name/code of the indicator, actors and policy makers at global and local levels. the country/area code to which the value refers, the value The information provided can strengthen research, and some textual notes. advocacy, and policy-making efforts by enabling a better Below we have included some examples: understanding of land governance issues affecting various countries and regions.

In particular, the Land Portal aggregates country-level data relevant to land-specific indicators from several

5 GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

Figure 1 Extract of a dataset of country-level values for land indicators on “Expropriation, Compensation, and Resettlement”7

Figure 2 Extract of a dataset of country-level values for nutrition indicators on child wasting (NUTRITION_ WH_2 = “Children aged <5 years wasted”)8

Therefore, many of the standardisation practices form of standardisation are the country names and the described in our suvey report (guidelines and code lists indicators. for national statistics, survey methodologies) are not very relevant in this secondary aggregation, as primary data Data formats from national statistics and surveys have already been combined to calculate the values of the indicators and The format of the data sources, both for the LP and the consolidated in simple tabular datasets with one line each GNR, is almost always tabular, CSV or Excel. for indicator name/code, country, year and indicator value. In some cases, the data is retrieved from APIs that accept This report focuses on the two use cases as aggregators, a format parameter and can send data as CSV, XML or as the major standardisation issues come from the JSON, like the World Bank API, the WHO Global Health aggregation of data. However, in the chapters dedicated Observatory API9 or the API of the Data from Center of to the individual use cases, we will also assess the Disease Control. standardisation of the data the two projects re-publish Below are examples of the same observation from the and the data standards they use. WHO Global Health Observatory dataset for indicator The similarities between the types of datasets aggregated “NUTRITION_WH_2” (Children aged <5 years wasted) by the two projects allows us to do a preliminary general related to for the year 1997, once in CSV > analysis of standardisation issues for the data sources Excel and once in XML. of both. 2.1 Common standardisation issues

As we have demonstrated, the structure of the source dataset is very similar for both projects: the name/code of the indicator, the country/area code to which the value refers, the value of the indicator and some textual notes. Sometimes the disaggregation goes a little beyond the country level, adding the dimensions of sex or income.

Beyond the numeric value and the textual value which do not lend themselves to much standardisation (with some exceptions, included in the paragraph on indicators’ values below), the two elements that can present some

6 7 Tagliarino, N.K. 2018. Voluntary Guidelines Section 16 Indicators on Expropriation, Compensation, and Resettlement. University of Groningen: Groningen, Netherlands. Available at: https://landportal.org/book/dataset/nkt-vggt16 8 World Bank API: http://apps.who.int/gho/athena/api/GHO/NUTRITION_WH_2?format=csv 9 http://apps.who.int/gho/athena/api/ GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

Figure 3 WHO GHO observation in CSV (http://apps.who.int/gho/athena/api/GHO/ NUTRITION_WH_2?format=csv)

Figure 4 WHO GHO observation in XML (http://apps.who.int/gho/athena/api/GHO/NUTRITION_WH_2)

Country/area codes Besides, when data are aggregated at the regional level, not all the data sources use the same regional system The practices for country and area names are quite similar (continents, economic regions, climatic regions). The issue in the types of datasets aggregated by the two projects: of different regional systems or country groupings has in some cases there are just names of countries or areas, been addressed also by the Joined-up Data Standards referring to some conventional list of names used by the Navigator mentioned in the next chapter, although the agency that publishes the data, but in many cases names mapping is limited to specific classifications. are also associated with a standard code list. To solve this issue, in both projects developers have The most used standard for these simple country and manually compiled an ad-hoc file mapping country area code is the Standard Country or Area Codes for names under different conventions, country codes and Statistical Use (M49) of the United Nations Statistics associated regions in different regional systems. Division (UNSD), linked to the ISO 3166 alpha codes. Examples can be found in chapters 3 and 4 on the Where country names are used, besides the country individual use cases. names in the UN Standard Country or Area Codes for Statistical Use, other standardised country names used Indicators are the ones from the ISO 3166, from the OECD list, the World Bank, the FAO geopolitical ontology or from the The two projects of course use different sets of indicators, World Integrated Trade Solution (WITS). but the way they are used is very similar. Normally the rows for each country or area contain the indicator name The fact that country names or codes are not uniformly or code and the value. standardised in datasets could be one of the common standardisation gaps to consider. Lists of indicators have been developed by many agencies for goals and objectives in their mandate, like the FAO

7

7 http://www.opengeospatial.org/standards/gml 12 http://www.opengeospatial.org/standards/tsml 8 http://www.opengeospatial.org/standards/wfs 13 http://vest.agrisemantics.org/content/ogc-sensor-model-language-sensorml 9 http://www.opengeospatial.org/standards/wcs 14 https://www.w3.org/2005/Incubator/geo/XGR-geo-ont-20071023/#ontologies 10 http://www.opengeospatial.org/standards/wms 15 https://www.w3.org/TR/vocab-ssn/ 11 http://www.opengeospatial.org/standards/om 16 https://en.wikipedia.org/wiki/Geography_Markup_Language GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

food security indicators or the OECD indicators or the view, they would lend themselves to be formalised in a Global Health Observatory indicators. vocabulary.

Many of these lists of indicators were developed in the From this point of view, an interesting experiment regarding context of the Sustainable Development Goals (SDGs). the SDG goals and indicators is the the Sustainable The official list of SDG indicators includes a number of Development Goals Interface Ontology (SDGIO)11, which land-related and nutrition-related indicators, for example: “provides a semantic bridge between 1) the Sustainable Development Goals, their targets, and indicators and • 1.4.2 Proportion of total adult population with secure 2) the large array of entities they refer to” and provides tenure rights to land the full list of indicators linked to the objectives in RDF12 13 • 2.4.1 Proportion of agricultural area under productive and in CSV. and sustainable agriculture For the indicators used by the LP and the GNR, no agreed • 5.a.1 (a) Proportion of total agricultural population standard is used, but we will see that some agencies at with ownership or secure rights over agricultural land least published the list of indicators in machine-readable format with a correspondence between the indicator and • 2.1.1 Prevalence of undernourishment the variable name in the dataset. • 2.1.2 Prevalence of moderate or severe food One issue, related to standardisation, is the fact insecurity that indicators may change (both names and actual calculation) or disappear over the years and the new • 2.2.2 Prevalence of malnutrition. series are not necessarily comparable with the previous Besides these indicators, which have been agreed upon ones. Besides, in some cases two agencies are using by the UN and have been formally assigned to “custodian the same indicator but the two indicators are not formally agencies” according to their scope of application, other linked. What helps with the comparability is the metadata indicators have been set and are monitored independently about the indicator, but it is only human-readable. Having by agencies that work on specific socio-economic issues. all indicators formalised as a vocabulary (similar to the SDGIO mentioned above) with relationships between The LP and the GNR monitor several indicators (623 indicators could help in disambiguating indicators and and 80 respectively), a few of which are part of the SDG merging data for the same indicator correctly. indicators while many others are monitored by dedicated organisations. An interesting experiment in mapping indicators (and other variables) across different agencies is the Joined-up “Indicators are data or combination of data collected Data Standards Navigator14, which maps concepts like and processed for a clearly defined analytical or policy indicators, sectors and country groupings across the purpose. That purpose should be explicitly specified following systems, using a SKOS concept schemes with and taken into account when interpreting the value of relations as the backbone: an indicator.” 10 This means that the value of an indicator cannot be interpreted without a description of which data it • UN Classifications of the Functions of Government is derived from, how it is calculated and for what purpose. (COFOG) Therefore the tabular datasets which only contain the • Organisation for Economic Cooperation and indicator as a variable name or a variable value do not Development Creditor Reporting System (CRS) provide enough information to interpret the indicator. In some cases, information is provided in a metadata section • National Center for Charitable Statistics National of the dataset, with a textual description of how the value Taxonomy of Exempt Entities (NTEE) is calculated in many other cases this information is published as an external resource that has to be consulted • UN International Standard Industrial Classification before reading the dataset. of All Economic Activities (ISIC)

Normally indicators have an identifier or short name, a • World Bank Themes longer name and a full description of what the indicator • World Bank Sectors measures. Therefore, from a standardisation point of • Millennium Development Goals (MDGs)

8 10 Sabatella, E.; Franquesa, R. Manual of fisheries sampling surveys: methodologies 12 https://raw.githubusercontent.com/SDG-InterfaceOntology/sdgio/master/ for estimations of socio-economic indicators in the Mediterranean Sea. Studies and imports/sdgio_indicator_values_import.ttl Reviews. General Fisheries Commission for the Mediterranean. No. 73. Rome, 13 https://github.com/SDG-InterfaceOntology/sdgio/blob/master/docs/SDG_ind. FAO. 2003. 37p. http://www.fao.org/docrep/006/y5228e/y5228e03.htm#bm03.1 csv 11 https://github.com/SDG-InterfaceOntology/sdgio 14 http://joinedupdata.org/en/translate/ GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

• Sustainable Development Goals (SDGs) indicator value ranges used in the considered data sources seems to have been formalised in a data standard. • World Bank World Development Indicators (WDI) All these aspects would lend themselves to improvements • Demographic and Health Surveys (DHS) V11 in terms of data standardisation. • UNICEF Multiple Indicator Cluster Survey (MICS5) Variables’ names and dataset structure More details about specific indicators used in the two use cases are provided in chapters 3 and 4 on the individual A basic statistical dataset is “a single statistical data use cases, but in general the lack of a common reference matrix, where every column of the table represents a framework for indicators, including identifiers, definitions particular variable, and each row corresponds to a given and possibly relations is another common area of possible member of the data set in question”.15 improvement. Typical variables in the datasets considered in this Indicators’ values study are year, country/region, indicator and value. An interesting aspect of data standardisation would be the Indicators’ values present the normal standardisation standardisation of the names of such variables. In other issues of statistical values. In most cases they are fields (e.g. weather and climate data) practitioners have numbers, in fewer cases named value ranges out of a felt the need for common names of variables and have conventional list. created Conventions (like the Climate and Forecast Conventions).16 Even when values are numbers they are not necessarily immediately interpretable without additional information: In the datasets used by the Land Portal and the GNR absolute numbers often have to be interpreted against there is no such standardisation of variable names. The a specific unit of measurement, other numbers are year is sometimes “year”, sometimes “time”, sometimes percentages. Good practices for representing the units “yr”, and above all the indicator, when it’s used as a of measure or calculation method of a value exist: in variable, doesn’t have a standard name. However, this traditional datasets, this is done in the dataset metadata is not perceived as an issue because the variants are section (whether it is a header in the same file or a separate very few (two or three maximum) and normally there file). In semantic datasets, these types of semantics have is only one dataset providing data for one indicator, so been standardised in some vocabularies and can be there is no need to integrate data on one indicator from linked to the data and be read by machines. several datasets.

Of the data sources considered in this report, many One note on the structure of these indicator datasets: are accompanied by some form of metadata with we said that the variables are more or less the same, measurement information, while none use semantic but, besides the names of the variables, there can be technologies. differences in the treatment of indicators as variables or as values of the “indicator” variable. The difference Units of measurement are managed in different ways in is normally just that the indicators are “transposed” in different datasets (sometimes they are in the metadata, the table. sometimes they are hinted at in the variable/column name). Normally, the variables in a tabular dataset would be the names of the columns, while in the columns you When values are ranges represented by codes or find the values for each variable for each member of the strings, they refer to some code list or convention. These dataset. In some indicator-based datasets, the indicator conventions may be indicated in the metadata, but are is a variable, i.e. a column name: more often just documented in guidelines. None of the

Figure 5 Indicator as a variable in dataset

Country Year Distribution of agricultural holders by sex (female - total n) Algeria 2001 41793 American Samoa 2008 1133 Argentina 2002 32768 9

15 https://en.wikipedia.org/wiki/Data_set 16 http://cfconventions.org/ GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

In many other datasets, especially if the dataset covers (which will require a different parser and therefore more more indicators, the indicator is a value of the “indicator” work on the part of the aggregator): variable, which causes sort of a transposition of the table

Figure 6 Indicator as a value in dataset ....

Gaps in primary sources affecting the An assessment of the level of fitness for purpose and quality of secondary sources adoption for these standards can easily be done, relying on the fact that the global agencies that create the datasets As we have seen, the secondary data sources used are are authoritative bodies and set the standards. So these sufficiently standardised, at least in terms of structure country naming conventions and indicators/indicators (dimensions and variables) and granularity. values are broadly recognised and adopted.

However, secondary sources don’t always provide all the However, a more technical analysis of the level of data that they could provide, and at the desirable level openness and usability of these standards reveals that of granularity (e.g. sub-national) because the original they could be much improved. data sources were either not standardised enough or Let’s take as an example the UN M.49 standard for area not collected at a granular enough level. codes is very often the convention used in these datasets This gap analysis should consider that some gaps may for country or region names. have to be addressed at the level of the standardisation UN M.49 is a standard for area codes17 used by the and methodology adopted in the primary national statistics United Nations for statistical purposes, developed and and surveys. More examples on these types of gaps can maintained by the United Nations Statistics Division. Each be found in chapters 3 and 4, dedicated to the individual area code is a 3-digit number which can refer to a wide projects. variety of geographical, political, or economic regions, like Openness and usability of the standards a continent, a country, or a specific group of developed or developing countries.

Besides the assessment of the level of standardisation of UN M.49 is clearly very authoritative and broadly the data, our gap analysis also aims to assess the data adopted, but it has only been published as HTML and standards themselves. PDF. Therefore, it is neither open nor interoperable and the level of usability is low. As we explained in the first version of this gap analysis, our assessment of data standards is based on two existing assessment practices: the assessment process used by the UK Government’s Open Standards Board and the Open Data Certificates. These assessment criteria were organised in four categories: (a) fitness for purpose, (b) adoption, c) usability and (d) openness.

The data sources of the two use cases use very few data standards, such as the UN M.49 for countries and regions, and a few agency-specific or inter-agency lists of indicators and ranges of indicator values.

10

17 https://unstats.un.org/unsd/methodology/m49/ GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

Figure 7 Assessment of the Un M.49 standard in the Map of agri-food data standards18

The UN M.49 is the only real standard used. However, 1. The WHO indicators, which are available via an API as previously stated, the indicators and the ranges of at http://apps.who.int/gho/athena/api/GHO values used for some indicators could also be considered controlled vocabularies. If we want to assess the level of 2. The World Bank indicators, available through an API usability and openness of these lists, we again have to say at https://api.worldbank.org/v2/indicators that the level is very low: they have only been published 3. The SDGs indicators, published as an ontology with in PDF guidelines or HTML pages and are therefore not the full list of indicators linked to the objectives in machine-readable, with the exception of: RDF19 and in CSV20.

Figure 8 XML record of a WHO indicator from the API at http://apps.who.int/gho/athena/api/GHO

11

18 http://vest.agrisemantics.org/content/un-m49-area-codes-countries-and-regions 19 https://raw.githubusercontent.com/SDG-InterfaceOntology/sdgio/master/imports/sdgio_indicator_values_import.ttl 20 https://github.com/SDG-InterfaceOntology/sdgio/blob/master/docs/SDG_ind.csv GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

Figure 9 XML record of the WB indicator “Land under cereal production (hectares)” (https://api. worldbank.org/v2/indicators/AG.LND.CREL.HA)

Figure 10 RDF representation of the SDG indicator for malnutrition in children under 5

However, these machine-readable versions of these two Now let’s see more in detail how these common sets of indicators are not leveraged in the source datasets standardisation issues present themselves in the two of the LP and GNR nor in the aggregation process. use cases, highlighting particular features that are specific to each. Other data standards used in the two projects, for the re-purposing of the data, will be analysed in chapters 3 and 4. 3 Land data use case: the Land Portal

In our survey report we give a detailed overview of all the As briefly noted in the previous chapter, the types of data types of data relevant for land issues (land tenure, land aggregated in the Land Portal are quite homogeneous: it use, land cover). Here we focus on specific examples of is secondary datasets compiled from primary statistical types of data used by the Land Portal. It is important to data around agreed indicators. keep this in mind and be aware that the standardisation challenges described in this chapter should not be The main data sources for this data are: FAO (Agricultural regarded as exhaustive or indicative for all types of land Census, Food Security, Land and Gender, Land Use), the data as described in our survey report. World Bank (Demographic Indicators, Health and Nutrition Indicators, LGAF Scorecards, Rural Development

12 GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

Indicators, Socio-Economic Indicators, World Governance male - share %), monitored by FAO in the Gender Indicators), UNDP (Human Development Index), IFPRI and Land Rights Database25 (Global Hunger Index), OECD (Social Institutions and Gender Index Database), the Land Matrix (Large-Scale • Indicators “A right to alternative land and/or housing 26 Land Acquisitions), among others. for all displaced persons” , “A right to negotiate compensation levels”27 and related, covered by The Land Portal monitors 623 indicators21 grouped under the UN Voluntary Guidelines on the Responsible 9 issues: Governance of Tenure of Land Fisheries and Forests in the Context of National Food Security (VGGTs)28 • Forest tenure • Indicators LMAF-LAC-1.1.A (Number of households • Indigenous & Community Land Rights - Total), LMAF-LAC-1.1.B (Number of farm 29 • Land conflicts households) and related for Lao, monitored by the Lao Ministry of Agriculture and Forestry. • Land & corruption 3.1 Land Portal specific • Land & food security standardisation gaps • Land & gender • Land & investments 3.1.1 Data sources • Rangelands, Drylands & Pastoralism The general standardisation gaps described in chapter 2 for both projects apply: names/codes of countries, names/ Urban tenure • codes of indicators, gaps in the primary sources. Some examples of indicators that we will use in the Let’s look at some examples in particular. following chapter:

22 Country names • Indicator “Area operated as owner” monitored by the FAO Agricultural Census In the data sources of the Land Portal, country names • Indicators “Agricultural holders by sex (female - are represented according to different conventions: from total n)” 23 (and male - total n)24, “Distribution of country names according to the UN M.49 convention or to agricultural holders by sex (female - share %)” (and the FAO Geopolitical ontology to ISO codes to UN M.49 codes, and also some customised ones.

Figure 11 UN M.49 country names in FAO Agricultural Census dataset on indicator “Land operated by tenure type: area”

13

21 https://landportal.org/book/indicators 26 https://landportal.org/book/indicator/nkt-vggt16-9c 22 https://landportal.org/book/indicator/fao-aci-6 27 https://landportal.org/book/indicator/nkt-vggt16-3e 23 https://landportal.org/book/indicator/fao-lg1fa 28 http://www.fao.org/docrep/016/i2801e/i2801e.pdf 24 https://landportal.org/book/indicator/fao-lg1fb 29 https://landportal.org/book/indicator/lmaf-lac-11b 25 http://www.fao.org/gender-landrights-database/data-map/statistics/en/ GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

Figure 12 Country ISO codes in dataset of the Lao Census of Agriculture

Figure 13 FAO English short country names in dataset on VGGTs30 indicators

The solution adopted by the Land Portal team to aggregate (that they call “country reconciler”) plus a country alias data by country is to convert all country representations to file that it is updated with new alias encountered in the the country ISO code using an ad-hoc “resolution table” sources.

Figure 14 The LP internal “country reconciler” table mapping between different country naming conventions and codes

In addition, when the data is re-published as RDF (see In terms of standardisation of indicator names, labels chapter 3.1.2 below) the country is represented by its in the Land Portal data sources are normally indicators URI in the FAO Geopolitical Ontology.31 The Geopolitical codes/IDs and are used as values of the variable Ontology can be looked up by ISO code, so the mapping “indicator” or “indicatorID” or “indicator-internal-id”. There is quite straightforward. is no resolution file or vocabulary at the source level that maps IDs to names and descriptions. Indicators An interesting aspect of the LP is the publication of all As we said above, data on indicators are imported from metadata about the datasets including information related the datasets produced mainly by the agencies that monitor to each indicator in machine-readable format: for instance, 32 those specific indicators and publish datasets where the from the page on the FAO Agricultural Census dataset , values of the indicators have already been calculated for a CSV or JSON file with the metadata of all covered is each country. available. This information is retrieved using a SPARQL query to the LP SPARQL endpoint.

Figure 15 Extract of the LP indicator metadata CSV dump for the FAO Agricultural Census

14

30 UN Voluntary Guidelines on the Responsible Governance of Tenure of Land Fisheries and Forests in the Context of National Food Security (VGGTs) 31 http://cfconventions.org/ GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

Some land-related indicators are also in the list of agricultural land, by sex; (b) share of women among indicators for the Sustainable Development Goals (SDGs), owners or rights-bearers of agricultural land, by type but the official process to collect data for these indicators of tenure. is currently still ongoing. Therefore there is no official data for the moment to be aggregated in the LP. This fact can As we said in chapter 2, the issue of indicators not having be interesting anyway in terms of standardisation in view standardised names is not perceived as a big difficulty of a possible mapping between the local LP indicators because data is aggregated around indicators and there is and the SDG indicators which, as we saw earlier, have one dataset for each indicator, so there is no risk of having already been published as RDF. to integrate data for the same indicator from different sources (in which case using standardised names or SDG indicators relevant for land: codes would help). • Indicator 1.4.2. - Proportion of total adult population However, since the LP re-publishes all data in RDF with secure tenure rights to land, with legally including indicators, all indicators used by the LP recognised documentation and who perceive their have a URI and an RDF record that includes a code rights to land as secure, by sex and by type of tenure; and a description, so the LP could, in absence of any standardisation of the indicators by the originating • Indicator 2.4.1. - Proportion of agricultural area under agencies, act as the standardisation authority for this productive and sustainable agriculture; and its URIs and codes could be promoted as standard • Indicator 5.a.1. - (a) Proportion of total agricultural identifiers for these indicators. population with ownership or secure rights over

Figure 16 JSON output of RDF records for indicators (URI and label) in the LP

As we said in chapter 2, indicators may change or their own methodology, building on the indicators of other disappear between years. For instance, for the LP, the agencies. Given the importance of these relationships to Global Hunger Index produces new series every year and understand if and when data are comparable, it would an indicator may disappear or change and the new series be interesting to formalise them in a vocabulary. For is not necessarily comparable with the previous ones. Or instance, the existing RDF version of SDG indicators the FAO food security set of indicators changes this year could include special relationships between indicators (sometimes the internal ID of the indicator remains the (“same as”, “derived from”, “similar”). same, sometimes it changes completely). What helps with the decision to merge or not data for the same indicator Indicators’ values produced in different years is the metadata that describes the indicator. As mentioned in chapter 2, indicator values can be numbers or literal values from a controlled list of possible Besides, even though normally there is one dataset values. for each indicator, there are sometimes more agencies publishing data for the same indicator. For instance, some Considering what we said about indicators being FAO indicators are the same as the World Bank’s and measurements and often the result of calculation IFPRI has very similar ones but sometimes calculated with between various original statistical values, numbers can be quantities, percentages, combinations of values. For

15

32 https://landportal.org/book/dataset/fao-aci GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

instance the IFPRI Global Hunger Index (GHI) is “calculated index ranks countries on a 100 point scale, with 0 being as the average of three indicators, the proportion of the the best and 100 the worst.” population that is undernourished (Undernourishment), the proportion of underweight children under five years When indicator values are literals, they come from old (Child Stunting and Child Wasting), and the proportion a controlled list of values created for that indicator, of children dying before age five (Child Mortality). The normally found in the legend of the dataset or in separate documentation.

Figure 17 Literal values for VGGT indicator NKT-VGGT16-1a: “Clear Conceptualization of Public Purpose”

Figure 18 Legend of indicators and possible indicator values for the VGGT dataset

The data sources only include this information in legends Gaps in primary sources or external documentation, while the LP publishes them as part of their RDF data: in particular, for the Indicator Several aspects of the standardisation of primary data class (in the Computex vocabulary)33, the LP uses a could improve the way data are aggregated by the specific property to represent the relationship between secondary sources and subsequently by the LP. These the indicator and the list of possible values: “has Coded issues include standard classification schemes, data Value” (which could potentially link the indicator to a models and formats, licensing and more. In our survey skos:Concept Scheme containing the controlled terms, report, we illustrated some examples of data standards although the LP hasn’t published these schemes yet).34 that have been created, for example, for classifying types of land use or types of land tenure, or for adopting common Also in this respect, the indicator data re-published by data models for cadaster data. Clearly, the lack of adoption the LP could become the reference indicator vocabulary of such standards in the primary data is a problem for for land, at least until the custodian agencies decide to data standardisation that deserves to be mentioned in a publish their indicators as linked data. gap analysis, and for examples of these standards we refer to our survey report.

16 34 http://www.metoffice.gov.uk/wow GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

However, while many of the general standardisation • RDF Schema (RDFS) for properties common to improvements related with statistical standards and the most resources use of common classifications and data models would definitely make the work of primary data aggregators • Dublin Core for properties common to most resources easier, there are some issues that deserve to be especially • RDF Data Cube40 as a means to publish multi- highlighted for the specific case of the LP because certain dimensional data, such as statistics, on the web in changes in the way primary data is collected and exposed such a way that it can be linked to related data sets would not just make the primary data aggregation work and concepts using RDF easier but it would allow for additional types of analyses and visualisations in the LP itself. So we decided to focus • Computex41 (Computing Statistical Indexes) an on improvements in the primary data sources that would extension of RDF Data Cube vocabulary to handle actually improve the LP end users’ experience. statistical indexes. For instance, an issue that the LP coordinators have • SDMX42 (Statistical Data and Metadata eXchange), chosen to highlight as a potential area for improvement in an ISO standard for exchanging and sharing the primary data is a more granular geographic/geopolitical statistical data and metadata among organisations. aggregation of data, for instance at sub-national level, 43 using different types of administrative units depending • The OWL-Time ontology , an ontology of temporal on the type of data. concepts, for describing the temporal properties of resources in the world For example, a governance quality indicator could be 44 measured at first- or second-level administrative scale, • The Schema.org vocabulary for properties of while land use data could be usefully measured at the all relevant entities (creative works, persons, farm plot level. organisations, events, places)

45 This would entail some work in identifying suitable • The SKOS vocabulary for all related concepts existing sub-national subdivisions, for example starting Besides these description vocabularies used to describe 35 from global resources like GeoNames or the UN and link all entities (datasets, indicators, observations), 36 Second Administrative Level Boundaries (SALB) or the the LP also uses value vocabularies to use standardised 37 Global Administrative Areas (GADM) data or the IATI values for certain properties: Administrative Area (First-level)38 and Administrative Area (Second-level) under construction, together with agreed • The FAO Geopolitical Ontology for countries and types of administrative divisions (e.g. th GeoNames geopolitical entities classes like “first-order administrative division” or the The LandVoc46 (the Linked Land Governance IATI location types39). • Thesaurus) for land-related concepts In addition, for types of data where the farm plot level or The LandVoc is mainly derived from FAO’s Agrovoc47, pure geospatial definition is more meaningful, GIS/spatial/ the standard Agriculture vocabulary, but extends it with point in space geolocation would be preferable, in the many other terms from vocabularies designed and/or case of farm plots matching also existing databases of used by land governance stakeholders on both a global farm boundaries or cadasters. and local level. In the LP datasets and indicators are all However, the main problem here would probably be the linked to terms from the LandVoc. availability of data aggregated at that level: it is unlikely Currently, the ranges of values allowed for certain that there are many statistics or surveys at that level of indicators are not serialised as RDF value vocabularies, granularity. although this seems something the Land Portal team 3.1.2 Re-published data could consider implementing. In general, the standardisation level of the Land Portal The Land Portal re-publishes all statistical data as Linked data publication layer is very high: the LP follows all Data, reusing many existing vocabularies, especially a Linked Data best practices and it reuses all relevant few specialised vocabularies for statistical data and for vocabularies for the types of data it manages. The indicators: vocabularies themselves are open, linked and endorsed by the relevant communities. 17

33 http://purl.org/weso/computex/ontology# 37 https://gadm.org/data.html 34 More info on the Land Portal RDF model and vocabularies: https://landportal. 38 http://iatistandard.org/101/codelists/administrative_area_1/index.html org/book/reuse 39 http://iatistandard.org/101/codelists/location_type/index.html 35 http://www.geonames.org/ 40 https://www.w3.org/TR/vocab-data-cube/ 36 https://www.unsalb.org/methodology 41 http://purl.org/weso/ontology/computex# GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

Figure 19 Graph of the RDF representation of a dataset and its indicators in the Land Portal triple store

Figure 20 RDF representation of a dataset and its indicators in the Land Portal triple store. @prefix qb: . @prefix schema: . @prefix dc: . @prefix rdfs: . @prefix skos: . @prefix ns0: .

a qb:DataSet ; schema:image ; schema:logo ;

18 42 https://raw.githubusercontent.com/UKGovLD/publishing-statistical-data/ 46 https://landportal.org/voc/landvoc master/specs/src/main/vocab/sdmx-attribute.ttl# 47 http://aims.fao.org/vest-registry/vocabularies/agrovoc 43 https://www.w3.org/TR/owl-time/ 43 https://www.w3.org/TR/owl-time/ 44 http://schema.org/ 45 http://www.w3.org/TR/swbp-skos-core-spec GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

dc:subject , , , ; dc:publisher ; dc:identifier “FAO-FS.2011” ; dc:title “FAO - Food Security (2011 suite)” ; dc:description “This dataset contains a subset of the suite of indicators (suite 2011) on the level of food security in several countries as provided by FAO.” ; rdfs:seeAlso .

a ; dc:identifier “FAO-21018-6125” ; skos:notation “FAO-21018-6125” ; dc:source ; dc:subject , , ; ns0:unitMeasure “Index” ; dc:title “Domestic food price level index” ; dc:description “Domestic food price level index is an important indicator for global monitoring of food security because it compares the relative price of food across countries and over time.” ; rdfs:seeAlso 4 Nutrition data use case: the Global Nutrition Report

In our survey report we already narrowed down the The main data sources are: United Nations, Department domain of nutrition to the more specific types of data of Economic and Social Affairs, Population Division for used in the Global Nutrition Report (GNR) and we clarified population statistics (modelled estimates), World Bank for that the focus is clearly on food security and malnutrition. household surveys and GDP per capita data from OECD, UNICEF/WHO/World Bank Group Joint Child Malnutrition The key types of data covered by the GNR are: Estimates based on population surveys, UNICEF for • Food security surveys (access to food, measured national birth registrations, birth and breastfeeding data, also through some of the types of data below) household surveys and routine reporting systems, WHO health surveys, modelled estimates on nutritional disorders • Demographic data, census data (population, age, and deficiencies from the Global Health Observatory Data gender, income) Repository, FAO food balance sheets and surveys from different organisations (IFPRI SPEED, ILO, IDS, SUN) • Population health data (childbirth, breastfeeding, on government expenditure and implementation of rights. anthropometry, nutritional disorders) As noted, a lot of harmonisation work is done by these • Food consumption/dietary habits surveys global agencies that aggregate the data from primary • Government expenditure (on nutrition interventions) sources (normally national statistics and surveys) and make them available again in a more standardised way. • Government legislation and policies

19 GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

These data are used to assess the status and monitor One consideration regarding the facilitation of data progress against around 80 indicators, grouped under standardisation could be that the agencies that compile 8 categories: the secondary datasets could also benefit from web services supporting crosswalks between different naming • Demographics conventions. In some cases, the agencies need to curate • Anthropometry that data extensively in order to normalise values and apply a consistent naming convention. • Micronutrient status outcomes Compared to the LP, the GNR has a few additional data • Diet related risk factors for NCDs integration issues while aggregating by region. There are at least three orders of problems regarding regional • Determinants systems:

• Intervention coverage a) different sources may use different region names • Financial Resources for the same region; b) the same region name may not always identify the same actual region, as a region can • Institutional/Legislative/Policy be physical (a continent) or geopolitical (so dependent on the geopolitical view of the world) or even socio- Some example of indicators that we will use in the economic or sliced in any way that is relevant for an following chapter agency; c) the same region name may not identify the • “Children aged <5 years wasted” monitored by same actual region because the countries belonging to a the WHO Global Health Observatory under “Child region may not be exactly the same in all systems (either malnutrition country survey results”48, from UNICEF- for geopolitical reasons or because regions are defined WHO-WB joint child malnutrition estimates. by different bodies, e.g. different economic observatories define different economic regions). • The World Bank GINI index, a measure of statistical dispersion intended to represent the income or Even the slightest difference in a region definition (e.g. wealth distribution of a nation’s residents, the most one country belonging to it or not) may make data not commonly used measurement of inequality. Gini comparable and therefore impossible to aggregate around index of 0 represents perfect equality, while an index a common indicator t regional level. of 100 implies perfect inequality. In the examples below, the FAOSTAT dataset seems to use the FAO Geopolitical Ontology regions, while the 4.1 GNR specific standardisation final aggregation in the GNR is made around the UN gaps M.49 regions.

4.1.1 Data sources

The general standardisation gaps described in chapter 3 apply: names/codes of countries, names/codes of indicators, gaps in the primary sources. Country names and regions

Similarly to what we said for the Land Portal, country name representations can be different in different sources. Normally, conventions are the same in datasets produced by the same agency: global agencies that aggregate country level data frequently face the challenge of normalising country level primary data to meet globally determined data standards.

20

48 http://apps.who.int/gho/data/node.main.CHILDMALNUTRITION?lang=en GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

Figure 21 Example of regional aggregation from FAOSTAT around adequate diet (used by the GNR for the undernourishment indicator)

Figure 22 Example of how final values for the indicator undernourishment (calculated also based on the dietary energy supply adequacy above) in the GNR are aggregated around continents as defined in the UN M.49.

Figure 23 Another example of region representation in a WHO Global Health Observatory dataset with indicator “NUTRITION_WH_2” (Children aged <5 years wasted)

The GNR team uses a similar mapping table to the one the correspondences or overlaps between regions used by the LP, with the addition of the mapping to different are not clear. regional systems. Probably, the LP and the GNR teams could spare their individual efforts in mapping countries and countries/regions if a reliable global geopolitical reference vocabulary included these mappings. Two existing projects cover part of such mappings: • the FAO Geopolitical Ontology49, which maps countries to FAO regions and at the level of regions at the moment only maps the FAOSTAT region code to the UN M.49 region code; • the JoinedUp Data Standards Navigator, which does some work in mapping countries to different “supranational regions and groupings”50, although

21 49 https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/reference/FeatureDatasets/PointFeatures.html 50 http://desktop.arcgis.com/en/arcmap/10.3/manage-data/netcdf/reading-netcdf-data-as-a-point-feature-layer.htm 51 http://www.opengeospatial.org/projects/groups/metoceandwg and https://external.opengeospatial.org/twiki_public/MetOceanDWG/WebHome 52 https://data.blog.gov.uk/2013/12/21/linked-data-registries-and-talking-about-the-weather/ GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

Figure 24 Example of different regional system in the JoinedUp Navigator browser.

Figure 25 Example of mapping between region and country in the JoinedUp Navigator RDF backbone.

Names of variables and could be used to produce different aggregations (e.g. age ranges, indicated and/or measured in different The fact that the same variable can have different names ways in different sources). In cases where it would be in different data sources is normally not a problem when useful to aggregate data around those variables, some aggregating by indicator, because as we said in most standardisation could become useful. cases there is one source dataset for one indicator. However, there are some variables beyond the usual An obvious such variable is gender, which however country/year series that are common across indicators normally only has three values, even if codified in different ways: female, male and both.

22 49 http://www.fao.org/countryprofiles/geoinfo/en/ 50 http://joinedupdata.org/sup/en/page/?uri=http://joinedupdata.org/Supranational GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

Another obvious variable around which it could be 12-23 months, under 5, 15 years and older…). These interesting to aggregate data from different indicators are sometimes not consistent even from the same data is age, or better age groups. Data sources use different provider. age groups, either by group name sometimes referring to different age ranges (Infant, child, adolescent, adult; The examples below show different gender codes and or pre-school, school age) or by numeric age ranges, different age ranges/groups. Clearly aggregating around sometimes overlapping (0-59 months, 12-15 months, these variables, especially age ranges that overlap, is not easy.

Figure 26 Example of sex and age range variables and related values in a WHO dataset for the underweight population

Figure 27 Example of sex and age range variables and related values in a WHO dataset for the mass index indicator by age group and sex

Figure 28 Example of the sex variable in a WHO Global Health Observatory dataset with indicator “NUTRITION_WH_2” (Children aged <5 years wasted): in this case the age range is part of the indicator itself.

Indicators In general, beyond a precise matching of the indicators with the SDG indicators, the GNR obviously contributes For most indicators, the GNR behaves like the LP: it to tracking progress towards the SDG targets, using the collects data already aggregated by secondary sources voluntary global nutrition targets adopted by member around that indicator. states of the World Health Organisation (WHO). The Global Nutrition Report has been tracking these global And as for the LP, there is some correspondence with nutrition targets over the last four years. the SDG indicators.

23 GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

These targets comprise: the:Maternal Infant and Young those interventions, including research, governance and Child Nutrition (MIYCN) targets (six global targets on policy support. Another recommendation is introducing MIYCN adopted at the World Health Assembly in 2012 to a nutrition policy marker to identify nutrition investments be attained by 2025); diet-related NCD targets (three of across sectors, which would allow for tracking in a more nine NCD targets; adopted at the World Health Assembly integrated way. A nutrition policy marker would identify in 2013 to be attained by 2025). These ‘MIYCN targets’ projects across health, emergency response, agriculture, and ‘diet-related NCD targets’ overlap significantly with education and any other nutrition sensitive sector that SDG targets 2.2 and 3.4, highlighting the synergies has nutrition goals, targets and activities. It would cover between the SDGs and current tracking efforts to tackle investments aimed at preventing overweight and obesity malnutrition. and diet-related NCDs”.

So the relevant SDG indicators with which some of the An agreed vocabulary could help to solve these issues GNR indicators overlap would be: (a redefined code, a nutrition policy marker) . • 2.2.1 Prevalence of stunting among children under Issues with the primary data 5 years of age There are also significant gaps in the data, and issues with 2.2.2 Prevalence of wasting and overweight among • the quality of the data that inhibit how effectively the GNR children under 5 years of age authors can use the data available. Similarly to the LP, the • 3.4.1: Mortality rate attributed to cardiovascular GNR team particularly highlighted the absence of good disease, cancer, diabetes or chronic respiratory quality data at the subnational level within countries and an inability to disaggregate data sufficiently (for example However, no exact matching is specified and the relations by gender, age, ethnicity etc.) as problems that inhibit with these indicators are not formalised. their analysis.

In addition to these indicators already monitored by other Subnational level data is deemed very useful because it agencies, there is a subset of indicators monitored in can help target nutrition efforts where it is most needed. the GNR that is different from the others: indicators In many countries there is a wide variation in stunting of nutrition spending. This is the only type of indicator between regions/districts, with many subnational regions for which the GNR does not aggregate from secondary having stunting rates three times higher than the region sources that have already consolidated the data around with the lowest stunting rate (GNR, 2016). However, good the indicator, but rather aggregates directly from primary quality, disaggregated, timely subnational level data is sources (like donors’ spending and official government often not available. Either it is not collected, or if collected spending data). Being GNR-specific indicators, primary the process for releasing it is slow and the data is out of data are elaborated in a custom way to calculate the date by the time it is available for use. In other cases the indicator, so the lack of homogeneity of the data sources data is collected by different agencies who may not want is a bigger issue in this case. to share the data for fear of losing funding to a competitor, or the data collected uses different indicators. An example of these standardisation issues is reported 51 in a GNR report entitled “Nourishing the SDGs” when Related to the issues highlighted in the previous chapter speaking of Official Development Assistance (ODA): about indicators of nutrition spending, another issue ODA spending data reported to the Organisation for for the GNR was their inability to access and integrate Economic Co-operation and Development (OECD) investment/financial data (aid and government spending) Development Assistance Committee (DAC) is used into their analysis. by the Creditor Reporting System (CRS) database to track nutrition investments by key donors, but “there These issues with the source data certainly presents is no systematic and standardised way for the CRS challenges for the team responsible for aggregating the to capture crosscutting nutrition-sensitive investments data and may affect the ability of GNR stakeholders to across sectors. One recommendation to improve the engage with the data effectively for advocacy purposes. way nutrition is captured in the CRS is by redefining These issues however are only minimally due to data the basic nutrition code to better align to the concept of standardisation gaps, but rather to the availability or nutrition-specific interventions. A redefined code would quality of the data itself. also capture any investment that supports the scale up of

24

51 Global Nutrition Report. Nourishing the SDGs. 2017 http://165.227.233.32/wp-content/uploads/2017/11/Report_2017-2.pdf GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

The only aspects where data standardisation could help Other challenges highlighted by are: experts • For subnational data: as we said for the LP, the identification of geospatial standards that treat The GNR data managers have highlighted some issues subnational divisions at the level needed by the related to the openness and usability of the source data. GNR (see above, standards like GeoNames or the For example, some agencies consider the datasets that UN SALB) they collect to be freely available. However, this data is not searchable online in any format and may be presented in • For both subnational level data and government/ a format that is not easily usable (e.g. as scans or PDFs). aid expenditure data, if the problem is sensitivity: a better standardisation of statistical techniques like It seems that more work on data publishing and related anonymisation. standardisation is needed to better meet the needs of the GNR data aggregation and analysis. • For aid expenditure data, some standardisation of the categories of spending could help. A starting 4.1.2 Re-published data point could be the IATI code lists52, perhaps to be extended. The GNR re-publishes the aggregated data with some level of normalisation.

Data are re-published in CSV format. The normalisation work concerns: • Country and region names, normalised to the UN M.49 convention • Indicator names: normally the same names and definitions as in the sources, sometimes shortened

Figure 29 GNR output data in tabular format

The data are not re-published as Linked Data, which re-publish everything with the same approach, copying somehow limits the potential for reusability and linkability. the methodology that has been successfully applied to Even though there aren’t probably many applications the Land Portal, i.e. the same data model and the same ready to exploit statistical linked data, it could be vocabularies. worthwhile to follow the example of the Land Portal and

25 52 http://reference.iatistandard.org/203/codelists/ GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

5 Experts interviewed

Below is the list of experts who have helped us identify Global Nutrition Report gaps in data standardisation for the two thematic topics. • Alan Stanley (Institute of Development Studies) Land Portal • Jordan Beecher (Development Initiatives, Bristol, • Marcello De Maria (Land Portal) UK), GNR data analyst • Carlos Tejo (Land Portal) 6 Conclusions

Malnutrition, land tenure and land use are socio-economic major data sources (so basically only the UN M.49), dimensions, mainly measured through statistics and while the LP re-publishes everything according to surveys and elaborated through projections. Linked Data principles and uses highly open and reusable published vocabularies. However, the two use cases are aggregators of secondary sources which in turn have consolidated primary statistical In terms of data standardisation gaps, it seems that two data around indicators. Therefore, the two projects areas lend themselves to interesting developments for considered as use cases do not use statistics and surveys the project, when it comes to use cases and pilots: directly and the data standardisation issues are therefore not related to the typical statistical standards or topic- 1. Support to the data aggregation process, by specific vocabularies we described in our survey report. improving the standardisation of primary sources, focusing on one country. For the Land Portal, the Summarising the findings of this report, we can say that: country could be South Africa, using the South Africa Land Observatory. For the Global Nutrition Report, • The two use cases present many similarities. They the country could be Tanzania, according to feedback both aggregate data from secondary sources, from the GNR team. Support to standardisation already partly normalised by global agencies; they could cover one of the areas identified in this both aggregate data around specific indicators; they document (indicators standardisation, vocabularies aggregate from datasets with a similar structure of variables, vocabularies of named value ranges (indicator, country, year, value). etc.). The identified gaps in data standardisation are • 2. Support to one of the needs expressed by end users’ very similar. The names of countries and regions in and decision makers: more data at the sub-national data sources are not standardised, or standardised level. While this is mainly an issue of data availability, according to different conventions; the indicators; work on standards can play a role in identifying the names of the variables and dimensions do not relevant sub-national subdivisions and potential follow any convention; indicators are represented existing vocabularies that do or could accommodate by strings and may change (both names and such sub-divisions, as well as in developing web measurement methods). services for lookup and geospatial resolution.

• With reference to our data standard assessment This second use case would be more easily linked to criteria, the few standards used by the data sources the capacity development activities planned within the (country naming conventions, value ranges, units of GODAN Action project, which are targeted at decision measurement) in both cases are not open and not makers. very usable. The situation is different when it comes to the way the two projects re-publish the data: the GNR normalises values around some conventions but only reuses the basic conventions used in its

26 GODAN ACTION LEARNING PAPER •••• LAND AND NUTRITION DATA: A GAP EXPLORATION REPORT

References

Pesce V, Kayumbi GW, Tennison J et al. Agri-food Data Standards: a Gap Exploration Report [version 1; not peer reviewed]. F1000Research 2018, 7:176 (document) (doi: 10.7490/f1000research.1115261.1)

Pesce, V.; Tennison, J.; Dodds, L. and Zervas, P. (2017) Weather data standards: a gap exploration report GODAN Action Learning Paper (available soon on F1000 GODAN Gateway https://F1000research.com/gateways/godan)

Pesce V, Tennison J, Mey L et al. A Map of Agri-food Data Standards [version 1; not peer reviewed]. F1000Research 2018, 7:177 (document) (doi: 10.7490/f1000research.1115260.1)

27 GODAN Action is supported by the UK Department for International Development (DFID), led by Wageningen Environmental Research with international partners AgroKnow, AidData, CTA, FAO, GFAR, IDS, Land Portal, and the ODI.

For more information visit the GODAN website www.godan.info/godan-action

Follow GODAN on Twitter: @godanSec

ORCID Identifiers:

Valeria Pesce https://orcid.org/0000-0003-3860-4304

Lisette Mey - not available

Pauline L’Hénaff - not available

Carlos Tejo-Alonso - not available

This GODAN Action publication is licensed under a Creative Commons Attribution 3.0 Unported License.