<<

Deconstructing Dataset Search May 2019

Introduction A little ripple went through the data services airwaves in fall of 2018 when Google announced its beta version of the Google Dataset Search. is increasingly successful as a research product, and now the tech giant has stepped forward again to bring its signature style to the data. This could be a great boon to data discovery and increase the overall sharing and use of research. But with that, Google brings its now familiar persuasive model to bear on another piece of digital research. This tool, especially if it follows the Google Scholar trajectory, will certainly make finding datasets easier. But will a corresponding increase in data provenance, sharing, and interoperability follow? Or, as with Google Scholar and publisher paywalls, will this tool’s impact be to highlight the painful gaps in data re- use, interoperability, and documentation?

How did Google Dataset Search Start? Beta Google Dataset Search is launching in a rich landscape. It is rich both in content, as use of web- based data repositories grow, and in opportunity, as standards, guidelines, and even publishing requirements dictate ever-changing data sharing norms. As we in libraries have experienced, this shift in the data landscape is concurrent with trends in publishing, reproducible and open science, and the exciting opportunities new technology afford us year after year (Ware & Mabe, 2015). In 2017, Natasha Noy and Dan Brickley posted to Google’s AI Blog announcing the new project to make public datasets searchable through (Noy & Brickley, 2017). Citing the growing need for research data to be made available publicly, the Google team indicates that a search through their famous engine could be possible. This is thanks to increasing popularity of structured and standardized descriptions of datasets. The challenge Google’s team is setting out to answer, per their own blog post, is that these datasets exist in multiple places across the internet. Not only that, but once a dataset is found, the “veracity or provenance of that information” is not always clear (Noy & Brickley, 2017).

How does Google Dataset Search work? This search tool, like other Google products, will use website providers’ embedded descriptions and metadata. The more structured this information is, the better Google’s tools will be able to represent the website in search results. To achieve this, Google’s guidelines are based on Schema.org (Follow the structured, 2019). Schema.org vocabulary was initially founded by –Google, Yahoo, Microsoft and Yandex. Now it is developed through an open, transparent, community process facilitated by W3.org, the World Wide Web Consortium that actively develops and communicates about the growth of the web, and GitHub, a collaboration platform for web and software developersSchema.org, n.d.). Schema.org provides commonly used, open, and transparent description standards for website structure and metadata. It is notable that Noy and Brickley, in their blogpost, make the connection that Schema.org’s work on datasets closely follows W3C’s work. They “expect [it] will be a foundation for future elaborations and improvements to dataset description” (Noy & Brickley, 2017).

This did not automatically provide a good way to make datasets machine-findable across the internet. So as with many other content types, Google published guidelines (Follow the structured, 2019). Alongside markups for job postings, recipes, and local businesses, there are structured data

1

Deconstructing Google Dataset Search May 2019 guidelines for datasets in the reference for developers. The markup guidelines provide definitions for required and recommended properties as well as best practices for sitemap, source, and provenance. Google also uses this guide to recommend using the Structured Data Testing tool to validate your code. This tool examines your website, or any other provided URL, and detects the use of specific data types and the tags that describe the information. This provides automatic feedback to site-builders about how the Google search engines ‘sees’ the website, and if there are any errors or warnings that can be corrected to improve the search.

What is Google Dataset Search doing? In this tool, Google‘s machine is targeting structured data that indicates a datasets is linked or displayed on the webpage. To do so, Google will use the web developer’s markup on the webpages, as well as the sitemap . It also looks for landing pages for the dataset, as seen in dataset repositories. By finding these specific structures of metadata in the webpages code, the Dataset Search knows if the page contains datasets or links to datasets, if it should be displayed in the search results, and if there is any description available. When scrutinizing for the markup on the website, Google Dataset Search is using the vocabulary of metadata properties provided by Schema.org to understand the webpage contents. This vocabulary defines the property ‘dataset’ as “a body of structured information describing some topic(s) of interest”, a type of thing part of the broader category ‘CreativeWork.’ (Dataset, n.d.). According to the Developers guidance, this property then requires two additional properties. The first is, “description,” defined as “a short summary describing a dataset.” This must be inputted as text in the code. The second required property to describe a ‘dataset’ is “name,” another text property defined as “a descriptive name of the dataset.” The guidance provides an additional fifteen properties that can be included in the website code. These are optional but provide deeper information about the dataset, including a citation, location, geospatial coverage, and temporal coverage.On the sitemap, Google is looking for practices that identify the relationships between links or files on the webpage. For example, it will seek named relationships between pages on the website using of the ‘sameAs’ or ‘isBasedOn’ properties. These properties indicate the description of the data repeated elsewhere on a site, if there are any derivatives or latter iterations of the dataset. The guidance documentation provided by Google explains the best use of these named relationships. Metadata implementation for websites is not always a perfect process, which is why Google maintains the guidance document and forums with FAQs for developers hoping to make their website discoverable (Follow the structured, 2019). Taken together, the guidance provided by Google for website developers aims to provide not only description, but access through links and citation information for datasets that appear in their search results, relying on that structured metadata embedded in web pages. Not only is this search tool now available for use, but the work of the search algorithms in finding the metadata is also being recorded by Google. This metadata will be used to compose what the Google team is calling “an index of enriched metadata” (Burgess & Noy, 2018). This index is what will provide more power and speed to Google Dataset Search as it grows.

How is it different from other Google Data tools?

2

Deconstructing Google Dataset Search May 2019

The Google Dataset Search does provide a more in-depth search result for a distinct type of resource, compared to other tools. It will find a wide variety of materials, from proprietary formats, to spreadsheets, even to organized tabular information displayed through HTML. These materials are raw datasets, usually with the type of format for download available. This is distinct from Google Public Data or . Google Public Data Explorer also leverages metadata to create an interface for searching and exploring data. The main difference is that Google Public Data Explorer requires two specific files, in specific formats, namely xml and csv. The goal of this interface is to create data visualizations, rather than access to raw data with citations (“DSPL Developer Guide,” 2015).The Google Knowledge Graph is the highlighted infobox that appears in search results and runs off Google’s retrieved results and metadata indexing to provide answer-like search results. It is also mentioned as a tool for fine-tuning the Google Dataset Search results (Burgess & Noy, 2018). These products are not necessarily similar, but the Google team expects interaction. This interaction will allow the dataset results improve rapidly where the Knowledge Graph has already demonstrated a method for improving the response from the algorithms.

The Google team also draws the connection between Google Scholar and Google Dataset Search. Dataset metadata can be ‘augmented’ through Google Scholar through references and citation to provide both author data and a “signal about the importance and prominence of a dataset” (Burgess & Noy, 2018). The team alludes to stronger connections between the two in the future, although details are not yet available.

An example search A simple example search of ‘graduation rates’ demonstrates the main features of the user interface. Unlike Google Public Data Explorer or Google Scholar, the beta Dataset Search has no advanced option for starting the query, only a keyword field. It does accept and encourage advanced Google search qualifiers such as ‘site.’ The search bar displays suggested search terms.Once the results have returned, the number of results is displayed in the top left. On a computer browser screen the results are displayed in two columns (On tablet or mobile, it is displayed as one column). The biggest change from the interface of Google Scholar or the Public Data Explorer is the lack of facets for search narrowing. Instead, results are displayed with logo images (if available), URL, and data if available for the results. In this example, the top results are from New York City Open Data initiative, .com, and a U.S. dataset hosted on data.gov.The interface is also different because it has a more prominent ‘share’ icon in the right hand side of the dataset information pane. There is also a feedback which allows for open-ended feedback and automatically includes a screenshot.

What does all of this mean for me? This new beta product from Google is not unexpected but another example of the changing players in the research landscape. Google alludes to “a better open data ecosystem” (Burgess & Noy, 2018). The team lists several admirable tenants for this, including a culture of citing data and wide-spread adoption

3

Deconstructing Google Dataset Search May 2019 of strong, open metadata standards. There is also a specific reference to ‘serendipitous discovery’ for data users, listing scientists and journalists as target audiences.

The Pros An easier method for searching research data is an admirable goal. It is important to remember that, as with any of Google’s other products, Google Dataset Search could put pressure on the information structures. As for dataset searchability, this pressure was sorely needed in some scholarly circles and has been called for on several occasions (Wilkinson, 2010; Piwowar & Vision 2013). Google cites some of these arguments (Burgess & Noy, 2018). Open metadata standards will not gain momentum without major internet stakeholders like Google on board. Citing datasets as a scholarly product will benefit from the same momentum. The Google team described the metadata in use as a mechanism for positive reinforcement of datasets citation (Noy & Brickley, 2017). This could provide incentive in the scholarly system to connect citations and publish datasets with individual DOIs. Shifts like these in the use of data have similarly been called for in the changing research landscape (Fenner et al., 2017; Wilkinson, 2010; Piwowar & Vision 2013). This tool provides the user experience that Google has a reputation for. A simple inviting search screen, easily skimmed results, and intuitive links have all contributed to a very popular interface. It has also left a mark in other discovery tools, as we can observe in Elsevier Dataset search.Elsevier Data search, also in beta, is another tool aiming to provide a single platform for searching across domains and disciplines for raw datasets (Frequently Asked Questions 2019). Elsevier has taken the approach of indexing repositories, but the user experience still mimics the ‘one stop shop’ discovery layer that has become the norm in information seeking behavior.

The Cons Though the standards in use are openly created by the community, Google Dataset Search is yet another piece of information infrastructure that will be owned by a mega-corporation. This could be innocent, like many other technological infrastructures we’ve come to rely on in research and scholarship. However, since Google currently collects data about its users and shares it through third party transactions, it begs the question what could be done with data obtained from dataset searches and use behavior. Like Google Scholar, there may be real benefits to improved search and discovery for research products. It is also possible that this tool stamps a shape into the information seeking environment just as Google Scholar did (Jacsó, 2005). The search results will be subject to the definitions and expectations of the guidelines, which, although released and open for feedback from the developers, are controlled not by a group of peer scholars but a group of researchers working for profit. The structures built, open and transparent though they may be, are still susceptible to the same bias and algorithmic challenges demonstrated in the Google Search engine (Noble, 2018).

Conclusion

4

Deconstructing Google Dataset Search May 2019

The future of Google Dataset Search is still unwritten. It may gain a huge following, or like some other Google products, it may not make it past Beta stage. It is exciting to have a major stakeholder like Google promoting dataset citation and easy dataset discovery. This couuld lead to real innovation in searching, but also the shaping of information discovery by a corporation. Dataset searching is becoming a more visible piece of the research process, and the benefits of a major player pushing for standards are real. There is a demonstrated need for a single platform search across the internet for relevant raw datasets, and it supports datasets as a scholarly output. This tool is definitely a sign of the growing awareness for reproducible, transparent data use. I believe it should also be signal for the research community to increase education and awareness of datasets as research products and research inputs, and be consciously developing the sharing and discovery of them in a way that complements big platform tools and the interests of the scientific community.

Bibliography:

Burgess, M. & Noy, N. (2018, September 26). Building Google Dataset Search and fostering an open data ecosystem [Web log message]. Retrieved from Google AI Blog, https://ai.googleblog.com/2018/09/building-google-dataset-search-and.html

Dataset. (n.d.). Retrieved February 12, 2019, https://schema.org/Dataset

DSPL Developer Guide (2015). Retrieved February 12, 2019, https://developers.google.com/public- data/docs/developer_guide

Follow the structured data guidelines. (2019). Retrieved February 12, 2019, https://developers.google.com/search/docs/guides/sd-policies

Frequently Asked Questions. (n.d.). Retrieved February 12, 2019, https://datasearch.elsevier.com/faq#/

Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce racism. New York: New York University Press.

Noy, N. (2018, September 5). Making it easier to discover datasets [Web log message]. Retrieved from Google AI Blog, https://www.blog.google/products/search/making-it-easier-discover-datasets/

Noy, N. & Brickley, D. (2017, January 24). Facilitating the discovery of public datasets [Web log message]. Retrieved from Google AI Blog, https://ai.googleblog.com/2017/01/facilitating-discovery-of-public.html

Jacsó, P. (2005). "Google Scholar: the pros and the cons", Online Information Review,29(2), 208-214. doi.org/10.1108/14684520510598066

5

Deconstructing Google Dataset Search May 2019

Piwowar, H. A., & Vision, T. J. (2013). Data reuse and the open data citation advantage. Peerj, 1, e175. doi:10.7717/peerj.175

Schema.org. (n.d.). Retrieved from https://schema.org/

Ware, M., & Mabe, M. (2015). The STM Report: An overview of scientific and scholarly journal publishing, 181.

Wilkinson, M. (2010). DataCite: The international data citation initiative. Datasets programme. Working Paper Series Des Rates Für Sozial- Und Wirtschaftsdaten, 163.

6