Deconstructing Google Dataset Search May 2019 1 Introduction A

Deconstructing Google Dataset Search May 2019 Introduction A little ripple went through the data services airwaves in fall of 2018 when Google announced its beta version of the Google Dataset Search. Google Scholar is increasingly successful as a research product, and now the tech giant has stepped forward again to bring its signature style to the data. This could be a great boon to data discovery and increase the overall sharing and use of research. But with that, Google brings its now familiar persuasive model to bear on another piece of digital research. This tool, especially if it follows the Google Scholar trajectory, will certainly make finding datasets easier. But will a corresponding increase in data provenance, sharing, and interoperability follow? Or, as with Google Scholar and publisher paywalls, will this tool’s impact be to highlight the painful gaps in data re- use, interoperability, and documentation? How did Google Dataset Search Start? Beta Google Dataset Search is launching in a rich landscape. It is rich both in content, as use of web- based data repositories grow, and in opportunity, as standards, guidelines, and even publishing requirements dictate ever-changing data sharing norms. As we in libraries have experienced, this shift in the data landscape is concurrent with trends in publishing, reproducible and open science, and the exciting opportunities new technology afford us year after year (Ware & Mabe, 2015). In 2017, Natasha Noy and Dan Brickley posted to Google’s AI Blog announcing the new project to make public datasets searchable through Google Search (Noy & Brickley, 2017). Citing the growing need for research data to be made available publicly, the Google team indicates that a search through their famous engine could be possible. This is thanks to increasing popularity of structured and standardized descriptions of datasets. The challenge Google’s team is setting out to answer, per their own blog post, is that these datasets exist in multiple places across the internet. Not only that, but once a dataset is found, the “veracity or provenance of that information” is not always clear (Noy & Brickley, 2017). How does Google Dataset Search work? This search tool, like other Google products, will use website providers’ embedded descriptions and metadata. The more structured this information is, the better Google’s tools will be able to represent the website in search results. To achieve this, Google’s guidelines are based on Schema.org (Follow the structured, 2019). Schema.org vocabulary was initially founded by –Google, Yahoo, Microsoft and Yandex. Now it is developed through an open, transparent, community process facilitated by W3.org, the World Wide Web Consortium that actively develops and communicates about the growth of the web, and GitHub, a collaboration platform for web and software developersSchema.org, n.d.). Schema.org provides commonly used, open, and transparent description standards for website structure and metadata. It is notable that Noy and Brickley, in their blogpost, make the connection that Schema.org’s work on datasets closely follows W3C’s work. They “expect [it] will be a foundation for future elaborations and improvements to dataset description” (Noy & Brickley, 2017). This did not automatically provide a good way to make datasets machine-findable across the internet. So as with many other content types, Google published guidelines (Follow the structured, 2019). Alongside markups for job postings, recipes, and local businesses, there are structured data 1 Deconstructing Google Dataset Search May 2019 guidelines for datasets in the reference for developers. The markup guidelines provide definitions for required and recommended properties as well as best practices for sitemap, source, and provenance. Google also uses this guide to recommend using the Structured Data Testing tool to validate your code. This tool examines your website, or any other provided URL, and detects the use of specific data types and the tags that describe the information. This provides automatic feedback to site-builders about how the Google search engines ‘sees’ the website, and if there are any errors or warnings that can be corrected to improve the search. What is Google Dataset Search doing? In this tool, Google‘s machine is targeting structured data that indicates a datasets is linked or displayed on the webpage. To do so, Google will use the web developer’s markup on the webpages, as well as the sitemap files. It also looks for landing pages for the dataset, as seen in dataset repositories. By finding these specific structures of metadata in the webpages code, the Dataset Search knows if the page contains datasets or links to datasets, if it should be displayed in the search results, and if there is any description available. When scrutinizing for the markup on the website, Google Dataset Search is using the vocabulary of metadata properties provided by Schema.org to understand the webpage contents. This vocabulary defines the property ‘dataset’ as “a body of structured information describing some topic(s) of interest”, a type of thing part of the broader category ‘CreativeWork.’ (Dataset, n.d.). According to the Developers guidance, this property then requires two additional properties. The first is, “description,” defined as “a short summary describing a dataset.” This must be inputted as text in the code. The second required property to describe a ‘dataset’ is “name,” another text property defined as “a descriptive name of the dataset.” The guidance provides an additional fifteen properties that can be included in the website code. These are optional but provide deeper information about the dataset, including a citation, location, geospatial coverage, and temporal coverage.On the sitemap, Google is looking for practices that identify the relationships between links or files on the webpage. For example, it will seek named relationships between pages on the website using of the ‘sameAs’ or ‘isBasedOn’ properties. These properties indicate the description of the data repeated elsewhere on a site, if there are any derivatives or latter iterations of the dataset. The guidance documentation provided by Google explains the best use of these named relationships. Metadata implementation for websites is not always a perfect process, which is why Google maintains the guidance document and forums with FAQs for developers hoping to make their website discoverable (Follow the structured, 2019). Taken together, the guidance provided by Google for website developers aims to provide not only description, but access through links and citation information for datasets that appear in their search results, relying on that structured metadata embedded in web pages. Not only is this search tool now available for use, but the work of the search algorithms in finding the metadata is also being recorded by Google. This metadata will be used to compose what the Google team is calling “an index of enriched metadata” (Burgess & Noy, 2018). This index is what will provide more power and speed to Google Dataset Search as it grows. How is it different from other Google Data tools? 2 Deconstructing Google Dataset Search May 2019 The Google Dataset Search does provide a more in-depth search result for a distinct type of resource, compared to other tools. It will find a wide variety of materials, from proprietary formats, to spreadsheets, even to organized tabular information displayed through HTML. These materials are raw datasets, usually with the type of format for download available. This is distinct from Google Public Data or Google Knowledge Graph. Google Public Data Explorer also leverages metadata to create an interface for searching and exploring data. The main difference is that Google Public Data Explorer requires two specific files, in specific formats, namely xml and csv. The goal of this interface is to create data visualizations, rather than access to raw data with citations (“DSPL Developer Guide,” 2015).The Google Knowledge Graph is the highlighted infobox that appears in search results and runs off Google’s retrieved results and metadata indexing to provide answer-like search results. It is also mentioned as a tool for fine-tuning the Google Dataset Search results (Burgess & Noy, 2018). These products are not necessarily similar, but the Google team expects interaction. This interaction will allow the dataset results improve rapidly where the Knowledge Graph has already demonstrated a method for improving the response from the algorithms. The Google team also draws the connection between Google Scholar and Google Dataset Search. Dataset metadata can be ‘augmented’ through Google Scholar through references and citation to provide both author data and a “signal about the importance and prominence of a dataset” (Burgess & Noy, 2018). The team alludes to stronger connections between the two in the future, although details are not yet available. An example search A simple example search of ‘graduation rates’ demonstrates the main features of the user interface. Unlike Google Public Data Explorer or Google Scholar, the beta Dataset Search has no advanced option for starting the query, only a keyword field. It does accept and encourage advanced Google search qualifiers such as ‘site.’ The search bar displays suggested search terms.Once the results have returned, the number of results is displayed in the top left. On a computer browser screen the results are displayed in two columns (On tablet or mobile, it is displayed as one column). The biggest change from the interface of Google Scholar or the Public Data Explorer is the lack of facets for search narrowing. Instead, results are displayed with logo images (if available), URL, and data if available for the results. In this example, the top results are from New York City Open Data initiative, Kaggle.com, and a U.S. dataset hosted on data.gov.The interface is also different because it has a more prominent ‘share’ icon in the right hand side of the dataset information pane.

Load more