A User Centred Perspective on Structured Data Discovery", University of Southampton, Faculty of Engineering and Physical Sciences, Phd Thesis, [Pagination]

University of Southampton Research Repository ePrints Soton Copyright © and Moral Rights for this thesis are retained by the author and/or other copyright owners. A copy can be downloaded for personal non- commercial research or study, without prior permission or charge. This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the copyright holder/s. The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the copyright holders. When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given e.g. Koesten, L. (2019) "A User Centred Perspective on Structured Data Discovery", University of Southampton, Faculty of Engineering and Physical Sciences, PhD Thesis, [pagination]. http://eprints.soton.ac.uk UNIVERSITY OF SOUTHAMPTON A User Centred Perspective on Structured Data Discovery by Laura Koesten A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in the Faculty of Engineering, Science and Mathematics Electronics and Computer Science November 2019 UNIVERSITY OF SOUTHAMPTON ABSTRACT FACULTY OF ENGINEERING, SCIENCE AND MATHEMATICS SCHOOL OF ELECTRONICS AND COMPUTER SCIENCE Doctor of Philosophy by Laura Koesten Structured data is becoming critical in every domain and its availability on the web is increasing rapidly. Despite its abundance and variety of applications, we know very little about how people find data, understand it, and put it to use. This work aims to inform the design of data discovery tools and technologies from a user centred perspective, aiming to better understand how we can support people in finding and selecting data that is useful for their tasks. We approached this by advancing our understanding of user behaviour in structured-data discovery through a mixed-methods study looking at the work flow of data practitioners when searching for data. From that we present a framework for structured data interaction describing data-centric tasks, search strategies, as well as an in-depth characterisation of selection criteria in data search. We identified textual summaries as a main element that supports the decision making process in information seeking activities for data. Based on these results we conducted a mixed-methods study to identify attributes that people consider important when describing a dataset. This enabled us to better define criteria for textual summaries of datasets for human consumption. We designed a set of template questions to help guide the summary writing process and conducted an online study to validate the applicability of dataset summaries in a dataset selection scenario. The findings of this work revealed unique interaction characteristics in information seeking for structured data. Our contributions can inform the design of data discovery tools, support the assessment of datasets and help make the exploration of structured data easier for a wide range of users. Contents Acknowledgements xiii 1 Introduction 1 1.1 Motivation ................................... 2 1.2 Approach .................................... 5 1.2.1 Study 1: Working with structured data ............... 6 1.2.1.1 Methods ........................... 6 1.2.1.2 Outcomes ........................... 6 1.2.2 Study 2 + 3: Dataset summaries ................... 7 1.2.2.1 Methods ........................... 7 1.2.2.2 Outcomes ........................... 7 1.2.3 Study 4 ................................. 8 1.2.3.1 Methods ........................... 8 1.2.3.2 Outcomes ........................... 8 1.3 Contributions .................................. 8 1.3.1 Publications ............................... 10 1.4 Thesis structure ................................. 11 1.5 Terminology ................................... 11 2 Background 13 2.1 Data search currently ............................. 13 2.1.1 Example of a data search process ................... 14 2.1.2 Data search in contrast to web search ................ 16 2.1.3 Dataset search versus database search ................ 19 2.2 Information seeking .............................. 20 2.2.1 Information seeking models and interactive information retrieval . 20 2.3 The user in data search ............................ 23 2.3.1 Search types: simple versus exploratory search tasks ........ 23 2.3.2 Data-centric information seeking tasks ................ 25 2.3.3 Sensemaking with structured data .................. 26 2.3.4 Context ................................. 27 2.3.5 Selection criteria ............................ 29 2.4 The system and the publisher in data search ................. 31 2.4.1 User interfaces for data search and exploration ........... 31 2.4.1.1 Specialised data interfaces ................. 31 2.4.2 Overviews of datasets ......................... 35 2.4.2.1 Search Results display .................... 35 v vi CONTENTS 2.4.2.2 Textual dataset summaries ................. 36 2.4.3 Metadata ................................ 36 3 Working with structured data 39 3.1 Motivation ................................... 39 3.2 Methodology .................................. 40 3.2.1 In-depth interviews ........................... 41 3.2.1.1 Recruitment ......................... 41 3.2.1.2 Description of participants ................. 41 3.2.1.3 Data collection and analysis ................ 43 3.2.2 Search logs ............................... 43 3.2.3 Ethics .................................. 45 3.3 Findings ..................................... 45 3.3.1 Data-centric tasks ........................... 45 3.3.1.1 Taxonomy of data-centric tasks ............... 45 3.3.1.2 Complex tasks ........................ 47 3.3.1.3 The impact of data quality on task outcomes ....... 47 3.3.2 Search for structured data ....................... 48 3.3.2.1 Searching on the web .................... 49 3.3.2.2 Searching on portals ..................... 49 3.3.2.3 Human recommendation and FOI requests ........ 51 3.3.3 Evaluation and exploration ...................... 52 3.4 Discussion of findings ............................. 56 3.4.1 Framework for Human Structured-Data Interaction ......... 56 3.4.2 Design recommendations ....................... 58 3.5 Limitations ................................... 60 3.6 Summary .................................... 60 4 Dataset summaries 63 4.1 Motivation ................................... 63 4.2 Related Work .................................. 66 4.2.1 Human-generated summaries of datasets ............... 66 4.2.2 Automatic summary generation .................... 67 4.3 Methodology .................................. 68 4.3.1 Data-search diaries ........................... 70 4.3.2 Dataset summaries ........................... 71 4.3.2.1 Datasets: the Set 5 and Set 20 corpora ........ 71 − − 4.3.2.2 Dataset summaries: Lab-based experiment ........ 72 4.3.2.3 Dataset summaries: Crowdsourcing experiment ...... 75 4.4 Findings: Data-search diaries ......................... 77 4.4.1 Relevance ................................ 78 4.4.2 Usability ................................ 79 4.4.3 Quality ................................. 80 4.5 Findings: Dataset summaries ......................... 80 4.5.1 Form and length ............................ 80 4.5.2 Information types ........................... 81 4.5.3 Summary attributes .......................... 83 CONTENTS vii 4.5.3.1 Summary attributes in detail ................ 89 4.6 Discussion of findings ............................. 95 4.6.1 Summaries attributes ......................... 95 4.6.2 Comparison to metadata standards and data search diaries .... 97 4.6.3 Making better summaries ....................... 99 4.7 Dataset summary template ..........................100 4.7.1 From summaries to metadata .....................102 4.7.2 Summary tool .............................103 4.8 Limitations ...................................105 4.9 Summary ....................................107 5 Summary evaluation 109 5.1 Motivation ...................................109 5.2 Related work ..................................110 5.2.1 Snippets .................................110 5.2.2 SERP design ..............................111 5.3 Methodology ..................................112 5.3.1 Tasks ..................................114 5.3.2 Process .................................114 5.3.3 Corpus .................................117 5.3.4 Metrics .................................118 5.3.5 Ethics ..................................119 5.4 Findings .....................................119 5.4.1 Participants ...............................119 5.4.2 Time-based results ...........................120 5.4.3 Interaction and performance based results ..............121 5.4.4 Interaction based results ........................124 5.4.5 Experience based results ........................126 5.4.6 Qualitative findings ..........................129 5.5 Discussion of findings .............................132 5.6 Limitations ...................................135 5.7 Summary ....................................136 6 Discussion 137 6.1 Contributions ..................................138 6.1.1 Framework for Human Structured Data Interaction .........138 6.1.1.1 Taxonomy of data-centric work tasks ...........139 6.1.1.2 Better understanding of search strategies in data search . 140 6.1.1.3 In-depth knowledge on selection criteria in data search . 140 6.1.1.4 Analysis of exploration activities for datasets .......141 6.1.2 Composition of dataset summaries

Load more