D2.1 Redefining What Data Is and the Terms We Use to Speak of It
Total Page:16
File Type:pdf, Size:1020Kb
D2.1 Redefining what data is and the terms we useto speak of it. Jennifer Edmond, Georgina Nugent Folan To cite this version: Jennifer Edmond, Georgina Nugent Folan. D2.1 Redefining what data is and the terms we useto speak of it.. [Research Report] Trinity College Dublin. 2018. hal-01842385 HAL Id: hal-01842385 https://hal.archives-ouvertes.fr/hal-01842385 Submitted on 18 Jul 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. H2020-ICT-2016-1-732340 Deliverable Title: D2.1 Redefining what data is and the terms we use to speak of it. Deliverable Date: 31st March 2018 Version: V2.0 Project Acronym : KPLEX Project Title: Knowledge Complexity Funding Scheme: H2020-ICT-2016-1 Grant Agreement number: 732340 Project Coordinator: Dr. Jennifer Edmond ([email protected]) Project Management Michelle Doran ([email protected]) Contact: Project Start Date: 01 January 2017 Project End Date: 31 March 2018 WP No.: 2 WP Leader: Dr Jennifer Edmond, TCD Authors and Contributors Dr Jennifer Edmond ([email protected]) (Name and email address): Dr Georgina Nugent-Folan ([email protected]) Dissemination Level: PU Nature of Deliverable: R = report 1 H2020-ICT-2016-1-732340 Abstract: How do we define data? Through a series of interviews with computer scientists and a text mining exercise, this report presents the investigations into the how data is defined and conceived of by computer scientists. By establishing a taxonomy of models underlying various conceptions of what data is, as well as a historical overview of when and how certain conceptions came to be dominant, and within what communities, this report moves towards a reconceptualisation of data that can accommodate knowledge complexity in big data environments. Revision History: Version 1.0 uploaded to the EC portal, March 2018 Version 2.0 was revised to include the KPLEX Project Glossary of Big Data Terms for Humanists (page 208- 210) and uploaded to the EC portal in July, 2018. 2 H2020-ICT-2016-1-732340 Table of Contents: 1. WP Objectives (taken from the DOW) 2. Lit Review 3. Methodology a. Interviews (design and implementation and target audience) - list of interview questions provided in Annex 1. b. Data Analysis (coding etc) c. Data Mining 4. Findings and Results a. Results of the coding of interview responses b. Results of data mining exercise 5. Discussion and Conclusion a. Trends in the results b. Emerging narratives? 6. Recommendations a. Recommendations for policy makers? b. Recommendations for future ICT projects? c. Recommendations for researchers? d. Recommendations for research funding bodies? e. Other? 7. Annex 1: Interview questions Annex 2: WP2 Code List for the Qualitative Analysis of Interviews Used in Atlas.ti Annex 3: WP2 Data Mining Results Annex 4: A KPLEX Primer for the Digital Humanities 3 H2020-ICT-2016-1-732340 1. WP Objectives WP 2’s objective was to deliver a fundamental comparative layer of information, which supported not only the research of the other WPs, but also enable project recommendations and underpin the discussion of D 1.1, the project final report on the exploitation, translation and reuse potential for project results. The objective was to synthesise findings into both a white paper for policy/general audiences and a journal article. The activities of the WP2 were broken down into the following tasks. T 2.1 Survey of the state of knowledge regarding the development of the concept of data (M 4-6) Drawing on a wide range of disciplines (primarily history, philosophy, science and technology studies, but also engineering and computer science), this first task sought to establish a taxonomy of models underlying various conceptions of what data is, as well as a historical overview of when and how certain conceptions came to be dominant, and within what communities. This work allowed the project as a whole to base its further development within a contextualised frame. T 2.2 Development of data mining exercise and interview questions (M 7-8) At the conclusion of the first phase (and in conjunction with the consortium’s midterm face- to-face meeting), the original objectives of WP2 as outlined in Annex 1 of the DOW were to develop both an online survey and a set of detailed questions to underpin one-on-one interviews with computer science researchers. As outlined in the interim project management report, which was approved by the project’s European Project Manager, these objectives were amended to reflect WP2 aim which was to acquire rich rather than large responses to the accompanying WP interview. Therefore we amended our objective of conducting a large survey and in the place of a survey, WP2 conducted a data mining exercise across a corpus of computer science journals and proceedings from “big data” conferences so as to get a more informed picture of what inherent definitions of the word “data” are expressed in them, and how transformation processes like data cleaning or processing are viewed and documented. This directly fed into WP2’s objective of providing a thorough taxonomy of the various definitions of data in use among different research communities. This amendment was included in the interim management report which was approved by the project's European Project Officer. Both of these instruments (the interviews and the data mining exercise) were designed to tease a more detailed picture of what model of data underlies computer science research and development. The data mining exercise was used largely at this level of definitions only, while the interviews will sought to delve more deeply into awareness and conceptions of the ethics and responsibilities of aggregating data that is an incomplete record of phenomena, as well as any strategies deployed to address this challenge. Our target was to be in excess of 12 1-hour interviews. 4 H2020-ICT-2016-1-732340 T 2.3 Delivery and initial data preparation/analysis of the data mining and interview results (M 9-11) The data mining exercise was performed across a corpus of computer science journals and proceedings from “big data” conferences. Interviewees were recruited from the partner networks (in particular the SFI ADAPT Centre, in which the WP leader is an Associate Investigator.) T 2.4 Analysis of data, write up and editing of reports (M 12-14) Although data analysis and interview transcription were continuous throughout the phase of Task 2.3, this task will took the final data set and determined the overall conclusions and recommendations that would become a part of the final WP reports. T 2.5 Integration of final WP results with overall project (M 15) The final month of the project will be dedicated to the alignment and harmonisation of final results among the WPs, which will be pushed forward by the final project plenary meeting at the start of M 15) All of the tasks will be led by WP2 leader (TCD). Input of the other partners (KNAW-DANS, FUB and TILDE) will be gathered at the kickoff, midterm and final project meetings as well as through the monthly PMB online conference calls, as well as asynchronously through the circulation of draft results. 5 H2020-ICT-2016-1-732340 2. Literature Review for WP2. 2.1. The influence of information studies, archival studies, and documentation studies on how we conceive of and identify data. The definitions, distinctions, and curatorial liberties granted to the documentationalist as outlined in Suzanne Briet’s What is Documentation have become integral to archival scholarship and conceptions of what data is, so much so that facets of Briet’s documentationalist approach have become enshrined in archival practice, providing prescriptive and unquestioned approaches to documentation that have been automatically translated into the digital sphere and co-opted into digital archival practice. Analogue documentation practice, what Ben Kafka refers to as “the ‘charismatic megafauna’ of paperwork,”1 has greatly influenced digital documentation practice, meta-data, and information architecture. Speaking in relation to projects such as Project Gutenberg and Google Books Rosenberg notes “Some of these resources are set up in ways that generally mimic print formats. They may offer various search features, hyperlinks, reformatting options, accessibility on multiple platforms, and so forth, but, in essence, their purpose is to deliver a readable product similar to that provided by pulp and ink.”2 The longstanding (and long-acknowledged) problems of analogue documentation and analogue forms of metadata have carried over to digital practice, shaping the architecture of digital documentation (and consequently the archives as a whole), and the epistemological implications of these structures, and their effect on how we approach and access stored material (I hesitate to use the term data) is not receiving the attention it deserves. Briet focuses on the importance of metadata in the form of the catalogue. The centralised role of metadata has been argued as necessary for functionality, for practicality (planning research trips, navigating the archive etc.) and for documenting the contents of the archive, to the point where its role as a mediator between scholar/ user and archive has become standard; metadata is now not only an obligatory intermediary between a scholar and an archive, but a structural facet of the information architecture that mediates the space between user interface and the database: Current catalogues, retrospective catalogues, and union catalogues are obligatory documentary tools, and they are the practical intermediaries between graphical documents and their users.