Research data introduction

Mark Scott, Nicki Clarkson, Alison Knight

Guide authors: Mark Scott, Richard Boardman, Philippa Reed, Simon Cox – FEPS Dorothy Byatt and Isobel Stark – Library

Accompanying guide available at: https://eprints.soton.ac.uk/403440/

Research data management web site: http://library.soton.ac.uk/researchdata DATA CREATION IN THE ‘GLOBAL DATASPHERE’ 16.1 ZB

4.4 ZB 2.8 ZB 1.8 ZB 0.281 ZB 1.0 ZB

2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

Source: Estimates from IDC 2008, 2011, 2012, 2014, 2017 3 DATA CREATION IN THE ‘GLOBAL DATASPHERE’

‘By 2025, embedded data will constitute nearly 20% of all data created.’

• There will be a massive increase in data generated by Mobile and Real- time applications (e.g. automated machines). • Internet of Things is driving real-time data.

Source: IDC 2017 4 DATA CREATION IN THE ‘GLOBAL DATASPHERE’

‘By 2025, an average connected person anywhere in the world will interact with connected devices nearly 4,800 times per day — basically one interaction every 18 seconds.’

• Driven by embedded devices and Internet of Things

Source: IDC 2017 5 RESEARCH DATA MANAGEMENT AT THE UNIVERSITY OF SOUTHAMPTON http://library.soton.ac.uk/researchdata

• Data management planning • Guidance on retention periods • Describing your data for effective • Data access statements reuse • FAQs • Sharing your data • Useful links • Securing your data • Advice via email: • Storing your data [email protected] • Destruction of data

6 TALK OUTLINE

1. Five ways to think about research data

2. Why data management is important to you

3. Data management best practices

7 FIVE WAYS TO THINK ABOUT RESEARCH DATA 1. CREATION

• Scientific experiment

• Models or simulation

• Derived data

• Reference data

8 FIVE WAYS TO THINK ABOUT RESEARCH DATA: 2. THE RESEARCH

Electronic text documents Notebooks and diaries

Questionnaires, transcripts and Spreadsheets codebooks

Digital objects, e.g. figures, videos Audiotapes and videotapes

Database schemas Photographs and films

Database contents Specimens, samples and artefacts

Models, algorithms and scripts Methodologies, workflows, procedures and protocols Software configuration Experimental results Software input and output files (pre- and post-process) Metadata (data describing data) 9 FIVE WAYS TO THINK ABOUT RESEARCH DATA: 3. ELECTRONIC REPRESENTATION

Textual Text files, Microsoft Word, PDF, RTF

Numerical Excel, CSV (Comma separated)

Multimedia TIFF image, AVI movie, MP3 audio

Structured CSV, database, multi-purpose (XML)

Software code Java, C, Matlab

Software specific 3D CAD, statistical model

Discipline specific Chemistry’s CIF (for crystallography)

Instrument specific Archaeology’s laser scanner files 10 FIVE WAYS TO THINK ABOUT RESEARCH DATA: 4. WHAT FILES MAKE UP YOUR DATA SET? Data sets come in all sizes and complexities

Complexity Type of data Size

Individual file Raw CT data 10–100s of gigabytes

Video Gigabytes

Photograph Megabytes

Individual frames of a Set of files Gigabytes movie

Source code files Kilobytes/megabytes

11 FIVE WAYS TO THINK ABOUT RESEARCH DATA: 5. DATA LIFE CYCLE

12 Categories Stages WHY IS RESEARCH DATA MANAGEMENT IMPORTANT?

‘Data sharing and management snafu in 3 short acts’ by NYU Health Sciences Library (2012)

13 FIVE REASONS TO BOTHER WITH RESEARCH DATA MANAGEMENT 1. You may be required to by your funding body 2. The University expects you to (e.g. backups) 3. You may want to use the dataset again 4. Others may want to use your data 5. Others may want to cite your data – datasets can be cited just like journals and papers which can help your research standing

15 Data management best practices DATA MANAGEMENT PLAN

• What is a DMP? “just a tool for thinking systematically through the kinds of material your work will produce, how you will work with it and ensure its integrity during the project, the possible reuse value of this material, and how it will be made safe and available into the future. A DMP is like an insurance policy for sustainability, ensuring you will maximise research value and have no unpleasant surprises at the close of your project. “

http://training.parthenos-project.eu/sample-page/manage-improve-and-open-up-your-research-and-data/data-management-planning/ 17 DATA MANAGEMENT PLAN

• Compulsory for all new doctoral students • http://library.soton.ac.uk/researchdata/phd • Must complete & upload a DMP as part of your 12m review • Information, templates and guidance on the Library website

18 OPEN ACCESS TO RESEARCH DATA

• Open Research requires data to be FAIR: • Findable • Accessible • Interoperable • Reusable • “As open as possible, as closed as necessary” • There may be moral, ethical, commercial or legal reasons for not sharing data or for restricting access • ORCiD can link your datasets and associated papers • Sharing your data can boost citation rates

19 OPEN ACCESS TO RESEARCH DATA

Create a metadata record in a data Describe what the data is, why, when and how it was generated repository within 12 months of the end of data collection or when publishing a paper The data repository can be discipline-specific or EPrints via https://pure.soton.ac.uk

Obtain a Digital Object Identifier (DOI) for e.g. 10.5258/SOTON/393614 the record

If you can, upload the data to the data repository

Include a data access statement in any Now a requirement of many funders published work

20 DATA ACCESS STATEMENTS – EXAMPLES Openly available data ‘Data published in this paper are available from the University of Southampton repository at 10.5258/SOTON/379558.’ (G. Squicciarini, M.G.R. Toward and D.J. Thompson 2015)

21 DATA ACCESS STATEMENTS – EXAMPLES

Restricted access – ethical, legal, commercial ‘The study data are not freely available due to legal restrictions, and Government of India’s Health Ministry Screening Committee (HMSC) assessment is required to obtain the data. The Parthenon Cohort team will provide the data on request subject to HMSC approval. For further information contact the corresponding author.’ (Ghattu V. Krishnaveni et al. 2015)

22 DATA ACCESS STATEMENTS – EXAMPLES

Secondary analysis of existing data This study was a re-analysis of an existing dataset that is publicly available from [organisation] at [web address]

23 DATA ACCESS STATEMENTS – EXAMPLES

No new data created No new datasets were created during this study 24 EU General Data Protection Regulation: ‘GDPR’ Came into force on 25 May 2018

Dr. Alison Knight, Legal Services & Data Governance Background

• GDPR aims to strengthen data protection laws to make them fit for the digital age by giving people more control over their own data • The current UK Data Protection Act 1998 was derived from EU law (the Data Protection Directive) which is being replaced by the GDPR • The text of GDPR applies throughout EU Member States, directly embedded into new national data protection legislation (incl. a new UK Data Protection Act 2018 to come) • Brexit does not affect the implementation of GDPR • GDPR also has indirect effects on privacy standards

outside the EU 26 Why comply? – New fine levels

Major breaches of data protection are subject to administrative fines: whichever is higher of the following: • up to 20,000,000 EUR, OR • up to 4 % of the total worldwide annual turnover of the preceding financial year (in the case of an undertaking) • Focused on incidents which are likely to cause damage and distress

Medium breaches of data protection are subject to administrative fines: whichever is higher of the following: • up to 10,000,000 EUR, OR • up to 2 % of the total worldwide annual turnover of the preceding financial year (in the case of an undertaking) • Focused on process failures. For example, failure to report ‘High risk’ breaches to the ICO and the relevant data subjects within 72 hours. Or, a failure to do a DPIA.

27 Any information relating to an identified or identifiable natural person An identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to: • an identifier such as a name, an identification number, location data, an online identifier or • to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person” Compare the concept of ‘sensitive’ personal data – delimited categories

Personal data – wide definition, but ultimately context focused 28 The main data protection principles – Key message: Establishing data processing purpose is fundamental to GDPR compliance Data protection impact assessments – ‘DPIAs’ (i)

• Used and recommended for some time – now a firm requirement under GDPR along with ‘Privacy By Design’ and ‘Privacy by Default’ • Purpose: exercise to identify and mitigate ‘high-risk’ data processing “to the rights and freedoms of natural persons” taking into account: nature, scope, context and purposes of the processing (‘the Data Environment’) • Must carry out a DPIA if plan “high risk" processing activity e.g. • Sensitive data or criminal convictions/offences in high volume • Systematic profiling of individuals leading to automated decisions e.g. credit checks • Large-scale systematic monitoring in public area e.g. CCTV • Other? To assess context by context… • What's included in a DPIA? • Details of processing including purposes • Assessment of necessity and proportionality of processing • Assessment of risks to data subjects • How risks will be reduced to a ‘non-high’ level 30 Data protection impact assessments – DPIAs (ii)

• Identify the need for a DPIA initial form (are you planning to process information relating to living people in your research project?) • If yes and triggers met, a DPIA form must be completed and undergo panel review before ethics clearance. This could delay new projects significantly build in to timeline ⇨ • "Where appropriate" must consult data subjects. Also needs to be factored in to timescales. • Follow up recommendations, and rarely if the DPIA identifies high-risk processing that cannot be properly mitigated, data controller (i.e the University) must consult supervisory authority (e.g. ICO). • A DPIA is not the end: it is an (ongoing) process, LIKE GOOD DATA MANAGEMENT!

31 Where can you get help?

Guidance type Web link

UoS resources • https://intranet.soton.ac.uk/sites/gdpr/Pages/Home.aspx • https://groupsite.soton.ac.uk/Administration/DPIA-Template-and- Resources/Pages/Home.aspx • To be embedded in Service Now – entry from link in ethics form ICO Guide to the GDPR • https://ico.org.uk/for-organisations/guide-to-the-general-data- and DPIAs protection-regulation-gdpr/ • https://ico.org.uk/media/1624219/preparing-for-the-gdpr-12- steps.pdf • https://ico.org.uk/for-organisations/guide-to-the-general-data- protection-regulation-gdpr/accountability-and-governance/data- protection-impact-assessments/ Email us • General enquiries: [email protected] • Legal: [email protected] • DPIAs: [email protected] • Information Security: [email protected] 32 GDPR/DPIA Questions/Discussion Time

33 RESEARCH DATA MANAGEMENT – BEST PRACTICES

Available on the library’s web site http://library.soton.ac.uk/researchdata • File naming • File versioning • File formats • Backups • Encryption • Data destruction at the end of its life

34 CAN YOU STILL ACCESS YOUR DATA IN 20 YEARS? What if the software was no longer available?

• Try to use text files. Formatting might be lost but perhaps the data will still be useful.

• Otherwise, use file formats with openly published specifications to provide some protection.

• Some believe storing a virtual machine containing the software may help in the future – helps medium term but perhaps not long term

35 VERSIONING • Follow a process that allows you to cope with versions of files so you can go back easily if a mistake is made • Add _1, _2, _3 or the date to file names for a simple solution • Not a very robust solution when dealing with large amounts of files • Use free specialist software such as Mercurial, Git, or Subversion

36 University GitLab service: VERSIONING EXAMPLE (GIT) https://git.soton.ac.uk

init Initialise a directory so it can track file versions

Copy/create files

add Start tracking the necessary files Save a copy of the files, along with a message detailing commit changes Modify files

37 BACKUP REGULARLY – OFFSITE!

• Hard drives do fail regularly and Consider all of the possible ways of losing disasters do happen! data, some more likely than others (non-exhaustive list): • The best local example of this is the Mountbatten building fire on 30 ❌ Hardware failure October 2005. Many students and staff were suddenly faced with the ❌ Theft possibility of having lost months’ ❌ Lost laptop/hard-drive worth of research • iSolutions estimates that there is at ❌ User error e.g. files being deleted least 6 PB not backed up through e.g. a software bug ❌ Vendor error central provision causing corruption • You will not get an extension if you ❌ Malware lose data and have no backups ❌ Fire, flood, etc.

38 BACKUP BEST PRACTICES

• Backup off-site • Backup using two solutions to two different places • Built-in tools to get you going: System File backup System Windows 7 Backup and Restore Backup and Restore Windows 8– File History System Image (8.1–) Mac 10.5– Time Machine Time Machine

• File synchronisation and backup tools:

rsync Example command: rsync -avi sourcefolder/ destfolder/ robocopy Example command: robocopy sourcefolder destfolder /z /e /v

SyncToy Old but still useful Microsoft tool supported up to Windows 7 39 UNIVERSITY RESOURCES – FILE STORE

University file store Both of these central \\filestore.soton.ac.uk\users\ file store areas are • 5 GB allocation each backed up off-site – backups retained for 3 months Research file store \\fsresearch.soton.ac.uk\

• 1440 TB • Requests of up to 1 TB through [email protected] • Higher allocations into hundreds of TB will involve faculty approval • ‘Secure research data storage service to the research community’

• Enterprise class platform with a couple of hundred hard drives connected on an InfiniBand backbone with massive data throughput and expansion potential. • It has 10 Gbit/s network access. This allows extremely fast data transfers across 40 campus UNIVERSITY RESOURCES – OFFICE 365

• Gives staff and students access to: OneDrive for Business SharePoint Online/Teams • The Microsoft Office Online 5 TB allocation No limit application suite Best for: Individual work Best for: Collaborative work • Microsoft Office software on Mac and Windows clients Mac and Windows clients personal equipment (up to 5) Note: When you leave the University, your OneDrive • OneDrive for Business and for Business account gets SharePoint online for cloud deleted after 30 days storage on servers in European Limitations: Union (eventually UK) • Maximum 15 GB file size • It is expected data is deposited as • Restrictions on characters in the file name (e.g. folders a dataset in a data repository with with the name ‘forms’) • Maximum path length of 400 characters a DOI • No more than 300,000 files recommended 41 DROPBOX

• Cloud file storage. 2 GB free. £6.58/month for 1 TB. • Alternatives: Microsoft OneDrive (5 GB free); Box (10 GB free); Google Drive (15 GB free) • Dropbox ≠ backup • Synchronises deletes as well as modifications • Full restore can be slow, difficult and messy

Feature Preferred alternatives File synchronisation OneDrive for Business, SharePoint Online, Peer-to-peer, e.g. , Collaborative working University file store SharePoint Online File sharing University Dropoff service at

https://dropoff.soton.ac.uk 42 SECURITY – ENCRYPTION

• Data encryption prevents people without the passphrase accessing your data • Encrypt your laptop in case you lose it System Built-in encryption system Windows Vista– BitLocker Mac OS X 10.3– FileVault Mac OS X 10.7– FileVault 2 LUKS

• Encrypt sensitive data sets when sharing System How to encrypt University Dropoff service at Tick "Encrypt every file" when creating a new drop-off https://dropoff.soton.ac.uk Encrypted ZIP Create AES-256 encrypted ZIPs with WinZIP or 7zip

7zip command line example: 7z a newzip.zip folder -tzip -mem=AES256 -mx9 –p 44 SECURITY – DISPOSING OF DATA

• Does data require destroying at the end of its life? Check your Data Management Plan.

Magnetic media hard drive Solid state disk Securely erase data with software: Secure erasing in with software will seriously shorten its life and may not • Darik’s Boot and Nuke (DBAN) boot CD or Parted Magic effectively destroy data • Windows: Eraser Use the manufacturer’s software to • Mac: Secure Empty Trash or Disk Utility reset all blocks, or consider encryption • Linux: shred, srm or nwipe Or physical destruction (most secure) Contact [email protected] 45 SUMMARY • Ensure proper management of the data throughout its life • Publish your data along with your publications • More information on the library’s web site or in the accompanying guide

• Consider the long-term view

46 UNIVERSITY RESOURCES – CONTACTS

Research data web site http://library.soton.ac.uk/researchdata

Research data advice [email protected]

File store requests [email protected]

GDPR queries [email protected] [email protected]

DPIA queries [email protected]

47 ACKNOWLEDGEMENTS

• The categorisation of research data collection was defined in Research Information Network (2008) • The forms of research data and categorisation of electronic storage of research data was adapted from The University of Edinburgh (2011). • The following people helped with the preparation of this document: • Andy Collins (Human Genetics case study). • Thomas Mbuya and Kath Soady (Materials Fatigue Test case study). • Gregory Jasion (CFD case study). • Simon Coles (Chemistry case study). • Graeme Earl (Archaeology case study). • Mark Scott, Richard Boardman, Philippa Reed and Simon Cox (overall content). • We acknowledge ongoing support from the University of Southampton, Robert’s funding, Microsoft, EPSRC, BBSRC, JISC, AHRC and MRC.

48 REFERENCES

• Digital Curation Centre (2010), ‘DCC Curation Lifecycle Model’. URL: http://www.dcc.ac.uk/resources/curation-lifecycle-model • Humphrey, C. (2006), ‘e-Science and the Life Cycle of Research’. URL: http://datalib.library.ualberta.ca/ humphrey/lifecycle-science060308.doc • Research Information Network (2008), ‘Stewardship of digital research data: a framework of principles and guidelines’. • The University of Edinburgh (2011), ‘Defining research data’. URL: http://www.ed.ac.uk/schools-departments/information-services/services/research- support/data-library/research-data-mgmt/data-mgmt/research-data-definition • University of York (2012), ‘Archaeology Data Service’. URL: http://archaeologydataservice.ac.uk/

49