<<

Data Sharing in Practice

Katherine Tucker & Chris Harbron

PSI Conference 2018 Disclaimer

This presentation reflects the views of the authors and should not be construed to represent Roche’s views or policies

2 Agenda

• Why share data?

• Overview of data sharing landscape – how far have we come?

• Connecting the dots & changing regulatory requirements

• UKAN and a practical framework for data anonymization

• The risk of re-identification

• Quantifying and mitigating the risk

• The concern of multiplicity

3 Why share data? Go Green! Why recycling data is good for pharma, for science, and for patients

* Findable Accessible Interoperable Reusable 4 Overview of data sharing landscape – how far have we come?

• Clinical trial data sharing platforms (IPD – individual participant data) –

– Project DataSphere, TransCelerate PSoC

• Collaboration across industry, also academia –

– EFPIA/PhRMA, TransCelerate

5 Connecting the dots & changing regulatory requirements

• Beyond IPD…

• Transparency across the clinical trial lifecycle: – Registries – Documents – Publications

• Regulatory: – EMA - Policy 0070 – Health Canada – ‘Public Release of Clinical Information’ – FDA – ‘Clinical Data Summary Pilot Program’

• All this sharing is great but data privacy is a big concern - how can we ensure that 6 clinical trial data can be safely shared or published? UKAN (UK Anonymisation Network) and a practical framework for data anonymisation

• Anonymisation Decision-Making Framework (ADF) published by UKAN

• Anonymisation is a heavily context-dependent process. Only by considering the data and its environment as a total system (which we call the data situation), can one come to a well informed decision about whether and what anonymisation is needed.

• The ADF incorporates two frames of action: contextual and technical

7 UKAN ADF (continued)

There are five principles upon which the ADF is founded: 1. You cannot decide whether data are safe to share/release or not by looking at the data alone 2. But you still need to look at the data 3. Anonymisation is a process to produce safe data but it only makes sense if what you are producing is safe useful data 4. Zero risk is not a realistic possibility if you are to produce useful data 5. The measures you put in place to manage risk should be proportional to the risk and its likely impact

8 The Risk Of Re-Identification Player Position Age Caps Goals Team GK 25 7 0 Stoke City GK 24 2 0 Everton GK 26 0 0 Burnley Trent Alexander-Arnold DF 19 0 0 Liverpool DF 32 58 4 Chelsea DF 28 9 0 City Phil Jones DF 26 24 0 Manchester United DF 25 4 0 Leicester City Danny Rose DF 27 16 0 Tottenham Hotspur DF 24 24 0 Manchester City DF 27 5 0 Tottenham Hotspur DF 28 34 0 Manchester City DF 32 33 7 Manchester United MF 22 23 2 Tottenham Hotspur MF 24 25 3 Tottenham Hotspur MF 27 38 0 Liverpool MF 25 10 1 Manchester United Ruben Loftus-Cheek MF 22 2 0 Crystal Palace FW 24 23 12 Tottenham Hotspur FW 20 17 2 Manchester United FW 23 37 2 Manchester City FW 31 21 7 Leicester City FW 27 37 15 Arsenal 9 The Risk Of Re-Identification Player Position Age Caps Goals Team Jack Butland GK 25 7 0 Stoke City Jordan Pickford GK 24 2 0 Everton Nick Pope GK 26 0 0 Burnley Trent Alexander-Arnold DF 19 0 0 Liverpool Gary Cahill DF 32 58 4 Chelsea Fabian Delph DF 28 9 0 Manchester City Phil Jones DF 26 24 0 Manchester United Harry Maguire DF 25 4 0 Leicester City Danny Rose DF 27 16 0 Tottenham Hotspur John Stones DF 24 24 0 Manchester City Kieran Trippier DF 27 5 0 Tottenham Hotspur Kyle Walker DF 28 34 0 Manchester City Ashley Young DF 32 33 7 Manchester United Dele Alli MF 22 23 2 Tottenham Hotspur Eric Dier MF 24 25 3 Tottenham Hotspur Jordan Henderson MF 27 38 0 Liverpool Jesse Lingard MF 25 10 1 Manchester United Ruben Loftus-Cheek MF 22 2 0 Crystal Palace Harry Kane FW 24 23 12 Tottenham Hotspur Marcus Rashford FW 20 17 2 Manchester United Raheem Sterling FW 23 37 2 Manchester City Jamie Vardy FW 31 21 7 Leicester City Danny Welbeck FW 27 37 15 Arsenal 10 The Risk Of Re-Identification Player Position Age Caps Goals Team Jack Butland GK 20-30 7 0 Stoke City Jordan Pickford GK 20-30 2 0 Everton Nick Pope GK 20-30 0 0 Burnley Trent Alexander-Arnold DF <20 0 0 Liverpool Gary Cahill DF >30 58 4 Chelsea Fabian Delph DF 20-30 9 0 Manchester City Phil Jones DF 20-30 24 0 Manchester United Harry Maguire DF 20-30 4 0 Leicester City Danny Rose DF 20-30 16 0 Tottenham Hotspur John Stones DF 20-30 24 0 Manchester City Kieran Trippier DF 20-30 5 0 Tottenham Hotspur Kyle Walker DF 20-30 34 0 Manchester City Ashley Young DF >30 33 7 Manchester United Dele Alli MF 20-30 23 2 Tottenham Hotspur Eric Dier MF 20-30 25 3 Tottenham Hotspur Jordan Henderson MF 20-30 38 0 Liverpool Jesse Lingard MF 20-30 10 1 Manchester United Ruben Loftus-Cheek MF 20-30 2 0 Crystal Palace Harry Kane FW 20-30 23 12 Tottenham Hotspur Marcus Rashford FW 20-30 17 2 Manchester United Raheem Sterling FW 20-30 37 2 Manchester City Jamie Vardy FW >30 21 7 Leicester City Danny Welbeck FW 20-30 37 15 Arsenal 11 UKAN Framework

Disclosure Risk Data Situation Audit Assessment & Impact Management Control 1. Data Situation 6. Assess Disclosure 8. Stakeholder & 2. Legal Risk Communication Responsibilities 7. Disclosure Control 9. Plan What Next 3. Know your Data Processes 10.Plan for 4. Use Case Something Going Wrong 5. Ethical Obligations 12 Re-Identification Risk

7. Identify the disclosure control processes that are relevant to your data situation

Reconfigure the Change the Data Environment Data Specifications

Prob(Re-Identification) = Prob(Attack) x Prob(Re-Identification | Attack)

Probability Of… Re-Identification Re-Identification Attack Given An Attack Rely on the Data Sharing Data Sharing Platforms LOW LOW HIGH Platform’s tracking and legal and reputational implications Need to reduce the risk of Open Sharing a re-identification in the event of e.g. EMA Policy 70, LOW HIGH LOW an GAAIN 13 attack to be close to 0 Types Of Attack

Risk Type Prosecutor : Journalist : Individual known Individual just known to to be in trial be in wider population “Nosy Neighbour” >

Motivation

> > “Demonstration” >

In an open sharing environment : all attack types are feasible Need to control for the highest risk : Demonstration – Showing it is possible. Any study participant can be a target Assume knowledge of trial participation – could be from multiple external sources 14 Risk Metrics – Concept Of Equivalence Classes Player Position Age Equivalence Class Size Jack Butland GK 20-30 Jordan Pickford GK 20-30 3 Nick Pope GK 20-30 Trent Alexander-Arnold DF <20 1 Gary Cahill DF >30 1 Fabian Delph DF 20-30 Phil Jones DF 20-30 Harry Maguire DF 20-30 Danny Rose DF 20-30 7 John Stones DF 20-30 Kieran Trippier DF 20-30 Kyle Walker DF 20-30 Ashley Young DF >30 1 Dele Alli MF 20-30 Eric Dier MF 20-30 Jordan Henderson MF 20-30 5 Jesse Lingard MF 20-30 Ruben Loftus-Cheek MF 20-30 Harry Kane FW 20-30 Marcus Rashford FW 20-30 4 Raheem Sterling FW 20-30 Danny Welbeck FW 20-30 Jamie Vardy FW >30 1 15 Different Risk Metrics

Equivalence class - set of patients with the same set of quasi-identifiers

Proportion of patients in an equivalence 푓푖 × 퐼 푓푖 < 휏 class smaller than a threshold size 푛 퐽 Average size of equivalence classes 푛

1 / Size of smallest equivalence class 1 푚푖푛 푓푖

As we are obliged to protect all patients enrolled in the trial from de-identification and a demonstration attack which would attack the weak spot of the study is feasible we focus on the minimum equivalence class size as the risk metric 16 Relationship Of Study Size To Risk

17 Mitigation Generalisatio Dropping Reconfigure the n / Grouping Variables Data Environment Strategies

Sharing Raw Data Sharing Reports

– Need to consider all combinations of all key – Tables presented including key variables or variables that an attacker could know combinations of tables of key variables – Many variables could never be known by an – Narratives – focus on single individuals often attacker, so are not of concern with interesting circumstances – With larger numbers of key variables, datasets • Need to consider the combinations of key can become granular variables which normally appear in a 18 narrative Sharing Data Is Possible Some Points to Consider in Sharing Data Impact Of Multiple Tables Impact Of Impact Of A Within a Report Multiple Data Sources Multiplicity of Studies

Combining disparate Multiple Trials P(re-identification) in a trial may pieces of information be small, but is >0 Other sources – social media, within a report together P(at least one re-identification) government releases, real across N trials grows world data Available now and in the future

At some stage across a company or across the industry as a whole this becomes relevant with potential vulnerability to a concerted 19 attack Conclusion

Data sharing is here to stay

High profile activity e.g. recent Facebook publicity

Sharing Data is a two-way process also giving access to create a win/win

Mindset Shift – Your data will be shared, but you can access others’ data

Tractable Problem – Possible to overcome challenges as seen in other industries

Maintenance of patient privacy is an absolute priority

20 Doing now what patients need next

21 Useful links

• FAIR data principles • ClinicalTrials.gov

• ClinicalStudyDataRequest.com • EudraCT

• YODA • ICMJE recommendations on data sharing

• Vivli • EMA policy 0070

• Project Datasphere • Health Canada Public Release of Clinical Information • TransCelerate CDT & PSOC • FDA Clinical Data Summary Pilot Program • EFSPI/PSI WG & publications • UKAN • PhUSE WG • UKAN ADF • EFPIA/PhRMA principles for responsible data sharing

22