Data-driven discovery

Case studies from patient records and spontaneous reports

Niklas Norén, PhD Uppsala Monitoring Centre WHO Collaborating Centre for International Monitoring

ISPE 2011 Mid-Year Meeting. April 9, 2011. Florence, Italy. Disclosure

• Uppsala Monitoring Centre research primarily self- funded • The results on patient records in this presentation came out of a, now finished, pilot study co-financed by IMS Health • Government support throught grants – IMI PROTECT – Monitoring Medicines (FP7) – OMOP (FNIH) Presentation outline

• Data-driven discovery • Three case studies • Lessons learned

RTWLAAQZDDYTFGFQXYSSSTFGLOIRQQXY DDAFAFEKRRQTNGDIRGEEWGWCFSSJWQYS JKAOFVHVVIKQXYWZSCYGRRWOYSAOVDVA DWUACVVSVKKIHLZJWDOVEYIQXYAQVVVQ KMRDPVWVFERUQTESQWMIERFPSYDVDAVQ JKKOLTHSNMKQXYYFNGHDDLYOCSAOLDZA VKKIHLZKRRQTNGDIRGEEWGWCFSSJWQXY QXYMYWZACIYRGEFQXYSWOYSAWLAAQZSE What is data-driven discovery?

• The application of analytics to detect patterns in data • Let data lead the way! – No pre-specified hypothesis – Parallel perspectives on data – Many covariates, many pattern types

• Simple diagnostic test: Can you enumerate the possible findings prior to your analysis? Why is it important?

• Identify the unexpected • Obtain a more complete perspective of data • Highlight issues that may alter the interpretation of your primary analysis When is it relevant?

• Fundamental to broad surveillance • A core component in signal refinement and refutation • Useful for data management and quality assurance • A safeguard in hypothesis-driven research Hand and Bolton J Appl Statist, 2004 Pattern discovery

• A pattern can be defined as a local deviation from a global baseline model • Affects a limited number of covariates and / or data points

RTWLAAQZDDYTFGFQXYSSSTFGLOIRQQXY DDAFAFEKRRQTNGDIRGEEWGWCFSSJWQYS JKAOFVHVVIKQXYWZSCYGRRWOYSAOVDVA DWUACVVSVKKIHLZJWDOVEYIQXYAQVVVQ KMRDPVWVFERUQTESQWMIERFPSYDVDAVQ JKKOLTHSNMKQXYYFNGHDDLYOCSAOLDZA VKKIHLZKRRQTNGDIRGEEWGWCFSSJWQXY QXYMYWZACIYRGEFQXYSWOYSAWLAAQZSE Hand and Bolton J Appl Statist, 2004 Pattern discovery

• A pattern can be defined as a local deviation from a global baseline model • Affects a limited number of covariates and / or data points

RTWLAAQZDDYTFGFQXYSSSTFGLOIRQQXY DDAFAFEKRRQTNGDIRGEEWGWCFSSJWQYS JKAOFVHVVIKQXYWZSCYGRRWOYSAOVDVAJKAOF V H VV IKQXYWZSCYGRRWOYSAO V D V A DWUACVVSVKKIHLZJWDOVEYIQXYAQVVVQDWUAC VV S V KKIHLZJWDO V EYIQXYAQ VVV Q KMRDPVWVFERUQTESQWMIERFPSYDVDAVQKMRDP V W V FERUQTESQWMIERFPSYD V DA V Q JKKOLTHSNMKQXYYFNGHDDLYOCSAOLDZA VKKIHLZKRRQTNGDIRGEEWGWCFSSJWQXY QXYMYWZACIYRGEFQXYSWOYSAWLAAQZSE How do you do it?

• No such thing as completely open-ended analysis! • Need to define: – Type of patterns (examples on following slides) – Baseline model – Covariates – Data subset(s) – How to follow up • The challenge is to maintain power to detect the unexpected Success factors

• Data preparation – Effective data management and cleaning • Robustness to data quality issues – ... or relevant patterns may be drowned in noise • Control of false alerts – Some false positives ok but positive predictive value must be acceptable – Rate of spurious associations can often be evaluated with Monte Carlo simulation or permutation tests – Biases are more difficult! Norén et al. Data Min Knowl Discov, 2007 Record matching

• Duplicate detection in six million VigiBase reports • Screen for pairs of suspiciously similar records • Baseline model: independent reports • Score each covariate with log-likelihood ratio (matches rewarded, mismatches penalized) RTWLAAQZDDYTFGFQXYSSSTFGLOIRQQXY DDAFAFEKRRQTNGDIRGEEWGWCFSSJWQYS JKAOFVHVVIKQXYWZSCYGRRWOYSAOVDVA DWUACVVSVKKIHLZJWDOVEYIQXYAQVVVQ KMRDPVWVFERUQTESQWMIERFPSYDVDAVQ JKKOLTHSNMKQXYYFNGHDDLYOCSAOLDZA VKKIHLZKRRQTNGDIRGEEWGWCFSSJWQXY QXYMYWZACIYRGEFQXYSWOYSAWLAAQZSE Norén et al. Data Min Knowl Discov, 2007 Record matching

• Duplicate detection in six million VigiBase reports • Screen for pairs of suspiciously similar records • Baseline model: independent reports • Score each covariate with log-likelihood ratio (matches rewarded, mismatches penalized) RTWLAAQZDDYTFGFQXYSSSTFGLOIRQQXY DDAFAFE KRRQTNGDIRGEEWGWCFSSJWQ YS JKAOFVHVVIKQXYWZSCYGRRWOYSAOVDVA DWUACVVSVKKIHLZJWDOVEYIQXYAQVVVQ KMRDPVWVFERUQTESQWMIERFPSYDVDAVQ JKKOLTHSNMKQXYYFNGHDDLYOCSAOLDZA VKKIHLZ KRRQTNGDIRGEEWGWCFSSJWQ XY QXYMYWZACIYRGEFQXYSWOYSAWLAAQZSE Norén et al. Data Min Knowl Discov, 2007 Record matching method

• Covariates: country of origin, patient gender, patient age, date of onset, outcome, , suspected ADRs • Suspected duplicates reviewed by national centre

RTWLAAQZDDYTFGFQXYSSSTFGLOIRQQXY DDAFAFE KRRQTNGDIRGEEWGWCFSSJWQ YS JKAOFVHVVIKQXYWZSCYGRRWOYSAOVDVA DWUACVVSVKKIHLZJWDOVEYIQXYAQVVVQ KMRDPVWVFERUQTESQWMIERFPSYDVDAVQ JKKOLTHSNMKQXYYFNGHDDLYOCSAOLDZA VKKIHLZ KRRQTNGDIRGEEWGWCFSSJWQ XY QXYMYWZACIYRGEFQXYSWOYSAWLAAQZSE Norén et al. Data Min Knowl Discov, 2007 Record matching results

• 78,000 suspected duplicates (2011) • ~65% recall, ~80% precision (rel. manual review, small study!) • Highlighted non-duplicates typically otherwise related

Country Patient Patient Drugs ADRs Date of of origin age gender onset Norway 8 F Epinephrine/ Facial pain 2003-12-16

Norway 18 F Epinephrine/Lidocaine Facial pain 2003-12-16

Norway 29 F Epinephrine/Lidocaine Facial pain 2003-12-16

• Three reports from the same dentist! Norén et al. Stat Med, 2008. Interaction detection

detection in VigiBase • Covariates: drugs and ADRs • Identify excess reporting of ADR with two drugs

RTWLAAQZDDYTFGFQXYSSSTFGLOIRQQXY DDAFAFEKRRQTNGDIRGEEWGWCFSSJWQYS JKAOFVHVVIKQXYWZSCYGRRWOYSAOVDVA DWUACVVSVKKIHLZJWDOVEYIQXYAQVVVQ KMRDPVWVFERUQTESQWMIERFPSYDVDAVQ JKKOLTHSNMKQXYYFNGHDDLYOCSAOLDZA VKKIHLZKRRQTNGDIRGEEWGWCFSSJWQXY QXYMYWZACIYRGEFQXYSWOYSAWLAAQZSE Norén et al. Stat Med, 2008. Interaction detection

• Drug interaction detection in VigiBase • Covariates: drugs and ADRs • Identify excess reporting of ADR with two drugs

RTWLAAQZDDYTFGF QX Y SSSTFGLOIRQ QX Y DDAFAFEKRRQTNGDIRGEEWGWCFSSJWQYS JKAOFVHVVIK QX Y WZSCYGRRWOYSAOVDVA DWUACVVSVKKIHLZJWDOVEYI QX Y AQVVVQ KMRDPVWVFERUQXESQWMIERFPSYDVDAVQ JKKOLTHSNMK QX Y YFNGHDDLYOCSAOLDZA VKKIHLZKRRQTNGDIRGEEWGWCFSSJW QX Y QX Y MYWZACIYRGEF QX Y SWOYSAWLAAQZSE Norén et al. Stat Med, 2008. Interaction detection method

• Baseline model: additive attributable risks • Shrinkage observed-to-expected ratio to protect against spurious associations • Suspected interactions assessed by clinical experts

RTWLAAQZDDYTFGF QX Y SSSTFGLOIRQ QX Y DDAFAFEKRRQTNGDIRGEEWGWCFSSJWQYS JKAOFVHVVIK QX Y WZSCYGRRWOYSAOVDVA DWUACVVSVKKIHLZJWDOVEYI QX Y AQVVVQ KMRDPVWVFERUQXESQWMIERFPSYDVDAVQ JKKOLTHSNMK QX Y YFNGHDDLYOCSAOLDZA VKKIHLZKRRQTNGDIRGEEWGWCFSSJW QX Y QX Y MYWZACIYRGEF QX Y SWOYSAWLAAQZSE Norén et al. Stat Med, 2008. Interaction detection results

• 15,000 triplets with excess reporting rates • Among those are cases of known interactions such as cerivastatin/gemfibrozil, /clarithromycin etc. • Also report clusters and a issue:

Drugs ADR(s) # Reports Expected Comment

Bupivacain Strabismus 25 <1 25 reports listing the same Hyaluronidase five drugs and ADR submitted by the same Gentamicin reporter in 1985 Lidocaine Drug maladministration 51 <1 Confusion of brand names Citalopram (Celebrex & Celexa) Norén et al. Data Min Knowl Discov, 2010 Temporal pattern discovery

• Pattern discovery in IMS UK collection of two million longitudinal patient records • Covariates: drugs and medical events • Screen for medical events that occur more often than expected soon after start of treatment Norén et al. Data Min Knowl Discov, 2010 Temporal pattern discovery method

• Baseline model: Relative frequency of medical event constant over time in exposed patients • Self-controlled cohort with external control group to adjust for age gradients, variations in use of healthcare and clustering of doctor’s visits • Follow-up: – Visualisation of temporal patterns – Computerized highlighting of potential confounders – Clinical assessment of patient details – Secondary analysis (related drugs or events, stratification...) Norén et al. Data Min Knowl Discov, 2010 Temporal pattern discovery results

• 42,000 associations between drugs and events • A variety of temporal patterns Lessons learned

• Many different patterns can be highlighted as deviations from a single simple baseline model • A substantial proportion of findings relate to data quality issues or highlight aspects of data that are important for interpretation of the primary analysis • Major intellectual input required after initial discovery! More lessons learned

• 10,000+ patterns -> additional triages required – Emerging patterns – Focus areas – Predictive models • Careful communication! • Many times, biases dominate! • Multiple comparisons are a real issue in some applications Hopstadius and Norén Submitted, 2011 – Naive sub-group analyses in spontaneous reports can lead to ~50% false positive rates RTWLAAQZD D YTFGF QXY SSSTFGLOIRQ QXY D D AFAFE K R RQTNG DI RGEEWGWCFSSJWQ YS JKAOF V H VV I K QXY WZ SC YGRRWOYSAO V D V A DWU A C VV S V KKIHLZJWD OV EYI QXY AQ VVV Q KMRDP V W V F E RUQTESQWMI ER FPSYD V DA V Q JKKOL T HS N MK QXY YFNGHDDL Y OCSAOLDZA VKKIHLZ KRRQTNGDIRGEEWGWCFSSJWQXY QXY MYWZ A CIYRGEF QXY SWOYSAWLAAQZSE References

1. Hand DJ, Bolton R. Pattern discovery and detection: a unified statistical methodology . Journal of Applied Statistics, 2004. 31 (8):885-924. 2. Norén GN, Orre R, Bate A, Edwards IR. Duplicate detection in adverse drug reaction surveillance . Data Mining and Knowledge Discovery2007; 14 (3):305-328. 3. Norén GN, Sundberg R, Bate A, Edwards IR. A statistical methodology for drug– drug interaction surveillance . Statistics in Medicine2008; 27 :3057-3070. 4. Norén GN, Hopstadius J, Bate A, Star K, Edwards IR. Temporal pattern discovery in longitudinal electronic patient records . Data Mining and Knowledge Discovery, 2010. 20 (3):361-387. 4. Norén GN, Edwards IR. Opportunities and challenges of adverse drug reaction surveillance in electronic patient records . Review, 2010. 4(1):17-20. 5. Hopstadius J, Norén GN. Robust discovery of local associations in high- dimensional binary data . Submitted, 2011. Challenges for the future

• Other types of patterns – Syndromes – Outlying temporal patterns – Complex event sequences • Unstructured data – free text • Multiple comparisons – Communication strategies – Secondary analyses spuriously contradicting the original finding -> risk of false negatives