Data-driven discovery
Case studies from patient records and spontaneous reports
Niklas Norén, PhD Uppsala Monitoring Centre WHO Collaborating Centre for International Drug Monitoring
ISPE 2011 Mid-Year Meeting. April 9, 2011. Florence, Italy. Disclosure
• Uppsala Monitoring Centre research primarily self- funded • The results on patient records in this presentation came out of a, now finished, pilot study co-financed by IMS Health • Government support throught grants – IMI PROTECT – Monitoring Medicines (FP7) – OMOP (FNIH) Presentation outline
• Data-driven discovery • Three case studies • Lessons learned
RTWLAAQZDDYTFGFQXYSSSTFGLOIRQQXY DDAFAFEKRRQTNGDIRGEEWGWCFSSJWQYS JKAOFVHVVIKQXYWZSCYGRRWOYSAOVDVA DWUACVVSVKKIHLZJWDOVEYIQXYAQVVVQ KMRDPVWVFERUQTESQWMIERFPSYDVDAVQ JKKOLTHSNMKQXYYFNGHDDLYOCSAOLDZA VKKIHLZKRRQTNGDIRGEEWGWCFSSJWQXY QXYMYWZACIYRGEFQXYSWOYSAWLAAQZSE What is data-driven discovery?
• The application of analytics to detect patterns in data • Let data lead the way! – No pre-specified hypothesis – Parallel perspectives on data – Many covariates, many pattern types
• Simple diagnostic test: Can you enumerate the possible findings prior to your analysis? Why is it important?
• Identify the unexpected • Obtain a more complete perspective of data • Highlight issues that may alter the interpretation of your primary analysis When is it relevant?
• Fundamental to broad surveillance • A core component in signal refinement and refutation • Useful for data management and quality assurance • A safeguard in hypothesis-driven research Hand and Bolton J Appl Statist, 2004 Pattern discovery
• A pattern can be defined as a local deviation from a global baseline model • Affects a limited number of covariates and / or data points
RTWLAAQZDDYTFGFQXYSSSTFGLOIRQQXY DDAFAFEKRRQTNGDIRGEEWGWCFSSJWQYS JKAOFVHVVIKQXYWZSCYGRRWOYSAOVDVA DWUACVVSVKKIHLZJWDOVEYIQXYAQVVVQ KMRDPVWVFERUQTESQWMIERFPSYDVDAVQ JKKOLTHSNMKQXYYFNGHDDLYOCSAOLDZA VKKIHLZKRRQTNGDIRGEEWGWCFSSJWQXY QXYMYWZACIYRGEFQXYSWOYSAWLAAQZSE Hand and Bolton J Appl Statist, 2004 Pattern discovery
• A pattern can be defined as a local deviation from a global baseline model • Affects a limited number of covariates and / or data points
RTWLAAQZDDYTFGFQXYSSSTFGLOIRQQXY DDAFAFEKRRQTNGDIRGEEWGWCFSSJWQYS JKAOFVHVVIKQXYWZSCYGRRWOYSAOVDVAJKAOF V H VV IKQXYWZSCYGRRWOYSAO V D V A DWUACVVSVKKIHLZJWDOVEYIQXYAQVVVQDWUAC VV S V KKIHLZJWDO V EYIQXYAQ VVV Q KMRDPVWVFERUQTESQWMIERFPSYDVDAVQKMRDP V W V FERUQTESQWMIERFPSYD V DA V Q JKKOLTHSNMKQXYYFNGHDDLYOCSAOLDZA VKKIHLZKRRQTNGDIRGEEWGWCFSSJWQXY QXYMYWZACIYRGEFQXYSWOYSAWLAAQZSE How do you do it?
• No such thing as completely open-ended analysis! • Need to define: – Type of patterns (examples on following slides) – Baseline model – Covariates – Data subset(s) – How to follow up • The challenge is to maintain power to detect the unexpected Success factors
• Data preparation – Effective data management and cleaning • Robustness to data quality issues – ... or relevant patterns may be drowned in noise • Control of false alerts – Some false positives ok but positive predictive value must be acceptable – Rate of spurious associations can often be evaluated with Monte Carlo simulation or permutation tests – Biases are more difficult! Norén et al. Data Min Knowl Discov, 2007 Record matching
• Duplicate detection in six million VigiBase reports • Screen for pairs of suspiciously similar records • Baseline model: independent reports • Score each covariate with log-likelihood ratio (matches rewarded, mismatches penalized) RTWLAAQZDDYTFGFQXYSSSTFGLOIRQQXY DDAFAFEKRRQTNGDIRGEEWGWCFSSJWQYS JKAOFVHVVIKQXYWZSCYGRRWOYSAOVDVA DWUACVVSVKKIHLZJWDOVEYIQXYAQVVVQ KMRDPVWVFERUQTESQWMIERFPSYDVDAVQ JKKOLTHSNMKQXYYFNGHDDLYOCSAOLDZA VKKIHLZKRRQTNGDIRGEEWGWCFSSJWQXY QXYMYWZACIYRGEFQXYSWOYSAWLAAQZSE Norén et al. Data Min Knowl Discov, 2007 Record matching
• Duplicate detection in six million VigiBase reports • Screen for pairs of suspiciously similar records • Baseline model: independent reports • Score each covariate with log-likelihood ratio (matches rewarded, mismatches penalized) RTWLAAQZDDYTFGFQXYSSSTFGLOIRQQXY DDAFAFE KRRQTNGDIRGEEWGWCFSSJWQ YS JKAOFVHVVIKQXYWZSCYGRRWOYSAOVDVA DWUACVVSVKKIHLZJWDOVEYIQXYAQVVVQ KMRDPVWVFERUQTESQWMIERFPSYDVDAVQ JKKOLTHSNMKQXYYFNGHDDLYOCSAOLDZA VKKIHLZ KRRQTNGDIRGEEWGWCFSSJWQ XY QXYMYWZACIYRGEFQXYSWOYSAWLAAQZSE Norén et al. Data Min Knowl Discov, 2007 Record matching method
• Covariates: country of origin, patient gender, patient age, date of onset, outcome, drugs, suspected ADRs • Suspected duplicates reviewed by national centre
RTWLAAQZDDYTFGFQXYSSSTFGLOIRQQXY DDAFAFE KRRQTNGDIRGEEWGWCFSSJWQ YS JKAOFVHVVIKQXYWZSCYGRRWOYSAOVDVA DWUACVVSVKKIHLZJWDOVEYIQXYAQVVVQ KMRDPVWVFERUQTESQWMIERFPSYDVDAVQ JKKOLTHSNMKQXYYFNGHDDLYOCSAOLDZA VKKIHLZ KRRQTNGDIRGEEWGWCFSSJWQ XY QXYMYWZACIYRGEFQXYSWOYSAWLAAQZSE Norén et al. Data Min Knowl Discov, 2007 Record matching results
• 78,000 suspected duplicates (2011) • ~65% recall, ~80% precision (rel. manual review, small study!) • Highlighted non-duplicates typically otherwise related
Country Patient Patient Drugs ADRs Date of of origin age gender onset Norway 8 F Epinephrine/Lidocaine Facial pain 2003-12-16
Norway 18 F Epinephrine/Lidocaine Facial pain 2003-12-16
Norway 29 F Epinephrine/Lidocaine Facial pain 2003-12-16
• Three reports from the same dentist! Norén et al. Stat Med, 2008. Interaction detection
• Drug interaction detection in VigiBase • Covariates: drugs and ADRs • Identify excess reporting of ADR with two drugs
RTWLAAQZDDYTFGFQXYSSSTFGLOIRQQXY DDAFAFEKRRQTNGDIRGEEWGWCFSSJWQYS JKAOFVHVVIKQXYWZSCYGRRWOYSAOVDVA DWUACVVSVKKIHLZJWDOVEYIQXYAQVVVQ KMRDPVWVFERUQTESQWMIERFPSYDVDAVQ JKKOLTHSNMKQXYYFNGHDDLYOCSAOLDZA VKKIHLZKRRQTNGDIRGEEWGWCFSSJWQXY QXYMYWZACIYRGEFQXYSWOYSAWLAAQZSE Norén et al. Stat Med, 2008. Interaction detection
• Drug interaction detection in VigiBase • Covariates: drugs and ADRs • Identify excess reporting of ADR with two drugs
RTWLAAQZDDYTFGF QX Y SSSTFGLOIRQ QX Y DDAFAFEKRRQTNGDIRGEEWGWCFSSJWQYS JKAOFVHVVIK QX Y WZSCYGRRWOYSAOVDVA DWUACVVSVKKIHLZJWDOVEYI QX Y AQVVVQ KMRDPVWVFERUQXESQWMIERFPSYDVDAVQ JKKOLTHSNMK QX Y YFNGHDDLYOCSAOLDZA VKKIHLZKRRQTNGDIRGEEWGWCFSSJW QX Y QX Y MYWZACIYRGEF QX Y SWOYSAWLAAQZSE Norén et al. Stat Med, 2008. Interaction detection method
• Baseline model: additive attributable risks • Shrinkage observed-to-expected ratio to protect against spurious associations • Suspected interactions assessed by clinical experts
RTWLAAQZDDYTFGF QX Y SSSTFGLOIRQ QX Y DDAFAFEKRRQTNGDIRGEEWGWCFSSJWQYS JKAOFVHVVIK QX Y WZSCYGRRWOYSAOVDVA DWUACVVSVKKIHLZJWDOVEYI QX Y AQVVVQ KMRDPVWVFERUQXESQWMIERFPSYDVDAVQ JKKOLTHSNMK QX Y YFNGHDDLYOCSAOLDZA VKKIHLZKRRQTNGDIRGEEWGWCFSSJW QX Y QX Y MYWZACIYRGEF QX Y SWOYSAWLAAQZSE Norén et al. Stat Med, 2008. Interaction detection results
• 15,000 triplets with excess reporting rates • Among those are cases of known interactions such as cerivastatin/gemfibrozil, digoxin/clarithromycin etc. • Also report clusters and a patient safety issue:
Drugs ADR(s) # Reports Expected Comment
Bupivacain Strabismus 25 <1 25 reports listing the same Hyaluronidase five drugs and ADR Cefazolin submitted by the same Gentamicin reporter in 1985 Lidocaine Celecoxib Drug maladministration 51 <1 Confusion of brand names Citalopram (Celebrex & Celexa) Norén et al. Data Min Knowl Discov, 2010 Temporal pattern discovery
• Pattern discovery in IMS UK collection of two million longitudinal patient records • Covariates: drugs and medical events • Screen for medical events that occur more often than expected soon after start of treatment Norén et al. Data Min Knowl Discov, 2010 Temporal pattern discovery method
• Baseline model: Relative frequency of medical event constant over time in exposed patients • Self-controlled cohort with external control group to adjust for age gradients, variations in use of healthcare and clustering of doctor’s visits • Follow-up: – Visualisation of temporal patterns – Computerized highlighting of potential confounders – Clinical assessment of patient details – Secondary analysis (related drugs or events, stratification...) Norén et al. Data Min Knowl Discov, 2010 Temporal pattern discovery results
• 42,000 associations between drugs and events • A variety of temporal patterns Lessons learned
• Many different patterns can be highlighted as deviations from a single simple baseline model • A substantial proportion of findings relate to data quality issues or highlight aspects of data that are important for interpretation of the primary analysis • Major intellectual input required after initial discovery! More lessons learned
• 10,000+ patterns -> additional triages required – Emerging patterns – Focus areas – Predictive models • Careful communication! • Many times, biases dominate! • Multiple comparisons are a real issue in some applications Hopstadius and Norén Submitted, 2011 – Naive sub-group analyses in spontaneous reports can lead to ~50% false positive rates RTWLAAQZD D YTFGF QXY SSSTFGLOIRQ QXY D D AFAFE K R RQTNG DI RGEEWGWCFSSJWQ YS JKAOF V H VV I K QXY WZ SC YGRRWOYSAO V D V A DWU A C VV S V KKIHLZJWD OV EYI QXY AQ VVV Q KMRDP V W V F E RUQTESQWMI ER FPSYD V DA V Q JKKOL T HS N MK QXY YFNGHDDL Y OCSAOLDZA VKKIHLZ KRRQTNGDIRGEEWGWCFSSJWQXY QXY MYWZ A CIYRGEF QXY SWOYSAWLAAQZSE References
1. Hand DJ, Bolton R. Pattern discovery and detection: a unified statistical methodology . Journal of Applied Statistics, 2004. 31 (8):885-924. 2. Norén GN, Orre R, Bate A, Edwards IR. Duplicate detection in adverse drug reaction surveillance . Data Mining and Knowledge Discovery2007; 14 (3):305-328. 3. Norén GN, Sundberg R, Bate A, Edwards IR. A statistical methodology for drug– drug interaction surveillance . Statistics in Medicine2008; 27 :3057-3070. 4. Norén GN, Hopstadius J, Bate A, Star K, Edwards IR. Temporal pattern discovery in longitudinal electronic patient records . Data Mining and Knowledge Discovery, 2010. 20 (3):361-387. 4. Norén GN, Edwards IR. Opportunities and challenges of adverse drug reaction surveillance in electronic patient records . Pharmacovigilance Review, 2010. 4(1):17-20. 5. Hopstadius J, Norén GN. Robust discovery of local associations in high- dimensional binary data . Submitted, 2011. Challenges for the future
• Other types of patterns – Syndromes – Outlying temporal patterns – Complex event sequences • Unstructured data – free text • Multiple comparisons – Communication strategies – Secondary analyses spuriously contradicting the original finding -> risk of false negatives