The EuResist Database

Yardena Peres, Carmel Kent IT for Healthcare & Life sciences Group

IBM Labs in Haifa © 2006 IBM Corporation IBM Labs in Haifa

Outline • Background ” EuResist project • Data Integration ” Virtual vs. Physical Data Integration ” EuResist Schema ” Availability • Data privacy • Mapping challenges • Data Cleansing • Alignment with Standards • Current Collaborations

© 2006 IBM Corporation IBM Labs in Haifa

Background - EuResist project

• Funded by EU FP6 • Multidisciplinary Consortium ” Informa S.r.l. Rome, ” University of Siena, Italy ” University of Roma TRE, Italy ” Karolinska University Hospital, ” University of Koln, ” Max Plank Institute, Germany ” KFKI Research Institute (RMKI), ” Kingston University, UK ” EFPIA (European Federation of Pharmaceutical Industries and Associations) ” IBM Haifa Research Labs,

© 2006 IBM Corporation IBM Labs in Haifa

Background – EuResist Goals

• Aims at developing a European integrated system for clinical management of antiretroviral drug resistance. ” Integrates biomedical information from three large genotype-response correlation databases “ ARCA - Italy “ Arevir - Germany “ Karolinska - Sweden ” Develops an array of engines for effective prediction of the response to treatment “ Case based reasoning “ Machine learning (Bayesian Networks, Support Vector Machines) “ Graph theory “ Evolution Model “ Fuzzy Logic ” Combines the engines into a predictive system publicly available on the web

© 2006 IBM Corporation IBM Labs in Haifa EuResist at a glance

Prediction System

Training

EuResist database DB2 V9 Data Integration & Cleansing

ARCA Arevir Karolinska

© 2006 IBM Corporation IBM Labs in Haifa

Data Integration – Virtual vs. Physical

• Database Federation (Virtual) ” Provides users with a virtual data mart, without moving any data ” Not recommended for heavy-duty queries that need access to the lowest level of detail • Data Consolidation (Physical) ” Requires periodical moving of data to a dedicated data mart ” Provides high availability, high performance & high level of data quality • EuResist project ” The Integrated database is a consolidated database ” we have created an HIV specific data mart, based on the prediction engines’ requirements

© 2006 IBM Corporation IBM Labs in Haifa

Data Integration – EuResist Schema

© 2006 IBM Corporation IBM Labs in Haifa

Data Integration – EuResist Schema

© 2006 IBM Corporation IBM Labs in Haifa

Data Integration – EuResist Schema

© 2006 IBM Corporation IBM Labs in Haifa

Data Integration – Availability EuResist database running in a DB2 v9 server is available to all partners “ Releases: ’ 18 May 2006 ’ 20 June 2006 ’ 4 September 2006 ’ 3 October 2006 ’ 31 October 2006 ’ 7 December 2006 ’ 21 January 2007 ’ ?? April 2007 “ Thoroughly documented (schema, mappings, cleansing activities, how to access the DB, etc.)

© 2006 IBM Corporation IBM Labs in Haifa

Data Integration – January 21 Release

ARCA Arevir Karolinska EuResist

Patients 8651 953 4207 13811 Therapies 25222 4787 14211 44220 Therapy Compounds 73498 14221 22482 110201 CD4 Isolates 100948 24742 85035 210725 Viral Load Isolates 77236 22171 58718 158125 Raw Sequences 13784 1274 1184 16242 Protease Sequences 12545 1274 1054 14873 Protease Mutations 109544 11827 8823 130194 Reverse Transcriptase 12609 1273 841 14723 Sequences Reverse Transcriptase 288032 28161 27186 343379 Mutations

© 2006 IBM Corporation IBM Labs in Haifa

Data Privacy

• The data in EuResist is fully anonymised ” patientID is just a serial number ” originalID is just a reference to the originating record in the data source (Arevir, ARCA or Karolinska) • The data in all data sources (Arevir, ARCA, Karolinska) is fully anonymised • It is impossible to back trace a patient from the database content • No privacy issues have arisen and the project obtained approval from the involved ethical committees

© 2006 IBM Corporation IBM Labs in Haifa

Mapping Challenges • Straightforward mappings (e.g., date of birth) • Complex mappings (e.g., Karolinska therapies) • Misleading mappings (e.g., Arevir sequence date) • Mappings that require data analysis ” Data Sources extract mutations by aligning their raw sequences “ Different wild types (HXB2, consensusB) “ Different alignment algorithms “ Different alignment parameters ” EuResist aligns all raw sequences “ With the chosen wild type (consensusB) “ Using a chosen alignment software (MyLap) “ Based on a configurable set of parameters

© 2006 IBM Corporation IBM Labs in Haifa

Data Cleansing

• An integral part of the data integration & feeding process • The quality of the data is crucial for the training and validation of the prediction engines • Eliminate error, bring consistency, report problems ” Data could be “good”, “bad” or “suspicious” • Requires domain knowledge • An ongoing process • The set of rules and checks defined are useful also for checking the input provided by the user to the predictive system

© 2006 IBM Corporation IBM Labs in Haifa

Alignment with Standards

• Health Level Seven (HL7) is a standard, domain specific, common protocol for the exchange of health care information. • HL7 v3 consists of two main elements: ” a messaging standard ” an information model, the RIM – Reference Information Mode • EuResist defines an HIV specific data mart which is derived from the generic HL7 RIM • By mapping EuResist schema to HL7, we provide a standard interface for sharing data. ” Provide mapping from HL7 RIM to EuResist ” Generate HL7 messages from EuResist data

© 2006 IBM Corporation IBM Labs in Haifa

Alignment with Standards

HL7 3.0 Msg HL7 V3.0 Certified

HL7 RIM EuResist Repository Integrated database HL7 3.0 Msg

HL7 v3 Message creation

© 2006 IBM Corporation IBM Labs in Haifa

Current Collaborations

• Stanford University (http://hidb.stanford.edu) provides a public HIV drug resistance database • Stanford is creating a large dataset of TCEs for analytic studies ” treatments, sequences, virus load levels, CD4 counts • Stanford and EuResist have different schemas for the same data ” We can convert from one schema to the other ” We can share data via standards (HL7)

© 2006 IBM Corporation IBM Labs in Haifa

Thai Traditional Chinese Gracias Russian Spanish Thank You English תודה Hebrew (Toda) Obrigado Brazilian Portuguese Arabic Danke German Grazie Simplified Chinese Italian Merci French

Japanese Korean TackTack Köszönöm Hungarian Swedish

© 2006 IBM Corporation