The EuResist Database
Yardena Peres, Carmel Kent IT for Healthcare & Life sciences Group
IBM Labs in Haifa © 2006 IBM Corporation IBM Labs in Haifa
Outline Background EuResist project Data Integration Virtual vs. Physical Data Integration EuResist Schema Availability Data privacy Mapping challenges Data Cleansing Alignment with Standards Current Collaborations
© 2006 IBM Corporation IBM Labs in Haifa
Background - EuResist project
Funded by EU FP6 Multidisciplinary Consortium Informa S.r.l. Rome, Italy University of Siena, Italy University of Roma TRE, Italy Karolinska University Hospital, Sweden University of Koln, Germany Max Plank Institute, Germany KFKI Research Institute (RMKI), Hungary Kingston University, UK EFPIA (European Federation of Pharmaceutical Industries and Associations) IBM Haifa Research Labs, Israel
© 2006 IBM Corporation IBM Labs in Haifa
Background – EuResist Goals
Aims at developing a European integrated system for clinical management of antiretroviral drug resistance. Integrates biomedical information from three large genotype-response correlation databases ARCA - Italy Arevir - Germany Karolinska - Sweden Develops an array of engines for effective prediction of the response to treatment Case based reasoning Machine learning (Bayesian Networks, Support Vector Machines) Graph theory Evolution Model Fuzzy Logic Combines the engines into a predictive system publicly available on the web
© 2006 IBM Corporation IBM Labs in Haifa EuResist at a glance
Prediction System
Training
EuResist database DB2 V9 Data Integration & Cleansing
ARCA Arevir Karolinska
© 2006 IBM Corporation IBM Labs in Haifa
Data Integration – Virtual vs. Physical
Database Federation (Virtual) Provides users with a virtual data mart, without moving any data Not recommended for heavy-duty queries that need access to the lowest level of detail Data Consolidation (Physical) Requires periodical moving of data to a dedicated data mart Provides high availability, high performance & high level of data quality EuResist project The Integrated database is a consolidated database we have created an HIV specific data mart, based on the prediction engines’ requirements
© 2006 IBM Corporation IBM Labs in Haifa
Data Integration – EuResist Schema
© 2006 IBM Corporation IBM Labs in Haifa
Data Integration – EuResist Schema
© 2006 IBM Corporation IBM Labs in Haifa
Data Integration – EuResist Schema
© 2006 IBM Corporation IBM Labs in Haifa
Data Integration – Availability EuResist database running in a DB2 v9 server is available to all partners Releases: 18 May 2006 20 June 2006 4 September 2006 3 October 2006 31 October 2006 7 December 2006 21 January 2007 ?? April 2007 Thoroughly documented (schema, mappings, cleansing activities, how to access the DB, etc.)
© 2006 IBM Corporation IBM Labs in Haifa
Data Integration – January 21 Release
ARCA Arevir Karolinska EuResist
Patients 8651 953 4207 13811 Therapies 25222 4787 14211 44220 Therapy Compounds 73498 14221 22482 110201 CD4 Isolates 100948 24742 85035 210725 Viral Load Isolates 77236 22171 58718 158125 Raw Sequences 13784 1274 1184 16242 Protease Sequences 12545 1274 1054 14873 Protease Mutations 109544 11827 8823 130194 Reverse Transcriptase 12609 1273 841 14723 Sequences Reverse Transcriptase 288032 28161 27186 343379 Mutations
© 2006 IBM Corporation IBM Labs in Haifa
Data Privacy
The data in EuResist is fully anonymised patientID is just a serial number originalID is just a reference to the originating record in the data source (Arevir, ARCA or Karolinska) The data in all data sources (Arevir, ARCA, Karolinska) is fully anonymised It is impossible to back trace a patient from the database content No privacy issues have arisen and the project obtained approval from the involved ethical committees
© 2006 IBM Corporation IBM Labs in Haifa
Mapping Challenges Straightforward mappings (e.g., date of birth) Complex mappings (e.g., Karolinska therapies) Misleading mappings (e.g., Arevir sequence date) Mappings that require data analysis Data Sources extract mutations by aligning their raw sequences Different wild types (HXB2, consensusB) Different alignment algorithms Different alignment parameters EuResist aligns all raw sequences With the chosen wild type (consensusB) Using a chosen alignment software (MyLap) Based on a configurable set of parameters
© 2006 IBM Corporation IBM Labs in Haifa
Data Cleansing
An integral part of the data integration & feeding process The quality of the data is crucial for the training and validation of the prediction engines Eliminate error, bring consistency, report problems Data could be “good”, “bad” or “suspicious” Requires domain knowledge An ongoing process The set of rules and checks defined are useful also for checking the input provided by the user to the predictive system
© 2006 IBM Corporation IBM Labs in Haifa
Alignment with Standards
Health Level Seven (HL7) is a standard, domain specific, common protocol for the exchange of health care information. HL7 v3 consists of two main elements: a messaging standard an information model, the RIM – Reference Information Mode EuResist defines an HIV specific data mart which is derived from the generic HL7 RIM By mapping EuResist schema to HL7, we provide a standard interface for sharing data. Provide mapping from HL7 RIM to EuResist Generate HL7 messages from EuResist data
© 2006 IBM Corporation IBM Labs in Haifa
Alignment with Standards
HL7 3.0 Msg HL7 V3.0 Certified
HL7 RIM EuResist Repository Integrated database HL7 3.0 Msg
HL7 v3 Message creation
© 2006 IBM Corporation IBM Labs in Haifa
Current Collaborations
Stanford University (http://hidb.stanford.edu) provides a public HIV drug resistance database Stanford is creating a large dataset of TCEs for analytic studies treatments, sequences, virus load levels, CD4 counts Stanford and EuResist have different schemas for the same data We can convert from one schema to the other We can share data via standards (HL7)
© 2006 IBM Corporation IBM Labs in Haifa
Thai Traditional Chinese Gracias Russian Spanish Thank You English תודה Hebrew (Toda) Obrigado Brazilian Portuguese Arabic Danke German Grazie Simplified Chinese Italian Merci French
Japanese Korean TackTack Köszönöm Hungarian Swedish
© 2006 IBM Corporation