The Euresist Database
Total Page:16
File Type:pdf, Size:1020Kb
The EuResist Database Yardena Peres, Carmel Kent IT for Healthcare & Life sciences Group IBM Labs in Haifa © 2006 IBM Corporation IBM Labs in Haifa Outline Background EuResist project Data Integration Virtual vs. Physical Data Integration EuResist Schema Availability Data privacy Mapping challenges Data Cleansing Alignment with Standards Current Collaborations © 2006 IBM Corporation IBM Labs in Haifa Background - EuResist project Funded by EU FP6 Multidisciplinary Consortium Informa S.r.l. Rome, Italy University of Siena, Italy University of Roma TRE, Italy Karolinska University Hospital, Sweden University of Koln, Germany Max Plank Institute, Germany KFKI Research Institute (RMKI), Hungary Kingston University, UK EFPIA (European Federation of Pharmaceutical Industries and Associations) IBM Haifa Research Labs, Israel © 2006 IBM Corporation IBM Labs in Haifa Background – EuResist Goals Aims at developing a European integrated system for clinical management of antiretroviral drug resistance. Integrates biomedical information from three large genotype-response correlation databases ARCA - Italy Arevir - Germany Karolinska - Sweden Develops an array of engines for effective prediction of the response to treatment Case based reasoning Machine learning (Bayesian Networks, Support Vector Machines) Graph theory Evolution Model Fuzzy Logic Combines the engines into a predictive system publicly available on the web © 2006 IBM Corporation IBM Labs in Haifa EuResist at a glance Prediction System Training EuResist database DB2 V9 Data Integration & Cleansing ARCA Arevir Karolinska © 2006 IBM Corporation IBM Labs in Haifa Data Integration – Virtual vs. Physical Database Federation (Virtual) Provides users with a virtual data mart, without moving any data Not recommended for heavy-duty queries that need access to the lowest level of detail Data Consolidation (Physical) Requires periodical moving of data to a dedicated data mart Provides high availability, high performance & high level of data quality EuResist project The Integrated database is a consolidated database we have created an HIV specific data mart, based on the prediction engines’ requirements © 2006 IBM Corporation IBM Labs in Haifa Data Integration – EuResist Schema © 2006 IBM Corporation IBM Labs in Haifa Data Integration – EuResist Schema © 2006 IBM Corporation IBM Labs in Haifa Data Integration – EuResist Schema © 2006 IBM Corporation IBM Labs in Haifa Data Integration – Availability EuResist database running in a DB2 v9 server is available to all partners Releases: 18 May 2006 20 June 2006 4 September 2006 3 October 2006 31 October 2006 7 December 2006 21 January 2007 ?? April 2007 Thoroughly documented (schema, mappings, cleansing activities, how to access the DB, etc.) © 2006 IBM Corporation IBM Labs in Haifa Data Integration – January 21 Release ARCA Arevir Karolinska EuResist Patients 8651 953 4207 13811 Therapies 25222 4787 14211 44220 Therapy Compounds 73498 14221 22482 110201 CD4 Isolates 100948 24742 85035 210725 Viral Load Isolates 77236 22171 58718 158125 Raw Sequences 13784 1274 1184 16242 Protease Sequences 12545 1274 1054 14873 Protease Mutations 109544 11827 8823 130194 Reverse Transcriptase 12609 1273 841 14723 Sequences Reverse Transcriptase 288032 28161 27186 343379 Mutations © 2006 IBM Corporation IBM Labs in Haifa Data Privacy The data in EuResist is fully anonymised patientID is just a serial number originalID is just a reference to the originating record in the data source (Arevir, ARCA or Karolinska) The data in all data sources (Arevir, ARCA, Karolinska) is fully anonymised It is impossible to back trace a patient from the database content No privacy issues have arisen and the project obtained approval from the involved ethical committees © 2006 IBM Corporation IBM Labs in Haifa Mapping Challenges Straightforward mappings (e.g., date of birth) Complex mappings (e.g., Karolinska therapies) Misleading mappings (e.g., Arevir sequence date) Mappings that require data analysis Data Sources extract mutations by aligning their raw sequences Different wild types (HXB2, consensusB) Different alignment algorithms Different alignment parameters EuResist aligns all raw sequences With the chosen wild type (consensusB) Using a chosen alignment software (MyLap) Based on a configurable set of parameters © 2006 IBM Corporation IBM Labs in Haifa Data Cleansing An integral part of the data integration & feeding process The quality of the data is crucial for the training and validation of the prediction engines Eliminate error, bring consistency, report problems Data could be “good”, “bad” or “suspicious” Requires domain knowledge An ongoing process The set of rules and checks defined are useful also for checking the input provided by the user to the predictive system © 2006 IBM Corporation IBM Labs in Haifa Alignment with Standards Health Level Seven (HL7) is a standard, domain specific, common protocol for the exchange of health care information. HL7 v3 consists of two main elements: a messaging standard an information model, the RIM – Reference Information Mode EuResist defines an HIV specific data mart which is derived from the generic HL7 RIM By mapping EuResist schema to HL7, we provide a standard interface for sharing data. Provide mapping from HL7 RIM to EuResist Generate HL7 messages from EuResist data © 2006 IBM Corporation IBM Labs in Haifa Alignment with Standards HL7 3.0 Msg HL7 V3.0 Certified HL7 RIM EuResist Repository Integrated database HL7 3.0 Msg HL7 v3 Message creation © 2006 IBM Corporation IBM Labs in Haifa Current Collaborations Stanford University (http://hidb.stanford.edu) provides a public HIV drug resistance database Stanford is creating a large dataset of TCEs for analytic studies treatments, sequences, virus load levels, CD4 counts Stanford and EuResist have different schemas for the same data We can convert from one schema to the other We can share data via standards (HL7) © 2006 IBM Corporation IBM Labs in Haifa Thai Traditional Chinese Gracias Russian Spanish Thank You English תודה Hebrew (Toda) Obrigado Brazilian Portuguese Arabic Danke German Grazie Simplified Chinese Italian Merci French Japanese Korean TackTack Köszönöm Hungarian Swedish © 2006 IBM Corporation.