Biogrid Australia (Formerly Biogrid)

BioGrid Australia (formerly BioGrid) Record Linkage The Vision – remove the ‘silos’ Population data Hospital data Research Research Disease Sub‐specialty /research data Project Project Gene Expression 2 1 Protein Expression Genotypes Public Domain Data A generic informatics platform which provides opportunities for collaboration across organisations and expansion to other research areas. Why link databases? • Record linkage for Clinical purposes: – Instant sharing of information across treatment centres (hospitals, GP’s, etc) • Research power: – Increase the sample size – Increase the potential for research collaborations • Main Issues: – Lack of a common identifier at national or even state level (unlike many countries) • Key Question for Linkage: – What level of accuracy is acceptable for each type of linkage? BioGrid ‐ Privacy and Ethics: Required approval from all sites to use their identifying information for linkage purposes . This includes the information being copied to a central location so it can be “matched” against patients from other sites Required a legal opinion that we could use the last 5 digits of the Medicare number as part of the linkage algorithm. There is a provision in the Medicare Act that the whole number cannot be used for identification purposes (a legacy of the Australia Card?) Data available in de‐identified form only, and therefore only suitable for research Health data always kept separate from identifying data Institute-specific data loaded Authorised researchers into institute-specific query the Federated Data Local Research Repository nightly Repository for analysis. Metadata Repository Hospital #A ETL Internet • Multiple LRR • Source • databases VPN De-identified Federated Linked data SAS eg Gastro Queries Data Analysis Integrator Reports Hospital # B VPN ETL GenBank Unique LRR UniProt • Multiple Subject LocusLink Identifier • Source PubMed • databases Public Data Sources The BioGrid Model Institute-specific data loaded Authorised researchers into institute-specific query the Federated Data Local Research Repository nightly Repository for analysis. Metadata Repository Hospital # A ETL Internet Multiple LRR Source VPN De-identified Databases SAS Federated Linked data Queries Data Statistical Nightly upload of data Integrator analysis To computer at the site Reports Hospital # B VPN ETL GenBank LRR Unique Multiple UniProt Subject LocusLink Source Indentifier PubMed Databases Public Data Sources The BioGrid Model Institute-specific data loaded Authorised researchers into institute-specific query the Federated Data Local Research Repository nightly Repository for analysis. Metadata Repository Hospital # A ETL Internet Multiple LRR Source VPN De-identified Databases SAS Federated Linked data Queries Data Statistical No identified health data Integrator analysis leaves the site Reports Hospital # B VPN ETL GenBank LRR Unique Multiple UniProt Subject LocusLink Source Indentifier PubMed Databases Public Data Sources The BioGrid Model Institute-specific data loaded Authorised researchers into institute-specific query the Federated Data Local Research Repository nightly Repository for analysis. Metadata Repository Hospital # A ETL Internet Multiple LRR Source VPN De-identified Databases SAS Federated Linked data Queries Data Statistical No health data stored analysis centrallyIntegrator Reports Hospital # B VPN ETL GenBank LRR Unique Multiple UniProt Subject LocusLink Source Indentifier PubMed Databases Public Data Sources The BioGrid Model Institute-specific data loaded Authorised researchers into institute-specific query the Federated Data Local Research Repository nightly Repository for analysis. Metadata Repository Hospital # A ETL Internet Multiple LRR Source VPN De-identified Databases SAS Federated Linked data Queries Data Statistical Individual records linked analysis Integrator Reports • 6 id fields sent along secure lines to secure computer Hospital # B • Secure computerVPN generates the linkage key • LinkageETL key sent to site’s computer GenBank LRR Unique UniProt Multiple • Linkage key store is ‘encrypted’ LocusLink • Health data only availableSubject with key, not with Source Indentifier PubMed Databases identifiers Public Data Sources The BioGrid Model Institute-specific data loaded Authorised researchers into institute-specific query the Federated Data Local Research Repository nightly Repository for analysis. Metadata RepositoryRESEARCH Hospital # A -Authorised access only ETL -De-identified data onlyInternet sent Multiple LRR -No health data stored Source VPN centrallyDe-identified Databases SAS Federated Linked data Queries Data Statistical analysis Integrator Reports Hospital # B VPN ETL GenBank LRR Unique Multiple UniProt Subject LocusLink Source Indentifier PubMed Databases Public Data Sources The BioGrid Model RCH/ MCRI South Aust Bendigo •Respiratory Authorised researchers and •Oncology •Diabetes RAH applications query the Federated •Crohns Q Elizabeth Data Repository for analysis. Flinders MC Hume Monash MC •Oncology •Oncology •Oncology •Cystic Fibrosis •Tissue Bank Gippsland •Cystic Fibrosis Tasmania •Oncology Box Hill RHH Business •Diabetes •Oncology Grampians Glossary •Oncology •Oncology St Vincents •Neuroscience ACT Barwon •Diabetes Canberra •Oncology Internet •Oncology •Oncology RWH Peninsula NSW •Oncology •Oncology St Vincents •Diabeties •Diabetes POW Federated de-identified Queries PeterMac •Oncology Cabrini Analysis •Tissue Bank Data data •Oncology Oncology VPN - each site Reports •Tissues Bank Integrator Queensland •PET images Epworth Brisbane •Oncology Alfred •Oncology •Cystic Fibrosis •Epilepsy •Neuroscience DHS •Oncology •VAED etc GenBank Austin UniProt •Oncology AIHW @ MH Linkage LocusLink •Tissues Bank • Death + cause Keys •Diabetes PubMed Melbourne + Data cached on Public Data Sources Western computers at the •Oncology •Tissues Bank sites - owned and •Diabetes BioGrid Australia •Neuroscience controlled by site. •MRI Images Health through information Scope - Clinical Datasets Cancer: Neuroscience: • Colorectal • Epilepsy •Brain BioGrid •MS • Breast •Stroke • Lung •Neuropsychiatry •Sarcoma D e • Gynaecology - id e Diabetes • Prostate n ti fie d • Type 1 • Head & Neck d a • Type 2 • Upper GI ta Other • Melanoma • Gestational • Renal • Crohns • Prostate • Cystic Fibrosis •Rare • Bone Density • Well Women’s • MRI and PET Researcher Types of Linkage: • Probabilistic • Hashing (exact) Probabilistic: – Requires the two sets of identifiers to be brought together for comparison – weights are assigned to each identifier – Often includes options for soundex, transposition detection, exclusion of dummy patients (“Babe 1”) – Final result is based on total of weighted matched fields, minus unmatched fields – User must define a “threshold value” that determines a match Linkage using eIndex ETL Data Upload LRR BioGrid Integrator Clinical Data e.g. BP (FDI) Source Identifiers (Name, DOB, Sex, Source DaSourtabaseces partial Medicare number) Databases Databases USI ETL Data Upload LRR eIndex Clinical Data e.g. BP Identifiers Source Identifiers (Name, DOB, Sex, Source (Name Dob, etc) DaSourtabaseces partial Medicare number) USI Databases Databases USI Operation of eIndex (1) • Patient matches in BioGrid are determined by using these basic demographic fields: – Surname – Given name – Middle name / initial – Date of Birth – Gender – Digits 5 to 9 of the Medicare Number (Note that Medicare legislation does not allow the use of the full Medicare Number for identification purposes. However this partial use has been approved by the BioGrid Ethics Committee and legal advice.) Operation of eIndex (2) • The matching process includes logic to handle many common types of errors, e.g. – Names, dates and numbers which differ in only one position or in the transposition of two adjacent characters are given a weight for a possible match – Names are also checked for “soundex” matches and given a positive weight if this is the case – “Dummy” names are ignored. These include “UNKNOWN” and commonly used hospital names such as “TWIN 1”, “TRIPLET 1”, “BABE” and so on Operation of eIndex (3) Hashing (“exact”): – A non-reversible hash is generated at each site and these are brought together for comparison – Therefore does not require the 2 sets of identifiers to be brought together. This can be very important if the ethics or legal requirements preclude identifiers leaving both sites. – In a simple hash system, the match must be exact – all or nothing. This makes them almost useless in real-life situations Linkage using GRHANITE™ (1) ETL Data Upload LRR with Hashes BioGrid Integrator (FDI) G Clinical Data e.g. BP Source Hashes (encrypted identity) Source DaSourtabaseces Databases Databases USI ETL Data Upload LRR with Hashes G GRHANITE Clinical Data e.g. BP Hashes Source Hashes (encrypted identity) Source (encrypted USI DaSourtabaseces Databases identifiers Databases USI Linkage using GRHANITE™ (2) Some data has additionalThis form encryption of encryption because it has isbeen deemed sensitive. non-reversibleDate of Birth has been and extracted heavily here, but can only be decrypted for research use if a specific study requiresprotected age-related against information dictionary attack Linkage using GRHANITE™ (3) ¾ “Hashing” creates a very long character string based on the incoming identifiers. The BioGrid- Grhanite system uses the SHA-256 algorithm, which cannot be reversed to reveal the original identifiers. ¾ The hash values are spread across the possible output values in a pseudo-random

Biogrid Australia (Formerly Biogrid)

Uniprot at EMBL-EBI's Role in CTTV

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

Webnetcoffee

The Biogrid Interaction Database

PINOT: an Intuitive Resource for Integrating Protein-Protein Interactions James E

The Uniprot Knowledgebase BLAST

Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases

Functionfinderspresnotes.Pdf

Protein Function Prediction in Uniprot with the Comparison of Structural Domain Arrangements

Uniprot.Ws: R Interface to Uniprot Web Services

Automated Data Integration for Developmental Biological Research

Three-Dimensional Structures of Carbohydrates and Where to Find Them