Population-Based Linkage of Big Data in Dental Research
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Environmental Research and Public Health Review Population-Based Linkage of Big Data in Dental Research Tim Joda 1,*, Tuomas Waltimo 2, Christiane Pauli-Magnus 3, Nicole Probst-Hensch 4 and Nicola U. Zitzmann 1 1 Department of Reconstructive Dentistry, University Center for Dental Medicine Basel, 4056 Basel, Switzerland; [email protected] 2 Department of Oral Health & Medicine Dentistry, University Center for Dental Medicine Basel, 4056 Basel, Switzerland; [email protected] 3 Department of Clinical Research & Clinical Trial Unit, Faculty of Medicine, University of Basel, 4031 Basel, Switzerland; [email protected] 4 Department of Epidemiology & Public Health, Swiss Tropical & Public Health Institute Basel, University of Basel, 4051 Basel, Switzerland; [email protected] * Correspondence: [email protected]; Tel.: +4161-267-2631 Received: 19 October 2018; Accepted: 23 October 2018; Published: 25 October 2018 Abstract: Population-based linkage of patient-level information opens new strategies for dental research to identify unknown correlations of diseases, prognostic factors, novel treatment concepts and evaluate healthcare systems. As clinical trials have become more complex and inefficient, register-based controlled (clinical) trials (RC(C)T) are a promising approach in dental research. RC(C)Ts provide comprehensive information on hard-to-reach populations, allow observations with minimal loss to follow-up, but require large sample sizes with generating high level of external validity. Collecting data is only valuable if this is done systematically according to harmonized and inter-linkable standards involving a universally accepted general patient consent. Secure data anonymization is crucial, but potential re-identification of individuals poses several challenges. Population-based linkage of big data is a game changer for epidemiological surveys in Public Health and will play a predominant role in future dental research by influencing healthcare services, research, education, biotechnology, insurance, social policy and governmental affairs. Keywords: big data; patient-generated health data (PGHD); register-based controlled (clinical) trials [RC(C)T]; epidemiological research; public health 1. Introduction Big data and its impact on society is an omnipresent topic affecting all spheres of our lives. We are now more globally connected than ever, and society continuously advances into data-driven infrastructures and organizations. The ‘digital revolution’ has the potential to enrich society by informing social policy decisions, but might also cause harm and put sensitive information at risk of abuse. How to best harness and use big data for social good remains a core focus within the biomedical community [1]. Population-based linkage of patient-level information is a powerful tool in health economics. Health data can be gathered from routine care process and other expanded sources including data on the social determinants of health, such as material circumstances (e.g., housing, transport, food security/safety, environmental degradation). Big data collaborations involve a diverse range of stakeholders with different analytical, technical and political capabilities (Figure1). The linkage of individual patient data obtained from different sources opens new strategies for research. Establishing large population-based patient cohorts may help identify unknown correlations Int. J. Environ. Res. Public Health 2018, 15, 2357; doi:10.3390/ijerph15112357 www.mdpi.com/journal/ijerph Int. J. Environ. Res. Public Health 2018, 15, 2357 2 of 5 Int. J. Environ. Res. Public Health 2018, 15, x 2 of 5 ofof symptoms and and diseases, diseases, novel novel prognostic prognostic factors, factors, new new treatment treatment concepts, concepts, new new revenue revenue sources andand facilitate facilitate the the evaluation evaluation of of the the healthcare healthcare system system [2]. [2 ].The The linkage linkage of ofpatient-level patient-level information information to population-basedto population-based citizen citizen cohorts cohorts and and biobanks biobanks provides provides the the required required reference reference of of diagnostic diagnostic and and screeningscreening cutoffs cutoffs that could detect new biomarkers through personalized health research [[3].3]. FigureFigure 1. 1. HealthHealth data data sources, sources, associated associated stakeholders stakeholders and and capabilities. capabilities. 2. Data Collection and Processing 2. Data Collection and Processing AsAs conventional methodsmethods for for collecting collecting data data such such as (prospective)as (prospective) clinical clinical trials trials have have become become more morecomplex complex due to highdue costs,to high time costs, expenditures time expendit and difficultures and patient difficult recruitment, patient linked recruitment, individual-level linked individual-leveldata is a valuable data and is a promising valuable and additional promising tool additional in healthcare tool relatedin healthcare science related [4]. The science strengths [4]. The of strengthsregister-based of register-based controlled (clinical) controlled trials (clinical) (RC(C)T) trials is that (RC(C)T) these are is well-characterized,that these are well-characterized, particularly for particularlymedical and for dental medical research, and dental provide research, comprehensive provide informationcomprehensive on hard-to-reachinformation on populations hard-to-reach and populationsallow observations and allow with observations minimal loss with to follow-up,minimal loss but to require follow-up, large but sample require sizes. large This sample type of sizes. trial Thisgenerates type of evidence trial generates with a highevidence level with of external a high validitylevel of [external5]. validity [5]. CollectingCollecting data data is is only only valuable valuable if this if this is done is done systematically systematically according according to harmonized to harmonized and inter- and linkableinter-linkable data datastandards standards to produce to produce high-quality high-quality data. data. This This will will avoid avoid the the garbage-in-garbage-out garbage-in-garbage-out issueissue that hashas afflictedafflicted big big data’s data’s reputation reputation in in the the past. past. The The ubiquitous ubiquitous generation generation and and storage storage of data of datainevitably inevitably leads leads to huge to accumulationhuge accumulation of information. of information. Efficient Efficient data management data management not only not involves only involvesthe assessment the assessment and storage and of information,storage of information, but must also but ensure must that also the ensure data can that be the accessed data quicklycan be accessedutilizing quickly user-friendly utilizing software user-frien for fastdly software and easy for filtering fast and [6]. easy However, filtering the [6]. best However, data processing the best data and processinganalysis algorithms and analysis are algorithms only as good are as only the as reliability good as andthe reliability validity of and the validity original of data. the original Despite data. best Despiteefforts, inadequaciesbest efforts, inadequacies in data quality in in data general quality do occur.in general A crucial do occur. problem A crucial is missing problem and incompleteis missing anddata, incomplete which can occurdata, which when datacan isoccur not capturedwhen data in ais standardizednot captured manner in a stan ordardized not recorded manner for specificor not recordedpopulation for cohorts specific [ 7population]. cohorts [7]. DigitalDigital information information that that is is not not yet being used is known as ‘dark data’. Do Do we we as biomedical researchersresearchers havehave the the obligation obligation to exploitto exploit the wealththe wealth of dark of data, dark which data, may which contain may vital contain information vital informationthat could lead that to could future lead achievements to future achievements such as cures such for diseases? as cures However,for diseases? these However, data may these be of data low mayquality, be of records low quality, may be records incomplete, may be processed incomplete, incorrectly, processed or storedincorrectly, in file or formats stored or in onfile devices formats that or on devices that have become obsolete over time. Furthermore, by the time the data can be cleaned Int. J. Environ. Res. Public Health 2018, 15, 2357 3 of 5 have become obsolete over time. Furthermore, by the time the data can be cleaned and processed it may be too old to be useful and would distract us from other discovery research, given that we live in a resource constrained environment. 3. Protection of Human Rights The use of linked biomedical data to support register-based research poses a unique set of methodological challenges, such as disclosing sensitive information about individuals whose consent cannot be easily obtained in retrospect. Data anonymization is a type of information sanitization whose primary intent is privacy protection. It is the process of either encoding or removing personally identifiable information from datasets. Anonymization methods include encryption, hashing, generalization, pseudonymization and perturbation. De-anonymization is the reverse engineering process used to detect the source data. The most common technique of de-anonymization is cross-referencing data from multiple