SMALL CELL RISK ASSESSMENT

IQEDfoot-project

PURPOSE: SMALL CELL RISKS ANALYSIS FOR THE IQEDFOOT PROJECT SMALL CELL RISK ASSESSMENT FORM FOR DATA USE WITHIN THE HEALTHDATA.BE PLATFORM V0.0

Project title It is important to ensure that the title of the study is clear, easy to understand and accurately reflects the main purpose/focus of the project. Initiative for Quality improvement and Epidemiology in multidisciplinary Diabetic Foot Clinics(IQED-Foot)

Applicant

Institution Sciensano

Address Juliette Wytsmanstraat 14, 1050 Brussel

Principal investigator Kris Doggen [email protected] (contact details) 02 642 57 22

Disclosure risk assessor

Institution P-95 Address Koning Leopold III laan 1 3001 Heverlee Assessor Kaatje Bollaerts : [email protected] (contact details) Margarita Riera : [email protected] Maria Alexandridou: [email protected]

Signature Applicant: Assessor: Name: Name: Date: Date: Signature: Signature:

1 I. DESCRIPTION OF THE DATA USE (filled by applicant) * Should be aligned with the authorization request

1. Data use scenario

Data use scenario:  New data collection………………………………………………………………………..SECTION A  Changes to existing data collection …………………………………………………SECTION B  Re-use of existing data……………………………………………………………………SECTION C  Publication of private and/or public reports……………………………………SECTION D

SECTION A: New data collection

A.1. Motivation of the data request (Max 500 words)

Provide the necessary background information and key references, providing evidence that the applicants know the relevant scientific literature. Clearly describe the reasons for data collection (mention legal obligations if any), the research questions and their relevance for policy making and science. Provide a concise overview of the objectives, methods and data analysis for the proposed research. The Initiative for Quality improvement and Epidemiology in the multidisciplinary diabetic foot clinics (IQEDfoot) project is being commissioned by the National Institute for Health and Disability Insurance (NIHDI) to study the quality of care for adult diabetic patients, in particular those suffering from a diabetic foot and treated in specialized clinics. Since 2005, Sciensano (formerly the Scientific Institute for Public Health) has been collecting, analyzing and managing the database, and distributes a center-specific report to each center and a general report to NIHDI. A.2. Objective(s)

Describe the objectives ordered from most to least important in sufficient detail, allowing assessment as to whether the data collection and intended analyses meet the objectives. The objective of the study is the realization of an audit of the quality of the care provided to diabetic patients and the promotion of quality improvement among medical care providers in diabetic foot clinics, leading to accreditation of the clinic.

A.3. Target population

Describe the reference or target population, its key features and size. In Belgium, there is no systematic registration of the diagnosis of diabetes. However, an estimate of the prevalence of diabetes can be made, based on data from health insurance (anti-diabetic drug

2 consumption) and the Belgian Health Survey. The prevalence of diabetes in Belgium is estimated at 8.0% of the adult Belgian population (both known and unknown diabetes). The prevalence of diabetic foot problems among people with diabetes is unknown. The total population of patients treated at one of the 35 diabetic foot clinics in Belgium is estimated to be 2,200.

A.4. Population intended to be covered by the new data collection Describe the population that will be covered by the data collection and its intended size. Include some consideration of whether the sample-size/power will be sufficient to meet the scientific objectives of the project. The study includes patients who have new onset of diabetes-related foot ulcers (minimum Wagner grade 2) or who suffer from a neurogenic arthropathy (Charcot foot) and are treated in a foot clinic which has a multidisciplinary team and sees at least 52 different patients per year.

A.5. Study design

Describe design characteristics that might be important for the small cell risk analysis.

This is a prospective study design. The first data collection took place in 2005 and is expected to be repeated every 2-3 years. There have been 5 audits in total each covering 18 months of data collection. For every audit, each participating clinic provides data from the first 52 patients with a new diabetes-related foot ulcer or a new Charcot foot. Every patient must be followed up for a period of 6 months with a maximum of 7 months.

A.6. Variables

Give an overview of the key variables (or groups) of variables of the study

Information on the following groups of variables is collected as part of the IQEDfoot study: - Hospital and care provider identifier - Socio-demographics - Type of diabetes foot problems - Treatment - Outcomes of the patient and foot problems

3 A.7. Data/statistical analyses planned

An overview of the data management and data analysis to be performed should be covered in this section. Applicants should ensure that analytical methods proposed are consonant with the objectives of the project and the data collected. Two types of analyses are performed: 1) Repeated cross-sectional analyses (RCS). In this approach, each audit is considered a study in itself. For each audit, we calculate indicators and aggregate them at the level of centers and at the national level. Within each audit we study between-center variability. Across audits, we study the evolution of the national data and of the between-center variability. 2) Longitudinal patient-level analyses. Since 2005 patients are identified by a pseudonymized national registry number, making it possible to follow patients over time, even though not every patient is sampled during each audit. We do longitudinal analyses to describe the natural history of the disease, the evolution of risk factors and the incidence of complications.

A.8. Plans for disseminating and communicating study results, including target audience

Describe the way the results will be disseminated and the intended target audience.

Results are disseminated in three ways: 1) Individualized feedback reports with anonymous benchmarking to centers. After each audit, centers receive an individualized feedback report that allows them to compare the patient characteristics, treatment and outcomes to those of their peers and to the national average. 2) National report. After each audit, national and anonymized center-level data are presented in a public report intended primarily for Riziv-Inami. The goal of this report is to communicate on the important findings and trends observed in the data. 3) Scientific publications. Based on our own interests and topics suggested by the Group of Experts, we analyse specific aspects.

A.9. Codebook

The assessor should have access to the codebook, listing all variables that will be collected, the variable name, short description, variable type (binary, categorical, continuous) and possible values (in case of a categorical variable) or value range (in case of a continuous variable).

The codebook can be found at Codelist_IQEDfoot.xlsx (Annex)

4

II. SMALL CELL RISK ASSESSMENT (filled by assessor)

1. Identify direct identifiers, indirect identifiers and sensitive information

Complete the codebook by indicating whether variables are direct identifiers, indirect identifiers or contain sensitive information.

The data provided for the small cell risk analysis (SCRA) included data from the fifth IQEDfoot data collection, for which data were collected from January 2016 until July 2017. During this data collection, a total of 1,889 diabetic patients with foot problems were sampled out of the estimated 2,200 conventional patients, resulting in a sampling fraction of 86%. A patient’s follow-up time was maximum 7 months.

For the SCRA, we use the patient’s identifier, the indirect identifiers, the sensitive variables and the date of last consultation. The date of last consultation is used to keep the latest record of the patient for the small cell risk analysis. A list of the indirect identifiers and sensitive variables is provided in Codelist_IQEDfoot.xlsx file.

If there were multiple records for a patient, we used the last record. If the last record had missing information in an important variable, but the information was filled in in a previous record, the information was moved to the last record.

File name: Codelist_IQEDfoot.xlsx Classification of variables: Marga Riera, MD

2. Disclosure risk assessment based on direct identifiers

Identify the direct identifiers. These are variables that unambiguously identify units of observation (e.g. names, addresses, phone numbers, social insurance numbers).

There are no direct identifiers in the data.

3. Disclosure risk assessment based on indirect identifiers

Assess the disclosure risk based on indirect identifiers. These are variables that –in combination with other indirect identifiers – can be used to disclose the identity of individuals or institutions.

5 Given that sample uniques (i.e. patients with a unique pattern of indirect identifiers) are more likely to be identified, one way to assess disclosure risk is to calculate the number of subjects in the sample having the same distinct pattern of indirect identifiers. This approach is called k- anonymity. It is typically required that each pattern of indirect identifiers has at least 3 sample records (k = 3) to ensure confidentiality. The variables classified as indirect identifiers are summarized in Table 1. Table 1. Indirect identifiers Indirect identifier Description of indirect identifier CD_RIZIV_TREAT_PHYS RIZIV code of the treating physician CD_DATA_PROV_VAL RIZIV code of the data provider CD_ACT_CHARCOT_IMMO_OTH Immobilisation of the leg up to the knee by other means e.g. diabetic walker but not including crutches bed rest wheelchair CD_ACT_CHARCOT_IMMO_TCC Immobilisation of the leg up to the knee by total contact cast tcc CD_IMMOB_FT_CASK Immobilisation of the foot by a cast e.g. scotch cast boot CD_IMMOB_FT_SHOE Immobilisation of the foot by a shoe e.g. barouk shoe CD_IMMOB_OTH Immobilisation of foot and lower leg by other means so no tcc e.g. diabetic walker does not include crutches bed rest wheelchair CD_IMMOB_TOT_TCC Immobilisation by total contact cast tcc CD_IMMOB_ULC Immobilisation around the ulcer e.g. felt silicone orthosis CD_MAJOR_AMP Has the patient ever received a major amputation of the lower limbs? CD_ORTHOSURG_MAJOR_AMP Has the patient a major amputation? CD_PAT_PLC_RESDC Place of residence - Postcode CD_PAT_SEX Sex CD_STAND Is the patient able to stand up without help? The date at which the major amputation was DT_ORTHOSURG_MAJOR_AMP performed DT_PAT_DOD Date of death

6 NR_PAT_BIRTHM Birth month NR_PAT_BIRTHY Birth year

K-anonymity There are 1,889 patients in our data out of an estimated population size of 2,200 patients.

Prior to the analysis, we transformed the month (NR_PAT_BIRTHM) and year of birth (NR_PAT_BIRTHY) into one variable (TRNSFM_dob), all variables containing information on immobilisation (CD_ACT_CHARCOT_IMMO_OTH, CD_ACT_CHARCOT_IMMO_TCC, CD_IMMOB_FT_CASK, CD_IMMOB_FT_SHOE, CD_IMMOB_OTH, CD_IMMOB_TOT_TCC, CD_IMMOB_ULC have been transformed into one variable (TRNSFM_immob), and all variables containing information on major amputation (CD_MAJOR_AMP and CD_ORTHOSURG_MAJOR_AMP) have been transformed into one variable (TRNSFM_major_amp).

We check 2-anonymity in our sample and population using the indirect identifiers listed in Table 1. If an indirect identifier has been used to create a new variable, the new variable has been used instead. The risk of identity disclosure is high with 99.8% of the samples violating 2-anonymity, meaning that almost every patient has a unique pattern of indirect identifiers in the data set (Table 2).

Table 2. K-anonymity: based on the indirect identifiers: RIZIV code of the treating physician, combined immobilization (Y/N), RIZIV code of the data provider, combined major amputation (Y/N), postcode, sex, stand (Y/N), date of major amputation, date of death, combined month and year of birth, (set 1).

Sample

K-anonymity Nr violations % violations

2 1885 99.8

3 1889 100

To find guidance on how to best improve the k-anonymity, we calculated k-anonymity using a leave-one-out procedure, excluding one variable at a time (Table 3). The variable that yielded the highest reductions in k-anonymity is date of birth (reduction 27.6%). followed by postcode (2.4%).

Table 3. 2-anonymity based on leave-one-out procedure, RIZIV code of the treating physician, combined immobilization (Y/N), RIZIV code of the data provider, combined major amputation (Y/N), postcode, sex, stand (Y/N), date of major amputation, date of death, combined month and year of birth, (set 1).

Variables Nr violations % reduction in (sample) violations

7 All 1885 .

Excl. treating physician 1883 0.1

Excl. immobilization (Y/N) 1885 0

Excl. data provider 1885 0

Excl. comb. Major amputation 1885 0

Excl. postcode 1839 2.4

Excl. sex 1879 0.3

Excl. stand 1885 0

Excl. date of major 1885 0 amputation

Excl. date of death 1885 0

Excl. date of birth 1364 27.6

Excl. – excluding; Y/N – yes/no As a first step to reduce 2-anonymity, we changed date of birth to year of birth, leading to a limited reduction, from 99.8% to 97.9%. In the next step, we recoded postcode to 2-level NIS code, again leading to a limited reduction, from 97.9% to 90.5%. Anonymizing the current treating physician, so that it is no longer an indirect identifier, further reduced the 2-anonymity to 88.0%. Next, the NIS–code was transformed into regions, leading to a reduction to 80.2%. Finally, we transformed year of birth into 5-year age groups (15-19y, 20-24y etc.) reducing the 2-anonymity to 48.0%. These different steps and corresponding reductions in k-anonymity are summarized in Table 4.

Table 4. Changes in K-anonymity for the different data manipulation steps.

Variables K-anonymity % violations (sample)

Set 1: treating physician, immobilization 2 99.8 (Y/N), data provider, major amputation (Y/N), postcode, sex, stand (Y/N), date of 3 100 major amputation, date of death, date of birth - -

Set 2: treating physician, immobilization 2 97.9 (Y/N), data provider, major amputation (Y/N), postcode, sex, stand (Y/N), date of 3 99.8 major amputation, date of death, year of birth 4 100 [date of birth -> year of birth]

8 Set 3: treating physician, immobilization 2 90.5 (Y/N), data provider, major amputation (Y/N), NIS-code, sex, stand (Y/N), date of 3 97.4 major amputation, date of death, year of birth 4 99.6 [postcode -> 2-level NIS code]

Set 4: immobilization (Y/N), data 2 88.0 provider, major amputation (Y/N), NIS- code, sex, stand (Y/N), date of major 3 97.0 amputation, date of death, year of birth [mask treating physician] 4 99.6

Set 5: immobilization (Y/N), data 2 80.2 provider, major amputation (Y/N), region, sex, stand (Y/N), date of major 3 93.9 amputation, date of death, year of birth [2-level NIS code -> region] 4 98.9

Set 6: immobilization (Y/N), data 2 48.0 provider, major amputation (Y/N), region, sex, stand (Y/N), date of major 3 68.2 amputation, date of death, age group [year of birth -> 5-year age group] 4 81.4

4. Impact of potential disclosure

Assess the impact of potential disclosure based on the sensitive variables.

A potential identity disclosure is particularly problematic if sensitive information is revealed. If a patient is sample unique, then any sensitive information is revealed upon identity disclosure. Sensitive information from patients that do not violate 2-anonymity might still be disclosed if all patients sharing the same pattern of identifying variables share the same sensitive information. We therefore calculated per distinct pattern of identifying variables, the proportion of patients with sensitive values for each sensitive variable in turn. If this proportion equals 1 or is very high (>0.95), sensitive information might get disclosed for all persons with that combination of identifying variables. We call this statistic M-sensitivity. The variables classified as sensitive variables and their sensitive values are summarized in Table 5.

Table 5. Sensitive variables and sensitive values Sensitive variable Description of sensitive variable Sensitive values CD_CNTRY Country of residence Non-Belgian

9 CD_KIDNEY_TRSPLT Has the patient already received Positive a kidney transplant or is he receiving hemodialysis or peritoneal dialysis? CD_MINOR_AMP Has the patient ever received a Positive minor amputation of the lower limbs? Has the patient a minor Positive CD_ORTHOSURG_MINOR_AMP amputation? CD_STA_SMOK Smoking status Ex-smoker, Smoker

Prior to the analysis, we transformed all variables containing information on minor amputation (CD_MINOR_AMP and CD_ORTHOSURG_MINOR_AMP) into one variable (TRNSFM_immob). There are 906 unique patients (48%), i.e. patients that violate 2-anonymity. If the researcher knows diabetic patients with foot ulcer or Charcot foot, the sensitive values of the unique patients should be suppressed. An exemption may be warranted depending on the importance of the sensitive information for research.

Next, using all the indirect identifiers we applied the M-sensitivity statistic. We use the set 6 (Table 4). There are 110 distinct patterns of the identifying variables with the proportion of patients having a sensitive value being higher than 0.95. The patterns can be broken down as follows: - For 100 patients, a distinct pattern of indirect identifiers might lead to a disclosing sensitive information regarding minor amputation. - For 163 patients, a distinct pattern of indirect identifiers might lead to a disclosing sensitive information regarding smoking status.

Based on the impact assessment, we conclude that kidney transplant is well protected. Country of residence has only 5 non-Belgian values. These 5 non-Belgians are all sample uniques. We recommend to re-group them into non-Belgians.

For 100 (5.3%) of the patients, sensitive information regarding minor amputation might be revealed. However, a diabetic person who has a foot ulcer or a Charcot foot has high chance of having received or going to receive a minor amputation. So, if the minor amputation is disclosed in this registry, where most diabetics have issues with foot ulcers (99.9%), this adds little additional information beyond what could already be expected.

For 163 (8.6%) of the patients, sensitive information regarding smoking status might be revealed. However, there is a strong research interest in this variable, which justifies to keep this variable.

10 Patients with diabetes and a history of cardiovascular disease (CVD) are at an increased risk of subsequent CVD events and require secondary prevention. Smoking cessation is an important (secondary) prevention measure of CVD and as such, smoking is an important variable to monitor for the IQEDfoot population [1]. Also smoking history is an important covariate to estimate the CVD risk [2]

Finally, there were only 10 non-Belgian patients who are sample uniques. We recommend to re- group the variable on residency in Belgian residency versus others.

[1.] Pagidipati NJ, Navar AM, Pieper KS, Green JB, Bethel MA, Armstrong PW, et al. Secondary Prevention of Cardiovascular Disease in Patients With Type 2 Diabetes Mellitus: International Insights From the TECOS Trial (Trial Evaluating Cardiovascular Outcomes With Sitagliptin). Circulation. 2017;136(13):1193-203. [2.] Stevens RJ, Coleman RL, Adler AI, Stratton IM, Matthews DR, Holman RR. Risk factors for myocardial infarction case fatality and stroke case fatality in type 2 diabetes: UKPDS 66. Diabetes Care. 2004;27(1):201-7.

5. Recommended disclosure control strategies

We recommend the following data transformations to minimize the risk of patient’s identity disclosure: 1. recoding the place of residence to a larger geographical scale (region – , , ) 2. masking the RIZIV code of the treating physician (creating an anonymous physician id) 3. recoding the date of birth information to 5-year age groups (15-19y, 20-24y, etc.)

After these data transformation steps, the percentage of patients violating 2-anonymity within the sample reduced from 99.8% to 48.0% (n = 906 patients). The % patients violating 2-anonymity within the total patient population is expected to be 41.2% as the sampling fraction is estimated to be 86%.

To reduce the impact of potential disclosure, we further recommend 4. re grouping country of residence (CD_CNTRY) in Belgian residency versus others.

Given the nature of the indirect identifying variables, it is very likely that the identity of a patient might be disclosed in case the researcher knows a diabetic person with either a foot ulcer or a Charcot foot. However, the probability of the researcher knowing such a person is low given the relatively low incidence of patients with a diabetes-related foot ulcer or Charcot foot. Regarding country of residence, all 5 non-Belgians are all sample uniques. We recommend regrouping the variable ‘country of residence’ into Belgium vs others.

In the future, additional data collections will be carried out as part of the IQEDfoot project. As the population to sample from continuously changes over time (only patients with a new diabetes- related foot ulcer or a new Charcot foot are eligible for sampling), our recommendations remain valid for all future data collections and combining data across different audits is allowed.

11

If new variables will be collected that are either sensitive or can be used as indirect identifiers, their impact on the SCRA needs to be assessed. For the 2018 audit, new information listed below will be collected. However, no new indirect identifiers nor new sensitive variables are being added. Hence, there is no need to do a new quantitative SCRA and our recommendations remain valid.

- Has the patient ever had a Charcot foot? - Which joints are affected by the Charcot process? - Immobilization of the lower limb using a non-removable device? - Has arterial Doppler test been carried out? - Has a sharp debridement been carried out? - At the end of follow-up, were there no more active wounds? - During follow-up, did the Charcot foot reactivate?

Annex

Codelist_IQEDfoot.xls x

12