MP RJ Vaitinadin – PhD Dissertation

A Study of Cardiometabolic Traits and their Progression, over a Decade, in a Croatian

Island Population

A dissertation submitted to the

Graduate School of the University of Cincinnati

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in the Division of Epidemiology of the Department of of the College of

by Nataraja Sarma Vaitinadin M.B.B.S. - Jawaharlal Institute of Post-Graduate Medical Education and Research M.P.H. - University of Cincinnati

March, 2019

Committee Chair: Ranjan Deka, Ph.D.

1

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Abstract The growth of computation and big data have allowed us to ask more and more complex questions about healthcare. One of the key questions being asked is about using data to predict outcomes. With the alarming rise in cardiometabolic diseases, and associated healthcare costs, it is about time to focus energy and resources on developing methodologies to predict disease status in the future. We have implemented a productive workflow that takes data from epidemiologic field studies, to develop prediction models to determine disease status a decade into the future. The models were tested for accuracy and were used to develop a diagnostic test. The test was validated using an ROC (Receiver

Operating Characteristic) curve. Further, the model was cross validated on untrained data.

The robustness of the approach was demonstrated across four different disease outcomes – high blood pressure, coronary heart disease, diabetes and gout. We further examined changes in biomarker measurements at the population level, over time and with sex. Both these approaches target two important aspects of – precision medicine and . To that extent, the study is a successful demonstration of utilizing field data to develop and test contrasting research questions – at the individual level and at the population level, simultaneously.

2

MP RJ Vaitinadin – PhD Epidemiology Dissertation

3

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Acknowledgements

At the outset, I would like to record my gratitude to the faculty and staff at the Department of Environmental Health for the wonderful experiences during the graduate program. I will always be in debt to Dr Ranjan Deka, my advisor and mentor, and to Dr Aimin Chen, for their kindness, steadfast guidance and financial support throughout the program. Dr Deka and Dr Chen have taught me the art and the science of epidemiologic research – that unique blend of skills where methodologic rigor happily co-exists with inspiring moments of discovery. I would like to thank Dr Marepalli Rao for spending long hours putting up with my ignorance of and for inspiring my interest in the field. I would like to thank my qualifying committee – Dr Aimin Chen, Dr Marepalli Rao and Dr Eric Hall, for their guidance and support in understanding the research process. I am grateful to my dissertation committee – Dr Ranjan Deka (Chair), Dr Aimin Chen, Dr Roman Jandarov and Dr Jane C, Khoury – for all their precious time spent in trying to work through my inadequacies, even while encouraging and supporting me at every turn of this arduous journey. I would also like to thank Dr Mary-Beth Genter, Angela Riall, Amy Itescu and Kathy McCann for their patience and support through the administrative minefield.

I would like to thank all my teachers, friends and colleagues – past and present – for all the experiences that have moulded my life. I am grateful for the steadfast support, sacrifices and understanding of Aparna Sreeshankaren, Mrs and Mr Sreeshankaren, and Chandramouli Shankaren. Manasvini and Anirudh … you inspire me every day! I have no illusions about how insignificant I am in the grand scheme of things … and yet, through all of life’s challenges, there seems to be the invisible hand … guiding … protecting … thank you, Mahaperiyava! How can I forget my late grandparents? Their love and encouragement played no small part in helping me accomplish everything I have done so far.

I will always be in debt to my parents, Mrs (Late) and Mr Vaitinadin.

Mom … I miss you, this one is for you, up there!

Nataraja Sarma Vaitinadin,

March 27, 2019

4

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Table of Contents Chapter 1: Cardiometabolic traits Study – Introduction, Hypothesis and Specific Aims 09 Introduction to Cardiometabolic Traits and Disease 10 Hypothesis and Specific Aims 11

Chapter 2: Background, Significance and Public Health Importance 12 Background 13 Significance and Public Health Importance 13

Chapter 3: Methods 15 Description of Study Population 16 Laboratory and Other Data Collection Methods 17 Cardiometabolic Disease and Trait Characterization 19 Statistical Methods and Specific Aims 19 Predictive Modeling 22 Diagnostic Test and Receiver Operating Characteristic 22 Rigor and Transparency 22

Chapter 4: The High Blood Pressure (HBP) Cohort – Results & Discussion 23 Descriptive Statistics 24 Exploratory Data Analysis 25 Epidemiologic Association Measures 26 Predictive Modeling for HBP 27 Diagnostic Test for HBP status 28 Repeated Cross Validation 30

5

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 5: The Coronary Heart Disease Cohort (CHD) – Results & Discussion 32 Descriptive Statistics 33 Exploratory Data Analysis 34 Epidemiologic Association Measures 35 Predictive Modeling for CHD 36 Diagnostic Test for CHD status 37 Repeated Cross Validation 38

Chapter 6: The Diabetes Cohort – Results & Discussion 40 Descriptive Statistics 41 Exploratory Data Analysis 42 Epidemiologic Association Measures 43 Predictive Modeling for Diabetes 44 Diagnostic Test for Diabetes Status 45 Repeated Cross Validation 47

Chapter 7: The Gout Cohort – Results & Discussion 49 Descriptive Statistics 50 Exploratory Data Analysis 51 Epidemiologic Association measures 52 Predictive Modeling for Gout 53 Diagnostic Test for Gout Status 54 Repeated Cross Validation 56

6

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 8: Study of Population of Fibrinogen - Results and Discussion 58 Descriptive Statistics and Paired t-Test 59 Does time influence the population of fibrinogen? 60 Is there an interaction between time and sex that affects the population mean of fibrinogen? 61

Chapter 9: Study of Population Means of Uric Acid - Results and Discussion 62 Descriptive Statistics and Paired t-Test 63 Does time influence the population mean of uric acid? 64 Is there an interaction between time and sex that affects the population mean of uric acid? 65

Chapter 10: Conclusion 66 Major Conclusions 67 Strengths and Limitations 68 Future Research 69

References 70

Appendix 77

Key Terms 103

7

MP RJ Vaitinadin – PhD Epidemiology Dissertation

A Study of Cardiometabolic Traits and their Progression, over a Decade, in a Croatian

Island Population

8

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 1: Cardiometabolic Traits Study – Introduction, Hypothesis and Specific Aims 09 Introduction to Cardiometabolic Traits and Disease 10 Hypothesis and Specific Aims 11

9

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 1: Cardiometabolic Traits Study - Introduction, Hypothesis and Specific Aims

1.1 Introduction to Cardiometabolic Traits and Disease

Cardiometabolic diseases are complex diseases that arise due to complicated interactions between genetic, lifestyle, and environmental factors, and this makes it difficult to understand, let alone manage, these medical conditions (Lander & Schork 1994;

Kibertsis & Roberts 2002). The cardiometabolic conditions that we focus on in this study are – high blood pressure, coronary heart disease, diabetes and gout.

High Blood pressure refers to increased vascular resistance to blood flow. Typically, normal blood pressure is around 120 mm Hg systolic and 80 mm Hg diastolic. Systolic at or above 140 mm Hg is high blood pressure, similarly, diastolic at or above 90 mm Hg is considered high blood pressure. This can manifest as breathlessness, reduced blood supply to organs, added pressure on the heart, and a higher risk of rupture and bleeding from blood vessels in the brain (Merai, 2016; Mozzafarian, 2015; Heidenreich, 2011; Palar,

2009).

Coronary heart disease refers to the development of atherosclerotic plaques that progress to diminish blood supply to cardiac tissue, impairing cardiac function. This can manifest as reduced exercise tolerance, breathlessness, chest pains and eventually, a cardiac arrest, also known as heart attack (ACCF/AHA/ACP/AATS/PCNA/SCAI/STS, 2012;

Diamond, 1982).

Diabetes, and in our context type 2, refers to the development of the inability to maintain normal levels of blood glucose due to insulin resistance, the inability of secreted insulin to act effectively in reducing blood sugar levels (Beckman, 2002). This results in

10

MP RJ Vaitinadin – PhD Epidemiology Dissertation higher levels of insulin, higher levels of blood glucose, and over a period, higher HbA1c

(glycosylated hemoglobin levels) – this is a measure of the average blood sugar level over the previous 120 days. This manifests as reduced healing capability, increases hunger, increases urination and thirst (Beckman, 2002; Combettes, 2006).

Gout is a metabolic condition, characterized by the deposition of monosodium urate crystals in joint spaces and eventually, even in the kidneys, due to prolonged elevated levels of uric acid in the blood, caused by derangements in the mechanisms involved in metabolizing uric acid. This results in sever joint pain, inflammation and kidney stones

(Dalbeth, 2016; Emmerson, 1996; Kuo, 2015).

1.2 Hypothesis and Specific aims

We hypothesize that prior disease status and values of cardiometabolic traits/ derived physiological parameters are significantly associated with future disease status and values of cardiometabolic traits/derived physiological parameters. We will employ the following two aims to test the hypothesis.

Specific Aim 1 Evaluate the relationship between prior cardiometabolic traits and the risk of developing cardio-metabolic disease over a ten-year period.

Specific Aim 2 Evaluate the relationship between cardiometabolic traits, their change over time and possible interaction terms.

11

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 2: Background, Significance and Public Health Importance 12 Background 13 Significance and Public Health Importance 13

12

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 2: Background, Significance and Public Health Importance

2.1 Background

Cardiometabolic disease is a broad term that encompasses disorders of the heart - coronary heart disease, of blood vessels – high blood pressure and of metabolism – diabetes and gout (Ndisang, 2018; Khan, 2016), among others. Cardiometabolic parameters such as blood sugar, blood pressure, lipid profile, creatinine and uric levels have afforded valuable guidance, when followed longitudinally, on the development of disease and mortality. Discrete fasting blood glucose trajectories have been shown to be associated with the development of myocardial infarction in non-diabetics (Jin, 2017), colorectal cancer in postmenopausal women (Kabat, 2012), arterial stiffening in non-diabetics (Wu,

2017), and development of rotator cuff disease (Lin, 2015).

2.2 Significance and Public Health Importance

Longitudinal follow up of blood pressure changes have been able to predict mortality in diabetics (Wu, 2016). Longitudinal levels of serum uric acid have been able to predict decline in renal function and the development of chronic kidney disease (Tsai,

2017; Desai, 2018), metabolic syndrome and hypertension (Sun, 2015, Sundstrom, 2004).

In the United States, where most of the population is still Caucasian, CDC reports that, annually about 370,000 people die of coronary heart disease (CDC – heart disease), 1 in 3 adults have high blood pressure which kills about 410,000 people/year (CDC – high blood pressure), diabetes affects about 30 million and is the 7th leading cause of death (CDC – diabetes). According the National Kidney Foundation, gout (diagnosed by detecting uric acid crystals in joints) affects about 8 million adults in the USA, eventually leading to joint and kidney disease (Kidney.org; United States Renal Data System, 2016). High blood

13

MP RJ Vaitinadin – PhD Epidemiology Dissertation pressure alone costs the country about 49 million dollars annually (CDC – high blood pressure), diabetes costs about 327 billion dollars (CDC- diabetes) and coronary heart disease costs about 200 billion dollars annually (CDC-heart disease). The impact on billions of hours of labor and productivity lost are adding to society’s burden of having to take care of an already aging population. The study population, off the Adriatic coast, is unique for its relative isolation, conferring an almost homogenous genetic and environmental make-up – thereby reducing variability (Deka, 2012, Karns, 2013). Study of the island population continues to provide insights into the natural history of these traits, further studies on the population are highlighted in the following chapter.

The study is highly significant from the following stand points, 1) it leverages the uniqueness of this population – relative isolation, homogenous genetic and environmental factors - to study cardio-metabolic traits, 2) there are currently no studies that have accomplished almost identical pairing of subjects in a ten-year study of cardiometabolic traits within a relatively homogenous population, and, 3) earlier studies on the population have yielded valuable genotypic data, the current study would extend that to unlocking potential new phenotypic information that could help predict future cardiometabolic disease, information that would help with population level strategies to reduce morbidity and mortality.

14

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 3: Methods 15 Description of Study Population 16 Laboratory and Other Data Collection Methods 17 Cardiometabolic Disease and Trait Characterization 19 Statistical Methods and Specific Aims 19 Predictive Modeling 22 Diagnostic Test and Receiver Operating Characteristic 22 Rigor and Transparency 22

15

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 3: Methods

3.1 Description of Study Population

The proposed study utilizes a longitudinal design to explore changes in cardiometabolic traits/derived physiological parameters and disease status over a period of ten years. The study will analyze data collected from the field over two time points, ten years apart – baseline at 2007-2008 and 2017.

The proposed study is a longitudinal design, of about 550 individuals living off the

Croatian coast, in the Adriatic Sea. The map below is from Deka et al, 2012.

The population is unique for its relative isolation, conferring an almost homogenous genetic and environmental make-up – thereby reducing variability.

The population is unique for its relative isolation, conferring an almost homogenous genetic and environmental make-up – thereby reducing variability. Earlier studies on the population have yielded valuable genotypic data on cardiometabolic disease (Sahay et al,

2015; Missoni et al, 2013; Sahay et al, 2013; Karns et al, 2013; Deka et al, 2012; Karns et al,

2012; Zhang et al, 2012; Karns et al, 2011; Zhang et al, 2011; Zhang et al, 2010).

The study population was the basis for a genetic epidemiologic study concerning an isolated island population located in the eastern Adriatic coast of Croatia (Kolcic et al,

2006; Deka et al, 2012; Rudan I et al, 2003; Pucarin-Cvetkovic et al, 2006).

16

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Croatians are Slavs, by descent, and the islanders are mostly of Slavs, as well. They are thought to have emigrated to these islands during the periods between 1) 6th and 8th centuries AD, and then between 2) the 15th and the 18th centuries (Rudan et al, 1992; Deka et al, 2012). The islanders have remained relatively isolated from the mainland populations due to geography. This has resulted in their genetic homogeneity along with homogenous dietary, lifestyle and exposure patterns concerning the environment. The research was based in the middle Dalmatian island of Hvar. Hvar is the largest Adriatic island, home to

11,103 people, according to the 2001 census (Karns et al, 2013; Deka et al, 2012). Also, relevant to the research was the fact that the people presented epidemiologic rates for obesity, hypertension, and metabolic syndrome akin to populations in the United States

(Kolcic et al, 2006; Deka et al, 2012; Rudan I et al, 2003; Pucarin-Cvetkovic et al, 2006).

3.2 Laboratory and Other Data Collection Methods

Subject level data were collected from 1,300 individuals between ages 18-80 years, during the first round, and they were followed up during 2017. There was no preference for disease status or medication use. Nonprobability sampling methods were employed to gather the study subjects; the study was advertised through village leaders, only volunteers who met age criteria could participate (Deka et al 2012, Karns et al, 2013).

Data collection was completed in in two waves during the first round, 2007 and

2008, followed by the second round in 2017 and has since been cleaned and prepared for analysis, so, there will no requirement for subject recruitment. There are about 560 and

615 subjects in the curated data from the two rounds, collated for the purpose of this analysis.

17

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Anthropometric measurements like Height (Ht), weight (Wt), waist circumference

(WC), and hip circumference (HC) were manually measured. This information was used to compute body mass index (BMI) (Wt in kg)/(Ht in m2) and waist–hip ratio (WC/HC). As for the cardiometabolic traits, the second and third of three blood pressure measurements

(using mercury sphygmomanometer at a resting position) were used to calculate mean systolic and diastolic blood pressures (SBP and DBP, respectively) during the first round, in

2017, the mean of three measurements was used to obtain blood pressure values. Subjects undertook a 12 hour fast before blood samples were drawn; the separated serum was frozen and sent to the laboratory in Zagreb, Croatia. There, biochemical analysis of the samples followed, as in, - fasting plasma glucose (FPG; enzymatic hexokinase assay CHOD-

PAP), glycated hemoglobin (HbA1c; immune-inhibition assay), high-density lipoprotein

(HDL; homogeneous enzyme inhibition assay), low-density lipoprotein (LDL; FRIED WALD calculation), total cholesterol (TC; photometric color test CHOD-PAP), triglycerides (TG; photometric color test GPO-PAP), and uric acid (UA; enzymatic color test) levels. A self- reported survey questionnaire was utilized to obtain information on diagnoses of stroke, type 2 diabetes, gout, coronary heart disease, and kidney disease, and the same were confirmed through a review of medications prescribed, medical charts, and other clinical data. The Institutional Review Board of the University of Cincinnati and the ethics committee of the Institute for Anthropological Research, Zagreb approved the study protocols. Informed written consent was obtained from the participants (Sahay et al, 2015;

Missoni et al, 2013; Sahay et al, 2013; Karns et al, 2013; Deka et al, 2012; Karns et al, 2012;

Zhang et al, 2012; Karns et al, 2011; Zhang et al, 2011; Zhang et al, 2010).

18

MP RJ Vaitinadin – PhD Epidemiology Dissertation

3.3 Cardiometabolic Disease and Trait Characterization

For the purposes of this study, cardio-metabolic diseases are identified as high blood pressure (HBP), coronary heart disease (CHD), diabetes (Diab) and gout (Gout).

Diagnostic criteria for diabetes were fasting plasma glucose (FPG) ≥ 7.0 mmol/L or acetylated hemoglobin (HbA1c) ≥6.5% and for hypertension (high blood pressure) were

SBP ≥ 140mmHg or DBP ≥ 90 mmHg were considered hypertensive (Deka et al, 2012)

Cardio-metabolic trait data on systolic and diastolic blood pressures, lipid profile – high density lipoprotein, low density lipoprotein & total cholesterol, fasting blood glucose levels, glycosylated hemoglobin (HbA1c), creatinine and uric acid levels were collected twice – 2007 and 2017, in addition to information about disease status – yes/no.

Demographic data Age, sex, height, weight and body mass index (BMI) are available for the two time points – 2007 and 2017.

3.4 Statistical Methods and Specific Aims

Specific Aim 1 Evaluate the relationship between the prior cardiometabolic traits and a diagnosis of cardiometabolic disease at the 10-year time point.

Hypothesis We hypothesize that prior cardiometabolic traits are significant predictors of

HBP, CHD, diabetes and gout development at the end of 10 years, in 2017, among individuals not diagnosed with the respective diseases in 2007.

Statistical Analyses We will generate summary statistics for all the variables collected in the study. This will include mean, , standard deviation, range and frequency. We will develop a logistic regression model to predict disease status in 2017, among disease

19

MP RJ Vaitinadin – PhD Epidemiology Dissertation negative individuals in 2007. We included age, sex, height, weight and BMI as covariates because they have also been shown to be involved in the pathogenesis of cardiometabolic disease ‘Male’ sex and older age have been shown to demonstrate higher prevalence for cardiometabolic disease. Therefore, we performed secondary analyses to examine the interaction between cardiometabolic traits and sex, age. All analyses were carried out using the -Studio statistical analysis package (Rstudio team, 2015). We stratified the development of disease based on sex, age (above or below 50 years) and co-morbidities.

For primary analysis, to reduce risk of multiple comparisons, we plan to stratify, if necessary, using the following cardiometabolic traits – systolic blood pressure, diastolic blood pressure, HbA1C and uric acid levels (Althouse, 2016), and as well, the same approach for Bon Ferroni correction to reduce type 1 error.

The sample size of about 560 individuals is sufficiently powered for multivariate analyses of this kind at a two-sided confidence level of 95, power of 80%, Missing data will be discarded. For a two-sided confidence level of 95, power of 80%, hypothetical sample size calculations, using an unmatched model, at a 10-20% prevalence rate of illness in the population, detection of an of about 2 would require a sample size of about 560

(Sullivan et al, 2007).

Specific Aim 2 Evaluate the relationship between cardiometabolic traits, their change over time and possible interaction terms.

The study collected various cardiometabolic traits for the same individual twice over the ten-year period. We examined change over time, across the population, for the cardiometabolic traits

20

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Cardiometabolic Traits studied were fibrinogen and uric acid levels in the populations at the two time points.

Hypothesis We hypothesize that there is significant difference in the cardiometabolic traits and the derived physiological parameters, between 2007 and 2017, and the difference is related to demographic and life style factors.

Statistical Analyses We generated summary statistics for the variables collected in the study. This will include mean, median, standard deviation, range and frequency. We employed paired t-test, repeated measures ANOVA and linear mixed effects models to study changes in the biomarkers over time, these approaches answer questions like those done by using generalized estimating equations (GEE) (Liang, 1986) to determine population level difference across time, in 2007 and 2017. Co-morbidities and older age have been shown to demonstrate higher prevalence for cardiometabolic disease, thereby impacting vascular functions. Therefore, we performed secondary analyses to examine the interaction between time and gender. All analyses were carried out using the R-Studio statistical analysis package (Rstudio team, 2015). We stratified the development of disease based on sex, age (above or below 50 years) and co-morbidities.

The sample size of about 560 individuals is sufficiently powered for multivariate analyses of this kind at a two-sided confidence level of 95, power of 80%. Missing data were discarded and did not affect power, as were outliers, with consideration for the age and morbidity patterns of the population. For a two-sided confidence level of 95, power of

80%, hypothetical sample size calculations, using an unmatched model, at a 10-20%

21

MP RJ Vaitinadin – PhD Epidemiology Dissertation prevalence rate of illness in the population, detection of an odds ratio of about 2 would require a sample size of about 560 (Sullivan et al, 2007).

3.5 Predictive Modeling

The intersection of computation and big data has provided opportunities to ask more and more complex questions in healthcare. Chief among them are the ones about predicting disease, sufficiently ahead of time (Bernard, 2017; Solomon, 2015; Karim et al,

2017). At a time when healthcare costs are rising alarmingly, the ability to predict future disease can re-define the care paradigm from treating illness to preventing illness. Not just change in the care priorities, this capability will save society – lives, money, and ensure productive members, thereby reducing the burdens on those working (Bernard, 2017;

Solomon, 2015; Karim, 2017).

3.6 Diagnostic Test and Receiver Operating Characteristic

We used the predictive models to develop a link function, using the logit (a linear combination of predictors) to develop a diagnostic test to predict outcome. We then evaluated the model’s function by generating a receiver operating characteristic (ROC) curve using sensitivity and specificity values generated from the prediction models (Hoo,

2017; Draijer , 2019; Carter, 2016). The area under the curve provided information on the usefulness of the diagnostic test (Hoo, 2017; Draijer , 2019; Carter, 2016)

3.7 Rigor and Transparency

The University of Cincinnati Institutional Review Board approved and oversees the entire study. All aspects of personnel training compliance procedures were completed.

22

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 4: The High Blood Pressure (HBP) cohort – Results & Discussion 23 Descriptive Statistics 24 Exploratory Data Analysis 25 Epidemiologic Association Measures 26 Predictive Modeling for HBP 27 Diagnostic Test for HBP Status 28 Repeated Cross Validation 30

23

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 4: The High Blood Pressure (HBP) Cohort – Results & Discussion 4.1 Descriptive Statistics Description # Total 372 Male 164 Female 208 Males with HBP in 2017 52 Females with HBP in 2017 58 Table 4.1 Cohort size and sex distribution

Age group in years Male Female 20 - 40 41 52 40-60 86 120 60-80 36 36 Over 80 1 0 Table 4.2 Cohort distribution across sex and age groups

The HBP cohort consisted of individuals who did not have HBP diagnosis in 2007, and, by the end of the follow-up in 2017, some of them developed HBP. There were more females than males in the cohort. The follow-up revealed that 110 of the subjects, 52 males and 58 females, developed HBP by 2017. The bulk of the cohort was in the 40-60-year age group, with most of the rest above 60 years of age.

The sex specific descriptive statistics of the continuous variables are presented in the appendix. The mean of most of the continuous variables measured were higher for males. Biceps skin fold thickness, triceps skin fold thickness, abdominal skin fold thickness, fibrinogen, HDL, LDL and cholesterol levels were higher for females (p-value significant), as were hip circumference, subscapular skin fold thickness and supra-iliac skin fold thickness

(p-value not significant).

24

MP RJ Vaitinadin – PhD Epidemiology Dissertation

4.2 Exploratory Data Analysis The primary aim of the analysis was to develop a model using 2007 data to predict

HBP status in 2017, from among a cohort that did not have any subject with HBP in 2007.

We employed a logistic regression model, the assumptions of which were satisfied as under,

1) outcome variable – binary – presence or absence of HBP

2) normality is not a requirement of the predictors, although we assessed normality

through histogram and qq-plots, as presented below.

3) there is a linear relationship between the logit of the outcome and the predictor

variables,

4) there are no extreme outliers – extreme outliers were removed based on

physiological, morbidity and age considerations for the population.

Multicollinearity is not considered a problem among predictors, while developing a model for binary outcome, unlike in causal modeling, where interactions between predictors could impact results. In these circumstances, it is recommended that the accuracy of the model be cross-validated on untrained data, we have done that. Correlation results are presented in the appendix.

The data was further explored using principal components analysis (PCA) to understand the extent of variance explained. PCA was carried out on the continuous biomarkers of the CHD cohort. The first, second and third principal components were able to explain 79.03%, 8.76% and 3.97% of variance in the data, respectively, more detailed

PCA results are presented in the appendix.

25

MP RJ Vaitinadin – PhD Epidemiology Dissertation

4.3 Epidemiologic Association Measures Age vs Sex

HBP Count Data Cell Row Column Percentages Cohort Percentages Percentages Male Female Male Female Male Female Male Female Age gp 73 105 19.62 28.23 41.01 58.99 44.51 50.48 less than 50 years Age gp 91 103 24.46 27.69 46.91 53.09 55.49 49.52 50 or more years

Age vs HBP Status

HBP Count Data Cell Percentages Row Percentages Column Percentages Cohort HBP No HBP HBP No HBP HBP No HBP HBP No HBP

Age gp less than 24 154 6.45 41.40 13.48 86.52 21.82 58.78 50 years

Age gp 50 or more 86 108 23.12 29.03 44.33 55.67 78.18 41.22 years

Sex vs HBP Status

HBP Count Data Cell Percentages Row Percentages Column Percentages Cohort HBP No HBP HBP No HBP HBP No HBP HBP No HBP

Male 52 112 13.98 30.11 31.71 68.29 47.27 42.75

Female 58 150 15.59 40.32 27.88 72.12 52.73 57.25

The 50 and over age group formed a higher share of the cohort, compared to those younger, and there were more females than males in the cohort.

The cumulative incidence of HBP was 29.56%. Incidence of HBP among those 50 and over was 44.32 %, and 13.48 % among those younger.

In terms of , those aged 50 and over had 3.29 times the risk of developing HBP compared to those under 50, over a ten-year period.

Incidence of HBP among males and females were, respectively, 31.71% and 27.88%.

26

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Males had 1.14 times the risk of females for developing HBP, over a ten- year period.

4.4 Predictive Modeling for HBP

Logistic regression was employed for predictive modeling. The outcome variable was HBP status in 2017 – yes or no. The predictors were the continuous variables, data on which were collected in 2007. As discussed in exploratory data analysis, logistic regression assumptions were satisfied. Model building using Akaike Information Criterion was employed to pick the predictive model. Prediction accuracy of the AIC model was compared with the full model, presented in the appendix, and found to be satisfactory and parsimonious. Model fit was assessed using a chi-squared test. The selected model was also used to develop a diagnostic test, its accuracy was measured using a ROC curve.

Age, Ht, Wt, BMI, SBP, DBP, UAC, SbS and Calcium levels are the significant predictors.

The odds ratios calculated from the co-efficient estimates for all the significant predictors, but Wt, UAC and SbS, are over 1, indicating a positive relationship with the outcome. So, increases in these predictors are associated, respectively, with an increase in the probability of HBP. Wt, UAC and SbS have odds ratios that are less than 1, indicating a negative relationship with the outcome. So, increases in these predictors are associated, respectively, with a decrease in the probability of HBP.

27

MP RJ Vaitinadin – PhD Epidemiology Dissertation

HBP Cohort - Logistic regression results of chosen model Predictor Significance Odds Ratio p-value Point estimates 2.5 % 97.5 % Intercept 0.000768 *** 1.46e-35 1.41e-56 7.50e-16 Age 0.005569 ** 1.05 1.01 1.08 Ht 0.011054 * 1.41 1.08 1.84 Wt 0.009640 ** 0.69 0.52 0.91 BMI 0.003056 ** 3.36 1.54 7.73 WHR 0.059737 . 121.69 0.90 1.99e+04 SBP 0.004961 ** 1.05 1.01 1.08 DBP 0.046155 * 1.06 1.00 1.13 UAC 0.005425 ** 0.97 0.95 0.99 UAW 0.054281 . 1.05 0.99 1.11 TrS 0.097659 . 1.05 0.99 1.11 SbS 0.036217 * 0.94 0.88 0.99 Creatinine 0.054646 . 0.98 0.96 1.00 Calcium 0.028263 * 19.7 1.53 316.42 HbA1c 0.140233 0.67 0.39 1.12 Insulin 0.052911 . 1.06 0.99 1.13 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Misclassification rate = 20.43 % Model Accuracy =79.57 %

4.5 Diagnostic Test for HBP Status The chosen logistic regression model was used to develop a diagnostic test, accuracy being estimated by calculating the area under the receiver operating characteristic curve.

For prediction purposes, a link function was developed. Here, the logit (linear combination of predictors) of the regression equation was used as the link function. The link function served as the biomarker for the diagnostic test.

28

MP RJ Vaitinadin – PhD Epidemiology Dissertation

The observed means of the cohort with HBP is different from that without HBP, and the biomarker values tend to be different for the HBP population. These differences seen in the non-parametric density curves, shown here, were employed to develop a diagnostic test.

Prediction accuracy, as mentioned earlier, was 79.57%. The positive predictive value of the test was 0.54, and the negative predictive value of the test was 0.91. It was needed to find a cut-off value for the biomarker function that would reflect a sufficiently high value of sensitivity and specificity, such that for a biomarker value at or above the cut-off, the test is positive for presence of HBP, and below the cut-off, the test is negative for a diagnosis of the absence of HBP.

We calculated the usefulness of the biomarker by calculating the area under the curve,0.8716, from a 95% confidence interval of 0.8347-0.9085. This rejects the null hypothesis. So, the biomarker is useful. The optimal cut-off is -0.8003, at or above this

29

MP RJ Vaitinadin – PhD Epidemiology Dissertation value for the biomarker or link function, the test is positive for predicting HBP as an outcome. Below this cut-off, the test is negative for predicting an outcome of absence of

HBP. Sensitivity and specificity are 0.80 and 0.79 and respectively.

4.6 Repeated Cross–validation

Cross-validation is the process of predictive model validation by splitting the given data into training and testing components. Here the data was split 5-fold, one of the folds was used as the testing data for the AIC based model chosen by stepwise logistic regression of the full model trained on the other four folds of the data (80:20 split). Repeated cross- validation refers to the process of randomly splitting multiple times. Each split is assessed by training the chosen model on 4 parts of the data, and then, testing the same on the remaining one of the folds. The process was repeated 10 times, and the average was taken

30

MP RJ Vaitinadin – PhD Epidemiology Dissertation as the accuracy of the model. Data was split in 4:1 ratio, and stepwise regression using

Akaike Information Criteria was used to choose the final model, based on the training data.

HBP Cohort - Logistic regression results of chosen model from one of 10 repeats Predictor Significance Odds Ratio Confidence Interval p-value Point estimates 2.5 % 97.5 % Intercept 7.82e-05 *** 3.36e-05 1.61e-07 0.0046 WC 8.48e-05 *** 1.10 1.05 1.15 SBP 6.18e-19 *** 1.07 1.05 1.09 UAC 0.000206 *** 0.96 0.94 0.98 UAW 0.052236 . 1.05 1.00 1.10 TrS 0.006386 ** 1.05 1.01 1.12 Glucose 0.013291 * 1.06 1.16 3.16 HbA1c 0.028531 * 0.44 0.21 0.88 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Misclassification rate = 28.57% Model Accuracy =71.43 %

Female sex, age, triceps skin fold thickness, supra-iliac skin fold thickness, calcium and fibrinogen are the significant predictors. The odds ratios calculated from the co-efficient estimates for age, triceps skin fold thickness, calcium and fibrinogen are over 1, indicating a positive relationship with the outcome. So, increases in these predictors are associated, respectively, with an increase in the probability of HBP. Female sex and supra-iliac skin fold thickness have an odds ratio less than 1, indicating a negative relationship with the outcome. So, being female and an increase in supra-iliac skin fold thickness are associated with a decrease in the probability of HBP, confirming that female sex is protective for HBP.

The following were the accuracy values obtained for each of the ten runs 72.73, 76.62,

77.92, 67.53, 74.03, 70.13, 80.52, 75.32, 81.82 and 71.43. So, the average accuracy of the prediction model was calculated as 74.81 %. This process has allowed us to develop multiple models of comparative predictive accuracy.

31

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 5: The Coronary Heart Disease Cohort (CHD) – Results & Discussion 32 Descriptive Statistics 33 Exploratory Data Analysis 34 Epidemiologic Association measures 35 Predictive Modeling for CHD 36 Diagnostic test for CHD status 37 Repeated Cross Validation 38

32

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 5: The Coronary Heart Disease (CHD) Cohort – Results & Discussion 5.1 Descriptive Statistics Description # Total 481 Male 216 Female 265 Males with CHD in 2017 12 Females with CHD in 2017 6 Table 5.1 Cohort size and sex distribution

Age group in years Male Female 20 - 40 41 51 40-60 105 146 60-80 69 68 Over 80 1 0 Table 5.2 Cohort distribution across sex and age groups

The CHD cohort consisted of individuals who did not have a CHD diagnosis in 2007, and, by the end of the follow-up in 2017, some of them developed CHD. There were more females than males in the cohort, although more males developed CHD. The follow-up revealed that 18 of the subjects, 12 males and 6 females, developed CHD by 2017. The bulk of the cohort was in the 40-60-year age group, with most of the rest above 60 years of age.

The sex specific descriptive statistics of the continuous variables are presented in the appendix. The mean of most of the continuous variables measured were higher for males. Biceps skin fold thickness, triceps skin fold thickness, abdominal skin fold thickness, fibrinogen, HDL, LDL and cholesterol levels were higher for females (p-value significant), as were hip circumference, subscapular skin fold thickness and supra-iliac skin fold thickness

(p-value not significant).

33

MP RJ Vaitinadin – PhD Epidemiology Dissertation

5.2 Exploratory Data Analysis The primary aim of the analysis was to develop a model using 2007 data to predict

CHD status in 2017, from among a cohort that did not have any subject with CHD in 2007.

We employed a logistic regression model, the assumptions of which were satisfied as under,

1) outcome variable – binary – presence or absence of CHD

2) normality is not a requirement of the predictors, although we assessed normality

through histogram and qq-plots, as presented below.

3) there is a linear relationship between the logit of the outcome and the predictor

variables,

4) there are no extreme outliers – extreme outliers were removed based on

physiological, morbidity and age considerations for the population.

Multicollinearity is not considered a problem among predictors, while developing a model for binary outcome, unlike in causal modeling, where interactions between predictors could impact results. In these circumstances, it is recommended that the accuracy of the model be cross-validated on untrained data, we have done that. Correlation results are presented in the appendix.

The data was further explored using principal components analysis (PCA) to understand the extent of variance explained. PCA was carried out on the continuous biomarkers of the CHD cohort. The first, second and third principal components were able to explain 79.03%, 8.76% and 3.97% of variance in the data, respectively, more detailed

PCA results are presented in the appendix.

34

MP RJ Vaitinadin – PhD Epidemiology Dissertation

5.3 Epidemiologic Association Measures Age vs Sex

CHD Count Data Cell Row Column Percentages Cohort Percentages Percentages Male Female Male Female Male Female Male Female Age gp 75 110 15.59 22.87 40.54 59.46 34.72 41.51 less than 50 years Age gp 141 155 29.31 32.22 47.64 52.36 65.28 58.49 50 or more years

Age vs CHD Status

CHD Count Data Cell Percentages Row Percentages Column Percentages Cohort CHD No CHD CHD No CHD CHD No CHD CHD No CHD

Age gp less than 1 184 0.21 38.25 0.54 99.46 5.56 39.74 50 years

Age gp 50 or more 17 279 3.53 58 5.74 94.26 94.44 60.26 years

Sex vs CHD Status

CHD Count Data Cell Row Column Percentages Cohort Percentages Percentages

CHD No CHD No CHD No CHD No CHD CHD CHD CHD

Male 12 204 2.49 42.41 5.56 94.44 66.67 44.06

Female 6 259 1.25 53.85 2.26 97.74 33.33 55.94

The 50 and over age group formed a higher share of the cohort, compared to those younger, and there were more females than males in the cohort.

The overall incidence of CHD was 3.74%. Incidence of CHD among those 50 and over was

5.74 %, and 0.54 % among those younger.

In terms of relative risk, those aged 50 and over had 10.63 times the risk of developing CHD compared to those under 50, over a ten-year period.

Incidence of CHD among males and females were, respectively, 9.95% and 4.33%.

35

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Males had 2.46 times the risk of females for developing CHD, over a ten- year period.

5.4 Predictive Modeling for CHD

Logistic regression was employed for predictive modeling. The outcome variable was CHD status in 2017 – yes or no. The predictors were the continuous variables, data on which were collected in 2007. As discussed in exploratory data analysis, logistic regression assumptions were satisfied. Model building using Akaike Information Criterion was employed to pick the predictive model. Prediction accuracy of the AIC model was compared with the full model, presented in the appendix, and found to be satisfactory and parsimonious. Model fit was assessed using a chi-squared test. The selected model was also used to develop a diagnostic test, its accuracy was measured using a ROC curve.

Logistic regression results of chosen model Predictor Significance Odds Ratio Confidence Interval p-value Point estimates 2.5 % 97.5 % Intercept 3.4e-05 *** 2.420946e-12 3.982895e-18 4.550740e-07 Age 0.03356 * 1.06 1.007 1.1214 SBP 0.03495 * 1.02 1.002 1.058 Abdominal Skin 0.07668 . 0.95 0.897 1.004 Fold Thickness Calcium 0.00258 ** 1553.24 13.89 206717.3 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Misclassification rate = 3.95% Model Accuracy =96.05 %

Age, systolic blood pressure and calcium levels are the significant predictors. The odds ratios calculated from the co-efficient estimates for age, systolic blood pressure and calcium levels are over 1, indicating a positive relationship with the outcome. So, increases in these predictors are associated, respectively, with an increase in the probability of CHD.

36

MP RJ Vaitinadin – PhD Epidemiology Dissertation

5.5 Diagnostic Test for CHD status The chosen logistic regression model was used to develop a diagnostic test, accuracy being estimated by calculating the area under the receiver operating characteristic curve.

For prediction purposes, a link function was developed. Here, the logit (linear combination of predictors) of the regression equation was used as the link function. The link function served as the biomarker for the diagnostic test.

The observed means of the cohort with gout is different from that without gout, and the biomarker values tend to be different for the gout population. These differences seen in the non-parametric density curves, shown here, were employed to develop a diagnostic test.

Prediction accuracy, as mentioned earlier, was 96.05%. The positive predictive value of the test was 0, and the negative predictive value of the test was 99.78%. It was needed to find a cut-off value for the biomarker function that would reflect a sufficiently high value of sensitivity and specificity, such that for a biomarker value at or above the cut-off, the test is positive for presence of CHD, and below the cut-off, the test is negative for a diagnosis of the absence of CHD.

37

MP RJ Vaitinadin – PhD Epidemiology Dissertation

We calculated the usefulness of the biomarker by calculating the area under the curve,

0.8482, from a 95% confidence interval of 0.7787-0.9177. This rejects the null hypothesis.

So, the biomarker is useful. The optimal cut-off is -3.089, at or above this value for the biomarker or link function, the test is positive for predicting CHD as an outcome. Below this cut-off, the test is negative for predicting an outcome of absence of CHD. Sensitivity and specificity are 0.8889 and 0.7732 respectively.

5.6 Repeated Cross–validation

Cross-validation is the process of predictive model validation by splitting the given data into training and testing components. Here the data was split 5-fold, one of the folds was used as the testing data for the AIC based model chosen by stepwise logistic regression of the full model trained on the other four folds of the data (80:20 split). Repeated cross- validation refers to the process of randomly splitting multiple times. Each split is assessed by training the chosen model on 4 parts of the data, and then, testing the same on the

38

MP RJ Vaitinadin – PhD Epidemiology Dissertation remaining one of the folds. The process was repeated 10 times, and the average was taken as the accuracy of the model. Data was split in 4:1 ratio, and stepwise regression using

Akaike Information Criteria was used to choose the final model, based on the training data.

CHD Cohort - Logistic regression results of chosen model from one of 10 repeats Predictor Significance Odds Ratio Confidence Interval p-value Point estimates 2.5 % 97.5 % Intercept 0.000209 *** 2.061e-09 3.504e-14 7.604e-05 Sex / Female 0.001757 ** 0.035 0.0036 0.254 Age 0.005954 ** 1.081 1.027 1.148 Triceps Skin Fold Thickness 0.047223 * 1.138 1.001 1.297 Supra-iliac Skin Fold Thickness 0.023780 * 0.902 0.820 0.983 Calcium 0.000856 *** 632.04 12.142 31955 HbA1c 0.234102 0.537 0.154 1.140 Fibrinogen 0.015929 * 1.654 1.10 2.52 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Misclassification rate = 4.95 % Model Accuracy = 95.05% Split = 80:20

Female sex, age, triceps skin fold thickness, supra-iliac skin fold thickness, calcium and fibrinogen are the significant predictors. The odds ratios calculated from the co-efficient estimates for age, triceps skin fold thickness, calcium and fibrinogen are over 1, indicating a positive relationship with the outcome. So, increases in these predictors are associated, respectively, with an increase in the probability of CHD. Female sex and supra-iliac skin fold thickness have an odds ratio less than 1, indicating a negative relationship with the outcome. So, being female and an increase in supra-iliac skin fold thickness are associated with a decrease in the probability of CHD, confirming that female sex is protective for CHD.

The following were the accuracy values obtained for each of the ten runs - 95.05, 95.05,

95.05, 96.04, 95.05, 96.04, 96.04, 96.04, 95.05 and 95.05. So, the average accuracy of the prediction model was calculated as 95.45%. This process has allowed us to develop multiple models of comparative predictive accuracy.

39

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 6: The Diabetes Cohort – Results & Discussion 40 Descriptive Statistics 41 Exploratory Data Analysis 42 Epidemiologic Association Measures 43 Predictive Modeling for Diabetes 44 Diagnostic Test for Diabetes Status 45 Repeated Cross Validation 47

40

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 6: The Diabetes Cohort – Results & Discussion 6.1 Descriptive Statistics Description # Total 488 Male 212 Female 276 Males with diabetes in 2017 14 Females with diabetes in 2017 11 Table 6.1 Cohort size and sex distribution

Age group in years Male Female 20 - 40 41 51 40-60 102 146 60-80 68 78 Over 80 1 1 Table 6.2 Cohort distribution across sex and age groups

The diabetes cohort consisted of individuals who did not have a diabetes diagnosis in 2007, and, by the end of the follow-up in 2017, some of them developed diabetes. There were more females than males in the cohort, although more males developed diabetes. The follow-up revealed that 25 of the subjects, 14 males and 11 females, developed diabetes by

2017. The bulk of the cohort was in the 40-60-year age group, with most of the rest above

60 years of age.

The sex specific descriptive statistics of the continuous variables are presented in the appendix. The mean of most of the continuous variables measured were higher for males. Biceps skin fold thickness, triceps skin fold thickness, abdominal skin fold thickness,

LDL, cholesterol and fibrinogen levels were higher for females (p-value significant), as were hip circumference, subscapular skin fold thickness, supra-iliac skin fold thickness and

HbA1c (p-value not significant).

41

MP RJ Vaitinadin – PhD Epidemiology Dissertation

6.2 Exploratory Data Analysis The primary aim of the analysis was to develop a model using 2007 data to predict diabetes status in 2017, from among a cohort that did not have any subject with diabetes in

2007. We employed a logistic regression model, the assumptions of which were satisfied as under,

5) outcome variable – binary – presence or absence of diabetes

6) normality is not a requirement of the predictors, although we assessed normality

through histogram and qq-plots, as presented below.

7) there is a linear relationship between the logit of the outcome and the predictor

variables,

8) there are no extreme outliers – extreme outliers were removed based on

physiological, morbidity and age considerations for the population.

Multicollinearity is not considered a problem among predictors, while developing a model for binary outcome, unlike in causal modeling, where interactions between predictors could impact results. In these circumstances, it is recommended that the accuracy of the model be cross-validated on untrained data, we have done that. Correlation results are presented in the appendix.

The data was further explored using principal components analysis (PCA) to understand the extent of variance explained. PCA was carried out on the continuous biomarkers of the CHD cohort. The first, second and third principal components were able to explain 78.39%, 9.07% and 4.36% of variance in the data, respectively, more detailed

PCA results are presented in the appendix.

42

MP RJ Vaitinadin – PhD Epidemiology Dissertation

6.3 Epidemiologic Association Measures Age vs Sex

Diabetes Count Data Cell Row Column Percentages Cohort Percentages Percentages Male Female Male Female Male Female Male Female Age gp 75 110 15.37 22.54 40.54 59.46 35.38 39.86 less than 50 years Age gp 137 166 28.07 34.02 45.21 54.79 64.62 60.14 50 or more years

Age vs diabetes Status

Diabetes Count Data Cell Percentages Row Percentages Column Percentages Cohort Diab No Diab Diab No Diab Diab No Diab Diab No Diab

Age gp less than 4 181 0.82 37.09 2.16 97.84 16 39.09 50 years

Age gp 50 or more 21 282 4.30 57.79 6.93 93.07 84 60.91 years

Sex vs diabetes Status

Diabetes Count Data Cell Percentages Row Percentages Column Percentages Cohort Diab No Diab Diab No Diab Diab No Diab Diab No Diab

Male 14 198 2.87 40.57 6.60 93.40 56 42.73

Female 11 265 2.25 54.30 3.99 96.01 44 57.24

The 50 and over age group formed a higher share of the cohort, compared to those younger, and there were more females than males in the cohort.

The overall incidence of diabetes was 5.12%.

Incidence of diabetes among those 50 and over was 6.93%, and 2.16% among those younger.

43

MP RJ Vaitinadin – PhD Epidemiology Dissertation

In terms of relative risk, those aged 50 and over had 3.21 times the risk of developing diabetes compared to those under 50, over a ten-year period.

Incidence of diabetes among males and females were, respectively, 6.60% and 3.99%.

Males had 1.66 times the risk of females for developing diabetes, over a ten- year period.

6.4 Predictive Modeling for Diabetes

Logistic regression was employed for predictive modeling. The outcome variable was diabetes status in 2017 – yes or no. The predictors were the continuous variables, data on which were collected in 2007. As discussed in exploratory data analysis, logistic regression assumptions were satisfied.

Diabetes Cohort - Logistic regression results of chosen model Predictor Significance Odds Ratio Confidence Interval p-value Point estimates 2.5 % 97.5 % Intercept 0.00302 ** 1.078903e-86 2.47e-148 1e-32 Age 0.01339 * 1.13 1.03 1.25 Ht 0.00447 ** 2.88 1.43 6.32 Wt 0.00875 ** 0.42 0.21 0.79 BMI 0.00948 ** 1.43 2.0 122.32 WC 0.12431 0.85 0.69 1.04 WHR 0.04576 * 8.329926e+07 2.66 9.70e+15 UAC 0.00477 ** 1.06 1.02 1.00 UAW 0.05427 . 0.86 0.74 0.99 SpS 0.00664 ** 1.20 1.05 1.38 AbS 0.10933 0.91 0.81 1.01 Creatinine 0.12492 1.03 0.99 1.06 Glucose 5.60e-06 *** 1.77 5.94 73.15 HDL 0.02829 * 0.08 0.006 0.65 Calcium 1.72e-05 *** 8.595741e-07 5.92e-10 2.55e-04 HbA1c 0.02069 * 3.45 1.16 9.85 Insulin 0.09785 . 0.90 0.79 1.02 Sex/Female 0.08298 . 12.6 0.79 265.99 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Misclassification rate = 0.028% Model Accuracy = 97.13%

44

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Model building using Akaike Information Criterion was employed to pick the predictive model. Prediction accuracy of the AIC model was compared with the full model, presented in the appendix, and found to be satisfactory and parsimonious. Model fit was assessed using a chi-squared test. The selected model was also used to develop a diagnostic test, its accuracy was measured using a ROC curve.

Age, systolic blood pressure and calcium levels are the significant predictors. The odds ratios calculated from the co-efficient estimates for age, systolic blood pressure and calcium levels are over 1, indicating a positive relationship with the outcome. So, increases in these predictors are associated, respectively, with an increase in the probability of diabetes.

6.5 Diagnostic Test for Diabetes Status The chosen logistic regression model was used to develop a diagnostic test, accuracy being estimated by calculating the area under the receiver operating characteristic curve.

For prediction purposes, a link function was developed. Here, the logit (linear combination of predictors) of the regression equation was used as the link function. The link function served as the biomarker for the diagnostic test.

45

MP RJ Vaitinadin – PhD Epidemiology Dissertation

The observed means of the cohort with diabetes is different from that without diabetes, and the biomarker values tend to be different for the diabetes population. These differences seen in the non-parametric density curves, shown here, were employed to develop a diagnostic test. Prediction accuracy, as mentioned earlier, was 97.13%. The positive predictive value of the test was 0.56, and the negative predictive value of the test was 0.99. It was needed to find a cut-off value for the biomarker function that would reflect a sufficiently high value of sensitivity and specificity, such that for a biomarker value at or above the cut-off, the test is positive for presence of diabetes, and below the cut-off, the test is negative for a diagnosis of the absence of diabetes.

46

MP RJ Vaitinadin – PhD Epidemiology Dissertation

We calculated the usefulness of the biomarker by calculating the area under the curve,

0.9673, from a 95% confidence interval of 0.9392- 0.9954. This rejects the null hypothesis.

So, the biomarker is useful. The optimal cut-off is -2.26, at or above this value for the biomarker or link function, the test is positive for predicting diabetes as an outcome. Below this cut-off, the test is negative for predicting an outcome of absence of diabetes. Sensitivity and specificity are 0.88 and 0.9309 respectively.

6.6 Repeated Cross–validation

Cross-validation is the process of predictive model validation by splitting the given data into training and testing components. Here the data was split 5-fold, one of the folds was used as the testing data for the AIC based model chosen by stepwise logistic regression of the full model trained on the other four folds of the data (80:20 split). Repeated cross- validation refers to the process of randomly splitting multiple times. Each split is assessed by training the chosen model on 4 parts of the data, and then, testing the same on the remaining one of the folds. The process was repeated 10 times, and the average was taken as the accuracy of the model. Data was split in 4:1 ratio, and stepwise regression using

Akaike Information Criteria was used to choose the final model, based on the training data.

Female sex, age, height, weight, BMI, WHR, abdominal skin fold thickness, supra-iliac skin fold thickness, glucose, HDL and calcium are the significant predictors. The odds ratios calculated from the co-efficient estimates of these predictors, save for weight, abdominal skin fold thickness, HDL and calcium, are over 1, indicating a positive relationship with the outcome. So, increases in these predictors are associated, respectively, with an increase in the probability of diabetes. Weight, abdominal skin fold thickness, HDL and calcium have

47

MP RJ Vaitinadin – PhD Epidemiology Dissertation an odds ratio less than 1, indicating a negative relationship with the outcome. So, an increase in these predictors are associated with a decrease in the probability of diabetes.

The following were the accuracy values obtained for each of the ten runs - 93.14, 97.06,

91.18, 94.12, 95.10, 94.12, 93.14, 94.12, 92.16 and 96.08. So, the average accuracy of the prediction model was calculated as 94.02%. This process has allowed us to develop multiple models of comparable predictive accuracy.

Logistic regression results of chosen model from one of 10 repeats Predictor Significance Odds Ratio Confidence Interval p-value Point estimates 2.5 % 97.5 % Intercept 0.00162 ** 5.11e-100 1.28e-167 1.13e-41 Sex / Female 0.03406 * 23.50 1.48 566.49 Age 0.00778 ** 1.15 1.04 1.29 Ht 0.00214 ** 3.47 1.64 8.28 Wt 0.00445 ** 0.34 0.15 0.69 BMI 0.00390 ** 22.50 2.97 223.29 WC 0.07150 . 0.83 0.66 1.01 WHR 0.01922 * 2.33e+09 62.28 5.39e+17 UAC 0.05152 . 1.05 1.00 1.09 UAW 0.05824 . 0.86 0.72 0.99 SpS 0.00451 ** 1.28 1.09 1.55 AbS 0.04925 * 0.89 0.77 0.99 Glucose 8.52e-05 *** 19.52 5.44 112.40 HDL 0.00541 ** 0.017 6.08e-04 0.21 Calcium 0.00176 ** 4.55e-05 2.98e-08 0.01 HbA1c 0.07529 . 2.70 0.85 7.70 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Misclassification rate =3.92 % Model Accuracy =96.08 %

48

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 7: The Gout Cohort – Results & Discussion 49 Descriptive Statistics 50 Exploratory Data Analysis 51 Epidemiologic Association Measures 52 Predictive Modeling for Gout 53 Diagnostic Test for Gout Status 54 Repeated Cross Validation 56

49

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 7: The Gout Cohort – Results & Discussion 7.1 Descriptive Statistics Description # Total 488 Male 211 Female 277 Males with Gout in 2017 21 Females with Gout in 2017 12 Table 7.1 Cohort size and sex distribution

Age group in years Male Female 20 - 40 40 51 40-60 104 148 60-80 66 77 Over 80 1 1 Table 7.2 Cohort distribution across sex and age groups

The gout cohort consisted of individuals who did not have a gout diagnosis in 2007, and, by the end of the follow-up in 2017, some of them developed gout. There were more females than males in the cohort, although more males developed gout. The follow-up revealed that 33 of the subjects, 21 males and 12 females, developed gout by 2017. The bulk of the cohort was in the 40-60-year age group, with most of the rest above 60 years of age.

The sex specific descriptive statistics of the continuous variables are presented in the appendix. The mean age of the subjects, as of 2007, was about the same for both sexes, and the mean of most of the continuous variables measured were higher for males.

Abdominal skin fold thickness, fibrinogen, HDL and cholesterol levels were higher for females (p-value significant), as were LDL levels, hip circumference, subscapular skin fold thickness and supra-iliac skin fold thickness (p-value not significant).

50

MP RJ Vaitinadin – PhD Epidemiology Dissertation

7.2 Exploratory Data Analysis The primary aim of the analysis was to develop a model using 2007 data to predict gout status in 2017, from among a cohort that did not have any subject with gout in 2007.

We employed a logistic regression model, the assumptions of which were satisfied as under,

1) outcome variable – binary – presence or absence of gout

2) normality is not a requirement of the predictors, although we assessed normality

through histogram and qq-plots, as presented below.

3) there is a linear relationship between the logit of the outcome and the predictor

variables,

4) there are no extreme outliers – extreme outliers were removed based on

physiological, morbidity and age considerations for the population.

Multicollinearity is not considered a problem among predictors, while developing a model for binary outcome, unlike in causal modeling, where interactions between predictors could impact results. In these circumstances, it is recommended that the accuracy of the model be cross-validated on untrained data, we have done that. Correlation results are presented in the appendix.

The data was further explored using principal components analysis (PCA) to understand the extent of variance explained. PCA was carried out on the continuous biomarkers of the gout cohort. The first, second and third principal components were able to explain 77.5%, 9.4% and 4.5% of variance in the data, respectively, more detailed PCA results are presented in the appendix.

51

MP RJ Vaitinadin – PhD Epidemiology Dissertation

7.3 Epidemiologic Association Measures Age vs Sex

Gout Cohort Count Data Cell Percentages Row Percentages Column Percentages Male Female Male Female Male Female Male Female Age gp less 74 109 15.16 22.34 40.44 59.56 35.07 39.35 than 50 years Age gp 137 168 28.07 34.43 44.92 55.08 64.93 60.65 50 or more years

Age vs Gout Status

Gout Cohort Count Data Cell Percentages Row Percentages Column Percentages Gout No Gout No Gout No Gout No Gout Gout Gout Gout Age gp less than 5 178 1.02 36.48 2.73 97.27 15.15 39.12 50 years Age gp 28 277 5.74 56.76 9.18 90.82 84.85 60.88 50 or more years

Sex vs Gout Status

Gout Cohort Count Data Cell Percentages Row Percentages Column Percentages Gout No Gout No Gout No Gout No Gout Gout Gout Gout Male 21 190 4.30 38.93 9.95 90.05 63.64 41.76

Female 12 265 2.46 54.30 4.33 95.67 36.36 58.24

The 50 and over age group formed a higher share of the cohort, compared to those younger, and there were more females than males in the cohort.

The cumulative incidence of gout was 6.76%.

Cumulative incidence of gout among those 50 and over was 9.18 %, and 2.73 % among those younger.

52

MP RJ Vaitinadin – PhD Epidemiology Dissertation

In terms of relative risk, those aged 50 and over had 3.36 times the risk of developing gout compared to those under 50, over a ten-year period.

Cumulative incidence of gout among males and females were, respectively, 9.95% and

4.33%.

Males had 2.29 times the risk of females for developing gout, over a ten- year period.

7.4 Predictive Modeling for Gout

Logistic regression was employed for predictive modeling. The outcome variable was gout status in 2017 – yes or no. The predictors were the continuous variables, data on which were collected in 2007. As discussed in exploratory data analysis, logistic regression assumptions were satisfied. Model building using Akaike Information Criterion was employed to pick the predictive model. Prediction accuracy of the AIC model was compared with the full model, presented in the appendix, and found to be satisfactory and parsimonious. Model fit was assessed using a chi-squared test. The selected model was also used to develop a diagnostic test, its accuracy was measured using a ROC curve.

Logistic regression results of chosen model Predictor Significance Odds Ratio Confidence Interval p-value Point estimates 2.5 % 97.5 % Intercept 1.79e-05 *** 0.0000009 -20.52 -7.75 Age 0.0108 * 1 0.013 0.090 HC 0.0220 * 1.06 0.009 0.112 Creatinine 0.1402 0.97 -0.052 0.006 UricAcid 1.18e-07 *** 1.01 0.010 0.022 Triglycerides 0.0848 . 1.39 -0.059 0.713 LDL 0.0186 * 0.59 -0.978 -0.097 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Misclassification rate = 7.18% Model Accuracy = 92.82%

53

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Age, hip circumference, uric acid and LDL are the significant predictors. The odds ratios calculated from the co-efficient estimates for age, head circumference and uric acid are over 1, indicating a positive relationship with the outcome. So, increases in these predictors are associated, respectively, with an increase in the probability of gout. LDL has an odds ratio less than 1, indicating a negative relationship with the outcome. So, decrease in LDL is associated with an increase in the probability of gout.

7.5 Diagnostic Test for Gout Status The chosen logistic regression model was used to develop a diagnostic test, accuracy being estimated by calculating the area under the receiver operating characteristic curve.

For prediction purposes, a link function was developed. Here, the logit (linear combination of predictors) of the regression equation was used as the link function. The link function serves as the biomarker for the diagnostic test.

54

MP RJ Vaitinadin – PhD Epidemiology Dissertation

The observed means of the cohort with gout is different from that without gout, and the biomarker values tend to be different for the gout population. These differences seen in the non-parametric density curves, shown here, were employed to develop a diagnostic test.

Prediction accuracy, as mentioned earlier, was 92.82%. The positive predictive value of the test was 12.12%, and the negative predictive value of the test was 98.68%. It was needed to find a cut-off value for the biomarker function that would reflect a sufficiently high value of sensitivity and specificity, such that for a biomarker value at or above the cut-off, the test is positive for presence of gout, and below the cut-off, the test is negative for a diagnosis of the absence of gout.

We calculated the usefulness of the biomarker by calculating the area under the curve,

0.9194, from a 95% confidence interval of 0.886 - 0.9529. This rejects the null hypothesis.

So, the biomarker is useful. The optimal cut-off is -2.497519. at or above this value for the

55

MP RJ Vaitinadin – PhD Epidemiology Dissertation biomarker or link function, the test is positive for predicting gout as an outcome. Below this cut-off, the test is negative for predicting an outcome of absence of gout. Sensitivity and specificity are 0.9091 and 0.8264 respectively.

7.6 Repeated Cross–validation

Cross-validation is the process of predictive model validation by splitting the given data into training and testing components. Here the data was split 5-fold, one of the folds was used as the testing data for the AIC based model chosen by stepwise logistic regression of the full model trained on the other four folds of the data (80:20 split). Repeated cross- validation refers to the process of randomly splitting multiple times. Each split is assessed by training the chosen model on 4 parts of the data, and then, testing the same on the remaining one of the folds. The process was repeated 10 times, and the average was taken as the accuracy of the model. Data was split in 4:1 ratio, and stepwise regression using

Akaike Information Criteria was used to choose the final model, based on the training data.

Gout Cohort Logistic regression results of chosen model from one of 10 repeats Predictor Significance Odds Ratio Confidence Interval p-value Point estimates 2.5 % 97.5 % Intercept 3.53e-06 *** 2.197530e-07 2.280056e-10 0.0001 Age 0.0907 . 1.034654 0.99 1.08 Hip Circumference 0.0974 . 1.046087 0.99 1.10 UricAcid 2.39e-07 *** 1.013890 1.01 1.02 Triglycerides 0.0539 . 1.442565 0.99 2.10 LDL 0.0185 * 0.587411 0.37 0.91 HbA1c 0.0354 * 1.585973 0.997 2.44 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Misclassification rate = 6.86% Model Accuracy =93.14 %

56

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Uric acid, LDL and HbA1c are the significant predictors. The odds ratios calculated from the co-efficient estimates for uric acid and HbA1c are over 1, indicating a positive relationship with the outcome. So, increases in these predictors are associated, respectively, with an increase in the probability of gout. LDL has an odds ratio less than 1, indicating a negative relationship with the outcome. So, decrease in LDL is associated with an increase in the probability of gout. The following were the accuracy values obtained for each of the ten runs - 92.16 89.22 91.18 91.18 91.18 93.14 92.16 90.20 93.14 93.14. So, the average accuracy of the prediction model was calculated as 91.67%. This process has allowed us to develop multiple models of comparative predictive accuracy.

57

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 8: Study of Population Means of Fibrinogen - Results and Discussion 58 Descriptive Statistics and Paired t-Test 59 Does time influence the population mean of fibrinogen? 60 Is there an interaction between time and sex that affects the population mean of fibrinogen? 61

58

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 8: Study of Population Means of Fibrinogen - Results and Discussion Population mean is a useful descriptor of a population. Biomarkers, measured as continuous variables, yield population means that are useful to understand the effects various other variables have on them, and hence the health of the population. We studied the population mean of the biomarker, fibrinogen. The biomarker levels were measured twice, once in 2007 and then again in 2017. We compared the population means for both measures. We wanted to understand the significance of time on the difference between the population means. We also wanted to understand the significance of sex and its interaction with time in affecting the difference between the population means. The samples are matched.

8.1 Descriptive Statistics and Paired t-Test

Descriptor #

Total 531

Male 235

Female 296

Fibrinogen 2007 2017

Mean 3.58 2.86

Range 1.20-6.90 1.10-6.70

Median 3.50 2.70 sd 1.13 1.12

59

MP RJ Vaitinadin – PhD Epidemiology Dissertation

We have seen the general trend of age distribution in the previous chapters, and, as seen before, the number of females are more than the number of males. At a glance, the overall mean levels of the 2017 cohort seems lower than 10 years earlier.

Paired t-Test results

2007 2017 p-value Trait n Mean SD Mean SD CI

Fibrinogen (g/L) 531 3.58 1.13 2.86 1.12 < 0.001

There is a significant difference in the population means of fibrinogen levels between 2007 and 2017. More detailed population specific means and significance levels, along with boxplots and qq plots of the distribution are provided in the appendix.

8.2 Does time influence the population mean of fibrinogen?

In this paired cohort, we wanted to understand the effect of time on the population mean of uric acid. To answer this question, we ran a repeated measures ANOVA using a time as a within subject factor and fibrinogen levels as the outcome variable. Time is referred to as ‘Biomarker’ in the model, referring to the quantity being measured two times, 2007 and 2017.

Df Sum Sq Mean Sq F-value p-value

Biomarker 1 25.3 25.3 20.12 8.09e-06 ***

Residuals 1058 1330.7 1.258

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

60

MP RJ Vaitinadin – PhD Epidemiology Dissertation

The results indicate that time has a significant effect on fibrinogen levels, suggesting that aging potentially decreased the mean fibrinogen levels in this population.

8.3 Is there an interaction between time and sex that affects the population mean of fibrinogen?

After learning that time significantly affects the population mean fibrinogen levels, we wanted to understand the interaction effect between time and sex, in affecting the population means of fibrinogen. To answer this question, we ran a mixed model with time as the within-subject factor, and sex as the between-subject factor. Time is referred to as

‘Biomarker’ in the model, referring to the quantity being measured two times, 2007 and

2017. Sex is referred to as ‘sex_gps’ in the model, possible values being male and female.

Type III Table with Satterthwaite's method

DenDF NumDF Sum Sq Mean F-value p-value

Sq

Biomarker 529 1 136.34 136.34 140.14 <2.2e-16 *** sex_gps/Female 529 1 10.30 10.30 10.58 0.001 **

Biomarker:sex_gps 529 1 0.74 0.74 0.76 0.382

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The results indicate that time and sex are significant effects on their own, in affecting the population means between the two cohorts. There are, however, no significant interaction effects between time and sex. A more detailed explanation of the result is given in the appendix, along with qq plot of the residuals satisfying model assumption.

61

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 9: Study of Population Means of Uric Acid - Results and Discussion 62 Descriptive Statistics and Paired t-Test 63 Does time influence the population mean of uric acid? 64 Is there an interaction between time and sex that affects the population mean of uric acid? 65

62

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 9: Study of Population Means of Uric Acid - Results and Discussion Population mean is a useful descriptor of a population. Biomarkers, measured as continuous variables, yield population means that are useful to understand the effects various other variables have on them, and hence the health of the population. We studied the population mean of the biomarker, uric acid. The biomarker levels were measured twice, once in 2007 and then again in 2017. We compared the population means for both measures. We wanted to understand the significance of time on the difference between the population means. We also wanted to understand the significance of sex and its interaction with time in affecting the difference between the population means. The samples are matched.

9.1 Descriptive Statistics and Paired t-Test

Descriptor #

Total 550

Male 242

Female 308

Uric Acid 2007 2017

Mean 297.8 316.6

Range 115 - 671 117 - 684

Median 288 311 sd 88.83 89.96

63

MP RJ Vaitinadin – PhD Epidemiology Dissertation

We have seen the general trend of age distribution in the previous chapters, and, as seen before, the number of females are more than the number of males. At a glance, the overall mean levels of the 2017 cohort seems higher than 10 years earlier.

Paired t-Test results

2007 2017 p-value Trait n Mean SD Mean SD CI

Uric Acid (umol/L) 550 297.81 88.83 316.62 89.96 <0.001

There is a significant difference in the population means of uric acid levels between

2007 and 2017. More detailed population specific means and significance levels, along with boxplots and qq plots of the distribution are provided in the appendix.

9.2 Does time influence the population mean of uric acid?

In this paired cohort, we wanted to understand the effect of time on the population mean of uric acid. To answer this question, we ran a repeated measures ANOVA using a time as a within subject factor and uric acid levels as the outcome variable. Time is referred to as ‘Biomarker’ in the model, referring to the quantity being measured two times, 2007 and 2017.

Df Sum Sq Mean Sq F-value p-value

Biomarker 1 46268 46268 5.786 0.0163 *

Residuals 1096 8764166 7997

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

64

MP RJ Vaitinadin – PhD Epidemiology Dissertation

The results indicate that time is a significant effect on the uric acid levels, suggesting that aging potentially increased the mean uric acid levels in this population.

9.3 Is there an interaction between time and sex that affects the population mean of uric acid?

After learning that time significantly affects the population mean uric acid levels, we wanted to understand the interaction effect between time and sex, in affecting the population means of uric acid. To answer this question, we ran a mixed model with time as the within-subject factor, and sex as the between-subject factor. Time is referred to as

‘Biomarker’ in the model, referring to the quantity being measured two times, 2007 and

2017. Sex is referred to as ‘sex_gps’ in the model, possible values being male and female.

Type III Analysis of Variance Table with Satterthwaite's method

DenDF NumDF Sum Sq Mean Sq F-alue p-value

Biomarker 548 1 91719 91719 46.28 2.697e-11 *** sex_gps/Female 548 1 479913 479913 242.14 <2.2e-16 ***

Biomarker:sex_gps 548 1 3248 3248 1.64 0.201

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The results indicate that time and sex are significant effects on their own, in affecting the population means between the two cohorts. There are, however, no significant interaction effects between time and sex. A more detailed explanation of the result is given in the appendix, along with qq plot of the residuals satisfying model assumption.

65

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 10: Conclusion 66 Major Conclusions 67 Strengths and Limitations 68 Future Research 69

66

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Chapter 10: Conclusion 10.1 Major Conclusions The current study approached population health through two distinct approaches 1) Subject specific research

2) Population specific research

Subject Specific Research

The advent of ever-increasing computing power has enabled us to ask more complex questions of health data. The dawn precision medicine and its marriage to big data has ushered in a new era of healthcare, one in which the vast repositories of data can be tapped and customized to answer specific health questions about a given individual. This is reflected in the first aim.

The aim was to develop useful predictive models of the data that could answer subject specific questions, disease status 10 years into the future. This was accomplished by developing a logistic regression model that was in turn used in two different ways.

First, we used the chosen model to train a part of the data and tested its prediction accuracy on an untrained part of the data. We were successful in developing predictive models of useful accuracy.

Second, we used the chosen model to develop a diagnostic test using the logit of the regression model. The diagnostic test was strengthened by the development of a receiver operating characteristic (ROC) curve, which was used to determine usefulness of the biomarker. This was further utilized to optimize the cut-off value to more acceptable sensitivity and specificity values for the test.

67

MP RJ Vaitinadin – PhD Epidemiology Dissertation

The above approaches were successfully validated on four separate outcomes – high blood pressure, coronary heart disease, diabetes and gout. This will enable field personnel to utilize the models to predict future disease status of the local population.

Population Specific Research

Public health aims to accomplish the best possible outcomes for the maximum number of people in the community. This requires research to better understand population level data and their change over time or with factors. This information could be used to better understand morbidity and mortality patterns in the community. Two biomarkers, fibrinogen and uric acid, were studied to understand change over time and with respect to sex, using mixed models.

10.2 Strengths and Limitations

Strengths

1) The study developed useful predictive models for four separate outcomes, that

could be deployed in the field to predict disease status in the future.

2) The study developed a diagnostic test that can be deployed to study population

trends in disease incidence and prevalence, and verify the findings with the

accuracy check provided by an ROC curve.

3) The study developed valuable workflow to study biomarkers and change on

population means with respect to time and sex, using mixed models. This

approach can be extended to more biomarkers and potential interaction terms.

4) The study approaches can be implemented in the local population to develop a

robust evidence base upon which to formulate .

68

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Limitations

1) Community survey research entails the limitations of reporting inaccuracy,

memory issues and other problems of survey research.

2) Errors in biomarker quantitation and anthropometric measurements cannot be

ruled out completely in large scale studies.

3) Population structures change over time. The current findings are applicable to

the existing population structure in this isolated community. Immigration trends

can alter prevalence and incidence of diseases and in turn limit the value of the

study, in terms of formulating health policy.

10.3 Future Research

Future research would

1) include integration of genomic and biomarker data to develop more accurate

predictive models for disease outcomes. This would have a significant impact in

formulating accurate health policy.

2) include integration of genomic data into population level markers, to better

understand community level impact of the interaction between the genome and the

environment. Again, this would result in the development more relevant health

policies for the community.

69

MP RJ Vaitinadin – PhD Epidemiology Dissertation

References

2012 ACCF/AHA/ACP/AATS/PCNA/SCAI/STS Guideline for the diagnosis and management of patients with stable ischemic heart disease: a report of the American College of

Cardiology Foundation/ American Heart Association Task Force on Practice Guidelines, and the American College of Physicians, American Association for Thoracic Surgery,

Preventive Cardiovascular Nurses Association, Society for Cardiovascular Angiography and

Interventions, and Society of Thoracic Surgeons. J Am Coll Cardiol.2012 Dec 18;60(24):e44- e164.

Althouse AD (2016) Adjust for Multiple Comparisons? It’s Not That Simple. Annals of

Thoracic Surgery 101:1644–5

Beckman JA, et al, (2002). Diabetes and atherosclerosis: epidemiology, pathophysiology, and management. JAMA, 287(19):2570–8110.1001/jama.287.19.2570

Bernard, A. (2017) Clinical prediction models: a fashion or a necessity in medicine? Journal of Thoracic Disease, Oct; 9(10): 3456–3457.

Carter, JV et al, () ROC-ing along: Evaluation and interpretation of receiver operating characteristic curves. Surgery, Jun;159(6):1638-1645.

CDC – diabetes. https://www.cdc.gov/diabetes/basics/diabetes.html

CDC – heart disease. https://www.cdc.gov/dhdsp/data_statistics/fact_sheets/fs_heart_disease.htm

CDC – high blood pressure. https://www.cdc.gov/dhdsp/data_statistics/fact_sheets/fs_bloodpressure.htm

70

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Combettes MM. (2006) GLP-1 and type 2 diabetes: physiology and new clinical advances.

Current Opinion in Pharmacology, 6(6):598-605.

Dalbeth N et al, (2016) Gout Lancet, 388(10055):2039–2052

Deka R et al. (2012) Prevalence of metabolic syndrome and related metabolic traits in an island population of the Adriatic. Ann Hum Biol 39:46-53

Desai RJ et al (2018) An evaluation of longitudinal changes in serum uric acid levels and associated risk of cardio-metabolic events and renal function decline in gout. PLoS ONE

13(2): e0193622. https://doi.org/10.1371/journal.pone.0193622

Diamond GA et al, (1982). Probability of CAD. Circulation.;65(3):641–2.

Draijer, LG et al, (2019) Comparison of diagnostic accuracy of screening tests ALT and ultrasound for pediatric non-alcoholic fatty liver disease. European Journal of Pediatrics.

Mar 22. doi: 10.1007/s00431-019-03362-3. [Epub ahead of print]

Emmerson BT (1996) The management of gout. New England Journal Medicine,

1996;334(7):445–451.

Heidenreich PA, et al. (2011) Forecasting the future of cardiovascular disease in the United

States: a policy statement from the American Heart Association External. Circulation,

123:933–44.

Hoo, ZH et al (2017) What is an ROC curve? Emergency Medicine Journal. Jun;34(6):357-

359

71

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Jin C et al (2017) Longitudinal Change in Fasting Blood Glucose and Myocardial Infarction

Risk in a Population Without Diabetes. Diabetes Care 40(11):1565-1572.

Kabat GC et al (2012) A longitudinal study of serum insulin and glucose levels in relation to colorectal cancer risk among postmenopausal women. British Journal of Cancer 106; 227–

232.

Kamarudin, AN et al. (2017) Time-dependent ROC curve analysis in medical research: current methods and applications. BMC Medical Research Methodology. Apr 7;17(1):53.

Karim, MN et al. (2017) Mortality risk prediction models for coronary artery bypass graft surgery: current scenario and future direction. Journal of Cardiovascular Surgery,

Dec;58(6):931-942.

Karns R et al. (2013) Modeling metabolic syndrome through structural equations of metabolic traits, comorbid diseases, and GWAS variants. Obesity (Silver Spring) 21:E745-

54

Karns R al. (2012) Genome-wide association of serum uric acid concentration: replication of sequence variants in an island population of the Adriatic coast of Croatia. Ann Hum

Genet 76:121-7

Karns R et al. (2011) Replication of genetic variants from genome-wide association studies with metabolic traits in an island population of the Adriatic coast of Croatia. Eur J Hum

Genet 19:341-6

72

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Khan TA et al (2016) Controversies about sugars: results from systematic reviews and meta-analyses on obesity, cardiometabolic disease and diabetes. European Journal of

Nutrition. Volume 55, Supplement 2, pp 25–43

Kidney.org https://www.kidney.org/atoz/content/gout/patient-facts

Kiberstis P, Roberts L. (2002) It's not just the genes. Science; 296:685.

Kolcic I et al (2006) Metabolic syndrome in a metapopulation of Croatian island isolates.

Croatian Medical Journal 2006;47: 585-592.

Kuo CF et al, (2015) Global epidemiology of gout: prevalence, incidence and risk factors. Nature

Reviews Rheumatology; 11(11):649–662.

Lander ES, Schork NJ. (1994) Genetic dissection of complex traits. Science; 265:2037-2048.

Levey AS et al. (2009) A new equation to estimate glomerular filtration rate. Ann Intern

Med. 2009;150(9):604-612.

Levey AS et al. (2010) Estimating GFR using the CKD Epidemiology Collaboration (CKD-

EPI) creatinine equation: more accurate GFR estimates, lower CKD prevalence estimates, and better risk predictions. Am J Kidney Dis.; 55(4):622-627.

Liang KY, Zeger SL. (1986) Longitudinal data analysis using generalized linear models.

Biometrika;73(1):13-22.

Lin TTL et al (2015) The Effect of Diabetes, Hyperlipidemia, and Statins on the

Development of Rotator Cuff Disease: A Nationwide, 11-Year, Longitudinal, Population-

Based Follow-up Study. The American Journal of Sports Medicine, 43; 9: 2126-2132

73

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Matsushita K et al. (2010) Risk implications of the new CKD Epidemiology Collaboration

(CKD-EPI) equation compared with the MDRD Study equation for estimated GFR: the

Atherosclerosis Risk in Communities (ARIC) Study. Am J Kidney Dis. 2010;55(4):648-659.

Merai R, et al. (2016) CDC Grand Rounds: A Public Health Approach to Detect and Control

Hypertension. MMWR Morbidity Mortality Weekly Reports Nov 18;65(45):1261-1264

Mozzafarian D, et al. (2015) Heart Disease and Stroke Statistics-2015 Update: a report from the American Heart Association. Circulation; e29-322.

Missoni S et al. (2013) Smoking habits according to metabolic traits in an island population of the eastern Adriatic Coast. Coll Antropol 37:745-53

Ndisang JF (2018) The Different Facets of Cardio-metabolic Diseases and Related

Complications: Current Perspective and Future Developments. Current Medicinal

Chemistry, Volume 25, No. 13

Palar K et al. (2009) Potential societal savings from reduced sodium consumption in the

U.S. adult population. American Journal ; 24:49–57.

Pucarin-Cvetkovic´ J et al (2006) Body mass index and nutrition as determinants of health and disease in population of Croatian Adriatic islands. Croat Med J; 47:619-626.

RStudio Team (2015). RStudio: Integrated Development for R. RStudio, Inc., Boston, MA

URL http://www.rstudio.com/. I would lie to gratefully acknowledge the authors of various packages that provide R and RStudio their functionality.

Rudan I et al (2003) Inbreeding and risk of late onset complex disease. J Med Genet;

40:925-932.

74

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Rudan P et al (1992) Population structure in the eastern Adriatic: the influence of historical processes, migration patterns, isolation and ecological pressures, and their interaction. In:

Roberts DF, Fujiki N, Torizuka K, editors. Isolation, migration and health. Cambridge

University Press, Cambridge, England. SSHB. p 204–218.

Sahay R et al. (2015) Fish and Shellfish Intake and Diabetes in a Costal Population of the

Adriatic. Coll Antropol 39:401-9

Sahay R et al. (2013) Dietary patterns in adults from an Adriatic Island of Croatia and their associations with metabolic syndrome and its components. Coll Antropol 37:335-42

Solomon, A, Soininen, H. (2015) Risk prediction models in dementia prevention. Nature

Reviews Neurology volume 11, 375–377.

Sun HL et al (2015) Uric Acid Levels Can Predict Metabolic Syndrome and Hypertension in

Adolescents: A 10-Year Longitudinal Study PLoS ONE 10(11): e0143786. https://doi.org/10.1371/journal.pone.0143786

Sundstrom J et al (2004) Relations of Serum Uric Acid to Longitudinal Blood Pressure

Tracking and Hypertension Incidence. Hypertension 45:28–33

Sullivan KM, et al. 2007. Sample Size for an Unmatched Case-Control Study. http://www.openepi.com/SampleSize/SSCC.htm

Tsai CW et al (2017) Serum Uric Acid and Progression of Kidney Disease: A Longitudinal

Analysis and Mini-Review. PLoS ONE 12(1): e0170393. https://doi.org/10.1371/journal.pone.0170393

75

MP RJ Vaitinadin – PhD Epidemiology Dissertation

United States Renal Data System. (2016) USRDS annual data report: Epidemiology of kidney disease in the United States. National Institutes of Health, National Institute of

Diabetes and digestive and Kidney Diseases, Bethesda, MD.

Wu Y et al (2017) Longitudinal fasting blood glucose patterns and arterial stiffness risk in a population without diabetes. Plos One 12(11): e0188423. https://doi.org/10.1371/journal.pone.0188423

Wu Z et al (2016) Longitudinal Patterns of Blood Pressure, Incident Cardiovascular Events, and All-Cause Mortality in Normotensive Diabetic People. Hypertension; 68:71-77.

Zhang G et al. (2012) Finding missing heritability in less significant Loci and allelic heterogeneity: genetic variation in human height. PLoS One 7:e51211

Zhang, Ge et al. (2011) Extent of height variability explained by known height-associated genetic variants in an isolated population of the Adriatic coast of Croatia. PLoS One

6:e29475

Zhang, Ge et al. (2010) Common SNPs in FTO gene are associated with obesity related anthropometric traits in an island population from the eastern Adriatic coast of Croatia.

PLoS One.

76

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix Sex Specific Summary Statistics for the HBP cohort 78 Histograms of continuous variables in the HBP cohort 79 QQ-plots of continuous variables in the HBP cohort 80 Correlation results of continuous biomarkers of the HBP cohort 81 (insignificant correlations were left out and cells whitened) PCA results of the HBP cohort 82 Sex Specific Summary Statistics for the CHD cohort 83 Histograms of continuous variables in the CHD cohort 84 QQ-plots of continuous variables in the CHD cohort 85 Correlation results of continuous biomarkers of the CHD cohort 86 (insignificant correlations were left out and cells whitened) PCA results of the Diabetes cohort 87 Sex Specific Summary Statistics for the Diabetes cohort 88 Histograms of continuous variables in the Diabetes cohort 89 QQ-plots of continuous variables in the Diabetes cohort 90 Correlation results of continuous biomarkers of the Diabetes cohort 91 (insignificant correlations were left out and cells whitened) PCA results of the Diabetes cohort 92 Sex Specific Summary Statistics for the Gout cohort 93 Histograms of continuous variables in the Gout cohort 94 QQ-plots of continuous variables in the Gout cohort 95 Correlation results of continuous biomarkers of the Gout cohort 96 (insignificant correlations were left out and cells whitened) PCA results of the Gout cohort 97 Fibrinogen (g/l) Paired t-test results, residuals and detailed explanation of the mixed model results (output) 98 Uric Acid (umol/l) Paired t-test results, residuals and detailed explanation of the mixed model results (output) 100

77

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix Sex Specific Summary Statistics for the HBP cohort, SD = Standard Deviation

Male (n =164) Female (n =208) Trait p-value Mean SD Mean SD

Age (years) 49.91 12.766 48.88 12.306 0.4316

Height (cm) 178.99 6.277 165.85 5.947 0.000

Weight (kg) 86.833 11.395 70.485 10.998 0.000

BMI (kg/m2) 96.91 8.596 86.037 11.514 0.000

WC (cm) 0.95 0.059 0.85 0.073 0.000

HC (cm) 102.01 7.219 101.80 9.585 0.806

WHR 0.950 0.059 0.845 0.073 0.000

UAC (mm) 300.24 21.309 279.22 27.18 0.000

UAW (mm) 73.05 6.301 62.88 5.933 0.000

BiS (mm) 14.63 7.328 18.58 7.477 0.000

TrS (mm) 14.03 4.748 25.09 6.572 0.000

SbS (mm) 22.12 5.780 21.96 7.645 0.754

SpS (mm) 27.47 7.835 27.98 8.535 0.549

AbS (mm) 29.71 8.086 30.63 9.502 0.315

Glucose (mmol/L) 5.91 1.097 5.51 0.671 0.000

HbA1c (%) 5.59 0.625 5.55 0.550 0.511

DBP (mmHg) 81.25 7.855 77.44 8.872 0.000

SBP (mmHg) 127.58 14.912 121.62 18.295 0.001

HDL (mmol/L) 1.29 0.286 1.51 0.286 0.000

LDL (mmol/L) 3.60 1.038 3.74 1.009 0.176

Cholesterol (mmol/L) 5.66 1.194 5.85 1.176 0.134

Triglycerides (mmol/L) 1.61 0.994 1.29 0.722 0.001

Calcium (mmol/L) 2.37 0.144 2.38 0.113 0.626

Creatinine (umol/L) 95.66 15.502 77.44 11.399 0.000

Fibrinogen (g/L) 3.39 1.115 3.72 1.140 0.005

Uric Acid (umol/L) 351.82 78.208 237.97 61.042 0.000

78

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix Histograms of continuous variables in the HBP cohort

79

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix QQ-plots of continuous variables in the HBP cohort

80

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix Correlation results of continuous biomarkers of the HBP cohort (insignificant correlations were left out and cells whitened)

81

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix PCA results of the HBP cohort

82

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix Sex Specific Summary Statistics in the CHD cohort, SD = Standard Deviation Male (n = 216) Female (n = 265) Trait p-value Mean SD Mean SD

Age (years) 52.99 12.85 51.67 12.66 0.264

Height (cm) 178.13 6.632 164.97 5.963 0.000

Weight (kg) 87.30 11.648 72.03 11.452 0.000

BMI (kg/m2) 27.47 2.969 26.51 4.249 0.004

WC (cm) 98.17 8.895 88.20 12.057 0.000

HC (cm) 102.60 7.456 103.41 9.784 0.306

WHR 0.957 0.059 0.852 0.074 0.000

UAC (mm) 300.41 21.348 283.65 28.882 0.000

UAW (mm) 73.39 6.053 63.60 5.958 0.000

BiS (mm) 14.42 7.024 19.34 7.483 0.000

TrS (mm) 14.13 4.966 25.58 6.568 0.000

SbS (mm) 22.93 6.207 23.30 8.096 0.577

SpS (mm) 27.62 7.961 28.88 8.419 0.091

AbS (mm) 29.38 8.306 31.70 9.336 0.004

Glucose (mmol/L) 6.01 1.090 5.60 0.736 0.000

HbA1c (%) 5.63 0.658 5.62 0.650 0.905

DBP (mmHg) 83.03 8.402 79.78 9.904 0.000

SBP (mmHg) 132.18 17.682 126.92 20.469 0.003

HDL (mmol/L) 1.31 0.303 1.50 0.295 0.000

LDL (mmol/L) 3.62 0.997 3.80 0.996 0.051

Cholesterol (mmol/L) 5.68 1.137 5.92 1.170 0.022

Triglycerides (mmol/L) 1.59 0.959 1.38 0.764 0.010

Calcium (mmol/L) 2.37 0.136 2.37 0.114 0.543

Creatinine (umol/L) 96.10 15.451 79.27 13.896 0.000

Fibrinogen (g/L) 3.40 1.100 3.80 1.169 0.001

Uric Acid (umol/L) 352.87 77.306 252.42 71.456 0.000

83

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix Histograms of continuous variables in the CHD cohort

84

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix QQ-plots of continuous variables in the CHD cohort

85

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix Correlation results of continuous biomarkers of the CHD cohort (insignificant correlations were left out and cells whitened)

86

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix PCA results of the CHD cohort

87

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix Sex Specific Summary Statistics of the diabetes cohort, SD = Standard Deviation

Male (n = 216) Female (n = 265) Trait p-value Mean SD Mean SD

Age (years) 53.09 13.178 52.47 13.127 0.608

Height (cm) 178.26 6.554 164.76 6.080 0.000

Weight (kg) 87.66 11.773 71.99 44.459 0.000

BMI (kg/m2) 27.55 3.015 26.56 4.218 0.003

WC (cm) 98.50 9.033 88.18 11.837 0.000

HC (cm) 102.75 7.538 103.54 9.809 0.315

WHR 0.96 0.060 0.85 0.072 0.000

UAC (mm) 300.56 22.261 283.72 29.004 0.000

UAW (mm) 73.28 5.950 63.57 6.043 0.000

BiS (mm) 14.58 7.112 19.40 7.469 0.000

TrS (mm) 14.30 5.008 25.64 6.530 0.000

SbS (mm) 23.06 6.291 23.29 8.115 0.722

SpS (mm) 27.77 8.108 28.57 8.344 0.281

AbS (mm) 29.47 8.618 31.52 9.204 0.012

Glucose (mmol/L) 5.85 0.756 5.53 0.625 0.000

HbA1c (%) 5.56 0.532 5.59 0.572 0.551

DBP (mmHg) 83.47 8.383 79.98 10.048 0.000

SBP (mmHg) 132.95 18.610 127.71 21.224 0.004

HDL (mmol/L) 1.30 0.306 1.50 0.293 0.000

LDL (mmol/L) 3.61 0.993 3.82 0.975 0.019

Cholesterol (mmol/L) 5.68 1.139 5.95 1.143 0.012

Triglycerides (mmol/L) 1.63 0.967 1.38 0.750 0.002

Calcium (mmol/L) 2.37 0.135 2.37 0.123 0.822

Creatinine (umol/L) 96.54 15.555 78.97 13.733 0.000

Fibrinogen (g/L) 3.40 1.096 3.75 1.180 0.001

Uric Acid (umol/L) 355.33 76.412 252.26 70.553 0.000

88

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix Histograms of continuous variables in the diabetes cohort

89

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix QQ-plots of continuous variables in the diabetes cohort

90

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix Correlation results of continuous biomarkers of the diabetes cohort (insignificant correlations were left out and cells whitened)

91

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix PCA results of the diabetes cohort

92

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix Sex Specific Summary Statistics in the gout cohort, SD = Standard Deviation

Male (n = 211) Female (n = 277) Trait p-value Mean SD Mean SD

Age (years) 52.921 12.923 52.527 13.076 0.739

Height (cm) 178.259 6.534 164.746 6.085 0.000

Weight (kg) 87.518 11.660 71.866 11.315 0.000

BMI (kg/m2) 27.504 2.997 26.517 4.183 0.003

WC (cm) 98.386 9.119 88.304 11.854 0.000

HC (cm) 102.757 7.508 103.419 9.684 0.395

WHR 0.958 0.060 0.853 0.073 0.000

UAC (mm) 301.171 21.857 283.296 28.368 0.000

UAW (mm) 73.508 6.045 63.563 6.005 0.000

BiS (mm) 14.728 7.090 19.295 7.437 0.000

TrS (mm) 14.347 5.021 25.592 6.541 0.000

SbS (mm) 23.033 6.152 23.272 8.105 0.711

SpS (mm) 28.039 8.055 28.736 8.452 0.354

AbS (mm) 29.711 8.525 31.590 9.220 0.020

Glucose (mmol/L) 6.026 1.096 5.587 0.742 0.000

HbA1c (%) 5.636 0.670 5.634 0.654 0.962

DBP (mmHg) 83.045 8.302 79.964 10.043 0.000

SBP (mmHg) 132.043 17.509 127.695 21.342 0.014

HDL (mmol/L) 1.301 0.293 1.493 0.292 0.000

LDL (mmol/L) 3.656 0.985 3.820 0.978 0.068

Cholesterol (mmol/L) 5.717 1.127 5.952 1.153 0.024

Triglycerides (mmol/L) 1.594 0.954 1.394 0.761 0.013

Calcium (mmol/L) 2.370 0.135 2.370 0.121 0.978

Creatinine (umol/L) 95.796 14.825 78.729 13.339 0.000

Fibrinogen (g/L) 3.404 1.083 3.737 1.167 0.001

Uric Acid (umol/L) 347.839 73.359 252.079 70.333 0.000

93

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix Histograms of continuous variables in the gout cohort

94

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix QQ-plots of continuous variables in the gout cohort

95

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix Correlation results of continuous biomarkers of the gout cohort (insignificant correlations were left out and cells whitened)

96

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix PCA results of the gout cohort

97

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix Fibrinogen (g/l) Paired t-test results, residuals and detailed explanation of the mixed model results (output) Summary Statistics Paired T test Wilcoxon signed rank test 2007 Mean 3.58 2007 Mean (Overall) 3.58 p-value <2.2e- Mean (Male) 3.41 16 Median 3.5 Mean (Female) 3.71 Range 1.2- p-value (Overall) <2.2e-16 6.9 SD 1.13 p-value (Male) 6.90e-11

2017 Mean 2.86 p-value (Female) 1.015e-15 Median 2.7 The population median for the Range 1.1- 2017 Mean (Overall) 2.86 2007 levels was 6.7 Mean (Male) 2.75 significantly SD 1.12 Mean (Female) 2.94 different than the population median for the 2017 levels of fibrinogen.

98

MP RJ Vaitinadin – PhD Epidemiology Dissertation

The overall model predicting Measurement (formula = Measurement ~ Biomarker * sex_gps + (1 | SUBJECT_ID)) has a total explanatory power (conditional R2) of 30.30%, in which the fixed effects explain 10.63% of the variance (marginal R2). The model's intercept is at 3.42 (SE = 0.073, 95% CI [3.27, 3.56]).

Within this model: The effect of BiomarkerFibrinogen_2017 is significant (beta = -0.67, SE =

0.091, 95% CI [-0.85, -0.49], t(529) = -7.34, p < .001) and can be considered as medium

(std. beta = -0.57, std. SE = 0.077). The effect of sex_gpsFemale is significant (beta = 0.30, SE

= 0.098, 95% CI [0.11, 0.49], t(1009) = 3.09, p < .01) and can be considered as small (std. beta = 0.26, std. SE = 0.083). The effect of BiomarkerFibrinogen_2017:sex_gpsFemale is not significant (beta = -0.11, SE = 0.12, 95% CI [-0.35, 0.13], t(529) = -0.87, p > .1) and can be considered as very small (std. beta = -0.090, std. SE = 0.10).

99

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Appendix Uric Acid (umol/L) Paired t-test results, residuals and detailed explanation of the mixed model results (output) Summary Statistics Paired T test Wilcoxon signed rank test

2007 Mean 297.8 2007 Mean (Overall) 297.81 p-value 6.05e-14 Mean (Male) 351.95 Median 288 Mean (Female) 255.27 Range 115-671 p-value (Overall) 0.0005 The population median for the 2007 levels was SD 88.83 p-value (Male) 0.0326 significantly different from the 2017 Mean 316.6 p-value (Female) 0.0004 population median for the Median 311 2017 levels of uric acid. Range 117-684 2017 Mean (Overall) 316.62 Mean (Male) 366.88 SD 89.96 Mean (Female) 277.13

100

MP RJ Vaitinadin – PhD Epidemiology Dissertation

The overall model predicting Measurement (formula = Measurement ~ Biomarker * sex_gps + (1 | SUBJECT_ID)) has a total explanatory power (conditional R2) of 75.50%, in which the fixed effects explain 27.62% of the variance (marginal R2). The model's intercept is at 351.95 (SE = 4.92, 95% CI [342.32, 361.59]).

Within this model: The effect of BiomarkerUricAcid_2017 is significant (beta = 14.93, SE =

4.05, 95% CI [7.00, 22.87], t(548) = 3.69, p < .001) and can be considered as very small (std. beta = 0.17, std. SE = 0.045). The effect of sex_gpsFemale is significant (beta = -96.68, SE =

6.57, 95% CI [-109.56, -83.81], t(762) = -14.71, p < .001) and can be considered as large

(std. beta = -1.08, std. SE = 0.073).

101

MP RJ Vaitinadin – PhD Epidemiology Dissertation

The effect of BiomarkerUricAcid_2017:sex_gpsFemale is not significant (beta = 6.92, SE =

5.41, 95% CI [-3.68, 17.52], t(548) = 1.28, p > .1) and can be considered as very small (std. beta = 0.077, std. SE = 0.060).

102

MP RJ Vaitinadin – PhD Epidemiology Dissertation

Key Terms BiS – Biceps Skin fold Thickness TrS – Triceps Skin fold Thickness SbS - Subscapular Skin fold Thickness SpS – Supra-iliac Skin fold Thickness AbS – Abdominal Skin fold Thickness UAW – Upper Arm Width UAC – Upper Arm Circumference

103