Methods for Data Analysis in Split-Mouth Randomized Clinical Trials: a Simulation Study

by

Romina Brignardello Petersen

A thesis submitted in conformity with the requirements for the degree of Master of Science (Clinical Epidemiology and Health Care Research) Institute of Health Policy, Management & Evaluation University of Toronto

© Copyright by Romina Brignardello Petersen 2012

Methods for Data Analysis in Split-Mouth Randomized Clinical Trials: a Simulation Study

Romina Brignardello Petersen

Master of Science (Clinical Epidemiology and Health Care Research)

Institute of Health Policy, Management & Evaluation University of Toronto

2012 Abstract

Split-mouth trials are a design of randomized controlled trial in dentistry in which divisions of the mouth are the units of randomization. Since there is more than one tooth in each mouth division, the structure of the data is complex, which can create difficulties in the statistical analysis. The aim of this study was to determine what is the most appropriate method to analyze split-mouth trials with continuous outcomes, with regards to the treatment effect estimates, power, type-I error, confidence interval coverage and confidence interval width. A superiority split-mouth trial in the field of was simulated, using two mouth divisions and varying underlying study characteristics such as correlation among teeth, treatment effects and sample size. Twenty-four statistical methods were compared across 315 scenarios. The performance of the statistical methods depended mainly on the correlation among the data, and a paired t-test performed the best across the different scenarios.

ii

Acknowledgments

I would like to thank my supervisor, Dr. George Tomlinson, for his encouragement, guidance, efforts and patience to explain things clearly and through all this process. This thesis would not have been possible without his help and he exceeded all my expectations as a supervisor. I really hope that we continue working together.

I thank my thesis committee members, Alex Jadad and Carlos Quiñonez, for their support and valuable advice and feedback.

To Alonso, for believing in my project and coming with me to Canada. His motivation, support and love have been essential throughout this experience.

iii

Table of Contents

!

List%of%Tables%...... %vii!

List%of%Figures%...... %viii!

List%of%Appendices%...... %x!

1! Introduction%...... %1!

2! Literature%review%...... %3! 2.1! Split?mouth%randomized%controlled%trials%...... %3! 2.2! Periodontal%diseases%...... %4! 2.3! Split?mouth%trials%in%periodontal%diseases%...... %7! 2.4! Statistical%analysis%in%split?mouth%trials%...... %8! 2.4.1! Nested!nature!of!the!data!within!a!split!...... !8! 2.4.2! Nested!nature!of!the!data!within!a!subject!...... !10! 2.4.3! Approaches!to!analyze!pre?post!treatment!data!...... !13! 2.5! Choice%between%the%different%methods%and%potential%consequences%of%an% inappropriate%analysis%...... %15! 2.6! Simulation%studies%...... %17!

3! Rationale%for%this%study%...... %23!

4! Research%question%...... %25!

5! Objectives%...... %26! 5.1! General%objective%...... %26! 5.2! Specific%objectives%...... %26!

6! Methods%...... %27! 6.1! Study%design%...... %27! 6.2! Rationale%for%using%a%simulation%study%...... %27! 6.3! Procedures%and%assumptions%...... %28! 6.3.1! Models!to!generate!the!data!...... !28! 6.3.2! Computational!methods!...... !29! 6.3.3! Level!of!dependence!between!simulated!datasets!...... !30!

iv

6.4! Scenarios%investigated%...... %30! 6.5! Analyses%performed%...... %34! 6.5.1! Full!data!versus!collapsed!data!...... !34! 6.5.2! Statistical!methods!to!evaluate!...... !34! 6.5.3! Estimates!obtained!from!each!analysis!...... !42! 6.5.4! Assessment!of!the!performance!of!each!of!the!methods:!outcomes!...... !42! 6.5.5! Weighting!of!the!criteria!of!performance!assessment!to!compare!the!methods!...... !45! 6.6! Number%of%simulations%...... %46! 6.7! Analysis%and%summary%of%the%results%...... %47!

7! Results%...... %48! 7.1! Bias%...... %53! 7.2! Power%...... %53! 7.2.1! Small!sized!trial!(Figure!7)!...... !53! 7.2.2! Medium!sized!trial!(Figure!8)!...... !56! 7.2.3! Large!sized!trial!(Figure!9)!...... !58! 7.3! Type?I%error%rate%...... %60! 7.4! Confidence%interval%coverage%...... %62! 7.5! Mean%square%error%...... %64! 7.6! Confidence%interval%width%...... %64! 7.7! Comparison%of%the%methods%of%analysis%...... %66! 7.7.1! Small!sized!trial!(Table!6)!...... !66! 7.7.2! Medium!sized!trial!(Table!7)!...... !68! 7.7.3! Large!sized!trial!(Table!8)!...... !71! 7.8! Exploration%of%other%assumption:%variation%in%the%treatment%effect%in%each%patient%.%73! 7.9! Summary%of%the%results:%comparison%of%the%simple%and%complex%methods%of%data% analysis%...... %82!

8! Discussion%...... %83!

9! Conclusions%...... %91!

References%...... %92!

Appendix%1:%R%program%used%to%generate%the%data%...... %103!

Appendix%2:%R%programs%used%to%analyze%the%data%...... %105!

v

Appendix%3:%The%power%of%the%statistical%methods%increase%with%the%number%of%teeth %...... %114!

Appendix%4:%The%differences%between%baseline%and%post%treatment%differences%in%the% treated%and%untreated%sides%are%not%correlated%...... %115!

vi

List of Tables

1. Methods of analysis used in each dataset (page 35)

2. Overall results: Mean of each of the outcomes per method of analysis (page 49)

3. Overall results: Mean of each of the outcomes per method of analysis using a weak correlation (page 50)

4. Overall results: Mean of each of the outcomes per method of analysis using a moderate correlation (page 51)

5. Overall results: Mean of each of the outcomes per method of analysis using a strong correlation (page 52)

6. Comparison of the methods in a trial with 2 teeth and 20 patients (page 67)

7. Comparison of the methods in a trial with 4 teeth and 20 patients (page 69)

8. Comparison of the methods in a trial with 10 teeth and 50 patients (page 72)

9. Comparison of the methods in a trial with 4 teeth and 20 patients where there is random variability in the treatment effect across patients (page 81)

10. Comparison of the simple and complex methods of data analysis in split-mouth trials (page 82)

vii

List of Figures

1. Healthy versus periodontitis (page 5)

2. Structure of the data from split-mouth trials: nested nature of the data (page 10)

3. Structure of the data from split-mouth trials: measurements at baseline and after the treatment (page 14)

4. Key aspects to consider when planning, executing and reporting a simulation study (page 19)

5. Steps performed in this study (page 30)

6. Schematic of the correlations between pairs of teeth (page 32)

7. Power versus treatment effect in a scenario with 20 patients and 2 teeth (page 55)

8. Power versus treatment effect in a scenario with 20 patients and 4 teeth (page 57)

9. Power versus treatment effect in a scenario with 50 patients and 10 teeth (page 59)

10. Type-I error rate versus number of patients in a scenario with 4 teeth (page 61)

11. Confidence interval coverage versus treatment effect in a scenario with 4 teeth and 20 patients (page 63)

12. Confidence interval width in a scenario with 4 teeth, 20 patients and a treatment effect of 0.6 (page 65)

13. Power and type-I error rate versus treatment effect in a scenario with 4 teeth and 20 patients (page 70)

14. Power versus treatment effect in a scenario with 4 teeth and 20 patients when treatment effect varied from subject to subject (page 74)

viii

15. Type I error rate versus number of teeth in a scenario with 20 patients when treatment effect varied from subject to subject. (page 76)

16. Confidence interval coverage versus number of teeth in a scenario with 20 patients (page 77)

17. Confidence interval width versus number of teeth in a scenario with 20 patients (page 79)

18. Comparison of the type-I error rate and power of the methods of analysis in a scenario with 4 teeth and 20 patients (page 80)

19. Power versus number of teeth in a scenario with 20 patients and a treatment effect equal to 0.3 (page 114)

ix

List of Appendices

1. R programs used to generate the data (page 103)

2. R programs used to analyze the data (page 105)

3. The power of the statistical methods increases with the number of teeth (page 114)

4. The differences between baseline and post treatment differences in the treated and untreated sides are not correlated (page 115)

x 1

1 Introduction

Randomized controlled trials usually have as the unit of randomization the individual; however, there are designs in which the unit of randomization is larger and others in which it is smaller1. In cluster-randomized trials, groups of people are randomly allocated to receive different interventions2. On the other hand, some trials use body parts as the unit of randomization, for example, limbs3 and eyes4-6. In the field of dentistry, these trials are called split-mouth trials.

A split-mouth trial is a study design in which divisions of the mouth within a patient constitute the experimental units randomly assigned to treatments7. Since their introduction in 19688, they have become very popular in dentistry, particularly in the fields of periodontology and orthodontics7, 9.

The goal of split-mouth trials is to remove between-subject variability, because patients act as their own controls. Nevertheless, a downside related to this is that the structure of the data is complex, which makes the data analysis more difficult when compared to a parallel two-armed trial10-12. This difficulty in data analysis has been discussed and explored in other fields in medicine, such as ophthalmology4-6 and orthopedics3; however, split mouth trials have the potential of having as many experimental units as divisions of the mouth, and including up to 16 teeth per experimental unit, as opposed to the trials in ophthalmology and orthopedics, in which there are only two experimental units. Therefore, the statistical analysis of split-mouth trials presents more challenges that the analysis of trials in medical fields where multiple body parts are used.

There have been reports that split mouth trials are not well analyzed7, 9, 13, which weakens or completely removes the advantage of the design. Many methods have been proposed to solve this problem14-19; however, there is no evidence supporting the use of any of these methods over the others, nor a description of the appropriateness of the methods that are commonly used.

The results derived from statistical analysis are the basis for drawing conclusions from a study and clinicians often make decisions based on these results. Since inappropriate statistical

2 methods can lead to inappropriate clinical decisions, it is important to determine to what extent the conclusions of a study would change depending on the choice of the statistical approach.

Simulation studies can be used to assess the appropriateness and accuracy of different analytical techniques20, 21. Furthermore, simulation studies are the only way of evaluating the overall operating characteristics when different trial designs or analytical techniques are used21. The aim of this thesis is to assess how different statistical methods for data analysis of a computer simulated split-mouth trial perform under different conditions.

The rest of the thesis is laid out as follows. Section 2 presents a literature review with descriptions of the split-mouth design, the field in which the simulated study will take place, the statistical analysis of split-mouth trials with continuous outcomes, and how simulation studies are done. Section 3 explains the rationale for doing this study and states its objective. Section 4 reports the methods of this study. Section 5 presents the results of the simulations and section 6 discusses these results and their implications. Section 7 lists the conclusions of this study and recommendations for further study.

3

2 Literature review

2.1 Split-mouth randomized controlled trials

A split-mouth trial is a study design in which anatomical regions of n subjects’ mouths are divided into p homogeneous within-patient experimental units. Subsequently, each of the p treatment modalities is randomly assigned to one within-patient experimental unit11. This design was introduced by Ramfjord et al.8 in 1968, who compared the efficacy of two periodontal treatments by randomly allocating the treatment methods to half of each patient’s dentition, divided by the mid-sagittal plane. However, since the dentition can be divided into many different segments, different combinations of split-mouth can be found. In fact, it has been reported that researchers have randomly allocated treatments to full or half contralateral sides, diagonal quadrants, all quadrants, contralateral or ipsilateral sextants, and the maxillae versus mandible7, 12.

The main advantage of split mouth studies is that since the subjects act as their own controls, inter-subject variability in the average response is removed, increasing the power of the study compared to a design where a subject (whole mouth) is assigned to a treatment9, 14. Due to the increased statistical efficiency, fewer patients are needed to detect a determined treatment effect7, 9, 10, 12. However, it has been claimed that the gain in efficiency only occurs when disease characteristics are symmetrically distributed over the within-patient experimental units and there are enough sites with disease per experimental unit12, which threatens the generalizability of results, and the rate of patient recruitment, due to the need of subjects that meet these criteria.10, 14.

A potential disadvantage of the split-mouth design is that it may lead to biased treatment effect estimates, due to carry-across effects7, 10, 11, 14. A carry-across effect occurs when the treatment performed in one part of the mouth can affect the treatment responses in other parts of the mouth7, 11, 14. If two active treatments are being compared, the treatment effect can be underestimated or overestimated, due to the leakage from one side to the other. Moreover, if the two treatments are administered at the same time, this leakage can go both ways. When an

4 active-placebo controlled trial is performed, the presence of a carry-across effect would lead to an underestimation of the treatment effect14. Since the presence of the carry-across effect cannot be detected using statistical tests, a split-mouth trial estimate of a treatment effect can be confounded by the carry-across effect10, 11, 14. Therefore, in order to improve the validity of these trials, a priori knowledge should indicate the absence of any carry-across effects11, 14.

The main issue when using a split-mouth design is the statistical analysis of the results. It has been pointed out that the analysis of split-mouth trials is more complex than the analysis of a whole-mouth design7, 10, 14. Authors agree that the essential feature to consider when analyzing split-mouth trials is that the treatment responses within an individual are correlated, and thus, this correlation must be taken into account7, 9, 10, 14. However, there is no consensus as to which specific method should be used to analyze the data, and many methods have been proposed to analyze binary14-18, continuous7, 9, 10, 14, and survival outcomes7, 19.

2.2 Periodontal diseases

Periodontal diseases comprise a variety of conditions affecting the health of the periodontium22, which are the specialized tissues that surround and support the teeth. They are divided into two main categories depending on the occurrence of attachment loss: and periodontitis. Gingivitis is the inflammation of the gingiva, without loss of connective tissue attachment22, 23. On the other hand, periodontitis is the presence of gingival inflammation at sites where there has been pathological detachment of collagen fibers from , and the has migrated apically, in addition to connective attachment loss that has lead to the resorption of coronal portions of tooth supporting alveolar bone23 (Figure 1).

Moderate generalized is one of the most prevalent diseases in our society, affecting a majority of adults24. A systematic review of surveys and epidemiologic studies reported that the prevalence of periodontitis in the U.S.A. can range from 24% to 92% depending on the subpopulation, and is as high as 86% in Canada25.

Periodontal disease diagnoses are made based on a clinical examination, which includes collection of demographic data, medical history, periodontal probe measurements, radiographic

5 findings and other clinical observations. Even though there are other complementary exams available, like assessment of the gingival crevicular fluid and subgingival microflora composition, most practitioners do not use them, since they have not been validated as diagnostic tools and are time consuming and costly26.

Two probe measurements are used in clinical practice to diagnose periodontal diseases: probing depth and clinical attachment level. Probing depth is the distance from the to the base of the probeable crevice. Clinical attachment loss is the distance between the cementoenamel junction and the base of the probeable crevice26 (Figure 1). Since there is no agreement about which of these measurements should be preferred, routine periodontal examination includes both. However, in order to have a more precise diagnosis of the severity of the disease, clinical attachment level measurements are considered27.

Figure 1: Healthy periodontium versus periodontitis. The left side of the figure shows the anatomy of a healthy periodontium and its main components. The right side of the figure illustrates the presence of chronic periodontitis, where inflammation of the gingiva, loss of connective tissue attachment, and a periodontal pocket, instead of the healthy sulcus, can be observed. A periodontal probe is used to quantify the severity of the disease, by measuring the probing depth (PD) and clinical attachment level (CAL). Illustration from the learning center of the Oral-B website for professionals28, adapted using the concepts in the book “Clinical Periodontology and Implant Dentistry”29.

6

Chronic periodontitis is caused and perpetuated by bacteria living in the biofilm present in supragingival and subgingival plaque30. The bacteria from the subgingival plaque invade and attack the tissues using their virulence factors, provoking direct damage and an immune response responsible for the destruction of the periodontal tissues (immune pathology)31. The goals of periodontal therapy are to alter or eliminate the microbial etiology and contributing risk factors for periodontitis, limiting the progression of the disease, and preventing its recurrence. In addition, regeneration of the periodontal tissues may be attempted, where indicated. Control of contributing systemic risk factors, supra and sub-gingival scaling and root planing, and elimination of local risk factors like occlusal trauma, filling over-hangs and over-contoured crowns and ill-fitting prosthetic appliances, are among the main interventions administered to patients diagnosed with chronic periodontitis32, 33.

Scaling is defined as the removal of plaque, calculus, and stains from the crown and root surfaces of the teeth. Root planing is a treatment procedure designed to remove cementum or surface dentin that is rough, impregnated with calculus, or contaminated with toxins or microorganisms34. Scaling and root planing have been demonstrated to be effective for reducing clinical inflammation, causing a microbial change to a less pathogenic subgingival flora, decreasing probing depth, gaining clinical attachment and lessening disease progression34-48. To date, they remain the main interventions applied to patients with periodontitis. Current clinical trials focus on the efficacy of different methods for administering these interventions49-57, and their combination with local antibiotics or antiseptics58-64.

The desired outcomes of periodontal therapy in patients with moderate and severe chronic periodontitis include reduction of clinical signs of clinical inflammation, reduction of detectable plaque to a level compatible with gingival health, reduction of probing depths, and stabilization or gain of clinical attachment32, 33.

7

2.3 Split-mouth trials in periodontal diseases

The split-mouth design has been used in many fields in dentistry. A bibliometric study of the split-mouth trials published in journals indexed in PubMed during 2004 showed that 32% of them were done in the field of periodontology, 29% in orthodontics, 18% in cariology and 21% in the remaining fields7.

Split-mouth trials may provide an efficient research tool for periodontal research if proper selection criteria are used to identify a patient population where the design is appropriate and validity issues are addressed12. The design is very attractive due to the symmetry of the mouth and the generalized nature of the , which may explain why most split-mouth trials are performed in the field of periodontology.

In 1990, Hujoel and Loesche studied the efficiency of split-mouth designs in the field of periodontology12. With the aims of investigating the similarity and suitability of 7 different types of split-mouth designs, according to the segments of the mouth used, and determining the relative efficiency of split-mouth designs as compared to whole mouth designs; they conducted a clinical trial in 69 patients with advanced periodontitis. All patients received a full-mouth treatment, and the data were analyzed as if they came from different types of trials, e.g., a full- mouth trial, and split-mouth trials dividing the mouth in halves, quadrants, and sextants. The authors found significant differences between the experimental units with respect to the amount, distribution and severity of periodontal disease. Even though pockets with probing depth over 6 mm were distributed asymmetrically for a significant proportion of patients for all types of split- mouth designs, pockets with probing depth from 4 to 6 mm were symmetrically distributed for most types of split-mouth designs, except when the mouth was divided in sextants. In addition, it was observed that division of the mouth into two experimental units was more efficient than divisions into more than two. The authors concluded that when disease characteristics are symmetrically distributed over the within-patient experimental units, the split-mouth design provides a moderate to large gain in relative efficiency, compared to a whole mouth design.

Although true clinical endpoints like tooth loss or pain due to periodontal abscesses should be preferred in periodontal clinical trials65, these outcomes represent very advanced or very unusual conditions when considering the signs, symptoms and progression of periodontal disease.

8

Therefore, authors often use surrogate outcomes to determine the effectiveness of periodontal therapies, typically the anatomical measures pocket depth and attachment level 65, 66. Both parameters are usually measured to the nearest millimeter using graduated periodontal probes; therefore, they are continuous outcomes. However, some studies report that these outcomes were measured to the nearest half-millimeter67-69 or to the nearest 1/10 millimeter70.

2.4 Statistical analysis in split-mouth trials

The first authors that reported problems with the design and statistical analysis of split mouth trials were Hujoel and Moulton, in 19889. They studied 22 split-mouth trials published in the field of periodontology, and found that only 5 of them had used appropriate methods to analyze the results. They reported that from the remaining trials, 7 used one-way analysis of variance (ANOVA), 5 used the two-sample t-test and the other 5 did not mention the statistical test used. In 1997, Lesaffre et al.7 reported that 19 out of 34 split-mouth trials published in all areas of dentistry used appropriate analytical statistical methods, and only one presented an appropriate sample size calculation. Among the inappropriate tests used for analyzing continuous data were the one-way ANOVA, the analysis of covariance and the Mann-Whitney test. Both studies considered that an appropriate analysis had been performed only if a statistical test that accounted for the paired nature of the data had been used; however, there are many other factors to consider.

Three main aspects should be considered when planning the statistical analysis of a split-mouth trial: (1) the nested nature of the data within a division of the mouth (split), (2) the nested nature of the data within a subject, and (3) the different approaches to analyze pre-post treatment data.

2.4.1 Nested nature of the data within a split

No matter which particular split-mouth design is chosen, usually there will be more than one tooth in each split. Using as an example a clinical scenario in which a split-mouth trial to compare the efficacy of two periodontal treatments is undertaken, and the splits are determined by a line in the mid-sagittal plane (the most common combination); in a subject with a complete

9 dentition there are 16 teeth in the right side and 16 teeth in the left side (Figure 2). In addition, a routine periodontal examination to assess the amount of periodontium damage consists in the recording of probing depth and clinical attachment level of six sites around each tooth71. Therefore, in a patient with complete dentition there are 96 measurements per split. As a consequence, many options arise when deciding how to analyze these data.

Two main approaches can be taken when dealing with data from one split; to use all the measurements separately (full data), or to average the measurements (collapsed data). The first approach is simple; however, there is a considerable amount of data per split. The second approach seems easy as well; nevertheless, the variances for the resulting treatment effect estimates can vary according to the method chosen for averaging the data.

Even though it has been suggested that the data obtained from individual sites should be summarized within experimental units, and the statistical analysis should use these summaries72, published literature show that authors use many different approaches. Examples of these approaches are using the mean per tooth73, using only one site representing the measurements of a tooth74, using the full data considering each site75, using the mean of the site measurements per experimental unit76, 77, and using the percentage of sites with a measurement higher or lower than an arbitrary value78-80.

10

Figure 2: Structure of the data from split-mouth trials: nested nature of the data. The figure illustrates a split-mouth trial in which the divisions of the mouth are determined by the mid-sagittal plane. Nested data within a split: In both, the right and left sides there are 16 teeth, each of them represented by a square. Nested data within a subject: many correlations can be observed between the teeth. The contralateral correlation refers to the correlation between a tooth and its homologous, i.e., tooth 11 with 21. The ipsilateral correlation refers to the correlation between teeth from the same side, i.e., tooth 15 with 13 and 12. The up/down correlation refers to the correlation between a tooth and other from the opposite maxillary bone, i.e., tooth 17 with 47. Teeth nomenclature corresponds to the FDI World Dental Federation81.

2.4.2 Nested nature of the data within a subject

In the simplest split-mouth design the mouth of a patient is divided in two; however, studies in which the mouth is divided in up to six splits are commonly found in the literature7. As a consequence, different numbers of units of observation are nested within a patient.

There is some literature suggesting that periodontal disease is symmetrical and that teeth within a mouth are correlated with each other73, 82-84. In 1987, Fleiss et al. studied the correlation between all the combinations of pairs of teeth using the Ramjford teeth84, a partial recording protocol that consists in measuring teeth 16, 11, 24, 36, 31, and 4485. They found correlations from 0.1 to 0.62 for probing depth measurements and from 0.47 to 0.82 for clinical attachment level

11 measurements. In 2001, Mombelli and Meier82 reported correlations ranging from 0.37 to 0.5 between probing depths recorded on the right and left side of the mouth, in patients with moderate to advanced periodontal disease. In 2011, Darby et al.73 showed that in patients with generalized severe chronic periodontitis, there is no evidence that homologous teeth have different probing depths. The latter studies reported that other clinical parameters such as attachment level, recession, bleeding on probing and furcation involvement had similar distributions between the right and left sides of the mouth. As a result, there is evidence that supports the claim of many authors regarding the need to account for the correlation of the data when analyzing split mouth trials (Figure 2).

Different methods have been proposed to analyze split-mouth trials with continuous outcomes. The simplest analysis recommended is a paired t-test or the two-way ANOVA, which are equivalent in the case of two measurements per subject when one factor is the subject and the other is the treatment received9. More complex analyses involve the use of statistical modeling, either using a repeated measures ANOVA10, mixed-effects ANOVAs, and generalized estimating equations7, 14. A brief description of the main approaches used in the literature can be found below. a. Paired t-test

The paired t-test is a method of analysis for paired samples, for example, when observations from two independent samples are matched into pairs by some covariates or when there is self- pairing at one time point or at different times. It uses as the variable of interest the difference in measurements between each pair and it calculates the mean and standard error of these differences86, 87. In a split mouth trial, the paired t-test will calculate the difference in the outcome of interest between the experimental and the control side of the mouth in each patient and then it will average these differences and calculate the standard error of the difference. When there is more than one tooth in each side of the mouth, an intuitive extension of the paired t-test will calculate the difference in the outcome of interest between homologous teeth of the experimental and control side of the mouth and then will average these differences.

12 b. Repeated measures ANOVA

This statistical technique is an extension of the paired t-test to the situation when a subject or unit of analysis has more than two measurements. In the setting of multiple measurements over time in a subject, it allows an assessment of whether the treatment groups show different responses over time, that is, it evaluates the presence of a treatment by time interaction effect. This analysis includes data from all time points. Downsides of this analysis are its vulnerability to large effects from missing values (when there is a missing observation, the patient has to be excluded from the analysis or the value has to be imputed, which is very likely to introduce bias) and that observations on all patients need to be made at the same time points88. c. Mixed effect models

The term ‘mixed effects models’ embraces models with fixed and random effects, covariance pattern models, and combinations of these. Mixed effect models can be used to analyze correlated data, such as longitudinal data, repeated measures data, clustered data and multivariate data89. Among the advantages of this technique are that it uses all available data; that is, the patient is not excluded from the analysis when there is a missing observation; it is unbiased in the presence of data missing at random; it has flexibility in the approaches to modeling a time effect; and it allows the use of realistic but parsimonious variance and correlation patterns. Disadvantages of the mixed effect models are their complexity and the difficulty to assess violations of assumptions88. It has been recommended that, when using this approach to analyze data from split mouth studies, the subjects and measurements should be treated as random effects and the treatment and site should be treated as fixed variables14. d. Generalized estimating equations

This extension of the generalized linear model was proposed by Liang and Zeger in 198690. It provides a method for the analysis of correlated dependent variables that are normally or non- normally distributed91. The generalized estimating equation methodology involves fitting a generalized linear model to the marginal distribution of the repeated outcomes92 and estimating the standard errors of the parameter estimates using an approach that accounts for their correlation. Therefore, robust estimates of standard errors are obtained, which leads to correct inferences91, 93.

13

2.4.3 Approaches to analyze pre-post treatment data

In the simplest design of a split-mouth trial in periodontology there are at least two measurements over time, the measurement before a treatment is administered, and a measurement performed after the treatment could have had an effect (Figure 3). Moreover, there are trials in which the measurements are performed more than once after the treatment administration. Therefore, the data analysis should account for this.

In 1989, as a response to the complexity of the methods described in the literature to analyze data from studies with more than one measurement over time, Matthews et al.94 brought to researchers’ attention the possibility of using summary measures of the repeated measurements. With this method, a summary measure that represents the response to a treatment of each patient is chosen, and then a simple two-group comparison is done.

14

Figure 3: Structure of the data from split-mouth trials: measurements at baseline and after the treatment. The figure illustrates that, besides the nested nature of the data, measurements are performed at baseline and after the treatment, which adds another correlation to consider when analyzing the data (pre-post correlation).

Even with this simplification, since in many clinical trials the goal is to evaluate the average response to a treatment over time93, 95, the analyst is faced with the following choices for the method of analysis: 1) using the mean for each patient post-treatment measurements as the summary measure, 2) using the difference between the mean pre-treatment measurement and the mean post treatment measurement for each patient as the summary measure, and 3) using the

15 mean baseline measurement for each patient as a covariate in a linear model that compares the post-treatment means95.

The t-test and the two-way ANOVA are the traditional methods to perform the analysis using the first and second approaches, and the ANCOVA is the statistical method that allows use of the baseline measurements as a covariate88. It has been reported that, since the ANCOVA always has a smaller variance than the other two approaches, it is superior 95, 96.

In addition, more complex methods that do not use summary measures have been described to analyze pre-post treatment data, such as repeated measures ANOVA, multivariate ANOVA, multilevel models, mixed effects models, and marginal models88, 92, 95. Reports from the late 1980s showed that these methods were rarely used in medical journals and that the data analyses were purely descriptive97, 98; however, it has been suggested that their use has increased over time88, 92.

2.5 Choice between the different methods and potential consequences of an inappropriate analysis

When deciding what method and approach will be used to analyze data from split-mouth trials, a starting point is the decision on what will be the unit of analysis. In 1985, Blomqvist discussed the experimental and observational units in studies in periodontology and illustrated the consequences of an incorrect choice of unit of analysis99. The experimental unit refers to those units that are randomized, as opposed to the observational unit, which is the most elementary unit from which information is available100. Using these definitions, in a split-mouth randomized trial the unit of analysis is the side or portion of the mouth randomized to receive one of the treatments under evaluation, whereas the unit of observation is the sites or teeth in which an outcome is measured.

Since the experimental unit is what was randomized, it is assumed that it was randomly chosen from some population, which means that only inferences regarding the experimental unit can be made, and thus, this should be the unit of analysis100. If correlated observational units, as in the case of split-mouth trials, are used as the unit of analysis, an erroneous standard error is obtained when employing standard formulas in the individual measurements (because the sample size

16 used in the calculations is too big); as a result, standard errors and p-values can be underestimated99. An incorrect choice of unit of analysis may result from the selection of full data or collapsed data, or from the selection of a statistical method that treats the observational units as if they are independent when they are not.

In addition, other issues arise when analyzing correlated data as if it were independent. Some of these issues can be seen through an examination of the variances of the means of correlated values and differences in these means. First, the variance of the mean of k observations on one

σ 2 side of the mouth sharing a correlation is larger than , the variance of the mean of k independent observations, as shown in the following formula

σ 2 [1+ (k −1)ρ] Var(X ) = k k

If k teeth on each of n patients are averaged together, the variance of this mean is

σ 2 [1+ (k −1)ρ] Var(X ) = nk nk

Second, the variance of the estimated difference between experimental and control sides can be overestimated. Because the sides within a patient are correlated, the variance of the difference between the means of the two sides is not equal to the sum of variances of each mean. Thus, the computation of the variance should not only consider the variance of both sides, it also considers subtracting another factor dependent on the correlation between the means of the two sides, as shown in the following equation for the variance of the differences in two means with correlation ρ:

Var(Y 1 −Y 2 ) = Var(Y 1) + Var(Y 2 ) − 2ρ Var(Y 1)Var(Y 2 )

This contrasts with the variance of the estimated difference when data are independent (or when data are treated as if they were independent), which is calculated using the formula

Var(Y 1 −Y 2 ) = Var(Y 1) + Var(Y 2 )

17

In consequence, if the paired nature of the sides is not considered, the resulting variance will be higher than the true variance, which is obtained when pairing is considered. This is especially important when considering differences in means of positively correlated values, as the correlation between the two means is larger than the correlation between individual values.

A final point relates to the degrees of freedom of statistical tests. For a comparison of means based on two independent samples, where each sample has n independent observations, the t-test has 2n-2 degrees of freedom. In a split-mouth study, the observations in a side and in opposite sides are not independent, so this number of degrees of freedom is too large. A larger number of degrees of freedom results in a smaller ‘critical value’ for the t-test and increases the chance of finding a statistically significant results, all other things being equal.

Considering all of the above, theoretically, the choice of statistical method should account for the paired nature of the data and the unit of analysis. The bias in the results due to the use of an inappropriate test could either underestimate or overestimate the p-value from statistical testing.

2.6 Simulation studies

Simulation studies use computer-intensive techniques to provide quantitative evidence for the performance of a specific trial design, analytical method or decision rule21. Although their main use has been described in the fields of pharmacology, pharmacometrics and drug development21, 101-103, they can also be used to assess the appropriateness and accuracy of different analytical techniques20, 21. In addition, simulation studies are often the only way of evaluating the overall performance metrics, such as required sample size, type-I error and power, when different trial designs or analytical techniques are used, allowing the assessment of the probability of making correct or incorrect clinical decisions21.

Other methods have been used for studying the performance of different statistical techniques. For example, some authors analyze a real dataset with various techniques and compare the results4, 104, 105. Nevertheless, when using this approach it is not possible to know the true values of the parameters that are being estimated, and thus there is no certainty about the reliability of the results. On the other hand, analytic or algebraic techniques can be used to derive formulas for

18 estimating different parameters and evaluating their performance metrics106-108. However, when some of the methods under comparison are complex and depend on iterative algorithms, it can be unfeasible to use these analytic techniques. Considering this, simulation studies are the best approach to comparing complex statistical methods and evaluating many performance metrics at the same time.

Clinical trial simulations consist in generating trial outcomes for artificial subjects based on given characteristics, utilizing data generation models that use these characteristics to create a dataset, and analyzing these data. The traditional approach is to first specify a fixed and known treatment effect (for example, a Poisson distribution for incidence of a disease where the rate parameter depends on the exposure to a risk factor). The next step is to undertake simulations to create random variable response observations for each of the artificial subjects, based on the given characteristics of the clinical trial (for example, in a study of 200 control and 100 treated patients, either an event time, or censoring at the end of a 2-year follow-up period). This process generates a whole dataset of outcomes and treatment assignments, which is analyzed in exactly the same way that a real data set would be analyzed. For a given set of design parameters (e.g., baseline rates and treatment effects), the only difference between the datasets created (and the results of the analyses) in each of the simulation repetitions is the random variability in the response21. This repeated sampling of data from the same true model mirrors the way that ‘real’ data would vary depending on which particular sample is obtained, so that estimates of the parameters of interest also show the same variability as would be expected in the real world. The data generation models can either reflect biological systems (e.g. mechanistic models) or be based on previous data from other studies and literature reviews (e.g. empirical models)21, 109.

As mentioned above, the performance of multiple statistical methods can be assessed in these studies20. The multiple methods are used to analyze the same dataset, allowing comparisons of performance metrics. Comparisons can be made between the metrics of each method and the data generation model, to evaluate how the specific test performs, and between statistical methods, to determine which of the methods performs best21.

Several data generation models can be used in simulation studies. This may be useful to assess the impact of the different statistical methods on different subpopulations. Thus, it is possible to

19 establish what statistical method should be used when analyzing data from each of the subpopulations21.

The literature concerning the planning, execution and reporting of simulation studies describes key issues to consider. Among these issues are the rationale for conducting a simulation study, its objectives, the procedures performed and the assumptions made, the scenarios investigated, and the analyses performed (Figure 4)20, 21, 102, 103.

Figure 4: Key aspects to consider when planning, executing and reporting a simulation study

20 a. Rationale and objectives of the study

It is very important to explicitly state what the aims of the study are and why the simulation study is the most appropriate design to reach these aims20, 21, 103. These objectives act as a frame that focuses the simulation study, avoiding unnecessary procedures20. For example, Brookes et al.110, 111 performed a simulation study with the aim to “quantify the extent to which subgroup analysis may be misleading”. They describe the rationale for using this design listing reasons such as the control over the underlying distributions and the nature of the alterations of the parameters, which makes “the interpretation of the results as transparent as possible for the general audience”. b. Simulation procedures and assumptions

The models used to generate the data, the computational methods used to generate the data, and the level of dependence between the simulated datasets are among the procedures that should be fully described in the report of a simulation study20, 21, 102.

Computer simulation requires the specification of a model with parameters reflecting the situation being simulated. This model should at least approximate a description of a clinical effect of a treatment, when appropriate, which should be stated in the report103. The sources of variability, covariate relationships and uncertainties in parameter estimates should be stated as well21. In addition, it is required to assume a distribution for the data20. Moreover, simulated data must account for stochastic processes such as dropouts and missingness21 when these are present in the clinical setting. For the results to be generalizable and credible, it is crucial that the datasets simulated using the model have some resemblance to reality. Thus, common approaches to generate the model are to use a real data set or published data20, 21. Brookes et al. used different data generation models by specifying four different general scenarios in their investigation of subgroup effects: 1) no overall treatment effect and no subgroup effect, 2) overall treatment effect but no subgroup effect, 3) no overall treatment effect but differential subgroup effect, and 4) overall treatment effect and differential subgroup effect. In addition, they generated binary, continuous and survival outcome data. Moreover, they used different magnitudes of treatment effects and differential subgroups effects110, 111.

21

The computational methods used to generate the data should be clearly stated as well. These methods include, but are not limited to the software used, the choice of random number generators and the starting seeds20, 21, 102. The code used is useful as well, since programming languages differ across software and the process might not be as transparent as possible if the specific code used is not reported. A detailed description of these methods allows replication of the simulation and facilitates the understanding of all the procedures performed21. For their simulation study, Brookes et al. used FORTRAN, a general purpose programming language. In addition, they repeated a representative sample of the analyses using Stata. They added an appendix with the data generation method110, 111.

Finally, a simulation study involves generating several datasets, which can be fully independent or moderately independent. The generated datasets are fully independent when a totally different set of data is created for each scenario and method of analysis; whereas moderately independent dataset simulation consists in generating a completely different datasets for each scenario, but analyzing each of these dataset with all the statistical methods. The latter is the choice when the aim of the study is comparing different statistical methods for the same scenario20. c. Scenarios investigated

As mentioned above, the scenarios investigated in a simulation study should resemble reality. Therefore, the most common settings should be reproduced and a range of plausible parameter values should be covered20. All the factors influencing the number of scenarios investigated and the methods for evaluating these scenarios must be clearly specified and justified21. For example, when simulating survival data, Brookes et al. used a mean survival time of 36 months, a follow- up period of 60 months110, 111. d. Analyses performed

The statistical methods used to analyze the generated datasets should be described in detail. In addition, the estimates stored after each round of simulation and how these estimates were summarized have to be explained. Finally, specification of the criteria for assessing the performance of the statistical methods for different scenarios (outcomes of the simulation study), and how these criteria were weighted to make a final decision about the overall performance of the tests and to compare different tests among each other, is required20, 21. Brookes et al. analyzed

22 continuous data using the least squares regressions techniques for the comparison of means, binary data using logistic regression based on maximum likelihood methods, and survival data using Cox proportional hazard models110. Since they were not interested in comparing the statistical methods, they only used one method to analyze each outcome.

23

3 Rationale for this study

The choice of an appropriate statistical method for analyzing data from a clinical trial has clinical, ethical and research implications.

Conclusions from a study are often used in the process of clinical decision-making. Inappropriate use of statistical methods leads to incorrect results that may be used in clinical practice112. Thus, it is necessary that conclusions drawn from a study are supported by the data. For this to be possible, it is crucial that the most suitable statistical methods were used to analyze the data113. An appropriate statistical test is characterized for yielding trustworthy p-values (that is, when there is no effect, there should be a 5% chance of getting a p-value below 0.05), the confidence intervals obtained from the method should have the nominal coverage (e.g., the 95% confidence intervals constructed should contain the true parameter 95% of the time), and the methods should be powerful.

Amongst the ethical concerns arising from an inappropriate analysis of a clinical trial are the misuse of patients and resources, and the publication of incorrect or misleading results. The statistical analysis of a clinical trial is essential for this trial to be ethically conducted. Exposing patients to potential harms or inconvenience, and spending time and resources in performing a clinical trial is useless if the analysis of the results is not done correctly. Besides, errors in the data analysis stage lead to the publication of inaccurate, misleading or incorrect results114.

An inappropriate statistical analysis could lead to larger than acceptable chances of type-I and type-II errors. A type-I error, e.g., rejecting the null hypothesis when it is true, may lead to the implementation of treatments that are not effective, which in the case of periodontal trials may result in an increase in burden and costs for the patient. A type-II error, that is, not rejecting the null hypothesis of no treatment effect when a treatment is effective, may lead to the decision of not administering a treatment that actually is effective.

In addition, split-mouth trials that report inaccurate or incorrect results could have an impact on the meta-analysis of systematic reviews. If meta-analyses pool trials with inaccurate results, their results, particularly the confidence intervals, will be unreliable. Therefore, inappropriate

24 conclusions regarding the clinical and statistical significance of the pooled estimate, and a distortion of the estimates of heterogeneity among the trials will result14.

Finally, to realize the benefits of the split mouth design, there must be application of appropriate statistical techniques in the analysis9. While studies like the reviews published by Hujoel9 and Lesaffre7 have shown that analysis of these studies appears to be suboptimal, there is no certainty that the results of these studies are invalid. Furthermore, to date there is no evidence demonstrating which statistical technique leads to the most correct results or how inappropriate the commonly used analyses are. Therefore, the aim of this study was to determine the performance of each of the methods used and recommended in the literature to analyze split- mouth trials in terms of the treatment effect estimates, confidence interval width, confidence interval coverage, power, and type I error.

25

4 Research question

What is the most appropriate method to analyze the results from split-mouth trials with continuous outcomes, with regards to the treatment effect estimates, power, type-I error, confidence interval coverage and confidence interval width?

26

5 Objectives

5.1 General objective

To determine which of the statistical methods used and recommended in the literature for analysis of the results of split mouth trials leads performs better with regards to the treatment effect estimates, power, type-I error, confidence interval coverage and confidence interval width.

5.2 Specific objectives

- To describe the impact of each of the statistical methods used and recommended in the literature for analysis of the results of split mouth trials in the treatment effect estimates, power, type-I error, confidence interval coverage and confidence interval width. - To determine the impact of the true treatment effect, correlation among the measurement and sample size in the performance of the statistical methods used and recommended in the literature for analysis of the results of split mouth trials. - To compare the performance of the statistical methods used and recommended in the literature for analysis of the results of split mouth trials.

27

6 Methods

6.1 Study design

We performed a simulation study.

Two factors were varied: (1) the statistical method of analysis, and (2) the underlying study characteristics.

The statistical methods of analysis used were the t-test for independent samples, t-test for paired samples, Wilcoxon signed test, Wilcoxon rank sum test, ANCOVA, mixed effects ANOVA and generalizing estimating equations. These tests were chosen based on the techniques of analysis of split mouth trials with continuous outcomes commonly found in the literature and the recommendations made by authors that have published articles regarding this topic7, 9, 10, 14.

The underlying study characteristics varied were the overall treatment effect, correlations between pairs of teeth, and the sample size for the number of patients and the number of teeth. Further details are given below in section 6.3.1.

6.2 Rationale for using a simulation study

This design was chosen because it allows the assessment of the appropriateness and accuracy of different statistical methods in relation to a known truth20.

In addition, this design allows control over the underlying distributions of the data and study characteristics. Moreover, more than one parameter can be varied at once; therefore, the influence of each of the characteristics on the results can be studied. As a consequence, this is the design that allows achieving the objectives of this study110.

28

6.3 Procedures and assumptions

6.3.1 Models to generate the data

We simulated a superiority two-arms split mouth randomized clinical trial assessing the effect of two active interventions in patients with chronic moderate periodontitis. Both interventions were active, and were randomly assigned to each half of a patient’s dentition, determined by the mid- sagittal plane, which is the most common division used in split-mouth periodontal trials7. The outcome of interest was probing depth, and was measured at baseline and after the treatment, and a single value per tooth at each time was used, which represented the mean of the six probing depth measures usually performed in each tooth. The values of the probing depth were considered as nested within the mouth’s side, and the side nested within a patient. Therefore, there was some correlation in the disease characteristics among all teeth. Not only teeth within a side were correlated, but evidence also suggests a high correlation between a tooth and its contralateral tooth73, 82.

The parameters and trial characteristics considered to be important for this simulation were:

a. True overall treatment effect: difference between the effects of the treatments in each side of the mouth (i.e. effect of scaling and root planing versus effect of scaling and root planing plus an adjuvant therapy). The treatment effect has an impact on the clinical and statistical significance of the results, and on the power of the statistical tests. b. Standard deviation of the measurements: this parameter determines the variability of the measurements when generating the data, and when it is considered relative to the treatment effect, it has an impact on the statistical significance of the results. c. Correlation between pairs of teeth: according to different authors, the correlation among the measurements is the essential feature to consider for choosing the statistical method of data analysis. Tests that consider the correlated or paired nature of the data should perform better than tests that treat the data as independent7, 9, 10, 12, 14. d. Sample size: the sample size has an impact on the statistical significance of the results, the confidence interval widths, and on the power of the statistical methods. In this study,

29

two sample sizes were considered: the total number of patients in the trial and the number of teeth per division of the mouth.

The following initial assumptions were made:

- The values of probing depth in one subject at baseline and after treatment follow a multivariate normal distribution - There is not any systematic difference between the right and left sides of the mouth in the population - The effect of the treatment is constant across teeth and across subjects

In addition, in section 6.4 we explored the impact of the methods of data analysis on the results when changing the latter assumption and assuming that there is a random variation in the treatment effect across subjects. This assumption was only explored in a subset of scenarios chosen to illustrate the impact of the methods of data analysis more clearly.

6.3.2 Computational methods

The simulations were performed using the software R115, version 2.12. The program mvrnorm116, from the MASS package117, was used to generate the probing depths at baseline and after treatment, assuming a multivariate normal distribution. The starting seed was arbitrarily chosen and kept fixed for each group of simulations. The code used for programming and performing the simulations can be found in Appendix 1.

In order to speed up the simulations, the package doMC118 was utilized to make use of multicore processors of the computer in which the simulation was run.

30

6.3.3 Level of dependence between simulated datasets

Different sets of datasets were generated for each of the scenarios investigated. Each dataset was analyzed using all the statistical methods under comparison (moderately independent datasets)20.

6.4 Scenarios investigated

A total of 315 scenarios were simulated for studying the impact of the data analysis method on the results using the main assumptions. These scenarios were formed by making all possible combinations of the values for each of the parameters and characteristics of the trials. Figure 5 illustrates how each of the scenarios was configured.

Figure 5: Steps performed in this study. A different scenario was created for each of the possible combinations of the parameters and trial characteristics, thus a total of 315 scenarios were explored. After specifying these parameters and characteristics for one scenario, a dataset was created and it was analyzed using each of the 24 methods. This process was repeated 3000 times. Finally, the outcomes were calculated for each of the clinical scenarios.

31

The values used for each of the parameters were:

a. True overall treatment effect: values relative to the standard deviation (SD) of the measurements were used. The 7 values used were 0, 0.2SD, 0.5SD, 0.8SD, 1SD, 1.2SD and 1.5SD. These represent small, medium, large and a number of very large effect sizes according to the usual criteria119, 120. b. Standard deviation of the treatment effect: it was set at 0.6 in both, the experimental and control group, and at both, baseline and follow up. This value was chosen based on published studies regarding periodontal measurements121-126. c. Correlation between pairs of teeth: different matrices were used. The correlations ranged from values representing weak to strong correlation. It was not possible to obtain data from a clinical study to calculate these correlations. Furthermore, literature regarding correlation between pairs of teeth of right versus left side of the mouth was scarce, and it did not provide enough information to create a matrix of correlations with specific values for each pair of teeth. Three types of correlations were present considering the design of the study: 1) correlation between teeth at one time, 2) correlation between teeth at different times, and 3) correlation within a tooth at different times. Note that the correlation between a tooth at any given time and that same tooth at the same time is by definition equal to 1. Figure 6 shows a schematic of the matrix of correlations.

32

Figure 6: Schematic of the correlations between pairs of teeth. The matrix illustrates the correlations of one of the scenarios used for this simulation study (two measurements per teeth, two teeth per split and two splits). Three types of correlations can be observed: 1. Correlation within a tooth at different times (ρwd), 2. Correlation between teeth at one time point (ρbs), and 3. Correlation between teeth at different times (ρbd).

No matter what the actual values of the different correlations are, the correlation within a tooth at different times (e.g., of a tooth with itself) should be largest in every scenario. The correlation between teeth at one time point should be the second highest, because the teeth are clustered within a side of the mouth. Finally, the smallest correlation is the correlation between different teeth at different times.

Considering the above, the following correlation matrices were set:

i. Weak correlations: ρbs = 0.0, ρbd = 0.0, ρwd = 0.4

ii. Moderate correlations: ρbs = 0.4, ρbd = 0.2, ρwd = 0.6

iii. Strong correlations: ρbs = 0.6, ρbd = 0.4, ρwd = 0.8

33

Where ρbs is the correlation between teeth at the same time, ρbd is the correlation between

teeth at different times, and ρwd is the correlation within a tooth at different times.

d. Sample size: The numbers of patients used were 10, 20, 30, 40 and 50; and the numbers of teeth per side were 1, 2 and 5. These values were taken as representative of typical study sizes in split mouth trials included in systematic reviews of periodontal therapies127- 129.

In the additional set of simulations exploring the assumption that there is a random variation in the treatment effect across subjects, a total of 54 scenarios were simulated using the following subset of parameters and creating all possible scenarios by combining them:

a. True overall treatment effect: 9 different values were used, relative to the standard deviation; these were 0SD, 0.1SD, 0.2SD, 0.3SD, 0.4SD, 0.5SD, 0.6SD, 0.7SD and 0.8SD b. Standard deviation of the treatment effect: the same value mentioned above c. Correlations among the measurements: weak and strong correlations were used d. Sample size: we used 20 patients and 1, 2 and 5 teeth per side. e. Standard deviation of the random variability of treatment effect per patient or tooth: this value was used to generate datasets with an added variability in the treatment effect due to random differences among patients or teeth. The value used for both cases was 0.15.

34

6.5 Analyses performed

6.5.1 Full data versus collapsed data

All the datasets generated were analyzed using full data and collapsed data. Full data consisted of one value of probing depth per tooth, assuming that this value represented the mean of the up to 6 measurements that can be done in each tooth. Collapsed data consisted in the mean probing depth per side per patient (Figure 5).

6.5.2 Statistical methods to evaluate

A total of 24 different approaches to analyze the generated data were used (Figure 5). Table 1 displays a summary of all the analyses performed in each dataset. Detailed descriptions for each of the methods are given in sections 6.5.2.1 to 6.5.2.7.

35

Table 1: Methods of analysis used in each dataset Data structure Approach to analyze repeated Method measurements data Full data Using only post-treatment 1. Unpaired t-test measurements 2. Paired t-test 3. Wilcoxon signed test 4. Wilcoxon rank-sum test Using the difference between post- 5. Unpaired t-test treatment and baseline 6. Paired t-test measurements 7. Wilcoxon signed test 8. Wilcoxon rank-sum test Using post-treatment 9. ANCOVA measurements, adjusting for baseline Using both pre-treatment and post- 10. Mixed effects ANOVA simple model treatment measurements 11. Mixed effects ANOVA complex model 12. GEE model Collapsed data Using only post-treatment 13. Unpaired t-test measurements 14. Paired t-test 15. Wilcoxon signed test 16. Wilcoxon rank-sum test Using the difference between post- 17. Unpaired t-test treatment and baseline 18. Paired t-test measurements 19. Wilcoxon signed test 20. Wilcoxon rank-sum test Using post-treatment 21. ANCOVA measurements, adjusting for baseline Using both, pre-treatment and post- 22. Mixed effects ANOVA simple model treatment measurements 23. Mixed effects ANOVA complex model 24. GEE model

36

6.5.2.1 Two-independent samples t-test86, 87

The t-test for two independent samples was used to test the hypothesis

H0: µ1 = µ2 (the mean of probing depth of the intervention sides is equal to the mean of probing depth in the control sides), vs.

H1: µ1 ≠ µ2 (the mean of probing depth of the intervention sides is different from the mean of probing depth in the control sides), using a 5% level of significance, and assuming that the probing depths in each group were independent, normally distributed, and that both groups had equal variances.

The effect estimate and its standard error were calculated using the formulas

est = x1 − x2

1 1 se = s + n1 n2

where s is the pooled standard deviation and n1 and n2 represent the sample size (number of teeth in each group in the full data analyses and number of sides in the collapsed analyses) of the intervention and control groups, respectively.

A test statistic was computed dividing the estimate by the standard error, which follows a central t distribution with n1+n2-2 degrees of freedom when there is no difference in the true means. This information was used to obtain a two-tailed p-value associated with this test and either reject H0 if p<0.05 or not reject H0 if p>0.05.

Finally, the lower and upper limits of the 95% confidence interval were computed using the equations

est − t se,est + t se ( n1+n2 −2,1−α /2 n1+n2 −2,1−α /2 )

where t represents the value of a t statistic with n1+n2-2 degrees of freedom associated n1+n2 −2,1−α /2 with a right tail probability equal to α/2, e.g., the critical p value to reject the null hypothesis.

37

6.5.2.2 T-test for paired samples86, 87

The t-test for paired samples was used to test the hypothesis

H0: Δ=0, vs.

H1: Δ≠0 using a 5% level of significance, and assuming that the data come from paired measurements, and are normally distributed; and where Δ represents the true mean difference in probing depth between the intervention and control sides. This value was estimated through d , which in the full data analyses was calculated obtaining the difference between the probing depth of homologous teeth and averaging these differences, while in the collapsed analyses it was calculated obtaining the difference between the mean of the probing depth values of the intervention and control sides, and averaging these differences.

The standard error of d was calculated using the equation

n 2 ∑(di − d ) i=1 se = n −1 n

where di represents each of the differences calculated, and n represents the total number of differences.

A test statistic was computed dividing the estimate by the standard error, which follows a t distribution with n-1 degrees of freedom when the true delta is zero. This information was used to obtain a two-tailed p-value associated with this test and either reject H0 if p<0.05 or not reject

H0 if p>0.05.

Finally, the lower and upper limits of the 95% confidence interval were computed using the equations

(d − tn−1,1−α /2se, d + tn−1,1−α /2se)

38

6.5.2.3 Wilcoxon signed-rank test130, 131

The Wilcoxon signed-rank test was used to test the hypothesis

H0: Δ=0, vs.

H1: Δ≠0 using a 5% level of significance, and where Δ represents the true median score difference between the probing depth values of the intervention and control sides. This value was estimated through di , which in the full data analyses was calculated obtaining the difference between the probing depth of homologous teeth, while in the collapsed analyses it was calculated obtaining the difference between the mean of the probing depth values of the intervention and control sides. Next, the absolute value of these differences was ranked, and the sum of the positive and negative ranks was computed, and the smallest of these values (ignoring the signs, denoted by T) was used to calculate a z statistic.

The standard deviation was calculated using the equation

n(n +1)(2n +1) sd = T 24

And the result was used to obtain a p-value, assuming that z has a normal distribution with mean zero and the standard deviation above when the median difference is zero.

6.5.2.4 Wilcoxon rank-sum test130, 131

The Wilcoxon rank-sum test was used to test the hypothesis

H0: F1(x)=F2(x), vs.

H1: F1(x)=F2(x-Δ) using a 5% level of significance and assuming that the distributions of probing depths in the experimental and control sides follow a similar distribution, and where F represents the cumulative distribution function of each group, obtained ranking each of the probing depth

39 values from the smallest to the largest, and Δ represents the shift in distribution between the intervention and control groups.

The smaller of the sum of the ranks id denoted by W, and using the equation

n n (n + n +1) sd = F1 F2 F1 F2 W 12 a standard deviation of W was computed, which was used to calculate a z statistic and to obtain a p-value.

6.5.2.5 ANCOVA132

The analysis of covariance was used to test the hypothesis

H0: βg = 0, vs.

H1: βg ≠ 0

where βg represent the relationship between the treatment group and the values of probing depth.

To estimate βg using the full data, the following model was constructed

pdij = α + βg xijg + βt xijt + εij

where xijg represents to which treatment group a measurement of a patient’s (i) tooth (j) belong, and xijt represents the baseline value of probing depth of a patient’s tooth.

To estimate βg using the collapsed data, the following model was constructed

pdis = α + βg xisg + βt xist + εis

where xisg represents to which treatment group a measurement of a patient’s (i) side (s) belong, and xist represents the baseline value of probing depth of a patient’s side.

40

A p-value was obtained based on a t statistic calculated using the value of βˆ and its standard error, and a level of significance of 5% was used.

6.5.2.6 Mixed effects model

A mixed effect model was used to test the hypothesis

H0: βγ = 0 vs.

H1: βγ ≠ 0 where βγ represents the interaction of treatment and time in the model for probing depth values. To estimate βγ, two different models were constructed. The first one we call the “simple” random effects model. It handles clustering of teeth within a subject through the introduction of a random effect for the subject. This allows that all the teeth on a subject will as a group tend to be higher or lower than the overall average.

pdgti = α + βxgti +γ xgti + βγ xgti +ηi + ε gti

In the simple model, g runs over treatment (treated or untreated), t runs over time (pre or post- treatment), and i runs over patients. β represents the effect of the treatment on the probing depth measurement, γ represents the effect of the time in which the measurement was performed, and ηi represent the random effect for patient. The time and treatment were treated as fixed variables, whereas the patient was treated as a random variable133.

The random effects for subject, ηi are assumed to be independent and normally distributed with 2 mean 0 and between-subject standard deviation σ (η).

The residual error, εgti are assumed to be independent and normally distributed with mean 0 and 2 standard deviation σ (ε).

In addition, a “complex” model was constructed

pdgtki = α + βxgtki +γ xgtki + βγ xgtki +τ ki +ηi + ε gtki

41

In the complex model, g runs over treatment (treated or untreated), t runs over time (pre or post- treatment), k runs over tooth in the full data analysis and over side in the collapsed analysis, and i runs over patients. β represents the effect of the treatment on the probing depth measurement, γ represents the effect of the time in which the measurement was performed, τ represents the random differences between teeth in the full data analysis and side in the collapsed analysis, and η represent the random differences between patients. The time and treatment were treated as fixed variables, whereas the patient and tooth or side were treated as random variables133.

The random effects for tooth or side, τki are assumed to be independent and normally distributed 2 with mean 0 and between-subject standard deviation σ (τ).

The random effects for subject, ηi are assumed to be independent and normally distributed with 2 mean 0 and between-subject standard deviation σ (η).

The residual error, εqti are assumed to be independent and normally distributed with mean 0 and 2 standard deviation σ (ε).

A p-value was obtained based on a t statistic calculated using the value of and its standard error, and a level of significance of 5% was used.

6.5.2.7 Generalized estimating equations

To estimate the effect of the treatment over time in the probing depth values, the following model was constructed when using full data

pdij = α + βxij +γ xij + βγ xij + εij where j runs over teeth, i runs over patients, β represents the effect of the treatment on the probing depth measurement, and γ represents the effect of the time in which the measurement was performed134.

The following model was constructed when using collapsed data

42

pdis = α + βxis +γ xis + βγ xis + εis where s runs over side, i runs over patients, β represents the effect of the treatment on the probing depth measurement, and γ represents the effect of the time in which the measurement was performed134.

An exchangeable correlation structure was assumed, this is, the analysis was done assuming that any pair of observations from the same subject have the same correlation91. Using a sandwich estimator135, robust standard errors were calculated. Finally a p-value was obtained under the assumption that the estimate divided by its robust standard error followed a normal distribution when the treatment effect was zero. The gee package in R was used to perform these analyses136.

The programs used to perform the statistical analyses of each simulated dataset can be found in Appendix 2

6.5.3 Estimates obtained from each analysis

The point estimate of the treatment effect, the confidence interval and the p-value were the findings obtained for each of the methods in each round of simulations.

6.5.4 Assessment of the performance of each of the methods: outcomes

6.5.4.1 Primary outcome

The primary outcome of this study was the bias in the point effect estimate. It was chosen as the primary outcome because the point effect estimate is the result that most of the clinicians use to judge whether a treatment is effective and whether it should be administered to patients.

Bias in the point estimate of the treatment effect was defined as the difference between the mean of the values of the point estimate obtained in each round of simulations and the true value of the overall treatment effect.

43

1 B ˆ Bias = ∑βb − β B b=1

ˆ where B is the number of simulations performed, βb is the estimate of the treatment effect obtained in a round of simulations, and β is the true treatment effect.

A positive value for bias can be interpreted as systematic overestimation of the true treatment effect; whereas a negative value for bias can be interpreted as systematic underestimation of the true treatment effect.

6.5.4.2 Secondary outcomes

The secondary outcomes of this study were the mean square error, confidence interval width, confidence interval coverage, power, and type-I error.

The mean square error of the estimate was defined as the squared bias plus the variance of the estimates from the simulations.

1 B 2 ˆ 2 MSE = bias + ∑(βb − β) B −1 b=1

ˆ where B is the number of simulations performed, βb is the estimate of the treatment effect obtained in a round of simulations, and β is the true treatment effect. Mean squared error can be large if an estimate is either biased or variable across samples; in either case, the estimate tends to be not equal to the true value β.

The confidence interval width was defined as the mean of the differences between the upper and lower limit of the confidence intervals obtained.

1 b

Width = ∑(ULb − LLb ) B b=1 where B is the number of simulations performed, UL is the upper limit of the confidence interval, and LL is the lower limit of the confidence interval.

44

The confidence interval coverage was defined as the proportion of simulated datasets in which the obtained confidence interval contained the true treatment effect.

Coverage = P(LL ˆ < β < UL ˆ ) βb βb where LL is the lower limit of the confidence interval and UL is the upper limit of the confidence interval.

Since this is a binary variable, it could take two values

⎧ 0 → LL ˆ > β ∨UL ˆ < β ⎪ βb βb Yb = ⎨ 1→ LL ˆ < β < UL ˆ ⎩⎪ βb βb.

Then

1 B

Coverage = ∑Yb B b=1 where B is the number of simulations performed.

The empirical power was defined as the proportion of simulated datasets in which the null hypothesis of no effect was rejected at a 5% significance level, when the null hypothesis was false. This was computed for each of the non-zero values of the treatment effect.

Power = P(p < 0.05 | β ≠ 0)

Since this is a binary variable, it could take two values

⎪⎧ 0 → p > 0.05 | β ≠ 0 Yb = ⎨ 1→ p < 0.05 | β ≠ 0 ⎩⎪

Then

1 B

Power = ∑Yb B b=1

45 where B is the number of simulations performed.

The empirical type-I error was defined as the proportion of simulated datasets that had a p-value for the test the null hypothesis of no treatment effect less than 0.05, when the treatment effect was zero.

Type I error = P(p < 0.05 | β = 0)

Since this is a binary variable, it could take two values

⎪⎧ 0 → p > 0.05 | β = 0 Yb = ⎨ 1→ p < 0.05 | β = 0 ⎩⎪

Then

1 B

Type I error = ∑Yb B b=1 where B is the number of simulations performed.

6.5.5 Weighting of the criteria of performance assessment to compare the methods

The criteria for choosing which of the tests performed best were the following:

a. Relative lack of bias, based on the bias in the point effect estimate: the mean bias should be equal to zero or close to this value. b. Highest power given a nominal or lower type-I error rate (5% or less): the highest power and a type I error rate equal or lower than the significance level used to reject the null hypothesis were preferred. In the presence of two methods with the same power, where both of them have different type-I error rate, the one with the lower type-I error rate was selected. c. Nominal confidence interval coverage (95%): the confidence interval coverage should be equal to the confidence level used to construct the confidence intervals.

46

d. Lowest mean square error e. Narrower confidence interval

These criteria were used hierarchically. That is, unbiasedness was considered more important than the nominal type-I error rate, and the latter were considered more important than the having the highest power, and so forth. In addition, these criteria were applied in a scenario in which most of the methods had power close to 0.8, and the correlation among the measurements was moderate or strong.

6.6 Number of simulations

A total of 3000 simulations were performed.

The number of simulations required was based on the precision of the outcomes of interest20. For continuous parameter estimates the expected 100(1-α)% confidence interval half-width δ is related to the number of repetitions B and the variance of the estimate σ through the relationship

Z σ δ = 1−α /2 B

When σ is not known, the relative precision (δ/σ) can be specified as

δ Z = 1−α /2 σ B

For binary outcomes the expected 100(1-α)% confidence interval half-width δ is related to the number of repetitions B and the true proportion p through the relationship

Z p(1− p) δ = 1−α /2 B

Using these formulas, it was estimated that using a number over 2000 repetitions, the level of precision of the bias estimate would be < 0.043, the level of relative precision of the power estimate would be < 0.0175, and the level of precision of the type-I error estimate would be < 0.01.

47

6.7 Analysis and summary of the results

A descriptive analysis of all the outcomes of interest (bias, power, type I and II error, MSE and CI width) was performed. For all outcomes, the mean and its standard deviation across different scenarios was calculated for each number of patients, number of teeth, treatment effect, correlation between the observations, and method of analysis used. In addition, the specific estimates of the outcomes were explored for each combination of these parameters, in order to choose representative clinical scenarios to display the results with more detail.

48

7 Results

The mean of the outcomes across scenarios per method is shown in Table 2. The GEE using full and collapsed data was the method with the highest power, the highest type-I error rate, the lowest confidence interval coverage and the narrowest confidence interval width.

Some differences can be observed when analyzing the mean of each outcome across scenarios per correlation between the measurements. When there is a weak correlation among the measurements, the mean power is similar for all methods, except for the LME simple using collapsed data, which has the lowest mean power. The mean type-I error rate is 0.05 for all the methods, and the mean confidence interval coverage is 0.95 for most of the methods as well (see Table 3). When the correlation among the measurements is moderate, the mean power and mean confidence interval coverage tends to increase. Only the paired t-test and Wilcoxon signed rank test with all approaches (full and collapsed data, and only post treatment measurements and differences between baseline and post treatment measurements) had a nominal type-I error rate (see Table 4). When the correlation among the measurements is strong, the unpaired t-test, the Wilcoxon rank-sum test and the LME simple using collapsed data had the lowest mean power. The mean type-I error rate and mean coverage behaves similarly to when there is a moderate correlation (see Table 5).

Since overall results show the mean of the different outcomes across different scenarios, they can conceal interesting findings for specific situations. Therefore, all the outcomes were explored in each scenario, and three scenarios were chosen to illustrate the obtained results:

1) small sized trial: 2 teeth (1 per side) and 20 patients

2) medium sized trial: 4 teeth (2 per side) and 20 patients,

3) large sized trial: 10 teeth (5 per side) and 50 patients

49

Table 2: Overall results: Mean of each of the outcomes per method of analysis Outcome Bias Power Type-I error Confidence interval Mean square Confidence interval coverage error width Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Method of analysis I. Full data 1. UTP 0.00 0.00 0.79 0.33 0.02 0.02 0.98 0.02 0.01 0.01 0.50 0.23 2. PTP 0.00 0.00 0.84 0.27 0.05 0.00 0.95 0.00 0.01 0.01 0.39 0.22 3. PWP - - 0.84 0.28 0.05 0.00 0.95 0.00 - - 0.4 0.22 4. UWP - - 0.79 0.33 0.02 0.02 0.98 0.02 - - 0.51 0.24 5. UTD 0.00 0.00 0.83 0.30 0.02 0.02 0.98 0.02 0.01 0.02 0.43 0.23 6. PTD 0.00 0.00 0.86 0.26 0.05 0.00 0.95 0.00 0.01 0.02 0.37 0.24 7. PWD - - 0.85 0.27 0.05 0.00 0.95 0.00 - - 0.38 0.24 8. UWD - - 0.82 0.31 0.02 0.02 0.98 0.02 - - 0.45 0.24 9. ANCOVA 0.00 0.00 0.85 0.28 0.02 0.02 0.98 0.02 0.01 0.01 0.38 0.20 10. LME simple 0.00 0.00 0.77 0.35 0.01 0.01 0.99 0.01 0.01 0.02 0.54 0.27 11. LME complex 0.00 0.00 0.83 0.3 0.02 0.02 0.98 0.02 0.01 0.02 0.42 0.21 12. GEE 0.00 0.00 0.87 0.25 0.07 0.01 0.93 0.01 0.01 0.02 0.34 0.2

II. Collapsed data 13. UTP 0.00 0.00 0.75 0.36 0.02 0.02 0.98 0.02 0.01 0.01 0.58 0.23 14. PTP 0.00 0.00 0.84 0.28 0.05 0.00 0.95 0.00 0.01 0.01 0.40 0.22 15. PWP - - 0.84 0.28 0.05 0.00 0.95 0.00 - - 0.41 0.23 16. UWP - - 0.74 0.36 0.02 0.02 0.98 0.02 - - 0.60 0.24 17. UTD 0.00 0.00 0.79 0.33 0.02 0.02 0.98 0.02 0.01 0.02 0.50 0.21 18. PTD 0.00 0.00 0.85 0.27 0.05 0.00 0.95 0.00 0.01 0.02 0.38 0.24 19. PWD - - 0.85 0.27 0.05 0.00 0.95 0.00 - - 0.39 0.25 20. UWD - - 0.78 0.34 0.02 0.02 0.98 0.02 - - 0.51 0.23 21. ANCOVA 0.00 0.00 0.82 0.32 0.02 0.02 0.98 0.02 0.01 0.01 0.44 0.18 22. LME simple 0.00 0.00 0.67 0.36 0.02 0.01 0.98 0.01 0.03 0.02 0.74 0.27 23. LME complex 0.00 0.00 0.80 0.33 0.02 0.03 0.98 0.03 0.01 0.02 0.48 0.20 24. GEE 0.00 0.00 0.87 0.25 0.07 0.01 0.93 0.01 0.01 0.02 0.34 0.20 UTP: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LME simple: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LME complex: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

50

Table 3: Overall results: Mean of each of the outcomes per method of analysis using a weak correlation Outcome Bias Power Type-I error Confidence interval Mean square Confidence interval coverage error width Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Method of analysis I. Full data 1. UTP 0.00 0.00 0.79 0.31 0.05 0.00 0.95 0.00 0.02 0.02 0.50 0.24 2. PTP 0.00 0.00 0.79 0.31 0.05 0.00 0.95 0.00 0.02 0.02 0.51 0.25 3. PWP - - 0.78 0.32 0.05 0.00 0.95 0.00 - - 0.52 0.26 4. UWP - - 0.78 0.31 0.05 0.00 0.95 0.00 - - 0.51 0.25 5. UTD 0.00 0.00 0.77 0.32 0.05 0.00 0.95 0.00 0.02 0.02 0.55 0.26 6. PTD 0.00 0.00 0.76 0.32 0.05 0.00 0.95 0.00 0.02 0.02 0.56 0.27 7. PWD - - 0.76 0.33 0.05 0.00 0.95 0.00 - - 0.57 0.28 8. UWD - - 0.76 0.32 0.05 0.00 0.95 0.00 - - 0.56 0.27 9. ANCOVA 0.00 0.00 0.81 0.30 0.05 0.00 0.95 0.00 0.02 0.02 0.46 0.22 10. LME simple 0.00 0.00 0.71 0.36 0.02 0.00 0.98 0.00 0.02 0.02 0.67 0.29 11. LME complex 0.00 0.00 0.77 0.31 0.05 0.00 0.95 0.01 0.02 0.02 0.53 0.24 12. GEE 0.00 0.00 0.78 0.30 0.07 0.02 0.93 0.01 0.02 0.02 0.51 0.22

II. Collapsed data 13. UTP 0.00 0.00 0.79 0.31 0.05 0.00 0.95 0.00 0.02 0.02 0.50 0.24 14. PTP 0.00 0.00 0.78 0.31 0.05 0.00 0.95 0.00 0.02 0.02 0.52 0.25 15. PWP - - 0.77 0.32 0.05 0.00 0.95 0.00 - - 0.53 0.26 16. UWP - - 0.78 0.32 0.05 0.01 0.95 0.01 - - 0.52 0.25 17. UTD 0.00 0.00 0.76 0.32 0.05 0.00 0.95 0.00 0.02 0.02 0.55 0.26 18. PTD 0.00 0.00 0.76 0.32 0.05 0.00 0.95 0.00 0.02 0.02 0.57 0.27 19. PWD - - 0.75 0.33 0.05 0.00 0.95 0.00 - - 0.58 0.28 20. UWD - - 0.75 0.33 0.05 0.00 0.95 0.00 - - 0.57 0.27 21. ANCOVA 0.00 0.00 0.81 0.30 0.05 0.00 0.95 0.00 0.02 0.02 0.47 0.22 22. LME simple 0.00 0.00 0.64 0.36 0.03 0.01 0.97 0.01 0.04 0.03 0.81 0.31 23. LME complex 0.00 0.00 0.77 0.31 0.06 0.01 0.94 0.01 0.02 0.02 0.53 0.24 24. GEE 0.00 0.00 0.78 0.30 0.07 0.02 0.93 0.01 0.02 0.02 0.51 0.22 UTP: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LME simple: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LME complex: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

51

Table 4: Overall results: Mean of each of the outcomes per method of analysis using a moderate correlation Outcome Bias Power Type-I error Confidence interval Mean square Confidence interval coverage error width Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Method of analysis I. Full data 1. UTP 0.00 0.00 0.79 0.33 0.01 0.00 0.99 0.00 0.01 0.01 0.50 0.23 2. PTP 0.00 0.00 0.85 0.27 0.05 0.00 0.95 0.00 0.01 0.01 0.39 0.19 3. PWP - - 0.84 0.28 0.05 0.01 0.95 0.00 - - 0.40 0.20 4. UWP - - 0.79 0.33 0.01 0.00 0.99 0.00 - - 0.51 0.25 5. UTD 0.00 0.00 0.82 0.32 0.01 0.00 0.99 0.00 0.01 0.01 0.44 0.21 6. PTD 0.00 0.00 0.88 0.24 0.05 0.00 0.95 0.00 0.01 0.01 0.32 0.16 7. PWD - - 0.88 0.25 0.05 0.00 0.95 0.00 - - 0.33 0.16 8. UWD - - 0.81 0.32 0.01 0.00 0.99 0.00 - - 0.46 0.22 9. ANCOVA 0.00 0.00 0.84 0.3 0.01 0.00 0.99 0.00 0.01 0.01 0.40 0.19 10. LME simple 0.00 0.00 0.76 0.37 0.00 0.00 1.00 0.00 0.01 0.01 0.56 0.25 11. LME complex 0.00 0.00 0.83 0.31 0.01 0.00 0.99 0.00 0.01 0.01 0.43 0.19 12. GEE 0.00 0.00 0.89 0.22 0.07 0.01 0.93 0.01 0.01 0.01 0.3 0.13

II. Collapsed data 13. UTP 0.00 0.00 0.74 0.37 0.01 0.01 0.99 0.01 0.01 0.01 0.60 0.21 14. PTP 0.00 0.00 0.84 0.28 0.05 0.01 0.95 0.00 0.01 0.01 0.40 0.19 15. PWP - - 0.84 0.28 0.05 0.01 0.95 0.00 - - 0.41 0.20 16. UWP - - 0.74 0.37 0.01 0.01 0.99 0.01 - - 0.62 0.23 17. UTD 0.00 0.00 0.77 0.37 0.00 0.00 1.00 0.00 0.01 0.01 0.55 0.19 18. PTD 0.00 0.00 0.88 0.25 0.05 0.00 0.95 0.00 0.01 0.01 0.33 0.16 19. PWD - - 0.87 0.25 0.05 0.00 0.95 0.00 - - 0.34 0.16 20. UWD - - 0.76 0.37 0.00 0.00 1.00 0.00 - - 0.57 0.20 21. ANCOVA 0.00 0.00 0.800 0.35 0.00 0.01 1.00 0.00 0.01 0.01 0.49 0.17 22. LME simple 0.00 0.00 0.66 0.38 0.01 0.00 0.99 0.00 0.02 0.02 0.77 0.26 23. LME complex 0.00 0.00 0.78 0.36 0.00 0.01 1.00 0.00 0.01 0.01 0.53 0.17 24. GEE 0.00 0.00 0.89 0.22 0.07 0.01 0.93 0.01 0.01 0.01 0.30 0.13 UTP: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LME simple: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LME complex: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

52

Table 5: Overall results: Mean of each of the outcomes per method of analysis using a strong correlation Outcome Bias Power Type-I error Confidence interval Mean square Confidence interval coverage error width Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Method of analysis I. Full data 1. UTP 0.00 0.00 0.80 0.35 0.00 0.00 1.00 0.00 0.01 0.01 0.49 0.23 2. PTP 0.00 0.00 0.90 0.22 0.05 0.00 0.95 0.00 0.01 0.01 0.28 0.14 3. PWP - - 0.90 0.22 0.05 0.00 0.95 0.00 - - 0.29 0.14 4. UWP - - 0.79 0.35 0.00 0.00 1.00 0.00 - - 0.50 0.24 5. UTD 0.00 0.00 0.89 0.25 0.01 0.00 0.99 0.00 0.00 0.00 0.31 0.15 6. PTD 0.00 0.00 0.93 0.18 0.05 0.00 0.95 0.00 0.00 0.00 0.23 0.11 7. PWD - - 0.93 0.19 0.05 0.00 0.95 0.00 - - 0.23 0.11 8. UWD - - 0.88 0.26 0.01 0.00 0.99 0.00 - - 0.32 0.16 9. ANCOVA 0.00 0.00 0.90 0.24 0.01 0.00 0.99 0.00 0.00 0.00 0.30 0.14 10. LME simple 0.00 0.00 0.85 0.31 0.00 0.00 1.00 0.00 0.00 0.00 0.39 0.17 11. LME complex 0.00 0.00 0.89 0.25 0.01 0.00 0.99 0.00 0.00 0.00 0.31 0.14 12. GEE 0.00 0.00 0.94 0.16 0.07 0.01 0.93 0.01 0.00 0.00 0.21 0.09

II. Collapsed data 13. UTP 0.00 0.00 0.72 0.39 0.00 0.00 1.00 0.00 0.01 0.01 0.65 0.21 14. PTP 0.00 0.00 0.90 0.22 0.05 0.00 0.95 0.00 0.01 0.01 0.28 0.14 15. PWP - - 0.90 0.23 0.05 0.00 0.95 0.00 - - 0.29 0.14 16. UWP - - 0.72 0.39 0.00 0.00 1.00 0.00 - - 0.66 0.23 17. UTD 0.00 0.00 0.84 0.31 0.00 0.00 1.00 0.00 0.00 0.00 0.39 0.13 18. PTD 0.00 0.00 0.93 0.19 0.05 0.00 0.95 0.00 0.00 0.00 0.23 0.11 19. PWD - - 0.93 0.19 0.05 0.00 0.95 0.00 - - 0.24 0.11 20. UWD - - 0.84 0.31 0.00 0.00 1.00 0.00 - - 0.40 0.14 21. ANCOVA 0.00 0.00 0.85 0.30 0.00 0.00 1.00 0.00 0.00 0.00 0.37 0.13 22. LME simple 0.00 0.00 0.72 0.36 0.02 0.01 0.98 0.00 0.02 0.01 0.64 0.21 23. LME complex 0.00 0.00 0.85 0.30 0.00 0.00 1.00 0.00 0.00 0.00 0.37 0.12 24. GEE 0.00 0.00 0.94 0.16 0.07 0.01 0.93 0.01 0.00 0.00 0.21 0.09 UTP: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LME simple: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LME complex: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

53

7.1 Bias

The mean bias per method and its standard deviation was calculated when using different correlations, numbers of teeth, numbers of patients and treatment effects. Rounded to two decimal places, the mean bias for all cases was 0 (SD=0.01).

7.2 Power

For most of the methods under study, the power depends on the correlation among the measurements, except for the unpaired t-test and the Wilcoxon rank-sum test using post-treatment full data. When the correlations are weak, the power of the methods is lower. It must be noted that, since the power is near 100% for a wide range of the treatment effect values used, there is a ceiling effect when calculating the mean power across some scenarios. Therefore, overall values of power are not reported. It can be observed that the correlation has a more important influence on the power when the clinical scenario represents a trial of small size, than when the trial is medium sized or large.

7.2.1 Small sized trial (Figure 7)

When the treatment effect is 0.6, e.g., it is equal to one standard deviation, and the correlation is weak, most of the methods have a power higher than 0.8. The method with the lowest power is the linear mixed effects simple model using collapsed data (LMESC), which has a power of 0.58. The other methods have a power of 0.77 (linear mixed effects simple model using full data- LMESF), 0.78 (Wilcoxon signed rank test using the differences between pre and post-treatment data, collapsed and full- PWDC and PWDF, respectively), and 0.79 (Wilcoxon rank-sun test using the differences between pre and post-treatment data, collapsed and full- UWDC and UWDF, respectively).

When the correlation among the measurements is moderate, the only method with a power lower than 0.8 given that the treatment effect is 0.6, is the LMESC (power= 0.73). Most of the methods have a power higher than 0.8 when the treatment effect is 0.48, except for the LMESC

54 (power=0.47), the unpaired t-test using the post-treatment full data (UTPF, power=0.72), the Wilcoxon rank-sum test using the post-treatment full data (UWPF, power=0.71), LMESF (power=0.68), the unpaired t-test using post-treatment collapsed data (UTPC, power=0.72), and the Wilcoxon rank-sum test using the post-treatment collapsed data (UWPC, power=0.71).

Finally, when the correlation among the measurements is strong, all the methods have a power over 0.8 when the treatment effect is higher than 0.48 (e.g., 0.8 standard deviations).

55

Figure 7: Power versus treatment effect in a scenario with 20 patients and 2 teeth

The top 12 panels show the full data analysis, whereas the bottom 12 panels show collapsed data analysis. UTP: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LMES: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LMEC: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

56 7.2.2 Medium sized trial (Figure 8)

The power goes up faster when the trial has 4 teeth and 20 patients. Even though it is still dependent on the treatment effect, all the methods reached a 0.8 power when the treatment effect is 0.48.

When there is a weak correlation among the measurements, the method with the lowest power is the LMESC, which has a power of 0.6 when the treatment effect is 0.48, and a power of 0.8 when the treatment effect is 0.6, whereas all the other methods have a power over 0.95 when the treatment effect is 0.6.

When the correlation among the measurements is moderate and the treatment effect is 0.48, all of the methods have a power higher than 0.9, except for the LMESC, which has a power of 0.65. The power increases to over 0.99 for all these methods when the treatment effect is 0.6. The power of the LMESC in this scenario is 0.88.

Finally, when the correlation among the measurements is strong, all the methods have a power over 0.8 when the treatment effect is higher than 0.48. Most of the methods have a power higher than 0.99 when the treatment effect is 0.6, with the exception of the UTPC (power= 0.92) and the LMESC (power=0.83).

57

Figure 8: Power versus treatment effect in a scenario with 20 patients and 4 teeth

The top 12 panels show the full data analysis, whereas the bottom 12 panels show collapsed data analysis. UTU: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LMES: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LMEC: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

58

7.2.3 Large sized trial (Figure 9)

The relationship between the correlation of the measurements and the treatment effect size is not as important as in the small and medium sized trials. With all correlations, all of the methods reached a 0.8 power when the treatment effect is 0.48. The UTPC and UWPC have a lower power when the correlation is weak than when it is strong. The method with the lowest power in all scenarios is the LMESC.

59

Figure 9: Power versus treatment effect in a scenario with 50 patients and 10 teeth

The top 12 panels show full data analysis, whereas the bottom 12 panels show collapsed data analysis. UTU: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LMES: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LMEC: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

60

7.3 Type-I error rate

The type I error rate of each of the methods was similar across all the scenarios. It was independent of the number of patients and the number of teeth for all methods, with the exception of the generalized estimating equations using full data (GEEF) and collapsed (GEEC) data, in which it ranged from 0.1 when the number of patients was 10 to 0.6 when the number of patients was 50.

Figure 10 illustrates the type-I error rate for each of the methods in the scenario with 4 teeth. When using the paired t-test, the Wilcoxon sign rank test, and the GEE; the type-I error rate was independent of the correlation among the measurements. The UTPF, UWPF, UTPC, UWPC, the ANCOVA using full and collapsed data (ANCOVAF and ANCOVAC, respectively), and the complex linear mixed effects model using uncollapsed full and collapsed data (LMECF and LMECC, respectively) had the nominal type-I error rate when the correlation among the measurements was weak; while a type-I error rate close to 0 was observed when the correlations were moderate or strong, with no important differences between them. The paired t-test and Wilcoxon sign-rank test were the only methods in which the type-I error rate was 0.05.

61

Figure 10: Type-I error rate versus number of patients in a scenario with 4 teeth

The top 12 panels show full data analysis, whereas the bottom 12 panels show collapsed data analysis. UTU: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LMES: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LMEC: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

62 7.4 Confidence interval coverage

The confidence interval coverage was independent of the treatment effect and number of teeth for all methods of analysis. The number of patients only showed an influence in the confidence interval coverage when analyzing the data using GEEF and GEEC.

Figure 11 illustrates the confidence interval coverage of each of the methods in a scenario with 4 teeth and 20 patients. The results of this scenario are similar to what was observed in all the other scenarios. The GEEF and GEEC were the methods with the lowest confidence interval coverage, which ranged from 0.92 to 0.94. All the other methods had confidence interval coverage around the nominal level or above.

The confidence interval coverage does not depend on the correlation when using the paired t-test, Wilcoxon sign-rank test and GEE; whereas the use of all the other methods results in lower confidence interval coverage when the correlation among the data is weak. The unpaired t-test, Wilcoxon rank-sum test, ANCOVA and the complex linear mixed effects model had nominal confidence interval coverage when the correlation was weak.

63

Figure 11: Confidence interval coverage versus treatment effect in a scenario with 4 teeth and 20 patients

The top 12 panels show full data analysis, whereas the bottom 12 panels show collapsed data analysis. UTU: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LMES: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LMEC: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

64

7.5 Mean square error

Since all the methods were unbiased, the mean square error is a reflection of the variance of the estimates produces by the various methods. When the correlation among the measurements was weak, all the methods had a mean square error of 0.2, except the LMESC, which had a mean square error or 0.4. When the correlations were either moderate or strong, the mean square error of all the methods was 0.1; whereas it was 0.2 for the LMESC.

7.6 Confidence interval width

The confidence interval width was independent of the treatment effect size and dependent on the number of teeth and number of patients in all the analysis methods. The correlation among the measurements had an influence in the confidence interval width in most of the methods, except for the UTPF and UWPF.

Figure 12 illustrates the confidence interval width in a scenario with 4 teeth, 20 patients and a treatment effect of 0.6. In general, a stronger correlation among the measurements is associated with a narrower confidence interval width. The methods in which this association was higher were the PTDF, PWDU, PWDC and PTDC. When using these methods, the confidence interval width decreased from 0.6mm when the correlation was weak, to 0.24mm when the correlation was strong. The confidence interval width of the UTPC and UWPC showed the opposite relationship with the correlations; their width was 0.55mm when the correlation was weak, whereas it was 0.71mm when the correlation was strong.

65

Figure 12: Confidence interval width in a scenario with 4 teeth, 20 patients and a treatment effect of 0.6

Each panel represents the correlation among the measurements. UTU: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LMES: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LMEC: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

66

7.7 Comparison of the methods of analysis

7.7.1 Small sized trial (Table 6)

The methods that had the highest power, while maintaining a nominal type-I error rate and confidence interval coverage, and also had the narrowest confidence interval overall width were PTDF and PTDC. The PWDU and PWDC showed very similar results.

When there was a weak correlation among the data, the majority of the methods performed appropriately; however, the LMESC had a lower power and type-I error rate, higher confidence interval coverage and wider confidence interval width when compared to the other methods. The GEEF and GEEC were the methods with the highest type-I error rate (0.08) and lowest confidence interval coverage (0.93).

The PTDF, PTDC, PWDF and PWDC were the only methods that had a nominal type-I error rate and confidence interval coverage when the correlation among the measurements was moderate or strong. The PTPF, PTPC, PWPF and PWPC also showed performed appropriately in terms of these outcomes. The GEEF and GEEC showed the highest type-I error rate (0.07 when the correlation was moderate and 0.08 when it was strong), and the lowest confidence interval coverage (0.93 when the correlation was moderate and 0.92 when it was strong). The unpaired t-tests had the lowest type-I error rate (0.01 and 0, with moderate and strong correlations respectively), the highest confidence interval coverage (0.99 and 1, with moderate and strong correlations respectively), and the second widest confidence intervals (0.76). The LMESC showed similar results, but even wider confidence intervals (0.78).

67

Table 6: Comparison of the methods in a trial with 2 teeth and 20 patients

Outcome Power Type-I error rate Confidence interval Confidence interval width coverage Treatment effect 0.48 0.6 0.0 0.6 0.6 Correlation W M S W M S W M S W M S W M S Method of analysis I. Full data 1. UTP 0.70 0.72 0.82 0.88 0.92 0.97 0.04 0.01 0.00 0.95 0.99 1.00 0.76 0.76 0.76 2. PTP 0.68 0.87 0.99 0.86 0.97 1.00 0.05 0.04 0.04 0.94 0.95 0.94 0.78 0.61 0.43 3. PWP 0.65 0.85 0.99 0.84 0.97 1.00 0.05 0.04 0.04 0.95 0.95 0.95 0.8 0.62 0.44 4. UWP 0.68 0.71 0.79 0.85 0.91 0.95 0.05 0.01 0.00 0.95 0.99 1.00 0.78 0.78 0.78 5. UTD 0.61 0.85 1.00 0.81 0.98 1.00 0.05 0.01 0.01 0.95 0.99 0.99 0.84 0.68 0.48 6. PTD 0.60 0.97 1.00 0.8 1.00 1.00 0.05 0.05 0.05 0.95 0.95 0.95 0.86 0.50 0.35 7. PWD 0.57 0.96 1.00 0.78 1.00 1.00 0.05 0.05 0.05 0.95 0.95 0.95 0.88 0.51 0.36 8. UWD 0.59 0.82 0.99 0.79 0.97 1.00 0.05 0.01 0.01 0.95 0.99 0.99 0.86 0.70 0.49 9. ANCOVA 0.77 0.93 1.00 0.92 0.99 1.00 0.04 0.01 0.01 0.94 0.99 0.99 0.71 0.61 0.46 10. LME simple 0.46 0.68 0.98 0.71 0.93 1.00 0.02 0.00 0.00 0.98 1.00 1.00 0.99 0.83 0.58 11. LME complex 0.64 0.87 1.00 0.83 0.98 1.00 0.06 0.01 0.01 0.94 0.99 0.99 0.81 0.66 0.47 12. GEE 0.66 0.98 1.00 0.84 1.00 1.00 0.08 0.07 0.08 0.93 0.93 0.92 0.78 0.45 0.32

II. Collapsed data 13. UTP 0.70 0.72 0.82 0.88 0.92 0.97 0.04 0.01 0.00 0.95 0.99 1.00 0.76 0.76 0.76 14. PTP 0.68 0.87 0.99 0.86 0.97 1.00 0.05 0.04 0.04 0.94 0.95 0.94 0.78 0.61 0.43 15. PWP 0.65 0.85 0.99 0.84 0.97 1.00 0.05 0.04 0.04 0.95 0.95 0.95 0.8 0.62 0.44 16. UWP 0.68 0.71 0.79 0.85 0.91 0.95 0.05 0.01 0.00 0.95 0.99 1.00 0.78 0.78 0.78 17. UTD 0.61 0.85 1.00 0.81 0.98 1.00 0.05 0.01 0.01 0.95 0.99 0.990 0.84 0.68 0.48 18. PTD 0.60 0.97 1.00 0.80 1.00 1.00 0.05 0.05 0.05 0.95 0.95 0.95 0.86 0.50 0.35 19. PWD 0.57 0.96 1.00 0.78 1.00 1.00 0.05 0.05 0.05 0.95 0.95 0.95 0.88 0.51 0.36 20. UWD 0.59 0.82 0.99 0.79 0.97 1.00 0.05 0.01 0.01 0.95 0.99 0.99 0.86 0.70 0.49 21. ANCOVA 0.77 0.93 1.00 0.92 0.99 1.00 0.04 0.01 0.01 0.94 0.99 0.99 0.71 0.61 0.46 22. LME simple 0.37 0.47 0.71 0.58 0.73 0.91 0.02 0.01 0.01 0.97 0.99 0.98 1.12 0.98 0.78 23. LME complex 0.64 0.87 1.00 0.83 0.98 1.00 0.06 0.01 0.01 0.94 0.99 0.99 0.81 0.66 0.47 24. GEE 0.66 0.98 1.00 0.84 1.00 1.00 0.08 0.07 0.08 0.93 0.93 0.92 0.78 0.45 0.32

UTP: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LME simple: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LME complex: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations, W: weak, M: moderate, S: strong

68

7.7.2 Medium sized trial (Table 7)

The power for the different methods was higher and the confidence intervals were narrower. The performance of the methods in regards to the type-I error rate and the confidence interval coverage was similar to the previous scenario. Figure 13 shows a comparison of power and type-I error rate of each of the methods depending on the treatment effect, in this scenario. The method with the best performance was the PTDF, which always showed a high power, adequate type-I error rate and confidence interval coverage, and the narrowest confidence interval.

69

Table 7: Comparison of the methods in a trial with 4 teeth and 20 patients

Outcome Power Type-I error rate Confidence interval Confidence interval width coverage Treatment effect 0.48 0.6 0.0 0.6 0.6 Correlation W M S W M S W M S W M S W M S Method of analysis I. Full data 1. UTP 0.95 0.98 1.00 0.99 1.00 1.00 0.05 0.01 0.00 0.95 0.98 1.00 0.53 0.53 0.53 2. PTP 0.94 1.00 1.00 0.99 1.00 1.00 0.05 0.06 0.05 0.94 0.95 0.94 0.54 0.42 0.30 3. PWP 0.93 1.00 1.00 0.99 1.00 1.00 0.05 0.05 0.05 0.95 0.95 0.95 0.55 0.43 0.30 4. UWP 0.94 0.97 0.99 0.99 1.00 1.00 0.05 0.01 0.00 0.95 0.98 1.00 0.55 0.54 0.54 5. UTD 0.90 1.00 1.00 0.98 1.00 1.00 0.05 0.01 0.01 0.95 0.99 0.99 0.58 0.47 0.33 6. PTD 0.89 1.00 1.00 0.98 1.00 1.00 0.05 0.05 0.05 0.95 0.95 0.96 0.59 0.34 0.24 7. PWD 0.88 1.00 1.00 0.97 1.00 1.00 0.05 0.05 0.05 0.95 0.95 0.95 0.60 0.35 0.25 8. UWD 0.89 0.99 1.00 0.97 1.00 1.00 0.05 0.01 0.01 0.95 0.99 0.99 0.60 0.48 0.34 9. ANCOVA 0.97 1.00 1.00 1.00 1.00 1.00 0.05 0.01 0.01 0.95 0.99 0.99 0.49 0.42 0.31 10. LME simple 0.79 0.97 1.00 0.94 1.00 1.00 0.01 0.00 0.00 0.98 1.00 1.00 0.73 0.61 0.43 11. LME complex 0.91 1.00 1.00 0.98 1.00 1.00 0.06 0.01 0.01 0.95 0.99 0.99 0.57 0.46 0.33 12. GEE 0.90 1.00 1.00 0.98 1.00 1.00 0.07 0.07 0.08 0.93 0.93 0.93 0.53 0.53 0.53

II. Collapsed data 13. UTP 0.94 0.93 0.92 0.99 1.00 1.00 0.05 0.00 0.00 0.95 0.99 1.00 0.56 0.32 0.23 14. PTP 0.93 1.00 1.00 0.99 1.00 1.00 0.05 0.06 0.05 0.95 0.95 0.94 0.54 0.64 0.71 15. PWP 0.92 0.99 1.00 0.99 0.99 0.99 0.05 0.06 0.05 0.94 0.95 0.94 0.55 0.43 0.30 16. UWP 0.93 0.90 0.90 0.98 1.00 1.00 0.05 0.00 0.00 0.94 0.99 1.00 0.57 0.44 0.31 17. UTD 0.89 0.98 1.00 0.97 1.00 1.00 0.05 0.00 0.00 0.95 1.00 1.00 0.56 0.65 0.71 18. PTD 0.87 1.00 1.00 0.97 1.00 1.00 0.05 0.05 0.05 0.95 0.96 0.95 0.59 0.59 0.42 19. PWD 0.85 1.00 1.00 0.97 1.00 1.00 0.05 0.05 0.06 0.95 0.96 0.95 0.61 0.35 0.25 20. UWD 0.87 0.96 1.00 1.00 1.00 1.00 0.05 0.00 0.00 0.95 1.00 1.00 0.63 0.36 0.25 21. ANCOVA 0.96 0.99 1.00 0.80 0.88 0.95 0.05 0.00 0.00 0.94 1.00 1.00 0.61 0.60 0.43 22. LME simple 0.60 0.65 0.83 0.98 1.00 1.00 0.03 0.01 0.02 0.96 0.99 0.98 0.50 0.52 0.40 23. LME complex 0.91 0.98 1.00 0.98 1.00 1.00 0.06 0.00 0.00 0.94 1.00 1.00 0.86 0.83 0.69 24. GEE 0.90 1.00 1.00 0.99 1.00 1.00 0.07 0.07 0.08 0.93 0.93 0.93 0.57 0.57 0.40 UTP: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LME simple: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LME complex: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations, W: weak, M: moderate, S: strong

70

Figure 13: Power and type-I error rate versus treatment effect in a scenario with 4 teeth and 20 patients

Each panel represents a treatment effect, whereas the colors of the dots represent different correlations among the measurements. UTU: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LMES: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LMEC: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

71

7.7.3 Large sized trial (Table 8)

All the methods had essentially 100% power. When the correlation among the measurements was weak, all the methods performed appropriately, showing nominal type-I error rate and confidence interval coverage, and narrow confidence intervals. When the correlation was moderate or strong, the methods performed as described in the previous scenarios.

The PTDF was the method with the best performance across correlations. The other paired t-test performed appropriately in all scenarios as well.

72

Table 8: Comparison of the methods in a trial with 10 teeth and 50 patients

Outcome Power Type-I error rate Confidence interval Confidence interval width coverage Treatment effect 0.48 0.6 0.0 0.6 0.6 Correlation W M S W M S W M S W M S W M S Method of analysis I. Full data 1. UTP 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.01 0.00 0.95 0.99 1.00 0.21 0.21 0.21 2. PTP 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.05 0.05 0.95 0.95 0.95 0.21 0.16 0.12 3. PWP 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.05 0.04 0.95 0.95 0.95 0.22 0.17 0.12 4. UWP 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.01 0.00 0.95 0.99 1.00 0.22 0.22 0.21 5. UTD 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.01 0.01 0.95 0.99 0.99 0.23 0.19 0.13 6. PTD 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.05 0.05 0.95 0.95 0.95 0.23 0.13 0.09 7. PWD 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.05 0.05 0.95 0.95 0.95 0.24 0.14 0.100 8. UWD 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.01 0.01 0.95 0.99 1.00 0.24 0.19 0.14 9. ANCOVA 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.01 0.00 0.95 0.99 1.00 0.19 0.17 0.13 10. LME simple 1.00 1.00 1.00 1.00 1.00 1.00 0.01 0.00 0.00 0.99 1.00 1.00 0.29 0.25 0.17 11. LME complex 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.01 0.01 0.95 0.99 0.99 0.23 0.19 0.13 12. GEE 1.00 1.00 1.00 1.00 1.00 1.00 0.06 0.06 0.06 0.94 0.94 0.94 0.23 0.13 0.09

II. Collapsed data 13. UTP 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.00 0.00 0.95 1.00 1-00 0.21 0.34 0.41 14. PTP 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.05 0.04 0.95 0.95 0.95 0.22 0.17 0.12 15. PWP 1.00 1.00 1.00 1.00 1.00 1.00 0.04 0.04 0.04 0.96 0.95 0.95 0.22 0.17 0.12 16. UWP 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.00 0.00 0.95 1.00 1.00 0.22 0.35 0.41 17. UTD 1.00 1.00 1.00 1.00 1.00 1.00 0.06 0.00 0.00 0.95 1.00 1.00 0.23 0.33 0.23 18. PTD 1.00 1.00 1.00 1.00 1.00 1.00 0.06 0.05 0.04 0.95 0.95 0.95 0.23 0.14 0.10 19. PWD 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.05 0.04 0.96 0.96 0.95 0.24 0.14 0.10 20. UWD 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.00 0.00 0.95 1.00 1.00 0.24 0.33 0.24 21. ANCOVA 1.00 1.00 1.00 1.00 1.00 1.00 0.05 0.00 0.00 0.95 1.00 1.00 0.2 0.29 0.22 22. LME simple 1.00 1.00 1.00 1.00 1.00 1.00 0.03 0.01 0.02 0.97 0.99 0.98 0.42 0.45 0.39 23. LME complex 1.00 1.00 1.00 1.00 1.00 1.00 0.06 0.00 0.00 0.95 1.00 1.00 0.23 0.33 0.23 24. GEE 1.00 1.00 1.00 1.00 1.00 1.00 0.06 0.06 0.06 0.94 0.94 0.94 0.23 0.13 0.09 UTP: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LME simple: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LME complex: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations, W: weak, M: moderate, S: strong

73

7.8 Exploration of other assumption: variation in the treatment effect in each patient

The mean bias per method and its standard deviation was calculated when using different correlations, numbers of teeth, and treatment effects. Rounded to two decimal places, the mean bias for all cases was 0 (SD=0.01). Thus, the mean square error only reflects the variance of the estimates.

The power was dependent on the treatment effect in all methods and number of teeth, the higher these parameters, the higher the power. It was also dependent on the correlation among the measurements in all methods, except for the UTPF, UWPF, UTPC, and UWPC. In the latter methods the power of the methods of analysis was very similar when the correlation among the measurements was weak and strong, while in all the other methods the power was higher when the measurements were strongly correlated. Figure 14 shows the power of each of the methods of analysis versus the treatment effect.

74

Figure 14: Power versus treatment effect in a scenario with 4 teeth and 20 patients when treatment effect varied from subject to subject.

The top 12 panels show full data analysis, whereas the bottom 12 panels show collapsed data analysis. UTU: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LMES: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LMEC: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

75 The type-I error rate was dependent on the correlation in most of the methods, except for PTPC, PWPC, PTDC, PWDC, GEEF and GEEC. Most of the methods had an adequate type-I error rate when the correlation among the measurements was weak; nevertheless, the type-I error rate increased or decreased when the correlation was strong (Figure 15). Even though the paired t-tests and Wilcoxon sign-rank tests using full data showed a type-I error rate close to the nominal value when the correlation among the measurements was weak for all numbers of teeth and when the correlation was strong and the number of teeth was 2, it increased to values close to 10% when the number of teeth was 10. However, when using collapsed data, the type-I error rate of these tests was appropriate in all scenarios.

The coverage was not dependent on the treatment effect and the number of teeth. When the correlation among the measurements was weak, all the methods had coverage close to the nominal value; however, when the correlation was strong, the coverage increased or decreased (Figure 16). The PTPC, PWPC, PTDC and PWDC had a nominal coverage in all scenarios.

76

Figure 15: Type I error rate versus number of teeth in a scenario with 20 patients when treatment effect varied from subject to subject.

The top 12 panels show full data analysis, whereas the bottom 12 panels show collapsed data analysis. UTU: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LMES: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LMEC: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

77

Figure 16: Confidence interval coverage versus number of teeth in a scenario with 20 patients

Each panel represents the number of teeth and the colors of the dots represent the correlation among the measurements. UTU: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LMES: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LMEC: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

78

The confidence interval width was independent of the number of teeth and treatment effect in all methods. It was dependent on the correlation in all methods, except for the UTPF and UWPF. In general, the confidence intervals were narrower when increasing the number of teeth and the correlation among the measurements; however, in the UTPC and UWPC the width was independent of the number of teeth when the correlation among the measurements was strong (see Figure 17).

A comparison of the methods shows that the statistical methods that treat data as if it were independent, the ANCOVA and the mixed effects models have a very low type-Í error rate when there is a strong correlation among the measurements. In addition, the power of these methods for small treatment effects is lower when compared to the other methods (see Figure 18 and Table 9). The confidence interval coverage has a nominal value when using the PTDC and PWDC independently of the correlation among the measurements (Figure 16).

All methods performed appropriately when the correlation among the measurements was weak; whereas the PTDC was the method that performed the best when the data were strongly correlated.

79

Figure 17: Confidence interval width versus number of teeth in a scenario with 20 patients

Each panel represents the number of teeth and the colors of the dots represent the correlation among the measurements. UTU: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LMES: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LMEC: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

80

Figure 18: Comparison of the type-I error rate and power of the methods of analysis in a scenario with 4 teeth and 20 patients

Each panel represents the treatment effect and the colors of the dots represent the correlation among the measurements. UTU: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LMES: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LMEC: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

81

Table 9: Comparison of the methods in a trial with 4 teeth and 20 patients where there is random variability in the treatment effect across patients

Outcome Power Type-I error rate Confidence interval Confidence interval coverage width Treatment effect 0.30 0.48 0.00 0.48 0.48 Correlation Weak Strong Weak Strong Weak Strong Weak Strong Weak Strong Method of analysis I. Full data 1. UTP 0.58 0.65 0.93 0.99 0.05 0.00 0.95 1.00 0.54 0.53 2. PTP 0.57 0.96 0.92 1.00 0.05 0.06 0.95 0.93 0.55 0.31 3. PWP 0.55 0.95 0.91 1.00 0.05 0.06 0.95 0.94 0.56 0.32 4. UWP 0.57 0.62 0.93 0.99 0.05 0.00 0.96 1.00 0.55 0.54 5. UTD 0.52 0.97 0.88 1.00 0.06 0.02 0.94 0.98 0.59 0.35 6. PTD 0.51 1.00 0.87 1.00 0.06 0.06 0.94 0.93 0.60 0.26 7. PWD 0.49 0.99 0.86 1.00 0.06 0.06 0.94 0.93 0.61 0.26 8. UWD 0.50 0.96 0.86 1.00 0.06 0.02 0.94 0.98 0.61 0.36 9. ANCOVA 0.65 0.98 0.96 1.00 0.06 0.01 0.95 0.98 0.50 0.33 10. LME simple 0.34 0.89 0.78 1.00 0.02 0.00 0.98 1.00 0.73 0.44 11. LME complex 0.53 0.97 0.89 1.00 0.06 0.02 0.94 0.98 0.58 0.34 12. GEE 0.55 0.99 0.88 1.00 0.08 0.07 0.93 0.92 0.57 0.26

II. Collapsed data 13. UTP 0.56 0.28 0.92 0.90 0.05 0.00 0.96 1.00 0.56 0.71 14. PTP 0.53 0.94 0.90 1.00 0.04 0.05 0.96 0.95 0.58 0.33 15. PWP 0.51 0.93 0.89 1.00 0.04 0.05 0.96 0.95 0.59 0.34 16. UWP 0.54 0.28 0.91 0.87 0.04 0.00 0.96 1.00 0.57 0.72 17. UTD 0.50 0.86 0.87 1.00 0.06 0.00 0.95 1.00 0.61 0.44 18. PTD 0.48 0.99 0.85 1.00 0.06 0.05 0.95 0.94 0.63 0.28 19. PWD 0.45 0.99 0.83 1.00 0.06 0.05 0.95 0.94 0.64 0.29 20. UWD 0.47 0.84 0.85 1.00 0.05 0.00 0.95 0.99 0.63 0.45 21. ANCOVA 0.62 0.90 0.95 1.00 0.05 0.00 0.95 1.00 0.52 0.42 22. LME simple 0.37 0.81 0.78 1.00 0.03 0.00 0.98 1.00 0.72 0.48 23. LME complex 0.53 0.89 0.88 1.00 0.07 0.00 0.94 0.99 0.59 0.42 24. GEE 0.55 0.99 0.88 1.00 0.08 0.07 0.93 0.92 0.57 0.26 UTP: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LME simple: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LME complex: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations.

82

7.9 Summary of the results: comparison of the simple and complex methods of data analysis

A comparison between the paired t-test and the complex linear mixed effects model that performed the best is shown in table 10. It can be observed that the power of the simpler method is highest when the correlation among the measurements is strong. In addition, the LMECF has a type-I error rate higher than the nominal value when it is assumed that there is random variation in the treatment effect across subjects, and a very low type-I error rate when it is assumed that the treatment effect is constants across subjects and the correlation among the measurements is strong. However, this low type-I error rate compromises the power, which is 13% lower than the power of the PTDF. The confidence interval coverage is very high when using the LMECF in both scenarios. Finally, when there is a strong correlation among the measurements, the paired t-test results in narrower confidence intervals.

Table 10: Comparison of the simple and complex methods of data analysis in split-mouth trials Assumption The treatment effect is constant across subjects and teeth Scenario 2 teeth (1 per side) and 20 patients Outcome Power Type I error rate Confidence interval Confidence interval coverage width Treatment effect 0.48 0 0.48 0.48 Correlation Weak Strong Weak Strong Weak Strong Weak Strong Method of analysis PTDF 0.60 1.00 0.05 0.05 0.95 0.96 0.86 0.35 LMECF 0.64 0.87 0.06 0.01 0.94 0.99 0.81 0.47

Assumption The treatment effect has random variation across subjects Scenario 4 teeth (1 per side) and 20 patients Outcome Power Type I error rate Confidence interval Confidence interval coverage width Treatment effect 0.3 0 0.3 0.3 Correlation Weak Strong Weak Strong Weak Strong Weak Strong Method of analysis PTDC 0.48 0.99 0.06 0.05 0.95 0.95 0.62 0.28 LMECF 0.53 0.97 0.08 0.07 0.94 0.99 0.58 0.34 PTDF: paired t-test using pre-post differences and full data, PTDC: paired t-test using pre-post differences and collapsed data, LMECF: complex linear mixed effects model using full data.

83

8 Discussion

The results of this study show that the performance of the statistical method used to analyze split- mouth clinical trials with continuous outcomes depends on the sample size, number of teeth analyzed and the correlations among the measurements. As expected, all tests resulted in unbiased treatment effect point estimates; therefore, the mean square error was a reflection of the estimated variance. The power of the different tests was seen to be dependent on the number of teeth, number of patients, treatment effect and correlation among the measurements. The type-I error rate and the confidence interval coverage were associated with the correlation among the measurements in some of the methods, but they were not associated with the treatment effect, number of patients or number of teeth. The confidence interval width depended on the number of patients and number of teeth in all tests, and on the correlation among the measurements in most of the methods; however, it was independent from the treatment effect. Unexpectedly, the paired t-test using the difference between the baseline and post-treatment measurements and full data (PTDF) was the test that performed most appropriately in the different clinical scenarios when it was assumed that the treatment had a constant effect across teeth and subjects, and the paired t-test using the difference between the baseline and post-treatment measurements and collapsed data (PTDC) was the test that performed most appropriately in the different clinical scenarios when it was assumed that the treatment had a constant effect across teeth but not across subjects.

Authors have been concerned about the statistical analysis of split-mouth trials for more than two decades. In 1988, Hujoel and Loesche published the first study that evaluated the statistical tests used to analyze split-mouth trials in the field of periodontology, and reported that only 23% of the trials had performed an appropriate statistical analysis9. This article was the starting point of publications regarding validity issues in split-mouth trials, and articles regarding their efficiency12, design10, 11, 14, adequate data analysis10, 14, 16-19 and reporting7, 13 can be found in the literature up to very recently. All these articles base their discussion on three main aspects: 1) the potential of bias caused by the confounding of treatment effects with carry-across effects, 2) the threat to the recruitment of patients caused by the need to enroll patients with a symmetrical oral disease distribution, which in addition jeopardizes the generalizability of the results, and 3) the need to

84 consider the paired nature of the data when analyzing the results. The latter of these issues was the motivation for this study.

The aim of this study was to determine whether different statistical methods for analyzing data from a computer-simulated split-mouth randomized clinical trial with continuous outcomes yield different results in terms of treatment effect estimate, mean square error and confidence interval width; and to evaluate the performance of the methods based on these outcomes and also in terms of their power, type-I error rate and confidence interval coverage. The field of periodontology was chosen because this is the field in dentistry in which most of the split-mouth trials are performed7. Since probing depth is one of the main outcomes of interest in periodontal trials32, 33, 137, it was the outcome of interest for this study as well; however, our results are applicable to other continuous outcomes in periodontal trials, such as clinical attachment level and gingival recession; and in other fields in dentistry in which split-mouth trials with continuous outcomes are used, such as orthodontics and oral and maxillofacial surgery.

A simulation study is the only design that allows the comparison of the performance of different statistical methods in relation to a known truth20, 21, 101. In addition, this design enables control over the underlying study characteristics, and these characteristics can be varied to determine their influence on the performance of the methods110. One of the main issues of this design, however, is that it is a theoretical approach, and that a distribution of the data and values of different parameters must be assumed, which could endanger the applicability of the results.

In order to obtain results that were applicable to clinical practice, a range of values for the different parameters were chosen. Three different sets of correlations representing a weak, moderate and strong correlation among the measurements, seven treatment effect values, three different numbers of teeth per side, and different numbers of patients were used. Every possible combination of these trial characteristics was explored, which led to a total of 315 clinical scenarios simulated. In addition, 54 scenarios were simulated in order to explore another assumption. All the decisions regarding these characteristics were based on published literature73, 82, 83, 122-126, 128, 138, 139 and clinical experience; nevertheless, the published literature concerning the correlation among periodontal measurements was very scarce, and provided limited information of interest. For this reason, attempts were made to base this decision on real clinical data; however, such data was not available.

85 The statistical methods assessed in this study were not only those recommended by the authors to analyze data from split-mouth studies, but also ones that are used in published split-mouth trials and considered inappropriate by these authors. In addition, when using those methods other issues respecting data analysis were addressed, such as the approach for analyzing pre-post data, and the use of individual measurements or summary measures per unit of randomization. Moreover, the manner in which a statistical analysis would be performed in practice was always considered when defining the different methods and approaches, especially when constructing the models for the linear mixed effects and generalized estimating equation analyses. A total of 24 different statistical methods were used to analyze each of the generated datasets, which allowed the comparison of the methods.

The outcomes chosen aimed to compare the statistical methods not only in terms of their statistical performance, but also regarding their potential influence on the estimates that are interpreted by clinicians and used for decision-making purposes. The bias and the confidence interval width were assessed with the latter aim. The power and type-I error rate of each of the methods were important outcomes as well, because hypothesis testing is the most common approach when making inferences in dentistry.

As previously stated, it was not expected to find that any of the statistical methods resulted in biased estimates of the treatment effect. Since the estimation of the treatment effect is based on comparing means, the final value will be the same whether this comparison is made by averaging all the data of the untreated sides and then subtracting one average from the other (the method that an unpaired test uses), or subtracting one observation from the other in each pair of data and then averaging this differences (the method that a paired test uses). If missing data were present those two values could be different; however, there was no missing data in the datasets generated for these simulations.

The overall power was similar across the different methods. It has been claimed that the use of an inappropriate analysis, e.g., an analysis that does not consider the paired nature of the data, leads to a higher type-II error rate and consequently, to an important loss of power9. The results of this study show that, when considering the mean power of each method across all scenarios, the difference between the power of the paired methods and the unpaired ones range from 3% to 9%. These differences are not as high as they could be, which may be due to the fact that the unpaired analyses

86 are considering all observations as independent, which results in a higher effective sample size and thus in a power increase. This argument is supported by the fact that the power was dependent on the sample size, showing higher values when the number of patients or number of teeth increased, compared to when they were lower, all other conditions being the same (see Appendix 3). In addition, when observing the mean power across scenarios by correlation, the unpaired methods had higher power than the paired methods when the correlation among the measurements was weak; and the paired methods had a range from 5% to 10% more power when the correlation among the measurements was strong. Since the paired methods consider the clustering of the observations in the calculation of the variances, it was not surprising that they had higher power than the unpaired methods that overestimate it when the correlation among the measurements is strong, as described in section 2.5.

The type-I error rates were lower than the nominal values for some of the methods, particularly in the methods that did not consider the paired nature of the data, when the correlation among the measurements was moderate or strong. This means that p-values over the level of significance of 0.05 were observed in more than 95% of the simulations. High p-values are associated with low values of t statistics. Since all methods are essentially unbiased, the numerator of the t-statistic is the same for all methods and a small t statistic results from large standard errors of the treatment effect. Standard errors are a reflection of the variance, and when there is correlation among the data the variance of the difference in treatment effects considers this correlation, thus the resulting variance is incorrect when this correlation is not taken into account. Unpaired tests treat observations as if they were independent from each other; therefore, the estimated variance is higher than the true variance, standard errors are higher, and t statistics are lower.

The higher values of the variance estimated for the unpaired methods also reflect in the high values of confidence interval coverage observed. The coverage of a confidence interval should be approximately equal to the nominal coverage rate20, 140. Over-coverage suggest that the results are too conservative20, e.g., the observed confidence intervals are too wide, and thus the proportion of them containing the true treatment effect value is too high. Since confidence interval estimation depends on the standard error of the treatment effect, it can be understood why the methods that do not consider the pairing of the data and estimate higher values of variance would result in wider confidence intervals.

87 These wider confidence intervals were observed when analyzing the mean confidence interval width of each method. Resembling what was observed in the type-I error rates and confidence interval coverage, the higher the correlation among the measurements, the wider the confidence intervals estimated using the unpaired methods, because of the estimation of higher variances. In contrast, the methods that considered the paired structure of the data resulted in narrower confidence intervals as the correlation among the measurements increased.

The use of models that consider the paired and the nested structure of the data has been suggested in the literature14. Theoretically, these models should result in adequate variance estimates. Two models were tested in this study: a simple model in which the predictor variables were the treatment, the time of the measurement (baseline or after treatment), the interaction between these two, and a random effect or clustering variable that accounted for the nested nature of the teeth within subject, whose model parameters were estimated by using a linear mixed effects model and a generalized estimating equations; and a complex model that included a random effect for the tooth and patient, and that considered that the tooth was nested within a patient, which was tested using a linear mixed effect model. The results showed that the simple models did not perform better than the simplest approaches, such as the paired t-test using only the post-treatment measurements, illustrating that a complex statistical method does not lead to better results if it is not specified correctly.

It is very important to consider that the interpretation of the results should be made not only in the context of the statistical performance of the methods, but also considering a clinical perspective. In order to interpret the findings of this study in terms of their clinical implication, the clinical significance of the treatment effects should be considered.

There is no agreement about the standards to quantify clinical significance in periodontal trials65, 66. Periodontal diseases are site specific; thus, the clinical significance of a response to a therapy should be established depending on the type of defect and disease66, 141. The statement of the American Academy of Periodontology establishes as a desired outcome of periodontal disease treatment the “reduction of probing depths”32, 33, with no further guidance regarding what the minimal magnitude of this reduction should be. Hujoel discusses this issue by providing an example of two clinical trials with the same results that were interpreted differently by experts panels responsible for drug approval65. Both trials reported a statistically significant reduction in probing depth of 0.2mm;

88 however, the results of the first trial142 were considered clinically insignificant; whereas the results of the second trial143 were considered clinically significant.

This controversy translates into a difficulty when interpreting results, because it makes it hard to judge which of the treatment effects explored has enough clinical importance to be considered when comparing the statistical methods. This adds up to the difficulties when measuring probing depths and the outcome used in this study in real clinical trials. The reliability of the measurements of probing depths has been explored in many clinical studies, and it has been known for a long time that probing measurements may present measurement errors derived from the instruments, patients, examiner and disease status67-70, 121, 144-149. It has been reported that the standard error of the measurements of probing depth is 0.45 mm70, and that the measurements can be reproduced within 0.5 mm 75% of the time69 and within 1 mm over 90% of the time67, 69. Therefore, measurement error would make it harder to detect a difference in the treatment, masking or underestimating real treatment effects.

From a clinical standpoint, the interpretation of the results of this study, and the decision of which of the methods should be preferred in order to have the most reliable treatment effect estimates, depends on the knowledge of the periodontal disease and what assumptions seem to represent it better. For example, if according to clinical experience it is reasonable to assume that the correlation among periodontal measurements of a patient is very low, then any of the statistical methods evaluated in this study would perform appropriately, and the use of one over the others would not result in different conclusions. This is due to the fact that very low correlations mean essentially that the observations are independent, and thus it is reasonable to analyze the data as if they are. In contrast, if it must be assumed that the correlation among the measurements is strong, and that there is a constant effect of the treatment across teeth, then a paired t-test using pre-post differences and full data, or any of the other methods that performed similarly, such as the Wilcoxon signed-rank test using pre-post differences and full data, the paired t-test using pre-post differences and collapsed data, and the Wilcoxon signed-rank test using pre-post differences and collapsed data should be used.

The issue of the statistical analysis when performing studies in multiple body parts has been addressed in other medical areas as well3-6. In 1998, Murdoch et al.5 reviewed the statistical methods

89 used in articles published in the British Journal of Ophthalmology, and concluded that many of the approaches failed to consider all the data available and that this may lead to inappropriate conclusions. In 2006, Bryant et al.3 aimed to determine the frequency of inappropriate inclusion of non-independent limb or joint observations in clinical studies, and concluded that a high proportion of clinical studies used multiple observations inappropriately. The authors discuss that this may lead to biased results. Both studies, however, address the issue of an inappropriate statistical analysis only from a theoretical point of view.

There are two studies published in the field of ophthalmology that have tried to quantify the impact of those inappropriate analyses on the results of a trial. In 1987, Newcombe and Duff6 performed a simulation study in which they tested the false positive rate of independent t-tests for analyzing paired data, and they found that this rate was 0.2. In 2000, Cheng et al.4 observed that, when analyzing a real dataset, there were discrepancies between the different methods and that the generalized estimating equations had increased precision. The situation in dentistry is similar to what was just described: many articles discuss the potential effect of conducting an incorrect statistical analysis; nevertheless, there are no publications quantifying the actual effect of a so-called incorrect analysis in the results of a study.

A limitation of this study is that many assumptions were made, and that the validity of the results depends on the validity of those assumptions. In addition, only continuous outcomes were considered in this study, and thus, these results are not applicable to areas in which binary or count outcomes are the outcomes of interest, such as caries. The exploration of different scenarios that depart from the assumptions used is a research area that could be developed, such as the presence of outliers, outcomes measurements that are not normally distributed, and missing data.

The characteristics of the type of periodontology trial we simulated in this study does not make it likely to have missing data. It has been suggested that probing depth, the outcome used in this simulation, is associated with an increased risk of tooth loss; however, survival analysis has shown that the risk of losing a tooth with a high value of probing depth (>7mm) is almost the same as the risk of losing a tooth with a low value of probing depth (<4mm), up to the third year after receiving periodontal treatment150, 151. This may be due to the fact that teeth with bad prognosis are extracted before the periodontal treatment is administered, and thus all teeth that receive treatment have high

90 probabilities of remaining in the mouth for a period of time longer than the follow-up period used in most of the clinical trials. The cohort study performed by McGuire and Nunn, shows that only 20% of the teeth with high values of probing depth in one site are lost at 6 years after the periodontal treatment, and that differences in probing depth may only be an important predictor of tooth loss at this time point151. In addition, since both units of observation are within a same individual, a dropout of an individual from a split-mouth trial results in missing data for both groups under comparison, which would not bias the estimates of the treatment effect but could affect the estimates of the variance of this effect. Therefore, the influence of missing data on the difference statistical methods was not explored because this is not a scenario common to find in practice.

The results of this study are partially in agreement with previous literature on the topic. Even though we agree that paired tests should be preferred over unpaired tests, the unpaired tests do not perform as bad as suggested by other authors. Moreover, the rate of false positive results obtained when using unpaired tests was very low when compared with the one reported in the simulation study performed in ophthalmology6. These differences may be due to the outcomes or the standard deviations used in this study; however, since the reporting of the study in question lacks of valuable information, no statements can be made regarding the cause of the difference. In addition, the assumptions made to obtain these results must be considered, since they may be the cause of the unexpected findings, for example, that a paired t-test seems to perform better from a statistical point of view than a complex linear mixed effects model. Further exploration of other scenarios and other assumptions will allow a better understanding of the results obtained using different statistical approaches for analyzing data form split-mouth randomized clinical trials.

91

9 Conclusions

The performance of the statistical methods used to analyze the data from split-mouth trials depends mainly on the correlation among the data. When there is a weak correlation among the data, the performance of the methods differs slightly from a statistical point of view, but this would not affect the overall conclusions of a study. When there is a moderate or strong correlation among the measurements, the performance of the statistical methods that do not consider this correlation is inferior to the performance of the statistical methods that do consider it. This is reflected in wider confidence intervals, which may influence the judgment regarding the clinical significance of the results. Under the assumptions that probing depth measurements are normally distributed, and that the treatment effect is constant across teeth, the paired t-test using the differences between baseline and post-treatment measurements and full data is the only method that performs appropriately in any of the scenarios; therefore, if those assumptions are judged as clinically reasonable, that is the statistical method that should be preferred. Under the assumptions that probing depth measurements are normally distributed, and that the treatment effect is constant across teeth but that it has a random variation across patients, the paired t-test using the differences between baseline and post-treatment measurements and collapsed data is the only method that performs appropriately in any of the scenarios; therefore, if those assumptions are judged as clinically reasonable, that is the statistical method that should be preferred. The performance of the more complex statistical approaches suggested in the literature is conditioned by the model specification; therefore, such methods should only used when there is certainty in this regard.

92

References

1. Jadad AR and Enkin MW. Types of randomized controlled trials. Randomized controlled trials Questions, answers and musings. 2nd ed.: Blackwell Publishing, 2007, p. 12-28.

2. Hahn S, Puffer S, Torgerson DJ and Watson J. Methodological bias in cluster randomised trials. BMC Medical Research Methodology. 2005; 5: 10.

3. Bryant D, Havey TC, Roberts R and Guyatt G. How many patients? How many limbs? Analysis of patients or limbs in the orthopaedic literature: a systematic review. J Bone Joint Surg Am. 2006; 88: 41-5.

4. Cheng CY, Liu JH, Chiang SC, Chen SJ and Hsu WM. Statistics in ophthalmic research: two eyes, one eye or the mean? Zhonghua Yi Xue Za Zhi (Taipei). 2000; 63: 885-92.

5. Murdoch IE, Morris SS and Cousens SN. People and eyes: statistical approaches in ophthalmology. Br J Ophthalmol. 1998; 82: 971-3.

6. Newcombe RG and Duff GR. Eyes or patients? Traps for the unwary in the statistical analysis of ophthalmological studies. Br J Ophthalmol. 1987; 71: 645-6.

7. Lesaffre E, Garcia Zattera MJ, Redmond C, Huber H and Needleman I. Reported methodological quality of split-mouth studies. J Clin Periodontol. 2007; 34: 756-61.

8. Ramfjord SP, Nissle RR, Shick RA and Cooper H, Jr. Subgingival curettage versus surgical elimination of periodontal pockets. J Periodontol. 1968; 39: 167-75.

9. Hujoel PP and Moulton LH. Evaluation of test statistics in split-mouth clinical trials. J Periodontal Res. 1988; 23: 378-80.

10. Hujoel PP. Design and analysis issues in split mouth clinical trials. Community Dent Oral Epidemiol. 1998; 26: 85-6.

11. Hujoel PP and DeRouen TA. Validity issues in split-mouth trials. J Clin Periodontol. 1992; 19: 625-7.

12. Hujoel PP and Loesche WJ. Efficiency of split-mouth designs. J Clin Periodontol. 1990; 17: 722-8.

13. Antczak-Bouckoms AA, Tulloch JF and Berkey CS. Split-mouth and cross-over designs in dental research. J Clin Periodontol. 1990; 17: 446-53.

14. Lesaffre E, Philstrom B, Needleman I and Worthington H. The design and analysis of split- mouth studies: what statisticians and clinicians should know. Stat Med. 2009; 28: 3470-82.

93 15. Donner A, Klar N and Zou G. Methods for the statistical analysis of binary data in split- cluster designs. Biometrics. 2004; 60: 919-25.

16. Donner A and Zou GY. Methods for the statistical analysis of binary data in split-mouth designs with baseline measurements. Stat Med. 2007; 26: 3476-86.

17. Riordan PJ and FitzGerald PE. Outcome measures in split mouth caries trials and their statistical evaluation. Community Dent Oral Epidemiol. 1994; 22: 192-7.

18. Vaeth M and Poulsen S. Comments on a commentary: statistical evaluation of split mouth caries trials. Community Dent Oral Epidemiol. 1998; 26: 80-3; discussion 4.

19. Tobi H, Kreulen CM, Gruythuysen RJ and van Amerongen WE. The analysis of restoration survival data in split-mouth designs. J Dent. 1998; 26: 293-8.

20. Burton A, Altman DG, Royston P and Holder RL. The design of simulation studies in medical statistics. Stat Med. 2006; 25: 4279-92.

21. Smith MK and Marshall A. Importance of protocols for simulation studies in clinical drug development. Stat Methods Med Res. 2010.

22. The pathogenesis of periodontal diseases. J Periodontol. 1999; 70: 457-70.

23. Armitage GC. Clinical evaluation of periodontal diseases. Periodontol 2000. 1995; 7: 39-53.

24. Burt B. Position paper: epidemiology of periodontal diseases. J Periodontol. 2005; 76: 1406- 19.

25. Albandar JM. Periodontal diseases in North America. Periodontol 2000. 2002; 29: 31-69.

26. Armitage GC. Diagnosis of periodontal diseases. J Periodontol. 2003; 74: 1237-47.

27. Armitage GC. Development of a classification system for periodontal diseases and conditions. Ann Periodontol. 1999; 4: 1-6.

28. Procter&Gamble. Preventing gum disease. 2006.

29. Salvi GE, Lindhe J and Lang NP. Examination of patients with periodontal diseases. In: Lang NP and Lindhe J, (eds.). Clinical periodontology and implant dentistry. 5th ed. Oxford: Blackwell Publishing, 2008, p. 14.

30. Kinane DF, Lindhe J and Trombelli L. Chronic periodontitis. In: Lindhe J, Lang NP and Karring T, (eds.). Clinical periodontology and implant dentistry. 5th ed. Oxford: Blackwell Publishing, 2008, p. 8.

31. Socransky SS and Haffajee AD. Periodontal infections. In: Lindhe J, Lang NP and Karring T, (eds.). Clinical periodontology and implant dentistry. 5th ed. Oxford: Blackwell Publishing, 2008, p. 61.

94 32. Parameter on chronic periodontitis with advanced loss of periodontal support. American Academy of Periodontology. J Periodontol. 2000; 71: 856-8.

33. Parameter on chronic periodontitis with slight to moderate loss of periodontal support. American Academy of Periodontology. J Periodontol. 2000; 71: 853-5.

34. Treatment of plaque-induced gingivitis, chronic periodontitis, and other clinical conditions. J Periodontol. 2001; 72: 1790-800.

35. Badersten A, Nilveus R and Egelberg J. Effect of nonsurgical periodontal therapy. I. Moderately advanced periodontitis. J Clin Periodontol. 1981; 8: 57-72.

36. Badersten A, Nilveus R and Egelberg J. Effect of nonsurgical periodontal therapy. II. Severely advanced periodontitis. J Clin Periodontol. 1984; 11: 63-76.

37. Badersten A, Nilveus R and Egelberg J. Effect of nonsurgical periodontal therapy. III. Single versus repeated instrumentation. J Clin Periodontol. 1984; 11: 114-24.

38. Becker W, Becker BE, Caffesse R, et al. A longitudinal study comparing scaling, osseous surgery, and modified Widman procedures: results after 5 years. J Periodontol. 2001; 72: 1675-84.

39. Becker W, Becker BE, Ochsenbein C, et al. A longitudinal study comparing scaling, osseous surgery and modified Widman procedures. Results after one year. J Periodontol. 1988; 59: 351-65.

40. Garrett JS. Effects of nonsurgical periodontal therapy on periodontitis in humans. A review. J Clin Periodontol. 1983; 10: 515-23.

41. Hill RW, Ramfjord SP, Morrison EC, et al. Four types of periodontal treatment compared over two years. J Periodontol. 1981; 52: 655-62.

42. Kaldahl WB, Kalkwarf KL, Patil KD, Dyer JK and Bates RE, Jr. Evaluation of four modalities of periodontal therapy. Mean probing depth, probing attachment level and recession changes. J Periodontol. 1988; 59: 783-93.

43. Kaldahl WB, Kalkwarf KL, Patil KD, Molvar MP and Dyer JK. Long-term evaluation of periodontal therapy: I. Response to 4 therapeutic modalities. J Periodontol. 1996; 67: 93-102.

44. Magnusson I, Lindhe J, Yoneyama T and Liljenberg B. Recolonization of a subgingival microbiota following scaling in deep pockets. J Clin Periodontol. 1984; 11: 193-207.

45. Morrison EC, Ramfjord SP and Hill RW. Short-term effects of initial, nonsurgical periodontal treatment (hygienic phase). J Clin Periodontol. 1980; 7: 199-211.

46. Mousques T, Listgarten MA and Phillips RW. Effect of scaling and root planing on the composition of the human subgingival microbial flora. J Periodontal Res. 1980; 15: 144-51.

95 47. Pihlstrom BL, McHugh RB, Oliphant TH and Ortiz-Campos C. Comparison of surgical and nonsurgical treatment of periodontal disease. A review of current studies and additional results after 61/2 years. J Clin Periodontol. 1983; 10: 524-41.

48. Ramfjord SP, Caffesse RG, Morrison EC, et al. 4 modalities of periodontal treatment compared over 5 years. J Clin Periodontol. 1987; 14: 445-52.

49. Apatzidou DA and Kinane DF. Quadrant root planing versus same-day full-mouth root planing. I. Clinical findings. J Clin Periodontol. 2004; 31: 132-40.

50. Apatzidou DA and Kinane DF. Quadrant root planing versus same-day full-mouth root planing. J Clin Periodontol. 2004; 31: 152-9.

51. Aykol G, Baser U, Maden I, et al. The effect of low-level laser therapy as an adjunct to non- surgical periodontal treatment. J Periodontol. 2011; 82: 481-8.

52. Jervoe-Storm PM, Semaan E, AlAhdab H, Engel S, Fimmers R and Jepsen S. Clinical outcomes of quadrant root planing versus full-mouth root planing. J Clin Periodontol. 2006; 33: 209-15.

53. Lopes BM, Marcantonio RA, Thompson GM, Neves LH and Theodoro LH. Short-term clinical and immunologic effects of scaling and root planing with Er:YAG laser in chronic periodontitis. J Periodontol. 2008; 79: 1158-67.

54. Schwarz F, Sahm N, Iglhaut G and Becker J. Impact of the method of surface debridement and decontamination on the clinical outcome following combined surgical therapy of peri- implantitis: a randomized controlled clinical study. J Clin Periodontol. 2011; 38: 276-84.

55. Schwarz F, Sculean A, Georg T and Reich E. Periodontal treatment with an Er: YAG laser compared to scaling and root planing. A controlled clinical study. J Periodontol. 2001; 72: 361-7.

56. Tomasi C, Bertelle A, Dellasega E and Wennstrom JL. Full-mouth ultrasonic debridement and risk of disease recurrence: a 1-year follow-up. J Clin Periodontol. 2006; 33: 626-31.

57. Wennstrom JL, Tomasi C, Bertelle A and Dellasega E. Full-mouth ultrasonic debridement versus quadrant scaling and root planing as an initial approach in the treatment of chronic periodontitis. J Clin Periodontol. 2005; 32: 851-9.

58. Deo V, Gupta S, Bhongade ML and Jaiswal R. Evaluation of subantimicrobial dose doxycycline as an adjunct to scaling and root planing in chronic periodontitis patients with diabetes: a randomized, placebo-controlled clinical trial. J Contemp Dent Pract. 2010; 11: 009-16.

59. Faveri M, Gursky LC, Feres M, Shibli JA, Salvador SL and de Figueiredo LC. Scaling and root planing and chlorhexidine mouthrinses in the treatment of chronic periodontitis: a randomized, placebo-controlled clinical trial. J Clin Periodontol. 2006; 33: 819-28.

96 60. Kr Ck C, Eick S, Kn Fler GU, Purschwitz RE and Jentsch HF. Clinical and Microbiological Results 12 months After Scaling and Root Planing With Different Irrigation Solutions in Patients With Moderate Chronic Periodontitis ?A Pilot Randomized Trial. J Periodontol. 2011.

61. Moeintaghavi A, Talebi-ardakani MR, Haerian-ardakani A, et al. Adjunctive effects of systemic amoxicillin and metronidazole with scaling and root planing: a randomized, placebo controlled clinical trial. J Contemp Dent Pract. 2007; 8: 51-9.

62. Paolantonio M, D'Ercole S, Pilloni A, et al. Clinical, microbiologic, and biochemical effects of subgingival administration of a Xanthan-based chlorhexidine gel in the treatment of periodontitis: a randomized multicenter trial. J Periodontol. 2009; 80: 1479-92.

63. Heller D, Varela VM, Silva-Senem MX, Torres MC, Feres-Filho EJ and Colombo AP. Impact of systemic antimicrobials combined with anti-infective mechanical debridement on the microbiota of generalized : a 6-month RCT. J Clin Periodontol. 2011; 38: 355-64.

64. Pradeep AR and Kathariya R. Clarithromycin, as an adjunct to non surgical periodontal therapy for chronic periodontitis: A double blinded, placebo controlled, randomized clinical trial. Arch Oral Biol. 2011.

65. Hujoel PP. Endpoints in periodontal trials: the need for an evidence-based research approach. Periodontol 2000. 2004; 36: 196-204.

66. Greenstein G. Clinical versus statistical significance as they relate to the efficacy of periodontal therapy. J Am Dent Assoc. 2003; 134: 583-91.

67. Wang SF, Leknes KN, Zimmerman GJ, Sigurdsson TJ, Wikesjo UM and Selvig KA. Intra - and inter-examiner reproducibility in constant force probing. J Clin Periodontol. 1995; 22: 918-22.

68. Zappa U, Simona C, Graf H, Case D and Thomas J. Reliability of single and double probing attachment level measurements. J Clin Periodontol. 1995; 22: 764-71.

69. Badersten A, Nilveus R and Egelberg J. Reproducibility of probing attachment level measurements. J Clin Periodontol. 1984; 11: 475-85.

70. Grossi SG, Dunford RG, Ho A, Koch G, Machtei EE and Genco RJ. Sources of error for periodontal probing measurements. J Periodontal Res. 1996; 31: 330-6.

71. Armitage GC. The complete periodontal examination. Periodontol 2000. 2004; 34: 22-33.

72. Imrey PB. Considerations in the statistical analysis of clinical trials in periodontitis. J Clin Periodontol. 1986; 13: 517-32.

73. Darby I, Polster A, Gan J, et al. Left-to-right distribution of periodontal disease. Int J Dent Hyg. 2011.

97 74. Pini Prato G, Rotundo R, Franceschi D, Cairo F, Cortellini P and Nieri M. Fourteen-year outcomes of coronally advanced flap for root coverage: follow-up from a randomized trial. J Clin Periodontol. 2011; 38: 715-20.

75. Graziani F, Cei S, Guerrero A, et al. Lack of short-term adjunctive effect of systemic neridronate in non-surgical periodontal therapy of advanced generalized chronic periodontitis: an open label-randomized clinical trial. J Clin Periodontol. 2009; 36: 419-27.

76. Silva MP, Feres M, Sirotto TA, et al. Clinical and microbiological benefits of metronidazole alone or with amoxicillin as adjuncts in the treatment of chronic periodontitis: a randomized placebo-controlled clinical trial. J Clin Periodontol. 2011; 38: 828-37.

77. Feres M, Gursky LC, Faveri M, Tsuzuki CO and Figueiredo LC. Clinical and microbiological benefits of strict supragingival plaque control as part of the active phase of periodontal therapy. J Clin Periodontol. 2009; 36: 857-67.

78. Rotundo R, Nieri M, Cairo F, et al. Lack of adjunctive benefit of Er:YAG laser in non- surgical periodontal treatment: a randomized split-mouth clinical trial. J Clin Periodontol. 2010; 37: 526-33.

79. Sampaio E, Rocha M, Figueiredo LC, et al. Clinical and microbiological effects of azithromycin in the treatment of generalized chronic periodontitis: a randomized placebo-controlled clinical trial. J Clin Periodontol. 2011; 38: 838-46.

80. Mestnik MJ, Feres M, Figueiredo LC, Duarte PM, Lira EA and Faveri M. Short-term benefits of the adjunctive use of metronidazole plus amoxicillin in the microbial profile and in the clinical parameters of subjects with generalized aggressive periodontitis. J Clin Periodontol. 2010; 37: 353-65.

81. FDI. Two-digit notation. Geneve: FDI World Dental Federation, 1971.

82. Mombelli A and Meier C. On the symmetry of periodontal disease. J Clin Periodontol. 2001; 28: 741-5.

83. Dowsett SA, Archila L, Segreto VA, Eckert GJ and Kowolik MJ. Periodontal disease status of an indigenous population of Guatemala, Central America. J Clin Periodontol. 2001; 28: 663-71.

84. Fleiss JL, Park MH and Chilton NW. Within-mouth correlations and reliabilities for probing depth and attachment level. J Periodontol. 1987; 58: 460-3.

85. Ramfjord S. Indices for prevalence and incidence of periodontal disease. Journal of Periodontology. 1959; 30: 9.

86. Pagano M and Gauvreau K. Comparison of two means. In: Pagano M, (ed.). Principles of biostatistics. 2nd ed. Belmont: Brooks/Cole, Cengage Learning, 2000, p. 26.

87. Rosner B. Hypothesis testing: two-sample inference. Fundamentals of Biostatistics. 7th ed. Boston: Brooks/Cole, Cengage Learning, 2011, p. 58.

98 88. Gueorguieva R and Krystal JH. Move over ANOVA: progress in analyzing repeated- measures data and its reflection in papers published in the Archives of General Psychiatry. Arch Gen Psychiatry. 2004; 61: 310-7.

89. Cnaan A, Laird NM and Slasor P. Using the general linear mixed model to analyse unbalanced repeated measures and longitudinal data. Stat Med. 1997; 16: 2349-80.

90. Liang K and Zeger S. Longitudinal data analysis using generalised linear models. Biometrics. 1986; 73: 7.

91. Ziegler A and Vens M. Generalized estimating equations. Notes on the choice of the working correlation matrix. Methods Inf Med. 2010; 49: 421-5; discussion 6-32.

92. Omar RZ, Wright EM, Turner RM and Thompson SG. Analysing repeated measurements data: a practical comparison of methods. Stat Med. 1999; 18: 1587-603.

93. Burton P, Gurrin L and Sly P. Extending the simple linear regression model to account for correlated responses: an introduction to generalized estimating equations and multi-level mixed modelling. Stat Med. 1998; 17: 1261-91.

94. Matthews JN, Altman DG, Campbell MJ and Royston P. Analysis of serial measurements in medical research. Bmj. 1990; 300: 230-5.

95. Frison L and Pocock SJ. Repeated measures in clinical trials: analysis using mean summary statistics and its implications for design. Stat Med. 1992; 11: 1685-704.

96. Senn S. Baselines and covariate information. In: Senn S and Barnett V, (eds.). Statistical issues in drug development. 2nd ed. Chichester: Wiley, 2007, p. 18.

97. de Klerk NH. Repeated warnings re repeated measures. Aust N Z J Med. 1986; 16: 637-8.

98. Pocock SJ, Hughes MD and Lee RJ. Statistical problems in the reporting of clinical trials. A survey of three medical journals. N Engl J Med. 1987; 317: 426-32.

99. Blomqvist N. On the choice of computational unit in statistical analysis. J Clin Periodontol. 1985; 12: 873-6.

100. McDonald BW and Pack AR. Concepts determining statistical analysis of dental data. J Clin Periodontol. 1990; 17: 153-8.

101. Holford N, Ma SC and Ploeger BA. Clinical trial simulation: a review. Clin Pharmacol Ther. 2010; 88: 166-82.

102. Holford NH, Kimko HC, Monteleone JP and Peck CC. Simulation of clinical trials. Annu Rev Pharmacol Toxicol. 2000; 40: 209-34.

103. Bonate P, Gillespie W, Ludden T, Rubin D and Stanski D. Simulation in drug development: good practices. San Francisco: University of California, 1999.

99 104. Wason JM and Mander AP. The choice of test in phase II cancer trials assessing continuous tumour shrinkage when complete responses are expected. Stat Methods Med Res. 2011.

105. Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ and Ehm MG. Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Hum Hered. 2002; 53: 79-91.

106. Borm GF, Fransen J and Lemmens WA. A simple sample size formula for analysis of covariance in randomized clinical trials. J Clin Epidemiol. 2007; 60: 1234-8.

107. Teerenstra S, Moerbeek M, van Achterberg T, Pelzer BJ and Borm GF. Sample size calculations for 3-level cluster randomized trials. Clin Trials. 2008; 5: 486-95.

108. Zhao H, Bang H, Wang H and Pfeifer PE. On the equivalence of some medical cost estimators with censored data. Stat Med. 2007; 26: 4520-30.

109. Burman C, Hamren B and Olsson P. Modelling and simulation to improve decision-making in clinical development. Pharmaceutical Statistics. 2005; 4: 12.

110. Brookes ST, Whitley E, Peters TJ, Mulheran PA, Egger M and Davey Smith G. Subgroup analyses in randomised controlled trials: quantifying the risks of false-positives and false-negatives. Health Technol Assess. 2001; 5: 1-56.

111. Brookes ST, Whitely E, Egger M, Smith GD, Mulheran PA and Peters TJ. Subgroup analyses in randomized trials: risks of subgroup-specific analyses; power and sample size for the interaction test. J Clin Epidemiol. 2004; 57: 229-36.

112. DeMets DL. Statistics and ethics in medical research. Sci Eng Ethics. 1999; 5: 97-117.

113. Strasak AM, Zaman Q, Pfeiffer KP, Gobel G and Ulmer H. Statistical errors in medical research--a review of common pitfalls. Swiss Med Wkly. 2007; 137: 44-9.

114. Altman DG. Statistics and ethics in medical research. Misuse of statistics is unethical. Br Med J. 1980; 281: 1182-4.

115. Team RDC. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2008.

116. Ripley B. Stochastic simulartion. WIley. 1987: 1.

117. Venables W and Ripley B. Modern applied statistics with S. 4th ed. New York: Springer, 2002.

118. doMC: Foreach parallel adaptor for the multicore package. In: Analytics R, (ed.). 1.2.3 ed. 2011.

119. Cohen J. A power primer. Psychol Bull. 1992; 112: 155-9.

100 120. Cohen J. Statistical power analysis for the behavioral sciences. 2nd ed. New Jersey: Lawrence Erlbaum Associates, 1988.

121. Fleiss JL, Mann J, Paik M, Goultchin J and Chilton NW. A study of inter- and intra-examiner reliability of pocket depth and attachment level. J Periodontal Res. 1991; 26: 122-8.

122. Kingman A, Susin C and Albandar JM. Effect of partial recording protocols on severity estimates of periodontal disease. J Clin Periodontol. 2008; 35: 659-67.

123. Vettore MV, Lamarca Gde A, Leao AT, Sheiham A and Leal Mdo C. Partial recording protocols for periodontal disease assessment in epidemiological surveys. Cad Saude Publica. 2007; 23: 33-42.

124. Beck JD, Caplan DJ, Preisser JS and Moss K. Reducing the bias of probing depth and attachment level estimates using random partial-mouth recording. Community Dent Oral Epidemiol. 2006; 34: 1-10.

125. Owens JD, Dowsett SA, Eckert GJ, Zero DT and Kowolik MJ. Partial-mouth assessment of periodontal disease in an adult population of the United States. J Periodontol. 2003; 74: 1206-13.

126. Silness J and Roynstrand T. Partial mouth recording of plaque, gingivitis and probing depth in adolescents. J Clin Periodontol. 1988; 15: 189-92.

127. Berglundh T, Persson L and Klinge B. A systematic review of the incidence of biological and technical complications in implant dentistry reported in prospective longitudinal studies of at least 5 years. J Clin Periodontol. 2002; 29 Suppl 3: 197-212; discussion 32-3.

128. Hung HC and Douglass CW. Meta-analysis of the effect of scaling and root planing, surgical treatment and antibiotic therapies on periodontal probing depth and attachment loss. J Clin Periodontol. 2002; 29: 975-86.

129. Van der Weijden GA and Timmerman MF. A systematic review on the clinical efficacy of subgingival debridement in the treatment of chronic periodontitis. J Clin Periodontol. 2002; 29 Suppl 3: 55-71; discussion 90-1.

130. Pagano M and Gauvreau K. Nonparametric methods. In: Pagano M, (ed.). Principles of biostatistics. 2nd ed. Belmont: Brooks/Cole, Cengage Learning, 2000, p. 21.

131. Rosner B. Nonparametric methods. Fundamentals of Biostatistics. 7th ed. Boston: Brooks/Cole, Cengage Learning, 2011, p. 25.

132. Rosner B. Multisample inference. Fundamentals of Biostatistics. 7th ed. Boston: Brooks/Cole, Cengage Learning, 2011, p. 72.

133. Faraway J. Random effects. Extending the linear model with R: generalized linear, mixed effects and nonparametric regression models. Boca Raton: Chapman & Hall/CRC, 2006, p. 32.

101 134. Faraway J. Mixed effect models for nonnormal responses. Extending the linear model with R: generalized linear, mixed effects and nonparametric regression models. Boca Raton: Chapman & Hall/CRC, 2006, p. 10.

135. White H. Maximum likelihood estimation of misspecified model. Econometrica. 1982; 50: 25.

136. Carey VJ, Lumley T and Ripley B. Generalized estimating equation solver. 4.13-17 ed.: CRAN 2011.

137. Page R, Armitage G and DeRouen T. Design and conduct of clinical trials of prodicts designed for the prevention, diagnosis, and therapy of periodontitis. Anerican Academy of Periodontology. 1995: 54.

138. Herrera D, Sanz M, Jepsen S, Needleman I and Roldan S. A systematic review on the effect of systemic antimicrobials as an adjunct to scaling and root planing in periodontitis patients. J Clin Periodontol. 2002; 29 Suppl 3: 136-59; discussion 60-2.

139. Haffajee AD, Socransky SS, Goodson JM and Lindhe J. Intraclass correlations of periodontal measurements. J Clin Periodontol. 1985; 12: 216-24.

140. Collins LM, Schafer JL and Kam CM. A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Methods. 2001; 6: 330-51.

141. Greenstein G and Lamster I. Efficacy of periodontal therapy: statistical versus clinical significance. J Periodontol. 2000; 71: 657-62.

142. Ainamo J, Lie T, Ellingsen BH, et al. Clinical responses to subgingival application of a metronidazole 25% gel compared to the effect of subgingival scaling in adult periodontitis. J Clin Periodontol. 1992; 19: 723-9.

143. Williams RC, Paquette DW, Offenbacher S, et al. Treatment of periodontitis by local administration of minocycline microspheres: a controlled trial. J Periodontol. 2001; 72: 1535-44.

144. Machtei EE, Dunford R, Hausmann E, Grossi S, Norderyd J and Genco RJ. A stepwise approach to determine periodontal attachment loss in longitudinal studies. J Periodontal Res. 1993; 28: 536-9.

145. van der Velden U. Probing force and the relationship of the probe tip to the periodontal tissues. J Clin Periodontol. 1979; 6: 106-14.

146. van der Velden U and de Vries JH. The influence of probing force on the reproducibility of pocket depth measurements. J Clin Periodontol. 1980; 7: 414-20.

147. Watts T. Constant force probing with and without a stent in untreated periodontal disease: the clinical reproducibility problem and possible sources of error. J Clin Periodontol. 1987; 14: 407-11.

102 148. Watts TL. Probing site configuration in patients with untreated periodontitis. A study of horizontal positional error. J Clin Periodontol. 1989; 16: 529-33.

149. Espeland MA, Zappa UE, Hogan PE, Simona C and Graf H. Cross-sectional and longitudinal reliability for clinical measurement of attachment loss. J Clin Periodontol. 1991; 18: 126-33.

150. McGuire MK and Nunn ME. Prognosis versus actual outcome. III. The effectiveness of clinical parameters in accurately predicting tooth survival. J Periodontol. 1996; 67: 666-74.

151. McGuire MK and Nunn ME. Prognosis versus actual outcome. IV. The effectiveness of clinical parameters and IL-1 genotype in accurately predicting prognoses and tooth survival. J Periodontol. 1999; 70: 49-56.

103

Appendix 1: R program used to generate the data

#Function)for)generating)the)dataset) ) #Defining)the)parameters) sdteeth<6)0.6) npatients.list<6)c(10,)20,)30,)40,)50)) btmt.list<6)sdteeth*c(0,)0.2,)0.5,)0.8,)1,)1.2,)1.5)) rho.list<6)matrix(c(0,0,0.4,0.4,0.2,0.6,0.7,0.6,0.8),)ncol=3,)byrow=T)) nteeth.list<6)c(2,)4),10)) ) dataset.parts)<6)function(npatients,)nteeth,)muteeth,)sdteeth,)btime,)btmt,)rhobs,) rhobd,)rhowd){) ))) ))#Creating)the)vector)with)the)sds) ))sds<6)rep(sdteeth,)nteeth*2)) ))) ))#Creating)the)mus)vector) ))mupre<6)rep(0,)nteeth)) ))#right)side)is)the)treated)side)(tooth)number)is)>)0.5)*)the)number)of)teeth)) ))mupost<6)mupre+btime+btmt*((1:nteeth)>nteeth/2)) ))mus<6)c(mupre,)mupost)) ))) ))#Creating)the)matrix)of)correlations) ))) ))#Matrix)of)correlations)between)teeth)at)occasion)1) ))C11=)matrix(ncol=nteeth,)nrow=nteeth)) ))C11[row(C11)==col(C11)]<6)1)#when)they)are)equal)put)1) ))C11[row(C11)!=col(C11)]<6)rhobs)#)when)they)are)different)put)rhobs) ))) ))#Matrix)of)correlation)at)occasion)2) ))C22=C11) ))) ))#Matrix)of)correlation)between)occasions) ))C12=)matrix(ncol=nteeth,)nrow=nteeth)) ))C12[row(C12)==col(C12)]<6)rhowd) ))C12[row(C12)!=col(C12)]<6)rhobd) ))) ))#Joining)the)matrices)(in)columns)and)then)in)rows)) ))mc1<6)cbind(C11,)C12)) ))mc2<6)cbind(C12,)C22)) ))mc<6)rbind(mc1,)mc2)) ))) ))#Creating)the)variance6covariance)matrix) ))vcm<6)t(mc*sds)*sds) ))) ))) ))#Creating)the)variable)for)patient)id) ))#The)first)nteeth*2)obs)belong)to)patient)1,)the)mteeth*2)second)obs)belong)to)))) #patient)2)and)so)on) ))pid<6)rep(1:npatients,)each=nteeth*2)) ))pid<6)factor(pid)) )))

104 ))#Creating)the)variable)for)tooth)id) ))#The)1st,)nteeth+1th,)nteeth*2+1th)and)so)on)obs)correspond)to)tooth)1,)the)2nd,) #nteeth+2th,)nteeth*2+2th)and)so)on)obs)belng)to)tooth)2,)etc) ))tid<6)rep(1:nteeth,)npatients*2)) ))tid<6)factor(tid)) ) ))#Creating)the)variable)for)time)and)making)it)a)factor) ))#Need)nteeth)1s,)nteeth)2s,)nteeth)1s,)nteeth)2s)and)so)on) ))time<6)rep(c("b",)"a"),)npatients,)each=nteeth)) ))time<6)factor(time,)levels=c("b","a"))) ) ))#Creating)the)variable)for)side)and)making)it)a)factor) ))#)Need)nteeth/2)ls,)nteeth/2)rs,)nteeth/2)ls,)nteeth/2)rs,)and)so)on) ))side<6)rep(c("l",)"r"),)npatients*2,)each=nteeth/2)) ))side<6)factor(side,)level=c("l","r"))) ))) ))#Creating)the)data)frame) ))d<6)list(d.preds)=data.frame(pid=pid,) )))))))))))))))))tid=tid,) )))))))))))))))))time=time,) )))))))))))))))))side=side),) ))) ) ))mus=mus,) ))) ) )vcm=vcm)) ))) ))#Returning)the)dataframe) ))d) }) ) dataset<6)function(npatients,)mus,vcm,d.preds,)round.it)=)F){) ))) ))#Generating)the)data) ))data<6)mvrnorm(npatients,)mus,)vcm)) ))) ))#Rearranging)the)data)of)pd)in)the)form)y11,y12,..,)y1nt*2,..,ynpnt*2) ))pd<6)c(t(data))) ))if(round.it){) ))) #round)PD)values)to)the)nearest)0.5)mm) ))))pd)<6)round(pd*2,0)/2) ))}) ) ))#Creating)the)data)frame) ))d<6)cbind(pd=pd,d.preds)) ))) ))#Returning)the)dataframe) ))d) ))) }) ) ) )) )

105

Appendix 2: R programs used to analyze the data

) ) #Collapsing)the)data:)calculating)the)mean)per)side)per)patient)and)creating)the)other) variables)to)have)the)collapsed)dataset)) ) collapse.d<6)function(d,d.predcoll){) ))pdcoll<6)with(d,)) ))))) ) ))tapply(pd,)list(pid,)side,)time),)mean)) ))) ) ) )))) ))) ))dcol<6)data.frame(pdcoll=c(pdcoll),d.predcoll)) ))dcol) }) ) collapse.pred)<6)function(npatients,)nteeth){) ))) ))pidcoll<6)rep(1:npatients,)4)) ))pidcoll<6)factor(pidcoll)) ))sidecoll<6)rep(c("l",)"r"),)each=npatients,)length.out=npatients*4)) ))sidecoll<6)factor(sidecoll,)levels=c("l","r"))) ))timecoll<6)rep(c("b",)"a"),)each=npatients*2)) ))timecoll<6)factor(timecoll,levels=c("b","a"))) ))) ))predcoll<6)data.frame(pidcoll=pidcoll,) )))))))))))))))))))))sidecoll=sidecoll,) )))))))))))))))))))))timecoll=timecoll)) ))predcoll) }) ) ) #)Uncollapsed)data)analysis) ) #I.)Post)treatment)measurements) ) #Unpaired)t6test) ) unpttestuncolpost<6)function(d){) ))choosetime<6)d$time=="a") ) chooseside<6)d$side=="r") ) ))result<6)t.test(d$pd[choosetime&chooseside],)d$pd[choosetime&!chooseside],)p=F)) ))pval<6)result$p.value) ))ci<6)result$conf.int) ))est<6)result$estimate) ))est)<6)est[1])6)est[2]) ))c(pval,)ci,)est)) }) ) #Paired)t6test) ) pttestuncolpost<6)function(d){)

106 ))choosetime<6)d$time=="a") ) chooseside<6)d$side=="r") ))result<6)t.test(d$pd[choosetime&chooseside],)d$pd[choosetime&!chooseside],)p=T)) ))pval<6)result$p.value) ))ci<6)result$conf.int) ))est<6)result$estimate) ))c(pval,)ci,)est)) }) ) #Wilcoxon)signed)rank)test) ) pwilcoxuncolpost<6)function(d){) ))choosetime<6)d$time=="a") ))chooseside<6)d$side=="r") ))result<6)wilcox.test(d$pd[choosetime&chooseside],)d$pd[choosetime&!chooseside],)p=T,) conf.int=TRUE)) )))) ) ))pval<6)result$p.value) ) ))ci<6)result$conf.int) ) ))est<6)NA) ) ))c(pval,)ci,)est)) ) ) }) #Wilcoxon)rank)sum)test) ) uwilcoxuncolpost<6)function(d){) ))choosetime<6)d$time=="a") ))chooseside<6)d$side=="r") ))result<6)wilcox.test(d$pd[choosetime&chooseside],)d$pd[choosetime&!chooseside],)p=F,) conf.int=TRUE)) ))pval<6)result$p.value) ))ci<6)result$conf.int) ))est<6)NA) ))c(pval,)ci,)est)) }) ) #II.)Pre6post)differences) ) #Creating)a)dataframe)with)the)differences) ) uncoldiff<6)function(d){) ))before<6)d[d$time=="b","pd"]) ))after<6)d[d$time=="a",)"pd"]) ))ppdu<6)after6before) ))side)<6)d$side[d$time=="b"]) ))pid)<6))d$pid[d$time=="b"]) ))uddiff<6)data.frame(before=before,) )))))))))))))))))))))after=after,) )))))))))))))))))))))ppdu=ppdu,) )))))))))))))))))))))side=side,) )))))))))))))))))))))pid=pid)) ))) ))uddiff) }) )

107 ) #Unpaired)t)test) ) unpttestuncoldiff<6)function(d){) ))ppduleft<6)d$ppdu[d$side=="l"]) ))ppduright<6)d$ppdu[d$side=="r"]) ))result<6)t.test(ppduright,)ppduleft,)paired=F)) ))pval<6)result$p.value) ))ci<6)result$conf.int) ))est<6)result$estimate) ))est)<6)est[1])6)est[2]) ))c(pval,)ci,)est)) }) ) ) #Paired)t)test) ) pttestuncoldiff<6)function(d){) ))ppduleft<6)d$ppdu[d$side=="l"]) ))ppduright<6)d$ppdu[d$side=="r"]) ))result<6)t.test(ppduright,)ppduleft,)paired=T)) ))pval<6)result$p.value) ))ci<6)result$conf.int) ))est<6)result$estimate) ))c(pval,)ci,)est)) }) ) #Wilcoxon)signed)rank)test) ) pwilcoxuncoldiff<6)function(d){) ))ppduleft<6)d$ppdu[d$side=="l"]) ))ppduright<6)d$ppdu[d$side=="r"]) ))result<6)wilcox.test(ppduright,)ppduleft,)p=T,)conf.int=TRUE)) ))) ))pval<6)result$p.value) ))ci<6)result$conf.int) ))est<6)NA) ))c(pval,)ci,)est)) }) ) ) #Wilcoxon)rank)sum)test) ) uwilcoxuncoldiff<6)function(d){) ))ppduleft<6)d$ppdu[d$side=="l"]) ))ppduright<6)d$ppdu[d$side=="r"]) ))result<6)wilcox.test(ppduright,)ppduleft,)p=F,)conf.int=TRUE)) ))pval<6)result$p.value) ))ci<6)result$conf.int) ))est<6)NA) ))c(pval,)ci,)est)) }) ) ) #III.)Post)controlling)for)pre)

108 ) #ANCOVA) ) ancovauncoll<6)function(d){) ))choosetime<6)d$time=="a") ))chooseside<6)d$side=="r") ))result<6)summary(lm(d$pd[choosetime]~d$side[choosetime]+d$pd[!choosetime]))) ))pval<6)result$coef[2,4]) ))est<6)result$coef[2,1]) ))se<6)result$coef[2,2]) ))df.resid)<6)result$df[2]) ))ci)<6)est)+)c(61,1)*qt(0.975,)df=df.resid)*se) ))c(pval,ci,est)) }) ) ) #Repeated)measurements) ) #Mixed)model) ) #Simple) ) mixeduncollsim<6)function(d){) ))preresult<6)try(lme(pd~side+time+side*time,)data=d,)random=~1|pid),)silent=T)) ))if(class(preresult)!="try.error"){) ))))result<6)summary(preresult)) ))pval<6)result$tTable[4,5]) ))est<6)result$tTable[4,1]) ))se<6)result$tTable[4,2]) ))df<6)result$tTable[4,3]) ))ci<6)est)+)c(61,1)*qt(0.975,)df=df)*se) ))c(pval,)ci,)est)) ))}) ))else{) ))))c(NA,)NA,)NA,)NA)) ))}) }) ) #Complex) ) mixeduncollcom<6)function(d){) )preresult<6)try(lmer(pd~side+time+side*time+(1|pid)+(1|pid:tid),)) )))))))))))))))))))))data=d),)silent=T)) ))if(class(preresult)!="try.error"){) )))result<6)coeffun(preresult)) ))))pval<6)result[4,4]) ))))est<6)result[4,1]) ))))se<6)result[4,2]) ))))ci<6)est)+)c(61,1)*qnorm(0.975)*se) ))))c(pval,)ci,)est)) ))}) ))else{) )))c(NA,)NA,)NA,)NA)) ))}) })

109 ) #GEE) ) #simple) ) geeuncolsim<6)function(d){) ))d<6)d[order(d$pid),]) ))result<6)summary(mygee(pd~side+time+side*time,)id=pid,)data=d,)corstr="exchangeable"))) ))est<6)result$coef[4,1]) ))se<6)result$coef[4,4]) ))z)<6)result$coef[4,5]) ))pval)<6)2*(16pnorm(abs(z)))) ))ci)<6)est+c(61.96,1.96)*se) ))c(pval,ci,est)) }) ) #complex) ) #geeuncolcom<6)function(d){) ) #geeglm(pid~time+side+time*side+(1|pid)+(1|pid:tid),)data=d)) #Error)in)model.frame.default(formula)=)pid)~)time)+)side)+)time)*)side)+)):)) )#)variable)lengths)differ)(found)for)'1)|)pid:tid')) ) ) #)Collapsed)data)analysis))) ) #I.)Post)treatment)measurements) ) #Unpaired)t6test) ) unpttestcolpost<6)function(d){) ))choosetime<6)d$timecoll=="a") ))chooseside<6)d$sidecoll=="r") ))result<6)t.test(d$pdcoll[choosetime&chooseside],)d$pdcoll[choosetime&!chooseside],) p=F)) ))pval<6)result$p.value) ))ci<6)result$conf.int) ))est<6)result$estimate) ))est<6)est[1]6est[2]) ))c(pval,)ci,)est)) }) ) ) #Paired)t6test) ) pttestcolpost<6)function(d){) ))choosetime<6)d$timecoll=="a") ))chooseside<6)d$sidecoll=="r") ))result<6)t.test(d$pdcoll[choosetime&chooseside],)d$pdcoll[choosetime&!chooseside],) p=T)) ))pval<6)result$p.value) ))ci<6)result$conf.int) ))est<6)result$estimate) ))c(pval,)ci,)est))

110 }) ) ) #Wilcoxon)signed)rank)test) ) pwilcoxcolpost<6)function(d){) ))choosetime<6)d$timecoll=="a") ))chooseside<6)d$sidecoll=="r") ))result<6)wilcox.test(d$pdcoll[choosetime&chooseside],) d$pdcoll[choosetime&!chooseside],)) )))))))))))))))))))))))p=T,)conf.int=T)) ))pval<6)result$p.value) ))ci<6)result$conf.int) ))est<6)NA) ))c(pval,)ci,)est)) }) ) ) #Wilcoxon)rank)sum)test) ) uwilcoxcolpost<6)function(d){) ))choosetime<6)d$timecoll=="a") ))chooseside<6)d$sidecoll=="r") ))result<6)wilcox.test(d$pdcoll[choosetime&chooseside],) d$pdcoll[choosetime&!chooseside],)) )))))))))))))))))))))))p=F,)conf.int=T)) ))pval<6)result$p.value) ))ci<6)result$conf.int) ))est<6)NA) ))c(pval,)ci,est))) }) ) ) ) #II.)Pre6post)differences) ) #Creating)the)diff)dataset)of)collapsed)data) ) colldiff)<6)function(d){) ))before)<6)d[d$timecoll=="b","pdcoll"]) ))after)<6)d[d$timecoll=="a","pdcoll"]) ))ppdc)<6)after)6)before) ))side)<6)d$sidecoll[d$timecoll=="b"]) ))pid)<6))d$pidcoll[d$timecoll=="b"]) ))cddiff<6)data.frame(before=before,)) ))))))))))))))))))))))after=after,) ))))))))))))))))))))))ppdc=ppdc,) ))))))))))))))))))))))side=side,) ))))))))))))))))))))))pid=pid)) ))cddiff) }) ) ) #Unpaired)t)test) )

111 unpttestcoldiff<6)function(d){) ))ppdcleft<6)d$ppdc[d$side=="l"]) ))ppdcright<6)d$ppdc[d$side=="r"])) ))result<6)t.test(ppdcright,)ppdcleft,)paired=F)) ))pval<6)result$p.value) ))ci<6)result$conf.int) ))est<6)result$estimate) ))est)<6)est[1])6)est[2]) ))c(pval,)ci,)est)) }) ) ) #Paired)t)test) ) pttestcoldiff<6)function(d){) ))ppdcleft<6)d$ppdc[d$side=="l"]) ))ppdcright<6)d$ppdc[d$side=="r"])) ))result<6)t.test(ppdcright,)ppdcleft,)paired=T)) ))pval<6)result$p.value) ))ci<6)result$conf.int) ))est<6)result$estimate) ))c(pval,)ci,)est)) }) ) ) #Wilcoxon)signed)rank)test) ) pwilcoxcoldiff<6)function(d){) ))ppdcleft<6)d$ppdc[d$side=="l"]) ))ppdcright<6)d$ppdc[d$side=="r"]) ))result<6)wilcox.test(ppdcright,)ppdcleft,)p=T,)conf.int=TRUE)) ))pval<6)result$p.value) ))ci<6)result$conf.int) ))est<6)NA) ))c(pval,)ci,)est)) }) ) ) #Wilcoxon)rank)sum)test) ) uwilcoxcoldiff<6)function(d){) ))ppdcleft<6)d$ppdc[d$side=="l"]) ))ppdcright<6)d$ppdc[d$side=="r"]) ))result<6)wilcox.test(ppdcright,)ppdcleft,)p=F,)conf.int=TRUE)) ))pval<6)result$p.value) ))ci<6)result$conf.int) ))est<6)NA) ))c(pval,)ci,)est)) }) ) ) #III.)Post)controlling)for)pre) ) #ANCOVA) )

112 ancovacoll<6)function(d){) ))choosetime<6)d$timecoll=="a") ))chooseside<6)d$sidecoll=="r") ))result<6) summary(lm(d$pdcoll[choosetime]~d$sidecoll[choosetime]+d$pdcoll[!choosetime]))) ))pval<6)result$coef[2,4]) ))est<6)result$coef[2,1]) ))se<6)result$coef[2,2]) ))df.resid)<6)result$df[2]) ))ci)<6)est)+)c(61,1)*qt(0.975,)df=df.resid)*se) ))c(pval,ci,est)) }) ) #Repeated)measurements) ) #Mixed)model) ) #simple) ) mixedcollsim<6)function(d){) ))preresult<6)try(lme(pdcoll~sidecoll+timecoll+sidecoll*timecoll,)) )))))))))))))))))))))))data=d,) )))))))))))))))))))))))random=)~1|pidcoll),)silent=T)) ))if(class(preresult)!="try6error"){) ))))result<6)summary(preresult)) ))pval<6)result$tTable[4,5]) ))est<6)result$tTable[4,1]) ))se<6)result$tTable[4,2]) ))df<6)result$tTable[4,3]) ))ci<6)est)+)c(61,1)*qt(0.975,)df=df)*se) ))c(pval,)ci,)est)) ))}) ))else{) ))))c(NA,NA,NA,NA)) ))}) }) ) #complex) ) mixedcollcom<6)function(d){) preresult<6)try(lmer(pdcoll~sidecoll+timecoll+sidecoll*timecoll+(1|sidecoll)) )))))))))))))))))))))+(1|pidcoll:sidecoll),)data=d),)silent=T)) ))if(class(preresult)!="try.error"){) )))result<6)coeffun(preresult)) ))))pval<6)result[4,4]) ))))est<6)result[4,1]) ))))se<6)result[4,2]) ))))ci<6)est)+)c(61,1)*qnorm(0.975)*se) ))))c(pval,)ci,)est)) ))}) ))else{) )))c(NA,)NA,)NA,)NA)) ))}) }) )

113 #GEE) ) #simple) ) geecolsim<6)function(d){) ))d<6)d[order(d$pidcoll),]) ))result<6)summary(mygee(pdcoll~sidecoll+timecoll+sidecoll*timecoll,)id=pidcoll,)data=d,) ))))))))))))))))))))))corstr="exchangeable"))) ))est<6)result$coeff[4,1]) ))se<6)result$coef[4,4]) ))z)<6)result$coef[4,5]) ))pval)<6)2*(16pnorm(abs(z)))) ))ci)<6)est+c(61.96,1.96)*se) ))c(pval,ci,est)) }) )

114

Appendix 3: The power of the statistical methods increases with the number of teeth

Figure 19: Power versus number of teeth in a scenario with 20 patients and a treatment effect equal to 0.3

The top 12 panels show full data analysis, whereas the bottom 12 panels show collapsed data analysis. UTU: unpaired t-test using only post-treatment data, PTP: paired t-test using only post-treatment data, PWP: Wilcoxon signed-rank test using only post-treatment data, UWP: Wilcoxon rank sum test using only post-treatment data, UTD: unpaired t-test using pre-post differences, PTD: paired t-test using pre-post differences, PWP: Wilcoxon signed-rank test using pre-post differences, UWD: Wilcoxon rank sum test using pre-post differences, ANCOVA: analysis of covariance using post-treatment measurements adjusted by baseline measurements, LMES: linear mixed effects model with probing depth and treatment treated as fixed variables and patient treated as random variable, LMEC: linear mixed effects model with probing depth and treatment treated as fixed variables and patient and tooth/side treated as random variables, GEE: generalized estimating equations

115

Appendix 4: The differences between baseline and post treatment differences in the treated and untreated sides are not correlated

This appendix aims to show that the differences obtained when using the statistical approach that calculates the difference between baseline and post-treatment measurements are not correlated, thus explaining why the statistical tests that use these differences perform appropriately even when they do not consider the pairing of the data.

Let X be the matrix that contains the values of the observations of a split-mouth trial in which two teeth per side were used

b b B B a a a a X ' = ( U1 U2 T1 T2 U1 U2 T1 T2 )1x8 where T represents the treated side, U represents the untreated side, the sub index shows the number of the tooth, and the super index shows whether the measurement was taken before or after the treatment.

Let L be the contrast matrix

⎛ 1 0 −1 0 −1 0 1 0 ⎞ L = ⎜ 0 1 0 1 0 1 0 1 ⎟ ⎝ − − ⎠ 2 x8

Then, LX is the matrix of the differences between the baseline-post treatment measurements of the treated and untreated sides

⎛ (T a −U a ) − (T b −U b ) ⎞ LX = ⎜ 1 1 1 1 ⎟ a a b b ⎜ (T2 −U2 ) − (T2 −U2 ) ⎟ ⎝ ⎠ 2 x1

Which reorganized is

⎛ D = (T a − T b ) − (U a −U b ) ⎞ LX = ⎜ 1 1 1 1 1 ⎟ a b a b ⎜ D2 = (T2 − T2 ) − (U2 −U2 ) ⎟ ⎝ ⎠ 2 x1

116 The variance of this matrix is

Var(LX) = LV(X)L'

L V(X) = LV(X)

⎛ 1 A A A B C C C ⎞ ⎜ A 1 A A C B C C ⎟ ⎜ ⎟ ⎜ A A 1 A C C B C ⎟ ⎛ 1 0 −1 0 −1 0 1 0 ⎞ ⎜ A A A 1 C C C B ⎟ ⎜ ⎟ ⎜ ⎟ = ⎝ 0 1 0 −1 0 −1 0 1 ⎠ ⎜ B C C C 1 A A A ⎟ ⎜ C B C C A 1 A A ⎟ ⎜ C C B C A A 1 A ⎟ ⎜ ⎟ ⎝ C C C B A A A 1 ⎠ ⎛ 1− A − B + C A − A − C + C A −1− C + B A − A − C + C B − C −1+ A C − C − A + A C − B + A −1 C − C − A + A ⎞ ⎜ ⎟ ⎝ A − A + C − C 1− A + C − B A − A + C − C A −1+ B − C C − C + A − A B − C + A −1 C − C + A − A C − B +1− A ⎠ where V(X) is the variance-covariance matrix. This matrix is obtained by multiplying the correlation matrix by 2σ2a. Since we are not interested in the actual value of σ2, we are not including it in this demonstration. Therefore, V(X) is equal to the correlation matrix we used in this simulation study. A represents the correlation between teeth at one time, B represents the correlation within a tooth at different times, and C represents the correlation between teeth at different times.

Let K=1-A-B+C, then

LV(X) L’ = Var(LX)

⎛ 1 0 ⎞ ⎜ 0 1 ⎟ ⎜ ⎟ ⎜ −1 0 ⎟ ⎛ K 0 −K 0 −K 0 K 0 ⎞ ⎜ 0 −1 ⎟ ⎛ K + K + K + K 0 ⎞ ⎜ ⎟ ⎜ ⎟ = ⎜ ⎟ ⎝ 0 K 0 −K 0 −K 0 K ⎠ ⎜ −1 0 ⎟ ⎝ 0 K + K + K + K ⎠ ⎜ 0 −1 ⎟ ⎜ 1 0 ⎟ ⎜ ⎟ ⎝ 0 1 ⎠

Where we can see that due to the structure of the correlation matrix used, the correlation of the differences is equal to 0.