<<

Zoonoses and Public Health

REVIEW ARTICLE Statistical Evaluation of Test Accuracy Studies for Toxoplasma gondii in Food Animal Intermediate Hosts I. A. Gardner1, M. Greiner2 and J. P. Dubey3

1 Department of Medicine and Epidemiology, School of Veterinary Medicine, University of California, Davis, CA, USA 2 Federal Institute for Risk Assessment, Epidemiology, Biostatistics and Mathematical Modelling Unit, Berlin, Germany 3 United States Department of Agriculture, Agricultural Research Service, Animal and Natural Resources Institute, Animal Parasitic Diseases Laboratory, Beltsville, MD, USA

Impacts • Receiver-operating curve (ROC) analysis and area under the ROC curve should be used to compare tests for Toxoplasma gondii against a highly accurate reference standard. • In the absence of a perfect reference standard, sensitivity and specificity of tests under evaluation can be estimated by maximum likelihood or Bayesian latent class methods. These methods are being increasingly accepted as a legitimate approach in chronic infectious diseases. • Recommendations are made regarding methods for improved analysis and reporting of studies for T. gondii.

Keywords: Summary Diagnostic test evaluation; test accuracy; receiver-operating characteristic analysis; The availability of accurate diagnostic tests is essential for the detection and likelihood ratios; latent class analysis; control of Toxoplasma gondii infections in both definitive and intermediate Toxoplasma gondii hosts. Sensitivity, specificity and the area under the receiver-operating character- istic (ROC) curve are commonly used measures of test accuracy for infectious Correspondence: diseases such as toxoplasmosis. These test performance characteristics are I. A. Gardner. Department of Medicine and important considerations when selecting from among a group of tests for a spe- Epidemiology, School of Veterinary Medicine, University of California, One Shields Ave, cific testing purpose. In this study, we reviewed statistical approaches to evalua- Davis, CA 95616, USA. Tel.: 530-752-6992; tion of tests for toxoplasmosis with and without a gold-standard (reference) Fax: 530-752-0414; test, including use of ROC analysis and likelihood ratios which retain the diag- E-mail: [email protected] nostic information inherent in a quantitative test result. We use previously pub- lished data from a comparison of the accuracy of serological tests for swine Received for publication August 5, 2008 toxoplasmosis to demonstrate suggested methods of . We make rec- ommendations for statistical analysis and reporting of test evaluation studies for doi: 10.1111/j.1863-2378.2009.01281.x T. gondii in food animals based on our own experiences and those of others.

intermediate hosts and to minimize the risk of human Introduction infection associated with ingestion of undercooked meat Livestock, poultry and wild game are commonly harbouring T. gondii cysts. Serological tests are used infected with Toxoplasma gondii following ingestion of mostly for detection of infected animals because they are pasture, feed or water contaminated with cat faeces inexpensive and results can be obtained rapidly. Compre- containing oocysts. The occurrence of T. gondii cysts in hensive studies comparing the accuracy of multiple sero- edible muscles of these species poses a potential risk to logical tests with bioassay-based reference standards have humans, if the meat is consumed without thorough been reported in pigs (Dubey et al., 1995a) but not in cooking to temperatures of at least 67C (Dubey et al., other livestock hosts such as goats and sheep, in wild 1990). game or in backyard poultry. Accurate and reliable diagnostic tests are essential for Design and reporting standards for test accuracy stud- the detection, surveillance and control of infections in ies have been developed for human diseases (Bossuyt

82 ª 2009 Blackwell Verlag GmbH • Zoonoses Public Health. 57 (2010) 82–94 I. A. Gardner et al. Statistical Evaluation of Test Accuracy Studies for T. gondii et al., 2003a,b) and these guidelines have been refined for 2005). In addition, latent class methods do not correct infectious diseases especially in the developing world for other biases such as verification (work-up) bias that (Peeling et al., 2006; TDR Diagnostics Evaluation Expert occur in diagnostic test evaluation studies because of fail- Panel, 2006). In addition, quality assessment criteria for ure to consider important design aspects (Bossuyt et al., inclusion of studies in systematic reviews of diagnostic 2003b). test accuracy are now published (Whiting et al., 2003, In this study, we describe statistical approaches to the 2004). In contrast, standards have not been developed for evaluation of tests for T. gondii in food animal intermedi- infectious animal diseases although two of the authors ate hosts. We use data from a previously published evalu- (IG, MG) have made general recommendations about ation study of serological tests for swine toxoplasmosis epidemiological (population-based) approaches to test (Dubey et al., 1995a) to demonstrate methods that evaluation (Greiner and Gardner, 2000a). In addition, the depend or do not depend on the availability of a gold World Organization for Animal Health provides guide- standard reference test. Finally, we make recommenda- lines for international certification of tests for trade and tions regarding the analysis and reporting of test evalua- non-trade applications and has a registry of tests that are tion studies for toxoplasmosis with the overall goal of validated as fit for specific purposes (Office International improving their quality. des Epizooties [OIE], 2008). Evaluation of the accuracy of diagnostic tests should be Example: Evaluation of the Accuracy of based on representative field samples from naturally Serological Testing for Detection of T. gondii infected animals rather than experimental challenge stud- Infection in Sows ies alone (Greiner and Gardner, 2000a). Experimental challenge can provide estimates of time to detect infection Serological tests are a practical and inexpensive method and duration of detectable infection, but the doses of for estimating the prevalence of T. gondii infection in organisms, challenge routes and experimental conditions food animals and wild game (reviewed by Tenter et al., (environment, husbandry, lack of concurrent infections) 2000; Dubey and Jones, 2008). Toxoplasma infections may not be representative of field situations. If study may cause reproductive failure in pregnant females but design recommendations are followed to minimize biases most infections are subclinical and hence, tests usually are (Greiner and Gardner, 2000a; Bossuyt et al., 2003a,b), use validated for detection of subclinical infection. of appropriate field samples mitigates inferential problems To demonstrate recommended approaches to evalua- associated with the generalization of findings from chal- tion of test accuracy, we re-analyzed data from a study lenge studies and allows for estimation of the sensitivity conducted by one of us (JPD) to evaluate the sensitivity and specificity of the test under evaluation (termed the and specificity of five serological tests (MAT = modified index test in Bossuyt et al., 2003a) to those of a reference agglutination test; ELISA = enzyme-linked immunoassay; standard for a specific testing purpose. The reference LAT = latex agglutination test; IHAT = indirect hemag- standard is often termed a ‘gold standard’ if it provides glutination inhibition test; DT = Sabin-Feldman dye test) perfect classification of infection status. For chronic infec- for toxoplasmosis in 1000 naturally exposed sows (Dubey tious diseases such as toxoplasmosis, it is often possible et al., 1995a). Hearts were collected in batches from a to establish an animal’s infection status definitively by swine slaughterhouse in Iowa and serological tests were post-mortem examination followed by additional tests of performed on clotted heart blood in collaborators’ labora- tissues that are considered predilection sites for cysts. tories. Only 893 sera were suitable for DT because of bac- Toxoplasma gondii cysts not only have a high affinity for terial contamination and anti-complementary substances. skeletal and cardiac muscles and brain in most species One serum was missing an ELISA result. All serological but can also be found in visceral organs such as lungs, tests were conducted in a blinded fashion. liver and kidneys (Dubey et al., 1998). Mouse bioassay, which is often used as a reference Latent-class statistical methods that do not require des- standard in test evaluation studies, was performed on all ignation of a gold standard provide a flexible approach to 1000 samples using 100 g of homogenized, acidic pepsin- estimate test accuracy (Hui and Walter, 1980; Enøe et al., digested heart tissue, and a subset of samples (n = 183) 2000) and potentially can prevent the bias that occurs in yielding low titres on the MAT was selectively followed estimates of sensitivity and specificity if the test under up by bioassay in T. gondii-free cats using five times more evaluation is compared with an imperfect reference stan- heart muscle (500 g). For bioassay, hearts were allocated dard. However, to obtain confidence intervals (CI) of the into two groups: group 1 consisted of the first 463 hearts same width for sensitivity and specificity, sample sizes for of which 42 (9.1%) were bioassayed in cats and group 2 a latent class analysis are typically much larger than those consisted of the remaining 537 samples of which 141 when a gold standard is available (Georgiadis et al., (26.3%) were bioassayed in cats. The higher percentage of

ª 2009 Blackwell Verlag GmbH • Zoonoses Public Health. 57 (2010) 82–94 83 Statistical Evaluation of Test Accuracy Studies for T. gondii I. A. Gardner et al. group 2 samples bioassayed in cats was mainly attribut- detected in 25% of tested heart muscles when T. gondii able to increased cat availability when the samples were cysts were present in other organs (Dubey et al., 1995b). collected. Detailed descriptions of bioassay procedures Bioassay in mice is less sensitive than bioassay in cats and and relationships between antibody test results and isola- hence, is a poorer reference standard. In the Dubey et al. tion of viable T. gondii in tissues are reported elsewhere (1995a) study, the relative sensitivity of mouse to cat bio- (Dubey et al., 1995b). assay was about 50%, although the samples verified selec- In the dataset, serological results [titres for MAT, LAT tively by cat bioassay were from sows with low MAT and IHAT; optical density (OD) values for ELISA; posi- titres that may have had few cysts. tive or negative for DT] were recorded along with mouse and cat bioassay results (positive or negative) for each Area under the receiver-operating characteristic curve sample. The composite reference standard was isolation of viable T. gondii by either cat or mouse bioassay. One Use of ROC analysis provides a cut-off-independent sample with serological results had no bioassay results, approach for evaluation of the global accuracy of a test reducing the total sample size to 999. Sows whose sam- wherein the results are measured on an ordinal or contin- ples yielded a positive bioassay result on either bioassay uous scale. The area under the ROC curve (AUC) pro- (n = 170) were considered infected, and sows whose sam- vides a single numerical estimate of overall accuracy that ples yielded a negative bioassay result (n = 829) were can be interpreted as the average probability that an considered non-infected. infected animal will have a higher test value than a non- We restrict our presentation of data analysis strategies infected animal. The main justification for ROC analysis mostly to the MAT and ELISA (tests under evaluation) is that cut-off values for test interpretation may change and mouse and cat bioassay results (reference standards). depending on the purpose of testing (e.g. screening versus For receiver-operating characteristic (ROC) analysis and confirmation) and with the prevalence of infection, the likelihood ratio (LR) calculations, we used MedCalc costs of test errors and the availability of other tests. (MedCalc version 9.4, Mariakerke, Belgium) because of Readers interested in detailed descriptions of ROC analy- its ease of use, including the ready importation of Excel sis are referred elsewhere (Zweig and Campbell, 1993; files. For Bayesian analyses, we used WinBUGS (Lunn Greiner et al., 2000; Gardner and Greiner, 2006). et al., 2000: software available at http://www.mrc-bsu. The accuracy of the MAT, ELISA, IHAT and LAT, cam.ac.uk/bugs/winbugs/contents.shtml) because it is measured as the AUC and 95% CI for differences in freely available and code can be written to allow the esti- AUC, were estimated relative to the composite reference mation of sensitivity, specificity and predictive values of standard of mouse and cat bioassay results. The MAT multiple tests (Branscum et al., 2005). Our evaluation of and ELISA were more accurate (AUC = 0.881 and 0.856 test accuracy parameters is based on results of individual respectively) than either IHAT or LAT (AUC = 0.800 and sows, although tests are commonly interpreted at the herd 0.733 respectively). These conclusions were verified by level in epidemiological studies and risk-based surveil- calculations of AUC differences and 95% CI between all lance programmes (Martin et al., 1992; Christensen and six test pairs (Table 1). The ROC plots for each test are Gardner, 2000). shown in Fig. 1.

Evaluation of Tests for T. gondii Infection Titre-specific likelihood ratios Using Methods that Assume the Reference Likelihood ratios (LR) are calculated as the ratio of the Test is a Gold Standard proportion of sows with T. gondii infection (as defined by The traditional approach to test evaluation assumes a per- the composite reference standard) for a given MAT titre fect reference standard against which the results of one or to the proportion of sows without T. gondii infection that more tests are compared. For T. gondii infections, bio- have the same MAT titre (Simel et al., 1993). For exam- assay using tissues fed to T. gondii-free cats is essentially a ple, for a titre of >400, the LR is calculated as 5/170 ‚ 6/ gold standard (perfectly specific and >95% sensitive based 999 = 24.41 (Table 2). LR and 95% CI were calculated on the experience of one of the authors, JPD), but the for each possible MAT titre value using the LRs (2 · k method is too expensive, labour intensive and ethically table) module in Medcalc. As expected, LR values justifiable except for targeted studies. False-negative increased with increasing MAT titres up to 1 : 800, but results may occur when limited amounts of tissues are then decreased or remained constant thereafter (Table 2). fed to cats or when cysts are not uniformly distributed Logistic regression can be used to model LR and to through muscle tissue in a carcass. In an experimental ensure that increasingly higher titres have higher LR val- study, Dubey et al. (1995b) found that T. gondii was not ues. For these data, we coded the composite reference

84 ª 2009 Blackwell Verlag GmbH • Zoonoses Public Health. 57 (2010) 82–94 I. A. Gardner et al. Statistical Evaluation of Test Accuracy Studies for T. gondii

Table 1. Pairwise differences in area under the receiver-operating Table 2. Relative frequencies of MAT reciprocal titre results by characteristic curve (AUC) and 95% CI for serological tests based infection status for 170 Toxoplasma gondii-infected (I+) and 829 on bioassay results from 170 Toxoplasma gondii-infected and 829 non-infected (I)) sows and likelihood ratios (LR) calculated from the non-infected sows empirical titre data and by logistic regression

AUC LR 95% CI for LR (logistic Test pairs difference 95% CI Titre I+ I) Total (titre) LR (titre) regression)

MAT-ELISA 0.025 )0.013 to 0.064 <20 29 748 777 0.19 0.14–0.26 0.22 MAT-IHAT 0.148 0.104 to 0.192 20 13 22 35 2.88 1.48–5.61 1.95 MAT-LAT 0.081 0.043 to 0.119 40 8 13 21 3.00 1.26–7.13 3.14 ELISA-IHAT 0.123 0.074 to 0.171 80 30 20 50 7.32 4.26–12.57 5.11 ELISA-LAT 0.056 0.010 to 0.101 200 24 8 32 14.63 6.69–32.01 9.77 LAT-IHAT 0.067 0.024 to 0.110 400 24 7 31 16.72 7.32–38.18 16.00 800 20 4 24 24.38 8.44–70.43 26.23 AUC for the modified agglutination test (MAT), enzyme-linked immu- 2000 9 3 12 14.63 4.00–53.48 50.44 nosorbent assay (ELISA), latex agglutination test (LAT) and indirect 4000 8 3 11 13.00 3.49–48.52 82.74 hemagglutination inhibition test (IHAT) were 0.881, 0.856, 0.800 and >4000 5 1 6 24.38 2.87–207.39 135.72 0.733 respectively. Total 170 829 999

For purposes of the analysis, a single MAT titre of 320 was collapsed 100 with titre values of 400 and a single titre of 1280 was collapsed with titre values of 2000.

outcome value X = log (8001) was assigned to all results 80 2 for MAT >4000 (censored values). Fitting the logistic regression model, logit Pr(Y = 1) =

60 a + bX, results in estimates of the coefficients a and b. ELISA Using the prevalence (P) of Y = 1 in the sample, we MAT derive the critical value x¢ = [logit (P) – a]/b, which cor- IHAT Sensitivity responds to the value for X such that LR(X) = 1. From 40 LAT results of the logistic , the titre-specific LR can then be calculated as LR(X) = exp[b (X-x¢)] for 20 each MAT titre value. A comparison of LR estimates from the empirical titre data and based on logistic regression is shown in Table 2. 0 For calculation of LR for ELISA results, a smoothed 0 20 40 60 80 100 ROC curve could be fitted to the data and the LR esti- 100-Specificity mated for each unique ELISA test value (Choi, 1998). In Fig. 1. Receiver-operating characteristic (ROC) plots for the enzyme- practice, continuous data are usually categorized into linked immunosorbent assay (ELISA), modified agglutination test intervals of results. The number of intervals chosen (MAT), indirect hemagglutination inhibition test (IHAT) and latex should be guided by the sample size used in the test eval- agglutination test (LAT) for the subclinical Toxoplasma gondii infection uation study. As more intervals are used, the precision in in 170 infected and 829 non-infected sows, as determined by the the interval-specific LR decreases. composite reference standard of cat and mouse bioassay results. Diag- For clinical applications, LR from the test evaluation onal line from bottom left to top right hand corner of the figure rep- study can be combined with estimates of pre-test infec- resents the plot of a test with no discriminatory power, with area under the ROC plot of 0.5. tion probabilities to yield post-test infection probabilities (Simel et al., 1993). These calculations can be modified to allow for sequential use of tests assuming conditional standard of mouse and cat bioassay results (Y) as 1 and 0 independence of the sensitivities and specificities and for positive and negative status, respectively, and used a hence, conditional independence of LRs. log2 transformation of the MAT titre. We added 1 to obtain transformed results because we assigned a value of Comparison of the sensitivity and specificity of 0 to all samples that were negative at 1 : 20. The trans- dichotomized test results formed MAT titre was thus X = log2 (MAT + 1). MAT titres >4000 were collapsed into a single category because The sensitivity and specificity of the MAT (titre ‡1 : 20 is of sparse data. For the logistic regression analysis, the positive) and ELISA tests (OD ‡0.360 is positive) were

ª 2009 Blackwell Verlag GmbH • Zoonoses Public Health. 57 (2010) 82–94 85 Statistical Evaluation of Test Accuracy Studies for T. gondii I. A. Gardner et al.

Table 3. Pairwise frequencies of modified agglutination test (M) and evaluation (Staquet et al., 1981). However, a flawed refer- enzyme-linked immunosorbent assay (E) results for Toxoplasma gondii ence standard might still be useful for comparative rank- in 998 sows based on bioassay results ing of the relative sensitivity and specificity of different Biossay M+E+ M+E) M)E+ M)E) Total serological tests when used on the same set of samples. When a perfect reference test is not available, research- Combined + 117 24 7 21 169 ers commonly estimate and report either the sensitivity cat/mouse ) 47 34 70 678 829 and specificity relative to the imperfect standard or the Mouse + 73 17 4 13 107 ) 91 41 73 686 891 kappa statistic which quantifies test agreement beyond Total 164 58 77 699 998 chance. Use of relative sensitivity and specificity has the limitation described in the prior paragraph and the mag- nitude of kappa depends on prevalence (Byrt et al., 1993). However, from a test substitution perspective, high to compared using McNemar’s test for correlated propor- very high (>0.9) kappa values might be adequate for lab- tions in subgroups of infected (n = 169) and non-infected oratory diagnosticians to replace an existing test with a sows (n = 829), respectively (Table 3). new test on the basis of criteria such as cost and sample- The MAT was significantly more sensitive (83.4% ver- throughput capacity. sus 73.4%: P < 0.0001) and more specific (90.2% versus Latent class analysis is preferred to use of relative sensi- 85.9%; P < 0.0001) than ELISA. In addition, the sensitiv- tivity and specificity because a test does not need to be ity and specificity covariances (Gardner et al., 2000) were designated as perfect (i.e. a gold standard) for purposes moderate (66% and 51% of maximum possible values for of the analysis, and the sensitivity and specificity of the sensitivity and specificity respectively). Hence, gains in reference standard can be estimated in addition to that of sensitivity from using both tests with a parallel interpreta- a new test (Hui and Walter, 1980; Enøe et al., 2000; tion (87.6% for MAT and ELISA versus 83.4% for MAT Johnson et al., 2001). Latent class methods are appropri- alone) and serial interpretation (94.3% for MAT and ate to compare antigen and antibody detection tests for ELISA versus 90.2% for MAT alone) were not substantial. T. gondii because tissue cysts and antibodies to cysts are Additional analyses (data not shown) indicated that use potentially detectable for most of an infected animal’s life. of IHAT and LAT tests failed to yield improvements in The utility of latent class methods is gaining recognition the joint sensitivity or specificity above the values internationally for test evaluation for important diseases obtained when MAT and ELISA were used in combina- that impact animal trade and the approach is mentioned tion. We refer readers interested in further discussion of in OIE’s Manual of Diagnostic Tests and Vaccines for log-linear and logistic modelling approaches to estimation Terrestrial Animals (OIE, 2008). of test dependence and evaluation of the accuracy of tests Two statistical approaches [maximum likelihood (ML) in combination to Hanson et al. (2000). and Bayesian] based on latent class methods can be used Since the Dubey et al. (1995a) study was reported, fur- for estimation of test sensitivity and specificity. Bayesian ther refinements of the ELISA have resulted in improved approaches are especially suited to situations where prior sensitivity. In recent studies, a commercial test kit (Toxo- information is available about test performance and when plasma Microwell Immunoassay Kit; Safe-Path Laborato- the problem is not identifiable, i.e. when there is no ries, Carlsbad, CA, USA) that uses a whole-tachyzoite unique set of ML parameter values. The WinBUGS soft- antigen was shown to be more sensitive than the MAT ware allows easy implementation of Markov-chain Monte (Gamble et al., 2005; Hill et al., 2006). Carlo methods for Bayesian estimation, and simple ML analyses can be done using a web-based interface (Pouil- lot et al., 2002). Bayesian methods can also be used to Evaluation of Tests for T. gondii Infection estimate herd-level sensitivity and specificity (Su et al., in the Absence of a Perfect Reference Standard 2007) but this scenario is not considered here. Prior Although used as a reference standard in T. gondii test information about model parameters used in the Bayes- evaluation studies, bioassay of tissue in mice often yields ian analyses may affect the final estimates depending on false-negative results. Hence, without use of statistical the relative strength of evidence provided by the priors methods that adjust for imperfect accuracy of the refer- (level of prior ) and the data (uncertainty ence standard it impossible to show that a serological test attributable to finite sample sizes). Therefore, the sources is more sensitive than mouse bioassay if the latter were of prior information must be well documented in Bayes- considered perfectly specific. Use of an imperfect refer- ian analyses. In addition, the influence of the selected ence standard, such as mouse bioassay, yields biased esti- prior distributions on the resulting estimates should be mates of sensitivity and specificity of the test under assessed.

86 ª 2009 Blackwell Verlag GmbH • Zoonoses Public Health. 57 (2010) 82–94 I. A. Gardner et al. Statistical Evaluation of Test Accuracy Studies for T. gondii

purposes, MAT and bioassay specificities are conditionally Evaluation of two tests applied to all individuals independent by definition. in two populations The MAT and the composite cat and mouse bioassay The simplest case that can be solved by ML and Bayesian (CMB) results were evaluated by ML in a prior study methods without use of informative prior information (Gardner, 2004) and estimates (MAT sensitivity = 82.9% about one or more parameters (e.g. sensitivity) is the sce- and specificity = 90.2%; CMB sensitivity and specific- nario when two tests are applied to all individuals in two ity = 100%) were identical to those obtained in Dubey populations. These populations can be natural aggregates, et al. (1995a) where CMB was treated as a gold standard. such as herds or flocks, or can be artificially created by We now analyze the same data using a Bayesian two-test stratification by an assumed risk factor for infection (e.g. two-population model using non-informative prior [beta covariates such as age or parity). Subject to the validity of (1,1)] distributions on all parameters and a highly infor- assumptions of constant sensitivity and specificity across mative prior [beta (9999,1)] distribution for specificity of populations, distinct prevalences and conditional inde- CMB. The modelling of bioassay specificity using this pendence in sensitivities and specificities, there are six prior distribution allowed for the possibility of rare false- degrees of freedom to estimate the six unknown para- positive results (approximately 1 in 10 000 truly negative meters (two sensitivities, two specificities and two preva- samples) caused by cross-contamination of samples in the lences). Hence, prior information is not a prerequisite laboratory. for the analysis. The model (Appendix: Code A) was fit in WinBUGS, For the data reported in Dubey et al. (1995a), the two and posterior inferences about test parameters were based populations could be based on mouse or composite cat/ on the 2.5th, 50th (median) and 97.5th percentiles of the mouse bioassay results (Table 3), making no assumptions Monte Carlo sample. The 2.5th and 97.5th percentile val- about whether these were perfect reference tests or based ues were the lower and upper endpoints of 95% probabil- on the groups as this allows for a natural cate- ity intervals (PI). Estimates in Table 5 were based on gorization of the data that were independent of test 50,000 iterations after discarding the first 5,000 iterations results (Table 4). The latter approach is based on the and were minimally different from the gold-standard assumption that an unknown risk factor induced different analysis. Use of a beta (1,1) prior as an alternative to the prevalences in the two groups of samples (Toft et al., beta (9999,1) prior resulted in minor changes in posterior 2005). inferences, although the median estimate for CMB speci- ficity was lower and the estimate for MAT sensitivity somewhat higher (data not shown). Comparison of MAT and bioassay accuracy For this comparison, an assumption of independence in Comparison of MAT and ELISA accuracy sensitivities is reasonable for the MAT and bioassay results as the tests measure different biological phenom- Kappa can be estimated from the pairwise results of MAT ena, e.g. one test detects antibody (an indirect measure of and ELISA collapsed across CMB status (total row in infection) and the other detects antigen (Gardner et al., Table 3) or group (total row in the lower section of 2000). As bioassay is perfectly specific for practical Table 4). For these data, kappa equals 0.621 which

Table 5. Bayesian estimates [median and 95% probability intervals Table 4. Pairwise frequencies of modified agglutination test (M) and (PI)] of sensitivity (Se) and specificity (Sp) of the modified agglutination combined cat/mouse bioassay (CMB) results and modified agglutina- test (MAT) and combined cat and mouse bioassay (CMB) results for tion test and enzyme-linked immunosorbent assay (E) results for Toxo- Toxoplasma gondii based on a beta (9999,1) prior for CMB specificity plasma gondii in 999 sows based on testing of pigs by group and beta (1,1) priors for other parameters

Group M+CMB+ M+CMB) M)CMB+ M)CMB) Total True value Posterior (gold-standard 1 37 55 7 363 462 Parameter median 95% PI analysis) 2 104 26 22 385 537 Total 141 81 29 748 999 Se MAT 0.826 0.765–0.878 0.829 M+E+ M+E) M)E+ M)E) Sp MAT 0.906 0.884–0.929 0.902 Se CMB 0.973 0.872–0.999 1 1 67 25 41 329 462 Sp CMB 0.9999 0.9996–1.0 1 2 97 33 36 371 537 Prevalence (group 1) 0.101 0.075–0.135 0.095 Total 164 58 77 700 999 Prevalence (group 2) 0.241 0.205–0.282 0.235

ª 2009 Blackwell Verlag GmbH • Zoonoses Public Health. 57 (2010) 82–94 87 Statistical Evaluation of Test Accuracy Studies for T. gondii I. A. Gardner et al. indicates moderate agreement beyond chance. However, that allow for dependence in test results. Posterior med- kappa provides no indication of the accuracy of either the ian estimates from WinBUGS (80.8% and 89.3% for MAT or ELISA or the superiority of one test over the MAT sensitivity and specificity; 69.5% and 84.8% for other. ELISA sensitivity and specificity) were close to the true If the goal were to obtain estimates of the sensitivity values and the four 95% PI included the true values and specificity of the MAT and ELISA, a different analysis (Table 6). The specificity correlation between MAT and than that used for MAT and CMB is necessary because ELISA results of non-infected pigs was positive but the two tests are conditionally dependent (correlated) as although the sensitivity correlation for infected pigs was they both measure serum antibodies. Hence, an appropri- positive, the 95% PI included zero (null value). Intervals ate Bayesian model is one that incorporates the condi- for the correlations were wide (data not shown). More tional dependence between the two serological tests rather detailed analysis of these data is described in Georgiadis than one that assumes independence as was used for et al. (2003), including the use of informative priors on MAT and CMB. other parameters and sensitivity analysis based on differ- First, we analysed the MAT and ELISA data by group ent priors. Mainar-Jaime and Barbera (2007) used the (Table 4) without considering true infection status as same Bayesian dependence model to evaluate the accuracy determined by CMB. Use of non-informative prior distri- of MAT and ELISA to detect serum antibodies to T. gon- butions using a Bayesian conditional independence model dii in two populations of sheep. (Code B) results in overestimation of the sensitivity and specificity of both the ELISA and the MAT (Table 6). Comparison of MAT, ELISA and mouse bioassay This is not corrected by the use of moderately informa- accuracy tive prior information about the MAT, where the distri- butions are centred at the true values (sensitivity As mouse bioassay data were available for all pigs in modelled as beta (24.05, 5.72) and specificity as beta Dubey et al. (1995a), these data could also be incorpo- (22.99, 3.42), respectively). rated into the analysis. The MAT and ELISA remain as Next, we analysed the data correctly to account for the the two dependent tests and two new populations are conditional dependence in the sensitivities and specifici- created based on results of mouse bioassay (Table 3). This ties of the MAT and ELISA, as found in the gold-stan- approximates a gold-standard analysis as with the excep- dard analysis. The model has two additional parameters tion of rare errors, all mouse-positive results are truly and hence, prior information must be provided for two infected (prevalence close to 1) and the error rate among parameters to ensure identifiability. Priors for sensitivity mouse-negative results should be low (prevalence <10%) and specificity of the MAT could be elicited from experts although the exact value would be known with less cer- as the test was used for many years. We use the same tainty. Our analysis used beta (9999,1) and beta (1,10) priors as used in the independence model but the Win- prior distributions for mouse-positive and mouse-nega- BUGS code (Code C) now includes additional parameters tive classifications, respectively. The major benefit from

Table 6. Posterior medians and 95% probability intervals (PI) for the sensitivity (Se) and specificity (Sp) of the modified agglutination test (MAT) and enzyme-linked immunosorbent (ELISA) test for Toxoplasam gondii in sows based on three different Bayesian models

Model 1 Model 2 Model 3

True Posterior Posterior Posterior Parameter value median 95% PI median 95% PI median 95% PI

Se MAT 0.829 0.854 0.672–0.993 0.809 0.640–0.924 0.836 0.759–0.836 Sp MAT 0.902 0.960 0.917–0.998 0.893 0.800–0.967 0.919 0.847–0.987 Se ELISA 0.729 0.843 0.715–0.990 0.695 0.343–0.920 0.712 0.624–0.792 Sp ELISA 0.859 0.932 0.891–0.995 0.848 0.761–0.931 0.867 0.807–0.918 Prevalence (group 1) 0.095 0.201 0.139–0.299 0.140 0.017–0.252 0.9999 0.9996–1.0 Prevalence (group 2) 0.235 0.244 0.177–0.326 0.189 0.050–0.304 0.090 0.004–0.172

True values are based on the analysis with the composite cat and mouse bioassay results as the gold standard. Model 1: Independence model based on group data in the lower section of Table 4 with non-informative beta (1,1) priors on all parameters. Model 2: Dependence model based on group data in the lower section of Table 4 with a beta (24.05, 5.72) prior for MAT sensitivity and a beta (22.99, 3.42) prior for MAT specificity. Non-informative priors are used for other parameters. Model 3: Dependence model with non-informative priors for MAT and ELISA and populations based on mouse bioassay results (Table 3) with prevalence modelled as beta (9999,1) for mouse-positive and beta (1,10) for mouse-negative groups.

88 ª 2009 Blackwell Verlag GmbH • Zoonoses Public Health. 57 (2010) 82–94 I. A. Gardner et al. Statistical Evaluation of Test Accuracy Studies for T. gondii the use of a Bayesian model based on 3 tests is that the Table 7. Checklist of statistical analysis and associated reporting 95% PI for test accuracy parameters are narrower considerations for studies evaluating diagnostic tests for Toxoplasam (Table 6) and from a philosophical perspective, the ana- gondii in intermediate hosts lysis is richer as it makes use of all the data. The model Statistical methods also allows reasonable precise estimation of the sensitivity Evaluation of a new test compared with a perfect reference test correlation in infected pigs (0.442, 95% PI = 0.238 to In this setting, the negative/positive classification for infection is based 0.614) and specificity correlation in non-infected pigs on results of the reference test. Methods for calculating or (0.389, 95% PI = )0.01 to 0.574) as it approximates a comparing diagnostic accuracy should be described, including methods for CI estimation. The selected sample sizes of infected and gold-standard analysis. non-infected animals should be justified. An alternative Bayesian model for the MAT, ELISA and For binary tests, comparisons of the sensitivity (specificity) of tests mouse data is one that considers results of three tests (two among subpopulations can be done by Pearson’s chi-square or conditionally dependent and a third conditionally inde- Fisher’s exact test for independent samples. pendent, namely bioassay) in a single population (i.e. sam- For paired designs, comparisons of the sensitivity and specificity of pling group is ignored). This model is highly dependent pairs of binary tests (or ordinal/continuous tests where results are on prior information as there are more parameters than dichotomized) can be done by McNemar’s chi-square test. For ordinal/continuous tests, ROC analysis should be used and the degrees of freedom. Without specification of reasonable area under the curve (AUC) and a 95% CI should be reported. priors on the prevalence and the accuracy of the MAT and When multiple ordinal/continuous tests are compared, differences in ELISA, this model performs poorly (data not shown). AUC for pairs of tests should be reported with 95% CI. If tests are used clinically for case management decisions in individuals, likelihood ratios and 95% CI should be reported for each ROC analysis of ELISA tests value of an ordinal test (e.g. MAT) or intervals of test results created In the absence of a gold standard, the ROC curve and its from continuous results (e.g. ELISA). If a diagnostic test, validated according to these guidelines, is used for associated AUC cannot be estimated unless prior infor- prevalence estimation, an adjustment for sensitivity and specificity mation about the accuracy of at least 1 test can be incor- should be made as described by Greiner and Gardner (2000b) porated in a Bayesian analysis. Estimation of the curve elsewhere. and its AUC is possible for two ELISA tests under the Evaluation of tests in the absence of a perfect reference test assumption of bivariate binormality (with or without When latent class methods (maximum likelihood or Bayesian) are used transformation of the data) provided there is adequate for parameter estimation, reasons why a perfect reference standard separation of the responses of infected and non-infected was not used should be given. In this case, classification of the sample into positive/negative for two or more tests is possible, groups of pigs (Choi et al., 2006). Because the assump- whereas a classification into true positive and true negative is tion of normality is very restrictive, non-parametric and impossible. Model assumptions (e.g. constant test accuracy across semi-parametric approaches have been developed and populations) should be justified from both statistical and biological may be especially useful when the distribution of test perspectives. The number of degrees of freedom and parameters to results in infected animals is bimodal (Branscum et al., be estimated in the model should be listed. 2008). This example is beyond the scope of this review Sources of prior distributions for a Bayesian analysis (e.g. published but should be considered when investigators compare the data or whether expert-elicited) should be described and the corresponding prior distributions (e.g. beta) for modelling sensitivity, accuracy of several ELISA tests when no bioassay results specificity and prevalence should be given. are available. If a diagnostic test, validated according to these guidelines, is used for prevalence estimation on the same sample, an adjusted (for the sensitivity [Se] and specificity [Sp] of all tests) prevalence estimate is Conclusions and Recommendations directly obtained from the latent class analysis. On the other hand, Reporting of sensitivity and specificity values and 95% CI for the use in other study populations, the Se and Sp estimates is considered standard practice for test evaluation studies. obtained from latent class analysis should be used for adjustment as described by Greiner and Gardner (2000b). For quantitative tests, AUC and LRs provide additional Presentation of results useful information about test accuracy and its clinical Evaluation of a new test compared with a perfect reference test utility because they prevent information loss based on the Sensitivity and specificity estimates should be presented with 95% CI arbitrary selection of a single cut-off value that might not (calculated using an exact method). Estimates should also be be optimal for all testing purposes. These measures presented by relevant subpopulations, e.g. separate sensitivity should also be reported where relevant. estimates for subclinical and clinical cases, if relevant. Comparison of reported sensitivity and specificity esti- The source 2 · 2 table should be displayed and additional measures (e.g. predictive values) should be presented for cross-sectional mates for the same test (even if using identical protocols sampling designs. including reagents and strains of T. gondii) is problematic because sampling designs are not consistent among stud-

ª 2009 Blackwell Verlag GmbH • Zoonoses Public Health. 57 (2010) 82–94 89 Statistical Evaluation of Test Accuracy Studies for T. gondii I. A. Gardner et al.

Table 7. Continued To facilitate improved analysis and reporting of studies for tests for toxoplasmosis, we have updated guidelines The number of uninterpretable, indeterminate and intermediate (suspicious) results, if any, and reasons for missing data should be for test evaluation studies (Greiner and Gardner, 2000a), provided. Results of all samples should be accounted for in the incorporating relevant recommendations from human study, including a description of how outlier values were handled. health studies and our own experiences (Table 7). We If the test is to be used for herd diagnosis/testing, herd sensitivity and recommend that authors, reviewers and journal editors specificity estimates and 95% CI should be provided for likely herd use this checklist to guide reporting of test evaluation test sizes. studies in animal health and veterinary public health When multiple tests are evaluated, relative frequencies of pairwise journals. Although important, we do not present design test results (++, +), )+, ))) should presented separately for infected and non-infected populations so that measures of test considerations and refer readers to the STARD guidelines dependence (sensitivity and specificity covariances) can be calculated. (Bossuyt et al., 2003a,b), which require adaptation to Evaluation of tests in the absence of a perfect reference test animal health studies. The source data tables with cross-classified test results for each population should be presented, and sensitivity, specificity and 95% CI (or probability) should be presented for frequentist (or Bayesian) Acknowledgements approaches. We thank Adam Branscum, Marios Georgiadis and Wes- Prevalence estimates, if relevant, for the populations may be presented with 95% intervals. ley Johnson for their assistance with writing the original If more than two populations or more than two tests have been WinBUGS code used in these analyses. available, the model fit should be investigated using chi-square test, cross-validation or other appropriate methods. For a Bayesian analysis, the convergence of the Markov chains should References be assessed by running multiple chains from dispersed starting Bossuyt, P. M., J. B. Reitsma, D. E. Bruns, C. A. Gatsonis, P. values, visual inspection of trace plots and assessment of Monte P. Glasziou, L. M. Irwig, J. G. Lijmer, D. Moher, D. Rennie, Carlo error. Results of the sensitivity analysis evaluating the effects and H. C. M. de Vet, 2003a: Towards complete and accurate of different prior distributions on parameter estimates should be reported. reporting of studies of diagnostic accuracy: the STARD initiative. Clin. Chem. 49, 1–6. Bossuyt, P. M., J. B. Reitsma, D. E. Bruns, C. A. Gatsonis, P. ies and estimates may be confounded by factors such as P. Glasziou, L. M. Irwig, D. Moher, D. Rennie, H. C. M. de age and type of production system. Hence, there is an Vet, and J. G. Lijmer, 2003b: The STARD statement for increasing realization of the importance of comparing reporting studies of diagnostic accuracy: explanation and multiple test methods on the same set of samples. There elaboration. Clin. Chem. 49, 7–18. is a paucity of reference materials for evaluation studies Branscum, A. J., I. A. Gardner, and W. O. Johnson, 2005: Esti- mation of diagnostic-test sensitivity and specificity through for serological tests for T. gondii, and standardization and Bayesian modeling. Prev. Vet. Med. 68, 145–163. harmonization of test methods has been recommended Branscum, A. J., W. O. Johnson, T. E. Hanson, and I. A. Gard- to help ensure comparability of surveillance data for ner, 2008: Bayesian semiparametric ROC curve estimation T. gondii (EFSA, 2007). and disease diagnosis. Stat. Med. 27, 2474–2496. Although the use of a perfect reference standard is the Byrt, T., J. Bishop, and J. B. Carlin, 1993: Bias, prevalence and ideal approach for test evaluation studies of infectious kappa. J. Clin. Epidemiol. 46, 423–429. diseases in animals, latent class methods will have an Choi, B. H. K., 1998: Slopes of a receiver operating characteris- increasing importance in the future because of their flexi- tic curve and likelihood ratios for a diagnostic test. bility to account for imperfect reference tests. However, Am. J. Epidemiol. 148, 1127–1132. the methods should be used carefully and include a thor- Choi, Y. K., W. O. Johnson, M. T. Collins, and I. A. Gardner, ough evaluation of underlying assumptions, including the 2006: Bayesian inferences for receiver operating characteris- effects of use of the selected prior distributions on poster- tic curves in the absence of a gold standard. J. Agric. Biol. ior inferences, and convergence of Markov chains in a Environ. Stat. 11, 210–229. Bayesian analysis (Toft et al., 2005). For studies compar- Christensen, J., and I. A. Gardner, 2000: Herd-level interpreta- ing two serological tests for T. gondii with no bioassay tion of test results for epidemiologic studies of animal data, we recommend the use of the Bayesian dependence diseases. Prev. Vet. Med. 45, 83–106. model (Code C) over reporting of kappa values. For stud- Dubey, J. P., and J. L. Jones, 2008: Toxoplasma gondii infection ies with data from two serological tests and mouse bioas- in humans and animals in the United States. Int. J. Parasitol. say but not cat bioassay data, we recommend a Bayesian 38, 1257–1278. dependence mode (Code D) with populations created on Dubey, J. P., A. Kotula, A. Sharar, C. D. Andrews, and D. S. the basis of mouse bioassay results. Lindsay, 1990: Effect of high temperature on infectivity of

90 ª 2009 Blackwell Verlag GmbH • Zoonoses Public Health. 57 (2010) 82–94 I. A. Gardner et al. Statistical Evaluation of Test Accuracy Studies for T. gondii

Toxoplasma gondii tissue cysts in pork. J. Parasitol. 76, 201– Hill, D. E., S. Chirukandoth, J. P. Dubey, J. K. Lunney, and H. 204. R. Gamble, 2006: Comparison of detection methods for Dubey, J. P., P. Thulliez, R. M. Weigel, C. D. Andrews, P. Lind, Toxoplasma gondii in naturally and experimentally infected and E. C. Powell, 1995a: Sensitivity and specificity of various swine. Vet. Parasitol. 141, 9–17. serologic tests for detection of Toxoplasma gondii infection Hui, S. L., and S. D. Walter, 1980: Estimating the error rates in naturally infected sows. Am. J. Vet. Res. 56, 1030–1036. of diagnostic tests. Biometrics 36, 167–171. Dubey, J. P., P. Thulliez, and E. C. Powell, 1995b: Toxoplasma Johnson, W. O., J. L. Gastwirth, and L. M. Pearson, 2001: gondii in Iowa sows: comparison of antibody titers to isola- Screening without a gold standard: the Hui-Walter tion of T.gondii by bioassays in mice and cats. J. Parasitol. paradigm revisited. Am. J. Epidemiol. 153, 921–924. 81, 48–53. Lunn, D. J., A. Thomas, N. Best, and D. Spiegelhalter, 2000: Dubey, J. P., D. S. Lindsay, and C. A. Speer, 1998: Structures WinBUGS – a Bayesian modelling framework: concepts, of Toxoplasma gondii tachyzoites, bradyzoites and sporozo- structure, and extensibility. Stat. Comp. 10, 325–337. ites and biology and development of tissue cysts. Clin. Mainar-Jaime, R. C., and M. Barbera, 2007: Evaluation of the Microbiol. Rev. 11, 267–299. diagnostic accuracy of the modified agglutination test EFSA, 2007: Scientific opinion of the panel on biological haz- (MAT) and an indirect ELISA for the detection of serum ards on a request from EFSA on surveillance and monitor- antibodies against Toxoplasma gondii in sheep through ing of Toxoplasma in humans, food and animals (adopted Bayesian approaches. Vet. Parasitol. 148, 122–129. on October 17, 2007). EFSA J. 583, 1–64. Martin, S. W., M. Shoukri, and M. A. Thorburn, 1992: Evalu- Enøe, C., M. P. Georgiadis, and W. O. Johnson, 2000: Estima- ating the health status of herds based on tests applied to tion of sensitivity and specificity of diagnostic tests and dis- individuals. Prev. Vet. Med. 14, 33–43. ease prevalence when the true disease state is unknown. Office International des Epizooties (OIE), 2008: Manual of Prev. Vet. Med. 45, 61–81. Diagnostic Tests and Vaccines for Terrestrial Animals, OIE, Gamble, H. R., J. P. Dubey, and D. N. Lambillotte, 2005: Paris. Available at: www.oie.int/eng/normes/mmanual/ Comparison of a commercial ELISA with the modified A_summry.htm (accessed on 13 May 2009). agglutination test for detection of Toxoplasma infection in Peeling, R. W., P. G. Smith, and P. M. Bossuyt, 2006: A guide the domestic pig. Vet. Parasitol. 128, 177–181. for diagnostic evaluations. Nat. Rev. Microbiol. 4(Suppl. 12), Gardner, I. A., 2004: An epidemiologic critique of current S2–S6. microbial risk assessment practices: the importance of preva- Pouillot, R., G. Gerbier, and I. A. Gardner, 2002: ‘‘TAGS’’, a lence and test accuracy data. J. Food Prot. 67, 2000–2007. program for the evaluation of test accuracy in the absence Gardner, I. A., and M. Greiner, 2006: Receiver-operating char- of a gold standard. Prev. Vet. Med. 53, 67–81. acteristic curves and likelihood ratios: improvements over Simel, D. L., G. P. Samsa, and D. B. Matchar, 1993: Likelihood traditional methods for the evaluation and application of ratios for continuous test results – making the clinicians’ veterinary clinical pathology tests. Vet. Clin. Pathol. 35, 8–17. job easier or harder? J. Clin. Epidemiol. 46, 85–93. Gardner, I. A., H. Stryhn, P. Lind, and M. T. Collins, 2000: Staquet, M., M. Rozencweig, Y. J. Lee, and F. M. Muggia, Conditional dependence affects the diagnosis and surveil- 1981: Methodology for the assessment of new dichotomous lance of animal diseases. Prev. Vet. Med. 45, 107–122. diagnostic tests. J. Chronic Dis. 34, 599–610. Georgiadis, M. P., W. O. Johnson, I. Gardner, and R. Singh, Su, C. L., I. A. Gardner, and W. O. Johnson, 2007: Bayesian 2003: Correlation-adjusted estimation of sensitivity and estimation of aggregate test accuracy based on different specificity of two diagnostic tests. Appl. Stat. 52, 63–76. sampling schemes. J. Agric. Biol. Environ. Stat. 12, 250– Georgiadis, M. P., W. O. Johnson, and I. A. Gardner, 2005: 271. Sample size determination for estimation of the accuracy of TDR Diagnostics Evaluation Expert Panel, 2006: Evaluation of two conditionally independent tests in the absence of a gold diagnostic tests for infectious diseases: general principles. standard. Prev. Vet. Med. 71, 1–10. Nat. Rev. Microbiol. 4(Suppl. 12), S20–S32. Greiner, M., and I. A. Gardner, 2000a: Epidemiologic issues in Tenter, A., A. R. Heckeroth, and L. M. Weiss, 2000: Toxo- the validation of veterinary diagnostic tests. Prev. Vet. Med. plasma gondii: from animals to humans. Int. J. Parasitol. 30, 45, 3–22. 1217–1258. Greiner, M., and I. A. Gardner, 2000b: Application of diagnos- Toft, N., E. Jorgensen, and S. Hojsgaard, 2005: Diagnosing tic tests in veterinary epidemiologic studies. Prev.Vet. Med. diagnostic tests: evaluating the assumptions underlying the 45, 43–59. estimated of sensitivity and specificity in the absence of a Greiner, M., D. Pfeiffer, and R. D. Smith, 2000: Principles and gold standard. Prev. Vet. Med. 68, 19–33. practical application of the receiver-operating characteristic Whiting, P., A. W. S Rutjes, J. Dinnes, J. B. Reitsma, P. M. M analysis for diagnostic tests. Prev. Vet. Med. 45, 23–41. Bossuyt, and J. Kleijnen, 2003: The development of Hanson, T. E., W. O. Johnson, and I. A. Gardner, 2000: Log- QUADAS: a tool for the quality assessment of studies of linear and logistic modeling of dependence among diagnos- diagnostic accuracy included in systematic reviews. BMC tic tests. Prev. Vet. Med. 45, 123–137. Med. Res. Methodol. 3, 25.

ª 2009 Blackwell Verlag GmbH • Zoonoses Public Health. 57 (2010) 82–94 91 Statistical Evaluation of Test Accuracy Studies for T. gondii I. A. Gardner et al.

Whiting, P., A. W. S Rutjes, J. Dinnes, J. B. Reitsma, P. M. M. Zweig, M. H., and G. Campbell, 1993: Receiver-operating Bossuyt, and J. Kleijnen, 2004: Development and validation characteristic (ROC) plots – a fundamental evaluation tool of methods for assessing the quality of diagnostic accuracy in clinical medicine. Clin. Chem. 39, 561–577. studies. Health Technol. Assess. 8, 1–234.

Appendix: WINBUGS code for Bayesian analyses Code A Two conditionally independent tests (MAT and bioassay) Populations created based on group of sample collection, either group 1 or group 2 Informative prior on specificity of CMB only model; { y1[1:Q, 1:Q]  dmulti(p1[1:Q, 1:Q], n1) y2[1:Q, 1:Q]  dmulti(p2[1:Q, 1:Q], n2) p1[1,1] <- pi1*SeMAT*SeCMB + (1-pi1)*(1-SpMAT)*(1-SpCMB) p1[1,2] <- pi1*SeMAT*(1-SeCMB) + (1-pi1)*(1-SpMAT)*SpCMB p1[2,1] <- pi1*(1-SeMAT)*SeCMB + (1-pi1)*SpMAT*(1-SpCMB) p1[2,2] <- pi1*(1-SeMAT)*(1-SeCMB) + (1-pi1)*SpMAT*SpCMB p2[1,1] <- pi2*SeMAT*SeCMB + (1-pi2)*(1-SpMAT)*(1-SpCMB) p2[1,2] <- pi2*SeMAT*(1-SeCMB) + (1-pi2)*(1-SpMAT)*SpCMB p2[2,1] <- pi2*(1-SeMAT)*SeCMB + (1-pi2)*SpMAT*(1-SpCMB) p2[2,2] <- pi2*(1-SeMAT)*(1-SeCMB) + (1-pi2)*SpMAT*SpCMB SeMAT  dbeta(1,1) SpMAT  dbeta(1,1) SeCMB  dbeta(1,1) SpCMB  dbeta(9999,1) #approx. 1 false positive in 10,000 pi1 dbeta(1,1) pi2  dbeta(1,1) } list(n1 = 462, n2 = 537, y1 = structure(.Data = c(37,55,7,363),.Dim = c(2,2)), y2 = structure(.Data = c(104,26,22,385), .Dim = c(2,2)),Q = 2) list(pi1 = 0.1, pi2 = 0.3, SeMAT = 0.80, SpMAT = 0.90, SeCMB = 0.90, SpCMB = 0.9995)

Code B Two conditionally dependent tests (MAT and ELISA) modelled as if independent Populations created based on group of sample collection; 1 pig missing ELISA result Non-informative priors for all parameters model; { y1[1:Q, 1:Q]  dmulti(p1[1:Q, 1:Q], n1) y2[1:Q, 1:Q]  dmulti(p2[1:Q, 1:Q], n2) p1[1,1] <- pi1*SeMAT*SeELISA + (1-pi1)*(1-SpMAT)*(1-SpELISA) p1[1,2] <- pi1*SeMAT*(1-SeELISA) + (1-pi1)*(1-SpMAT)*SpELISA p1[2,1] <- pi1*(1-SeMAT)*SeELISA + (1-pi1)*SpMAT*(1-SpELISA) p1[2,2] <- pi1*(1-SeMAT)*(1-SeELISA) + (1-pi1)*SpMAT*SpELISA p2[1,1] <- pi2*SeMAT*SeELISA + (1-pi2)*(1-SpMAT)*(1-SpELISA) p2[1,2] <- pi2*SeMAT*(1-SeELISA) + (1-pi2)*(1-SpMAT)*SpELISA

92 ª 2009 Blackwell Verlag GmbH • Zoonoses Public Health. 57 (2010) 82–94 I. A. Gardner et al. Statistical Evaluation of Test Accuracy Studies for T. gondii p2[2,1] <- pi2*(1-SeMAT)*SeELISA + (1-pi2)*SpMAT*(1-SpELISA) p2[2,2] <- pi2*(1-SeMAT)*(1-SeELISA) + (1-pi2)*SpMAT*SpELISA SeMAT  dbeta(1,1) SpMAT  dbeta(1,1) SeELISA  dbeta(1,1) SpELISA dbeta(1,1) pi1 dbeta(1,1) pi2  dbeta(1,1) } list(n1 = 462, n2 = 537, y1 = structure(.Data = c(67,25,41,329),.Dim = c(2,2)), y2 = structure(.Data = c(97,33,36,371), .Dim = c(2,2)),Q = 2) list(pi1 = 0.1, pi2 = 0.2, SeMAT = 0.80, SpMAT = 0.90, SeELISA = 0.70, SpELISA = 0.80)

Code C Two conditionally dependent tests (MAT and ELISA) modelled as dependent; priors on lambdas and gammas induce priors on ELISA; Georgiadis parameterization Populations created based on group of sample collection; 1 pig missing ELISA result Informative priors for MAT but not on prevalence or lambdas and gammas model; { y1[1:Q, 1:Q]  dmulti(p1[1:Q, 1:Q], n1) y2[1:Q, 1:Q]  dmulti(p2[1:Q, 1:Q], n2) p1[1,1] <- pi1*eta11 + (1-pi1)*theta11 p1[1,2] <- pi1*eta12 + (1-pi1)*theta12 p1[2,1] <- pi1*eta21 + (1-pi1)*theta21 p1[2,2] <- pi1*eta22 + (1-pi1)*theta22 p2[1,1] <- pi2*eta11 + (1-pi2)*theta11 p2[1,2] <- pi2*eta12 + (1-pi2)*theta12 p2[2,1] <- pi2*eta21 + (1-pi2)*theta21 p2[2,2] <- pi2*eta22 + (1-pi2)*theta22 eta11 < - lambdaD*SeMAT eta12 < - SeMAT - eta11 eta21 < - gammaD*(1-SeMAT) eta22 < - 1 - eta11 - eta12 - eta21 theta11 < - 1 - theta12 - theta21 - theta22 theta12 < - gammaDc*(1-SpMAT) theta21 < - SpMAT - theta22 theta22 < - lambdaDc* SpMAT SeELISA <- eta11 + eta21 SpELISA <- theta22 + theta12 rhoD <- (eta11 - SeMAT*SeELISA) / sqrt(SeMAT*(1-SeMAT)*SeELISA*(1-SeELISA)) rhoDc <- (theta22 - SpMAT*SpELISA) / sqrt(SpMAT*(1-SpMAT)*SpELISA*(1-SpELISA)) pi1  dbeta(1,1) pi2  dbeta(1,1) SeMAT  dbeta(24.05, 5.72) # mode = 0.83; 95% sure >0.68 SpMAT  dbeta(22.99, 3.42) # mode = 0.90; 95% sure >0.70 lambdaD  dbeta(1,1) gammaD  dbeta(1,1) lambdaDc  dbeta(1,1) gammaDc  dbeta(1,1) }

ª 2009 Blackwell Verlag GmbH • Zoonoses Public Health. 57 (2010) 82–94 93 Statistical Evaluation of Test Accuracy Studies for T. gondii I. A. Gardner et al. list(n1 = 462, n2 = 537, Q = 2, y1 = structure(.Data = c(67,25,41,329),.Dim = c(2,2)), y2 = structure(.Data = c(97,33, 36,371),.Dim = c(2,2))) list(pi1 = 0.07, pi2 = 0.20, SeMAT = 0.83, SpMAT = 0.90, lambdaD = 0.50, lambdaDc = 0.50, gammaD = 0.50, gammaDc = 0.50)

Code D Two conditionally dependent tests (MAT and ELISA) modeled as dependent; priors on lambdas and gammas induce priors on ELISA; Georgiadis parameterization Two populations created based on mouse bioassay; 1 pig missing ELISA result Informative priors for prevalence only: pi1 = mouse positive and pi2 = mouse negative model; { y1[1:Q, 1:Q]  dmulti(p1[1:Q, 1:Q], n1) y2[1:Q, 1:Q]  dmulti(p2[1:Q, 1:Q], n2) p1[1,1] <- pi1*eta11 + (1-pi1)*theta11 p1[1,2] <- pi1*eta12 + (1-pi1)*theta12 p1[2,1] <- pi1*eta21 + (1-pi1)*theta21 p1[2,2] <- pi1*eta22 + (1-pi1)*theta22 p2[1,1] <- pi2*eta11 + (1-pi2)*theta11 p2[1,2] <- pi2*eta12 + (1-pi2)*theta12 p2[2,1] <- pi2*eta21 + (1-pi2)*theta21 p2[2,2] <- pi2*eta22 + (1-pi2)*theta22 eta11 < - lambdaD*SeMAT eta12 < - SeMAT - eta11 eta21 < - gammaD*(1-SeMAT) eta22 < - 1 - eta11 - eta12 - eta21 theta11 < - 1 - theta12 - theta21 - theta22 theta12 < - gammaDc*(1-SpMAT) theta21 < - SpMAT - theta22 theta22 < - lambdaDc* SpMAT SeELISA <- eta11 + eta21 SpELISA <- theta22 + theta12 rhoD <- (eta11 - SeMAT*SeELISA) / sqrt(SeMAT*(1-SeMAT)*SeELISA*(1-SeELISA)) rhoDc <- (theta22 - SpMAT*SpELISA) / sqrt(SpMAT*(1-SpMAT)*SpELISA*(1-SpELISA)) pi1  dbeta(9999,1) pi2  dbeta(1,10) SeMAT  dbeta(1,1) SpMAT  dbeta(1,1) lambdaD  dbeta(1,1) gammaD  dbeta(1,1) lambdaDc  dbeta(1,1) gammaDc  dbeta(1,1) } list(n1 = 107, n2 = 891, Q = 2, y1 = structure(.Data = c(73,17,4,13),.Dim = c(2,2)), y2 = structure(.Data = c(91,41,73, 686),.Dim = c(2,2))) list(pi1 = 0.999, pi2 = 0.10, SeMAT = 0.83, SpMAT = 0.90, lambdaD = 0.50, lambdaDc = 0.50, gammaD = 0.50, gammaDc = 0.50)

94 ª 2009 Blackwell Verlag GmbH • Zoonoses Public Health. 57 (2010) 82–94