Follicle Stimulating Hormone Is an Accurate Predictor of Azoospermia in Childhood Cancer

1 Follicle Stimulating Hormone is an accurate predictor of

2 azoospermia in childhood cancer survivors

3 4 5 Thomas W Kelsey1, Lauren McConville2, Angela B Edgar3, Alex I Ungurianu1, Rod 6 Mitchell4, Richard A Anderson4 and W Hamish B Wallace3 7 8 1 School of Computer Science, University of St. Andrews, St. Andrews, United 9 Kingdom 10 11 2School of Medicine, University of Edinburgh, Edinburgh EH16 4TJ, United 12 Kingdom 13 14 3 Department of Haematology/Oncology, Royal Hospital for Sick Children, Edinburgh 15 EH9 1LF, United Kingdom 16 17 4 MRC Centre for Reproductive Health, University of Edinburgh, EH16 4TJ, United 18 Kingdom 19 20 For correspondence: 21 Department of Haematology/Oncology 22 Royal Hospital for Sick Children 23 17 Millerfield Place 24 Edinburgh EH9 1LW 25 UK 26 Email [email protected] 27 28 29 30 Key words: FSH/azoospermia /childhood cancer/late effects 31 32 DISCLOSURE STATEMENT: The authors have no conflicts of interest to disclose. 33

34 Abstract 35 Study question: How accurate is FSH as a predictor of azoospermia in survivors of 36 childhood cancer? 37 Summary answer: FSH is an accurate predictor, with a diagnostic threshold of 17 38 IU/L giving 94% probability of avoiding a misdiagnosis of azoospermia. 39 What is known already: The accuracy of FSH as a predictor of azoospermia in adult 40 survivors of childhood cancer is unclear, with conflicting results in the published 41 literature. 42 Study design, size, duration: A systematic review and post hoc analysis of combined 43 data (n = 367) were performed on all published studies containing extractable data on 44 both serum FSH concentration and semen concentration in survivors of childhood 45 cancer. 46 Participants/materials, setting, methods: To identify relevant studies based on the 47 PRISMA statement, PubMed and Medline databases were searched up to September 48 2016 by two blind investigators. Articles were included if they contained both serum 49 FSH concentration and semen concentration, used World Health Organisation 50 certified methods for semen analysis, and the study participants were all childhood 51 cancer survivors. 52 Main results and the role of chance: There was no evidence for either publication 53 bias or heterogeneity for the five studies. For the combined data (n=367) the optimal 54 FSH threshold was 10.4 IU/L with specificity 81% (95% CI 76% - 86%) and 55 sensitivity 83% (95% CI 76% - 89%). The AUC was 0.89 (95%CI 0.86 – 0.93). A 56 range of threshold FSH values for the diagnosis of azoospermia with their associated 57 sensitivities and specificities were calculated. 58 Limitations, reasons for caution: Semen sample analysis remains the gold standard 59 for diagnosis of azoospermia; our findings provide an alternative and inferior method 60 for patients who are reluctant to submit semen samples. 61 Wider implications of the findings : This study provides strong supporting evidence 62 for the use of serum FSH as a surrogate biomarker for azoospermia in adult males 63 who have been treated for childhood cancer. 64 Study funding/competing interest(s): RTM is supported by a Wellcome Trust 65 Intermediate Clinical Fellowship (Grant No: 098522). TWK is supported by EPSRC 66 grant EP/P015638/1. The funding bodies played no role in the design, methods, data

67 management or analysis or in the decision to publish. The authors have no conflicts of 68 interest to declare. 69

70 Introduction

71 The potential impact of childhood cancer treatment on male fertility is a significant

72 issue for both families at the time of diagnosis, and the young adult survivor

73 (Anderson, et al., 2015, Skinner, et al., 2017). Treatment at any age, with

74 chemotherapy agents, particularly high doses of alkylating agents, and pelvic

75 radiotherapy, may damage the testes resulting in impaired sperm production(Chow, et

76 al., 2016, Green, et al., 2014, Greenfield, et al., 2007, Greenfield, et al., 2010,

77 Skinner, et al., 2017, van Beek, et al., 2007). While semen analysis remains the gold

78 standard, a serum biomarker of sufficient accuracy, for example Follicle Stimulating

79 Hormone (FSH) would provide a useful indirect assessment of fertility.

81 The feedback relationship between the seminiferous tubule and the

82 hypothalamus/pituitary underpins the putative value of FSH and inhibin B in the

83 quantitative assessment of spermatogenesis (McCullagh, 1932). FSH concentrations

84 are negatively related to sperm concentration in both normal men and in those with

85 testicular dysfunction, whereas serum inhibin B is positively related (Anderson, et al.,

86 1997, Illingworth, et al., 1996, Jensen, et al., 1997). Both can be used to aid

87 discrimination of obstructive vs non-obstructive azoospermia in infertile men (Toulis,

88 et al., 2010) without clear benefit of one over the other, likely reflecting their

89 interdependence and relationship to maturational stages of spermatogenesis (Okuma,

90 et al., 2006).

92 The ready availability and acceptability of serum FSH analysis compared to semen

93 analysis makes it of potential value as a predictor of azoospermia in childhood cancer

94 survivors (CCS), but the literature contains conflicting reports of the sensitivity and

95 specificity of plasma concentrations of FSH in this context. Green et al. (Green, et al.,

96 2013) found that FSH was unsuitable as predictor of azoospermia in CCS whilst

97 Romerius et al. (Romerius, et al., 2011) concluded that FSH was an excellent

98 predictor. It is possible that sources of heterogeneity such as diagnosis, treatment

99 regimens or pubertal status may account for this difference. It is also possible that

100 there is little or no inherent heterogeneity, in which case data can be combined from

101 multiple studies in order to provide a dataset suitable for improved assessment of the

102 true level of diagnostic strength.

104 In this study we identified studies that have reported FSH and sperm concentrations in

105 CCS, and used them to (a) test the data for homogeneity and (b) to assess the value of

106 FSH as a diagnostic predictor of azoospermia in CCS.

108 Patients and methods

109 Using an established methodology (Iliodromiti, et al., 2013, Iliodromiti, et al., 2016,

110 Kelsey, et al., 2013), a scoping search was carried out using relevant MeSH headings

111 which generated 680 results on PubMed and 973 on Scopus. The abstracts of all

112 studies identified were screened, and any studies in cancer survivors that had data on

113 semen analysis and FSH levels were read in full. Studies were selected if they met

114 the following criteria: (i) they contained both serum FSH concentration and semen

115 concentration (either as explicit values or reported in a scatterplot), (ii) World Health

116 Organisation (WHO) certified methods (World Health Organisation, 2010) were used

117 in the semen analysis; (ii) the study participants were all childhood cancer survivors,

118 or data was clearly demarcated between childhood cancer survivors and normal

119 controls, in which case only cancer survivor data was extracted; (iii) all study designs

120 were included except case reports.

122 In addition to data identified from a systematic search of the literature, we included

123 our own data (SI 1) used (but not explicitly reported or given as a scatterplot) in a

124 CCS semen quality study (Thomson, et al., 2002). This study involved 33 male

125 survivors of childhood cancer recruited from the oncology database at the Royal

126 Hospital for Sick Children, Edinburgh, from whom FSH levels were obtained in

127 addition to semen concentrations determined according to WHO protocols.

129 While recognising that different FSH assays were used in the studies included, a

130 detailed comparison has shown ‘fair to strong consistency’ between the relevant

131 assays (Radicioni, et al., 2013) thus extracted data were used without further

132 conversion.

133 Approval was not required from an ethics committee or institutional review board

134 since our research was limited to use of previously collected, non-identifiable data

135 that has been published in peer reviewed journals which is specifically excluded from

136 Research Ethics Committee review by the National Research Ethics Service

137 guidelines of the UK Health Research Agency (HRA, 2013).

139 The risks of publication bias and potential small study effect were visually assessed

140 by constructing funnels plots, in which calculated diagnostic accuracy is set against

141 statistical precision (Sterne, et al., 2011). In addition, we performed a linear

142 regression of log diagnostic ratios on the inverse root of effective sample sizes as a

143 test for funnel plot asymmetry, where a non-zero slope coefficient is suggestive of

144 significant asymmetry and small study bias (Deeks, et al., 2005).

147 Statistical analysis

148 Initial analysis considered the heterogeneity or otherwise of the included studies. This

149 was tested using four distinct techniques: visually by forest plots (Sedgwick, 2015),

150 numerically by calculating the slope of the affine regression equation linking the

151 study diagnostic odds ratios (DOR) to the study thresholds (Moses, et al., 1993,

152 Walter, 2002) (where a slope close to zero shows homogeneity of the studies), and

153 statistically by (i) calculating the p-value for the chi-squared test of the hypothesis

154 that the studies are heterogeneous (a high p-value suggests homogeneity) and (ii)

155 calculating Higgins I2 statistic for measuring inconsistency in meta-analyses (Higgins,

156 et al., 2003) (a small value suggests homogeneity). Two statistical tests were used as

157 the interpretation of I2 can be misleading, since the importance of inconsistency

158 depends on several factors and the magnitude and direction of effects could lead to a

159 small I2 despite a large chi-squared p-value (Julian P T Higgins, 2011).

161 After combining the data into a single set of (FSH, azoospermic or not azoospermic)

162 pairs, a ROC curve was constructed. 95% confidence intervals for the AUC were

163 calculated using 200 bootstraps of the data set, as were the optimal threshold (i.e. the

164 level of FSH that maximizes the probability of a randomly-selected (azoospermic, not

165 azoospermic) pair from the CCS population being correctly diagnosed) and the 95%

166 confidence intervals for the specificity and sensitivity at each threshold value. All

167 analyses were performed using the mada and pROC packages for the R statistical

168 language (R Development Core Team, 2010).

171 Results

172 The application of inclusion and exclusion criteria to the studies found in the literature

173 yielded four sources of FSH and semen concentration in CCS (Table 1, SI 2, Fig. 1)

174 (Green, et al., 2013, Lahteenmaki, et al., 2008, Rendtorff, et al., 2012, van Beek, et

175 al., 2007). The Chi-squared statistical test for funnel plot asymmetry (Fig. 2) did not

176 reach statistical significance (p=0.32 for sensitivity; p = 0.17 for specificity),

177 suggesting that neither studies with small sample size nor studies with results lacking

178 statistical significance are missing from the literature. As all the included studies used

179 WHO protocols, we conclude that they are at low risk of bias and have low concern

180 about applicability, as specified by the QUODAS-2 and STARD frameworks for

181 reporting diagnostic accuracy (Bossuyt, et al., 2015, Whiting, et al., 2011).

184 The confidence intervals for the log-adjusted DOR for each study have similar ranges,

185 suggesting a lack of significant study heterogeneity (Fig. 3). The slope of the

186 regression equation linking the study log DOR to the study FSH thresholds was close

187 to zero (slope = -0.01), providing numerical evidence for study homogeneity. The chi-

188 squared p-values were 0.32 for study sensitivity and 0.17 for study specificity,

189 supplying no statistically significant evidence for the hypothesis that the studies are

190 heterogeneous. The Higgin’s I2 statistic was 0%, the lowest possible indication of

191 study heterogeneity. Taken together, and in conjunction with the lack of publication

192 bias, we conclude that the studies are homogeneous in terms of dependency on FSH

193 thresholds to determine diagnostic accuracy, and hence that combining the study data

194 into a single set results in a representative sample of the CCS population in terms of

195 FSH levels and sperm concentrations.

197 For the combined data (n=367, SI 1, SI 2) the optimal FSH threshold was 10.4 IU/L

198 with specificity 81% (95% CI 76% - 86%) and sensitivity 82% (95% CI 76% - 88%).

199 The AUC was 0.89 (95%CI 0.85 – 0.92), demonstrating that FSH is a strong predictor

200 of azoospermia for CCS (Fig. 4).

202 The optimal threshold maximizes the chance of a correct classification for an arbitrary

203 survivor of childhood cancer. In order to quantify FSH levels that minimize

204 misdiagnosis of azoospermia, a range of threshold FSH values for the diagnosis of

205 azoospermia were calculated, together with the median and 95% confidence intervals

206 for their associated sensitivities and specificities (Table 2). A diagnostic threshold of

207 17 IU/L for FSH gives 94% probability of avoiding misdiagnosis of azoospermia,

208 with 95% confidence interval 90 - 97% (Table 2).

210 Discussion

211 We have shown that FSH has strong diagnostic power, with 89% probability that FSH

212 levels will correctly classify as azoospermic, not azoospermic a randomly chosen

213 survivor of childhood cancer (i.e. positive predictive value) (Hanley and McNeil,

214 1982), with 95% confidence that this probability is within 85% and 92% (Fig. 4). We

215 have also calculated clinically-useful diagnostic levels for a range of FSH thresholds

216 (Table 2).

218 We have assessed heterogeneity of existing studies using visual, numeric modelling

219 and two distinct statistical tests; none of these suggested any important level of

220 heterogeneity (Figs. 2 and 3). This result is of clinical and biomedical interest in its

221 own right, but also allows us safely to combine the data into a single set, which has

222 greater power for statistical analysis than any single study reported to date. While

223 different FSH assays were used in the studies included in this analysis, there is good

224 concordance between them (Radicioni, et al., 2013).

226 FSH, inhibin B, and more recently anti-Mullerian hormone have been previously

227 investigated as biomarkers of seminiferous tubule function, often to attempt to predict

228 the surgical recovery of sperm in azoospermic men (Toulis, et al., 2010). The latter

229 two are products of the Sertoli cell, with potentially an additional contribution to

230 serum inhibin B from germ cells (Makanji, et al., 2014). In the post-chemotherapy

231 testis, the key pathology determining azoospermia or not is the presence or absence of

232 spermatogonial stem cells at the end of treatment. This differs therefore from the

233 situation in the more general male infertility population, where disorders of

234 spermatogenic maturation are relatively common, with likely impact on the germ cell-

235 Sertoli cell interaction and production of inhibin B, and feedback regulation of FSH.

236 It is thus possible that serum biomarkers of spermatogenesis may be more accurate in

237 post-chemotherapy assessment than with the wide range of pathologies seen in the

238 general infertile population.

240 Our calculated optimal FSH threshold for classifying a CCS as azoospermic is 10.4

241 UI/L, where optimal means providing the best tradeoff between sensitivity (i.e.

242 minimized prediction of non-zero sperm concentration for CCS who are in reality

243 azoospermic) and specificity (i.e. minimized prediction of azoospermia for CCS who

244 in reality have non-zero sperm concentration). In clinical practice of long-term follow

245 up of CCS, however, we suggest that a more conservative threshold is more

246 appropriate, since a wrong diagnosis of azoospermia is worse than a false negative. It

247 should also be emphasized that in some azoospermic CCS it is possible to obtain

248 sperm by micro-TESE (Shin, et al., 2016). The bootstrap sampling used to provide the

249 optimal threshold (necessarily) allows calculation of confidence intervals for

250 sensitivities and specificities for all potential thresholds, and from these we observe

251 that a diagnostic threshold of 17 IU/L for FSH has a 94% probability of avoiding this

252 misdiagnosis, with 95% confidence interval of 90% - 97% (Table 2). Our specificity

253 results are in quantitative agreement with a study that reported mean FSH of 22 IU/L

254 in 21 azoospermic CSS compared to 9 IU/L in 10 controls with 81% specificity at a

255 10 IU/L cutoff (Wilhelmsson, et al., 2014) compared to our value of 78%. However

256 our calculated sensitivity at this cutoff is higher: 56% (Wilhelmsson, et al., 2014)

257 compared to 83%.

259 Serum assessment of FSH is therefore a useful test before the patient is ready to

260 submit a semen sample for analysis, and the present analysis indicates high predictive

261 accuracy. Attempts to survey CCS with universal semen analysis have demonstrated

262 the reluctance of these patients to submit semen samples. In contrast, a blood test is

263 less intrusive and more acceptable to these young CCS (Lahteenmaki, et al., 2008).

264 The use of hormone measurement in dried blood spots sent by post has recently been

265 evaluated in the analysis of reproductive function in female cancer survivors (Roberts,

266 et al., 2016), and this technique has clear potential to be useful in the male case.

268 This study provides strong supporting evidence for the use of serum FSH as a useful

269 surrogate biomarker for spermatogenesis in adult males who have been treated for

270 childhood cancer, however semen analysis should always be encouraged and remains

271 the gold standard test of spermatogenesis.

273 Author contributions

274 TWK study design, data collection, analysis, drafting and finalising manuscript 275 LM study design, data collection, drafting and finalising manuscript 276 AU data analysis, drafting and finalising manuscript 277 WHBW study design, data analysis, drafting and finalising manuscript 278 ABE data analysis, drafting and finalising manuscript 279 RTM data analysis, drafting and finalising manuscript 280 RAA data analysis, drafting and finalising manuscript 281 282

283 References 284 Anderson RA, Mitchell RT, Kelsey TW, Spears N, Telfer EE, Wallace WH. Cancer 285 treatment and gonadal function: experimental and established strategies for fertility 286 preservation in children and young adults. Lancet Diabetes Endocrinol 2015;3: 556- 287 567. 288 Anderson RA, Wallace EM, Groome NP, Bellis AJ, Wu FC. Physiological 289 relationships between inhibin B, follicle stimulating hormone secretion and 290 spermatogenesis in normal men and response to gonadotrophin suppression by 291 exogenous testosterone. Hum Reprod 1997;12: 746-751. 292 Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, Lijmer JG, 293 Moher D, Rennie D, de Vet HC et al. STARD 2015: An Updated List of Essential 294 Items for Reporting Diagnostic Accuracy Studies. Clin Chem 2015;61: 1446-1452. 295 Chow EJ, Stratton KL, Leisenring WM, Oeffinger KC, Sklar CA, Donaldson SS, 296 Ginsberg JP, Kenney LB, Levine JM, Robison LL et al. Pregnancy after 297 chemotherapy in male and female survivors of childhood cancer treated between 1970 298 and 1999: a report from the Childhood Cancer Survivor Study cohort. Lancet Oncol 299 2016;17: 567-576. 300 Deeks JJ, Macaskill P, Irwig L. The performance of tests of publication bias and other 301 sample size effects in systematic reviews of diagnostic test accuracy was assessed. J 302 Clin Epidemiol 2005;58: 882-893. 303 Green DM, Liu W, Kutteh WH, Ke RW, Shelton KC, Sklar CA, Chemaitilly W, Pui 304 CH, Klosky JL, Spunt SL et al. Cumulative alkylating agent exposure and semen 305 parameters in adult survivors of childhood cancer: a report from the St Jude Lifetime 306 Cohort Study. Lancet Oncol 2014;15: 1215-1223. 307 Green DM, Zhu L, Zhang N, Sklar CA, Ke RW, Kutteh WH, Klosky JL, Spunt SL, 308 Metzger ML, Navid F et al. Lack of specificity of plasma concentrations of inhibin B 309 and follicle-stimulating hormone for identification of azoospermic survivors of 310 childhood cancer: a report from the St Jude lifetime cohort study. J Clin Oncol 311 2013;31: 1324-1328. 312 Greenfield DM, Walters SJ, Coleman RE, Hancock BW, Eastell R, Davies HA, 313 Snowden JA, Derogatis L, Shalet SM, Ross RJ. Prevalence and consequences of 314 androgen deficiency in young male cancer survivors in a controlled cross-sectional 315 study. J Clin Endocrinol Metab 2007;92: 3476-3482.

316 Greenfield DM, Walters SJ, Coleman RE, Hancock BW, Snowden JA, Shalet SM, 317 DeRogatis LR, Ross RJ. Quality of life, self-esteem, fatigue, and sexual function in 318 young men after cancer: a controlled cross-sectional study. Cancer 2010;116: 1592- 319 1601. 320 Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating 321 characteristic (ROC) curve. Radiology 1982;143: 29-36. 322 Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta- 323 analyses. BMJ 2003;327: 557-560. 324 HRA. Does my project require review by a Research Ethics Committee? 2013. 325 Iliodromiti S, Kelsey TW, Anderson RA, Nelson SM. Can anti-Mullerian hormone 326 predict the diagnosis of polycystic ovary syndrome? A systematic review and meta- 327 analysis of extracted data. J Clin Endocrinol Metab 2013;98: 3332-3340. 328 Iliodromiti S, Sassarini J, Kelsey TW, Lindsay RS, Sattar N, Nelson SM. Accuracy of 329 circulating adiponectin for predicting gestational diabetes: a systematic review and 330 meta-analysis. Diabetologia 2016;59: 692-699. 331 Illingworth PJ, Groome NP, Byrd W, Rainey WE, McNeilly AS, Mather JP, Bremner 332 WJ. Inhibin-B: a likely candidate for the physiologically important form of inhibin in 333 men. J Clin Endocrinol Metab 1996;81: 1321-1325. 334 Jensen TK, Andersson AM, Hjollund NH, Scheike T, Kolstad H, Giwercman A, 335 Henriksen TB, Ernst E, Bonde JP, Olsen J et al. Inhibin B as a serum marker of 336 spermatogenesis: correlation to differences in sperm concentration and follicle- 337 stimulating hormone levels. A study of 349 Danish men. J Clin Endocrinol Metab 338 1997;82: 4059-4063. 339 Julian P T Higgins SG. Cochrane Handbook for Systematic Reviews of Interventions, 340 2011. The Cochrane Collaboration. 341 Kelsey TW, Dodwell SK, Wilkinson AG, Greve T, Andersen CY, Anderson RA, 342 Wallace WH. Ovarian volume throughout life: a validated normative model. PLoS 343 One 2013;8: e71465. 344 Lahteenmaki PM, Arola M, Suominen J, Salmi TT, Andersson AM, Toppari J. Male 345 reproductive health after childhood cancer. Acta Paediatr 2008;97: 935-942. 346 Makanji Y, Zhu J, Mishra R, Holmquist C, Wong WP, Schwartz NB, Mayo KE, 347 Woodruff TK. Inhibin at 90: from discovery to clinical application, a historical 348 review. Endocr Rev 2014;35: 747-794. 349 McCullagh DR. Dual Endocrine Activity of the Testes. Science 1932;76: 19-20. 16 17

350 Moses LE, Shapiro D, Littenberg B. Combining independent studies of a diagnostic 351 test into a summary ROC curve: data-analytic approaches and some additional 352 considerations. Stat Med 1993;12: 1293-1316. 353 Okuma Y, O'Connor AE, Hayashi T, Loveland KL, de Kretser DM, Hedger MP. 354 Regulated production of activin A and inhibin B throughout the cycle of the 355 seminiferous epithelium in the rat. J Endocrinol 2006;190: 331-340. 356 R Development Core Team. R: A language and environment for statistical learning. 357 2010. R Foundation for Statistical Computing, Vienna, Austria. 358 Radicioni A, Lenzi A, Spaziani M, Anzuini A, Ruga G, Papi G, Raimondo M, Foresta 359 C. A multicenter evaluation of immunoassays for follicle-stimulating hormone, 360 luteinizing hormone and testosterone: concordance, imprecision and reference values. 361 J Endocrinol Invest 2013;36: 739-744. 362 Rendtorff R, Beyer M, Muller A, Dittrich R, Hohmann C, Keil T, Henze G, 363 Borgmann A. Low inhibin B levels alone are not a reliable marker of dysfunctional 364 spermatogenesis in childhood cancer survivors. Andrologia 2012;44 Suppl 1: 219- 365 225. 366 Roberts SC, Seav SM, McDade TW, Dominick SA, Gorman JR, Whitcomb BW, Su 367 HI. Self-collected dried blood spots as a tool for measuring ovarian reserve in young 368 female cancer survivors. Hum Reprod 2016;31: 1570-1578. 369 Romerius P, Stahl O, Moell C, Relander T, Cavallin-Stahl E, Wiebe T, Giwercman 370 YL, Giwercman A. High risk of azoospermia in men treated for childhood cancer. Int 371 J Androl 2011;34: 69-76. 372 Sedgwick P. How to read a forest plot in a meta-analysis. BMJ 2015;351: h4028. 373 Shin T, Kobayashi T, Shimomura Y, Iwahata T, Suzuki K, Tanaka T, Fukushima M, 374 Kurihara M, Miyata A, Kobori Y et al. Microdissection testicular sperm extraction in 375 Japanese patients with persistent azoospermia after chemotherapy. Int J Clin Oncol 376 2016;21: 1167-1171. 377 Skinner R, Mulder RL, Kremer LC, Hudson MM, Constine LS, Bardi E, Boekhout A, 378 Borgmann-Staudt A, Brown MC, Cohn R et al. Recommendations for gonadotoxicity 379 surveillance in male childhood, adolescent, and young adult cancer survivors: a report 380 from the International Late Effects of Childhood Cancer Guideline Harmonization 381 Group in collaboration with the PanCareSurFup Consortium. The Lancet Oncology 382 2017;18: e75-e90.

383 Sterne JA, Sutton AJ, Ioannidis JP, Terrin N, Jones DR, Lau J, Carpenter J, Rucker G, 384 Harbord RM, Schmid CH et al. Recommendations for examining and interpreting 385 funnel plot asymmetry in meta-analyses of randomised controlled trials. BMJ 386 2011;343: d4002. 387 Thomson AB, Campbell AJ, Irvine DC, Anderson RA, Kelnar CJ, Wallace WH. 388 Semen quality and spermatozoal DNA integrity in survivors of childhood cancer: a 389 case-control study. Lancet 2002;360: 361-367. 390 Toulis KA, Iliadou PK, Venetis CA, Tsametis C, Tarlatzis BC, Papadimas I, Goulis 391 DG. Inhibin B and anti-Mullerian hormone as markers of persistent spermatogenesis 392 in men with non-obstructive azoospermia: a meta-analysis of diagnostic accuracy 393 studies. Hum Reprod Update 2010;16: 713-724. 394 van Beek RD, Smit M, van den Heuvel-Eibrink MM, de Jong FH, Hakvoort-Cammel 395 FG, van den Bos C, van den Berg H, Weber RF, Pieters R, de Muinck Keizer- 396 Schrama SM. Inhibin B is superior to FSH as a serum marker for spermatogenesis in 397 men treated for Hodgkin's lymphoma with chemotherapy during childhood. Hum 398 Reprod 2007;22: 3215-3222. 399 Walter SD. Properties of the summary receiver operating characteristic (SROC) curve 400 for diagnostic test data. Stat Med 2002;21: 1237-1256. 401 Whiting PF, Rutjes AW, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, Leeflang 402 MM, Sterne JA, Bossuyt PM, Group Q-. QUADAS-2: a revised tool for the quality 403 assessment of diagnostic accuracy studies. Ann Intern Med 2011;155: 529-536. 404 Wilhelmsson M, Vatanen A, Borgstrom B, Gustafsson B, Taskinen M, Saarinen- 405 Pihkala UM, Winiarski J, Jahnukainen K. Adult testicular volume predicts 406 spermatogenetic recovery after allogeneic HSCT in childhood and adolescence. 407 Pediatr Blood Cancer 2014;61: 1094-1100. 408 World Health Organisation. Examination and processing of human semen. 5th edn, 409 2010. The World Health Organisation. 410 411

412 413 Table 1

414 Characteristics of the included studies. 415 Age 1st Author Year PubMed ID Number CSS (years, median & range) Green 2012 23423746 257 30.5, 19.7 – 59.1 Lähteenmäki 2008 18430073 23 20.5, 15.6 -31.2 Rendtorff 2012 21726269 37 25, 19-45 van Beek 2007 17981817 17 27,, 17.7 – 42.6 Thomson 2002 12241775 33 21.9, 16.5 - 35.2 416 417 418 419 Table 2. Sensitivity and specificity of FSH-based azoospermia diagnosis for a 420 range of threshold values. Median and 95% CI are calculated from 2,000 421 stratified bootstrap replicates of the combined data (n = 367).

Threshold FSH Specificity Specificity Sensitivity Sensitivity (IU/L) Median 95% CI Median 95% CI

9 0.743 0.690 – 0.801 0.851 0.787 – 0.901

10 0.783 0.730 – 0.836 0.830 0.766 – 0.894

10.4 0.814 0.761 – 0.863 0.823 0.759 – 0.897

11 0.827 0.774 – 0.872 0.801 0.731 – 0.865

12 0.858 0.810 – 0.903 0.773 0.702 – 0.837

13 0.885 0.845 – 0.925 0.752 0.681 – 0.823

14 0.898 0.858 – 0.938 0.716 0.638 – 0.780

15 0.916 0.881 – 0.951 0.695 0.617 – 0.766

16 0.925 0.889 – 0.956 0.660 0.582 – 0.731

17 0.938 0.903 – 0.969 0.638 0.560 – 0.723

18 0.947 0.916 -- 0.974 0.610 0.532 – 0.688

19 0.951 0.920 – 0.978 0.589 0.511 – 0.674

20 0.969 0.943 – 0.991 0.553 0.468 – 0.638

423 424

425 Figure legends 426 Figure 1. 427 Flow-chart of systematic search methodology. n = number of studies; N = number of 428 childhood cancer survivors fulfilling criteria. 429 430 431 Figure 2. 432 Funnel plots for specificity (upper panel) and sensitivity (lower panel) relating study 433 size to reported diagnostic accuracy for the five studies listed in Table 1. The Chi- 434 squared statistical test for funnel plot asymmetry did not reach statistical significance 435 (p=0.32 for sensitivity; p = 0.17 for specificity), suggesting a lack of publication bias. 436 437 438 439 Figure 3. 440 Forest plot of 95% confidence limits for the log-adjusted diagnostic odds ratio for the 441 five studies listed in Table 1. The vertical dashed line denotes the line of no effect. 442 Visual inspection shows that each study is statistically significant in its own right, that 443 the intervals overlap to a great extent, and that therefore the studies are unlikely to be 444 heterogeneous. 445

446 447 448 449 450 Figure 4. 451 Receiver-operator characteristic (ROC) curve analysis of FSH as predictor of 452 azoospermia (combined cohort: n=367). Area under the curve: 0·89 (95% CI 0·85 – 453 0·92. The optimal diagnostic threshold is 10.4 mIU/mL, with sensitivity 0.814 and 454 specificity 0.823. 455 456 457