<<

ORIGINAL ARTICLE

Robustness and of Radiomics in Magnetic Resonance Imaging A Phantom Study

Bettina Baeßler, MD,*† Kilian Weiss, PhD,‡ and Daniel Pinto dos Santos, MD*

finally aims at supporting clinical decision making and leads the Objectives: The aim of this study was to investigate the robustness and reproduc- way toward precision medicine, while overcoming the limitations ibility of radiomic features in different magnetic resonance imaging sequences. of a purely visual image interpretation. Materials and Methods: A phantom was scanned on a clinical 3 T system using The extensive and encouraging preliminary in fluid-attenuated inversion recovery (FLAIR), T1-weighted (T1w), and T2-weighted the field has led to an increasing desire and need for translating (T2w) sequences with low and high matrix size. For retest data, scans were repeated radiomic image analysis to clinical practice. The establishment of novel after repositioning of the phantom. Test and retest datasets were segmented using quantitative imaging biomarkers, however, has to be preceded by an a semiautomated approach. Intraobserver and interobserver comparison was per- extensive knowledge of the robustness and reproducibility of the un- formed. Radiomic features were extracted after standardized preprocessing of derlying quantitative imaging features. Recently, an Image images. Test-retest robustness was assessed using concordance correlation coef- standardization initiative has been formed by Zwanenburg and col- ficients, dynamic range, and Bland-Altman analyses. Reproducibility was assessed leagues,7 addressing the need for standardization of the radiomic by intraclass correlation coefficients. feature extraction process. In addition, many studies have been Results: The number of robust features (concordance correlation coefficient and ≥ published mostly in the environment of CT and positron emission dynamic range 0.90) was higher for features calculated from FLAIR than from tomography imaging highlighting the challenges of reproducibility of T1w and T2w images. High-resolution FLAIR images provided the highest per- radiomic features when using different vendors, scanners or acquisi- centage of robust features (n = 37/45, 81%). No considerable difference in the tion, and reconstruction settings.7–15 number of robust features was observed between low- and high-resolution T1w Conversely, only very few studies have analyzed the robustness and T2w images (T1w low: n = 26/45, 56%; T1w high: n = 25/45, 54%; T2 low: of radiomic features in MRI.16–19 Given the qualitative nature of most n = 21/45, 46%; T2 high: n = 24/45, 52%). A total of 15 (33%) of 45 features showed MRI techniques and the known variations of the resulting absolute excellent robustness across all sequences and demonstrated excellent intraobserver 20 ≥ signal intensities, we hypothesized that the robustness of radiomic and interobserver reproducibility (intraclass correlation coefficient 0.75). features extracted from MRI scans largely depends on the MRI se- Conclusions: FLAIR delivers the most robust substrate for radiomic analyses. quence used for image acquisition as well as on acquisition and Only 15 of 45 features showed excellent robustness and reproducibility across reconstruction settings. all sequences. Care must be taken in the interpretation of clinical studies using Thus, the presented phantom study aims at evaluating the ro- nonrobust features. bustness and reproducibility of radiomic imaging features for the most Key Words: radiomics, texture analysis, magnetic resonance imaging, MRI, commonly used MRI sequences in a phantom at 3 Tand proposing a set robustness, reproducibility, phantom study of robust features, which can be reliably used in future clinical studies. (Invest Radiol 2019;54: 221–228) MATERIALS AND METHODS he recent rise of machine learning techniques and the exponential T growth of computational power enable researchers to exploit large Study Design numbers of quantitative features derived from medical images,1 a field A total of 4 onions, 4 limes, 4 kiwifruits, and 4 apples placed on called radiomics. The term radiomics refers to the characterization of a a box made out of styrofoam served as our radiomics phantom (Fig. 1). tissue (eg, a tumor2–4 or the myocardium5,6) by extraction of high- The different vegetables/fruits were supposed to reflect different signal dimensional, mineable data from various sources of medical images, intensities, shapes, and tissue textures. Image acquisition was repeated including computed tomography (CT) and magnetic resonance im- immediately after repositioning of the phantom and replanning of all aging (MRI). Various classes of radiomic features can be extracted, sequences to obtain test and retest data. including morphological (shape), intensity-based (histogram), and various textural features. The subsequent analysis of these features Magnetic Resonance Imaging The phantom was placed in a clinical 3 T scanner (Ingenia; Philips Healthcare, Best, the Netherlands) and imaged using the stan- Received for publication August 16, 2018; and accepted for publication, after revision, dard body-matrix coil and built-in spine matrix coil. Six different MRI October 6, 2018. From the *Department of Radiology, University Hospital of Cologne, Cologne; †Institute sequences were acquired (Fig. 1): (1) low-resolution fluid-attenuated in- of Clinical Radiology and Nuclear Medicine, University Medical Centre Mannheim, version recovery (FLAIR), (2) high-resolution FLAIR, (3) low-resolution Medical Faculty Mannheim, Heidelberg University, Mannheim; and ‡Philips T1-weighted (T1w), (4) high-resolution T1w (T1w), (5) low-resolution Healthcare Hamburg, Hamburg, Germany. T2-weighted (T2w), and (6) high-resolution T2w. Imaging parameters Conflicts of interest and sources of funding: none decalred. Supplemental digital contents are available for this article. Direct URL citations appear are shown in Table 1. The resolution was changed by alteration of the in the printed text and are provided in the HTML and PDF versions of this article matrix size while keeping all other imaging parameters constant. The on the journal’s Web site (www.investigativeradiology.com). high-resolution sequences were adopted from the standard clinical Correspondence to: Bettina Baeßler, MD, Institute of Clinical Radiology and Nu- brain imaging sequences used in our department. clear Medicine, University Medical Centre Mannheim, Theodor-Kutzer-Ufer 1-3, D-68167 Mannheim, Germany. E-mail: [email protected]. Copyright © 2018 Wolters Kluwer Health, Inc. All rights reserved. Image Segmentation ISSN: 0020-9996/19/5404–0221 Image segmentation was performed semiautomatically using the DOI: 10.1097/RLI.0000000000000530 3-dimensional (3D) Slicer open-source software platform (version 4.8;

Investigative Radiology • Volume 54, Number 4, April 2019 www.investigativeradiology.com 221

Copyright © 2019 Wolters Kluwer Health, Inc. All rights reserved. Baeßler et al Investigative Radiology • Volume 54, Number 4, April 2019

FIGURE 1. Phantom images acquired with different MRI sequences. Images of the vegetable/fruit radiomics phantom (upper left), an exemplarily 2-dimensional slice after segmentation with colored segmentation label maps (upper mid), and the 3D segmentation label (upper right). In the middle row from left to right, exemplarily images of the phantom acquired with FLAIR, T1w, and T2 low-resolution imaging; in the bottom row, images acquired with FLAIR, T1w, and T2w high-resolution imaging. T1w indicates T1-weighted; T2w, T2-weighted. www.slicer.org21) as follows (Fig. 1): (1) after loading the Digital Imag- 3D volumes of interest (VOIs) from the surrounding volume; (4) the ing and Communications in Medicine (DICOM) files, a region of interest semiautomatically generated VOIs were corrected manually to exclude, (ROI) was placed separately in each vegetable/fruit using the Segment for example, partial volume artifacts by manually excluding the border Editor; (2) an additional ROI was placed in the volume outside the zone between fruit/vegetable and surrounding air as well as the most vegetables/fruits; (3) the grow from seeds algorithm was used to seg- apical and basal slice of each fruit/vegetable using a brush-erase tool; ment all 16 vegetables/fruits semiautomatically and to separate their and (5) the corresponding label map was exported and saved as a .nii

222 www.investigativeradiology.com © 2018 Wolters Kluwer Health, Inc. All rights reserved.

Copyright © 2019 Wolters Kluwer Health, Inc. All rights reserved. Investigative Radiology • Volume 54, Number 4, April 2019 Robustness of Radiomics in MRI

TABLE 1. Imaging Sequence Parameters

FLAIR Low FLAIR High T1 Low T1 High T2 Low T2 High Parameter Resolution Resolution Resolution Resolution Resolution Resolution Voxel size (acquired), mm 1.2 Â 1.5 Â 5.5 0.8 Â 1.1 Â 5.5 1.4 Â 1.4 Â 5.5 0.9 Â 0.9 Â 5.5 0.8 Â 1.0 Â 5.5 0.56 Â 0.70 Â 5.5 Voxel size (reconstructed), mm 0.45 Â 0.45 Â 5.5 0.45 Â 0.45 Â 5.5 0.45 Â 0.45 Â 5.5 0.45 Â 0.45 Â 5.5 0.30 Â 0.30 Â 5.5 0.30 Â 0.30 Â 5.5 Field of view 300 Â 300 Â 77 300 Â 300 Â 77 300 Â 300 Â 77 300 Â 300 Â 77 300 Â 300 Â 77 300 Â 300 Â 77 TR, ms 12000 1200 366 366 2500 2500 TE, ms 140 140 13.4 13.4 80 80 Flip angle, degrees 90 90 90 90 90 90 FLAIR indicates fluid-attenuated inversion recovery.

file. Image segmentation with manual correction took approximately the lack of a fully automated segmentation approach), calculated CCCs 10 minutes per image stack. After segmentation of the test and retest were corrected for intraobserver variability as follows: CCCcorr =CCC+ dataset, one observer repeated the segmentation of all images in the test (1 − intraobserver ICC). dataset after a pause of 2 weeks and in random order to allow for Statistical analysis was performed in R (version 3.4.0; R Founda- intraobserver comparison. A second observer analyzed all images of tion for Statistical Computing, Vienna, Austria26) with RStudio (version the test dataset to allow for assessment of interobserver reproducibility. 1.0.136; RStudio, Boston, MA27) using the R packages DescTools,28 psych,29 ggplot2,30 and Epi.31 Radiomic Feature Extraction Extraction of radiomic features was performed using the open- RESULTS source software platform LIFEx (version 4.00; www.lifexsoft.org22;) on a standard personal computer. The LIFEx platform was chosen Test-Retest Robustness of Radiomic Features because it represents a freeware dedicated to easy image preprocess- Across all feature classes, the number of robust features, in ing and calculation of radiomic features, which does not require any general, was higher for features calculated from FLAIR than from specific programming skills and thus can be easily used by clinical T1w and T2w images (Fig. 2). High-resolution FLAIR images pro- researchers throughout the field. vided the highest percentage of robust features (n = 37/45, 81%), DICOM data were loaded into the software, and the correspond- followed by low-resolution FLAIR images (n = 31/45, 68%; Fig. 2). ing label map (.nii file) was imported into the ROI manager. Before fea- Feature robustness was slightly lower for low-resolution T2w images ture extraction, all images underwent a standardized preprocessing as (n = 21/45, 46%) than for high-resolution T2w images (n = 24/45, 22 Â Â follows : spatial resampling to 1 1 1 mm using a Lagrangian poly- 52%) and low- and high-resolution T1w images (n = 26/45, 56%; and nomial of degree 5; intensity discretization to 64 gray levels (performed n = 25/45, 54%; Fig. 2). For FLAIR and T2w images, high-resolution after the voxel resampling step); relative intensity rescaling by mean ± 3 17 images resulted in a higher number of robust features than low-resolution standard deviations. Gray-level co-occurrence matrix (GLCM) fea- images, whereas the difference between low- and high-resolution images tures were computed at 4 interpixel distances. After standardized pre- was marginal for T1w images (Fig. 2). processing, a total of 45 radiomic features corresponding to 6 different Regarding the different feature classes, shape features were most matrices/feature classes (histogram matrix, shape matrix, GLCM, gray- robust across all MRI sequences (n = 3/4, 83%), followed by GLZLM level run length matrix [GLRLM], neighboring gray-level dependence (n = 8/11, 75%) and GLRLM features (n = 7/11, 66%; Fig. 3). Histo- matrix [NGLDM], and gray-level zone length matrix [GLZLM]) were gram and GLCM features were least robust (n = 3/9, 38% and n = 3/ extracted (Supplementary Table 1, Supplemental Digital Content 1, 7, 45%). Shape features extracted from T2w images (low- and high-res- http://links.lww.com/RLI/A415). Feature extraction of all 16 VOIs for olution) showed higher robustness (n = 4/4, 100%) as compared with one DICOM dataset took approximately 5 to 10 minutes. those derived from the other MRI sequences (Supplementary Fig. 1, Supplemental Digital Content 2, http://links.lww.com/RLI/A410). Statistical Analysis Robustness of all individual radiomics features is shown in Sup- Test-retest robustness was assessed across all 16 (4 Â 4) fruits/ plemental Figure 2, Supplemental Digital Content 3, http://links.lww. vegetables using concordance correlation coefficients (CCCs) and cal- com/RLI/A411. A total of n = 15/45 (33%) features showed excellent culation of the dynamic range (DR) as previously described.8,14 Briefly, robustness across all sequences (Table 2). Bland-Altman analyses for − range 8,23 the 15 robust features are shown in Supplementary Figures 3 to 8 the normalized DR for a feature was defined as 1 average difference. Values close to 1 imply that the feature has a large biological range with (one Figure for each sequence variant), Supplemental Digital Con- good reproducibility. The percentage of features with a CCC and DR tent 4, http://links.lww.com/RLI/A412. of ≥0.85, ≥0.90, and ≥0.95 was calculated, respectively. Excellent reproducibility was then defined as CCC and DR ≥ 0.90, as previ- Intraobserver and Interobserver Reproducibility of ously described.8 Radiomic Features Intraobserver and interobserver reproducibility was tested In general, radiomic features showed an excellent intraobserver using intraclass correlation coefficients (ICCs). Intraclass correla- reproducibility (ICC ≥ 0.90 for all; Fig. 4A). Interobserver repro- tion coefficients were defined as excellent (ICC ≥ 0.75), good ducibility was better for features derived from FLAIR and T2w im- (ICC = 0.60–0.74), moderate (ICC = 0.40–0.59), and poor (ICC ≤ 0.39) ages (ICC ≥0.75) than from T1w images (ICC between 0.60 and as proposed previously.24,25 0.71; Fig. 4A). To account for subtle intrareader differences in image segmenta- Similarly, features from all feature classes demonstrated an ex- tion potentially biasing the assessment of test-retest robustness (due to cellent intraobserver reproducibility (ICC ≥ 0.90 for all; Fig. 4B).

© 2018 Wolters Kluwer Health, Inc. All rights reserved. www.investigativeradiology.com 223

Copyright © 2019 Wolters Kluwer Health, Inc. All rights reserved. Baeßler et al Investigative Radiology • Volume 54, Number 4, April 2019

FIGURE 2. Percentage of robust features for individual MRI sequences. Robustness is shown for 3 different cutoffs for CCC and DR (≥0.85, ≥0.90, and ≥0.95). FLAIR high-resolution imaging shows the highest percentage of robust features among all sequences, followed by FLAIR low-resolution imaging. T1w and T2w imaging deliver a considerably lower number of robust features. CCC indicates concordance correlation coefficient; DR, dynamic range; FLAIR, fluid-attenuated inversion recovery; T1w, T1-weighted; T2w, T2-weighted.

Interobserver reproducibility was excellent for histogram, GLRLM, DISCUSSION NGLDM, and GLZLM features (ICC ≥ 0.75 for all), whereas shape Despite a continuously evolving body of literature concerning and GLCM features showed slightly lower (although still good) interob- the application of radiomics on MRI in clinical studies, little is known server reproducibility (ICC, 0.70–0.73; Fig. 4B). about the robustness and reproducibility of radiomic features derived Intraobserver and interobserver reproducibility of all individual from different MRI sequences. Thus, the interpretation of clinical stud- radiomics features is shown in Figure 5. All features showed an excel- ies based on MRI rather than CT as well as the critical process of stan- lent intraobserver reproducibility (ICC ≥ 0.75), whereas considerable dardization remain challenging. Our study systematically investigated variations were observed for interobserver reproducibility. Of the 15 fea- the test-retest robustness and intraobserver and interobserver reproduc- tures showing excellent robustness across all sequences during test-retest ibility of radiomic features derived from FLAIR, T1w, and T2w low- analysis, all features also demonstrated an excellent intraobserver and and high-resolution images commonly used in routine clinical practice interobserver reproducibility (Table 2). in a well-controlled phantom setup. The major findings are that

FIGURE 3. Percentage of robust features for different feature classes. Percentage of robust features for the different radiomic features classes (matrices), averaged across all MRI sequences. Robustness is shown for 3 different cutoffs for CCC and DR (≥0.85, ≥0.90, and ≥0.95). Shape features show the highest percentage of robust features, followed by GLZLM and GLRLM features. Histogram and GLCM features are least robust. CCC indicates concordance correlation coefficient; DR, dynamic range; GLCM, gray-level co-occurrence matrix; GLRLM, gray-level run length matrix; NGLDM, neighboring gray-level dependence matrix; GLZLM, gray-level zone length matrix.

224 www.investigativeradiology.com © 2018 Wolters Kluwer Health, Inc. All rights reserved.

Copyright © 2019 Wolters Kluwer Health, Inc. All rights reserved. Investigative Radiology • Volume 54, Number 4, April 2019 Robustness of Radiomics in MRI

TABLE 2. The 15 Robust and Reproducible Features Across All Sequences

Matrix Feature CCC (95% CI) DR Intraobserver ICC (95% CI) Interobserver ICC (95% CI) Shape Volume, mL 0.994 (0.986–0.997) 0.96 0.989 (0.972–0.996) 0.964 (0.914–0.985) Shape Volume (voxel) 0.994 (0.986–0.997) 0.96 0.989 (0.972–0.996) 0.964 (0.914–0.985) Shape Compacity 0.961 (0.913–0.984) 0.93 0.949 (0.872–0.980) 0.798 (0.600–0.904) GLCM Correlation 0.992 (0.978–0.997) 0.94 0.965 (0.909–0.987) 0.854 (0.666–0.940) GLRLM SRE 0.949 (0.886–0.978) 0.94 0.992 (0.979–0.997) 0.880 (0.713–0.953) GLRLM GLNU 0.974 (0.934–0.990) 0.94 0.986 (0.965–0.995) 0.893 (0.750–0.957) GLRLM RLNU 0.996 (0.990–0.999) 0.96 0.990 (0.975–0.996) 0.973 (0.932–0.989) GLRLM RP 0.945 (0.880–0.976) 0.95 0.993 (0.980–0.997) 0.881 (0.719–0.953) NGLDM Busyness 0.990 (0.973–0.996) 0.96 0.989 (0.970–0.996) 0.942 (0.851–0.978) GLZLM SZE 0.971 (0.927–0.989) 0.95 0.990 (0.974–0.997) 0.920 (0.808–0.968) GLZLM HGZE 0.981 (0.953–0.993) 0.95 0.987 (0.968–0.995) 0.961 (0.904–0.984) GLZLM SZHGE 0.954 (0.890–0.983) 0.94 0.978 (0.941–0.992) 0.893 (0.744–0.958) GLZLM GLNU 0.981 (0.953–0.993) 0.96 0.991 (0.976–0.997) 0.926 (0.838–0.968) GLZLM ZLNU 0.981 (0.953–0.976) 0.93 0.976 (0.936–0.991) 0.750 (0.524–0.831) GLZLM RP 0.944 (0.872–0.976) 0.94 0.990 (0.973–0.996) 0.781 (0.525–0.908) CCC indicates concordance correlation coefficient; CI, confidence interval; DR, dynamic range; ICC, intraclass correlation coefficient; GLCM, gray-level co-occurrence matrix; GLRLM, gray-level run length matrix; NGLDM, neighboring gray-level dependence matrix; GLZLM, gray-level zone length matrix.

FIGURE 4. Intraobserver and interobserver reproducibility for individual MRI sequences and feature classes. Intraobserver (upper row) and interobserver (bottom row) reproducibility of radiomic features for individual MRI sequences (A) and feature classes (B). Intraobserver reproducibility is considerably higher than interobserver reproducibility. Features derived from T1w images are less reproducible than those derived from FLAIR or T2w imaging. For feature classes, GLCM features are least reproducible at the interobserver level, whereas histogram and NGLDM features show best interobserver reproducibility. ICC indicates intraclass correlation coefficients; FLAIR, fluid-attenuated inversion recovery; T1w, T1-weighted; T2w, T2-weighted; GLCM, gray-level co-occurrence matrix; GLRLM, gray-level run length matrix; NGLDM, neighboring gray-level dependence matrix; GLZLM, gray-level zone length matrix.

© 2018 Wolters Kluwer Health, Inc. All rights reserved. www.investigativeradiology.com 225

Copyright © 2019 Wolters Kluwer Health, Inc. All rights reserved. Baeßler et al Investigative Radiology • Volume 54, Number 4, April 2019

FIGURE 5. Intraobserver and interobserver reproducibility for individual radiomic features. Intraobserver (upper row) and interobserver (bottom row) reproducibility of radiomic features for all individual radiomic features averaged across the different MRI sequences. The cutoff for definition of “excellent reproducibility” is shown as a red line. Features with ICCs above this cutoff at both the intraobserver and interobserver level are highlighted in yellow. Intraobserver reproducibility is excellent for all features, but considerable variations can be seen for interobserver reproducibility. ICC indicates intraclass correlation coefficients.

(1) considerable differences exist in the number of robust and repro- It has to be noted, however, that these conclusions are drawn ducible features between different MRI sequences; (2) high-resolution from a phantom study using vegetables and fruits, combined with a FLAIR images deliver the highest percentage of robust features in the semiautomatic segmentation approach with manual corrections instead setting of our fruit/vegetable phantom study, followed by low-resolution of a clinical dataset requiring manual delineation of, for example, a tu- FLAIR images, whereas T1w and T2 images result in a considerably mor. Recently, Saha et al19 could show that radiomic features extracted smaller amount of reliable features; and (3) a total of 15 of 45 features from fibroglandular tissue in the breast had significantly higher interob- show excellent robustness across all sequences and demonstrate server stability than those derived from breast cancers. They concluded excellent reproducibility. that the main reason for this observation is that the fibroglandular tissue Despite considerable variations in the number of reproducible segmentation label is less affected by the reader's annotation due to the radiomic features between sequences, we could show that 15 features higher homogeneity of the underlying tissue. The different signal inten- were extremely robust to variations between test and retest acquisition, sities between a tumor and the surrounding tissue might be much higher segmentation done by different observers, and—most importantly— and thus lead to more pronounced interreader differences. Conse- different MRI sequences types. Consequently, we suppose that these quently, the results of our study should be validated in a clinical study 15 features can reliably be applied for the design of radiomics signa- using MRI test-retest data from cancer patients before being extensively tures within clinical studies, whereas care must be taken when selecting applied for an a priori feature selection. one or more of the nonrobust features during a multistep feature selec- Similar to the study by Saha et al,19 we observed a considerable tion process. Nonrobust features could undergo an a priori elimination influence of interobserver variability on the extracted features. Despite before any model building, which also has the potential to abbreviate an excellent intraobserver reproducibility and although we used a semi- the dimension reduction process and to avoid model overfitting. In ad- automatic approach for image segmentation, many features appear to be dition, the presence of 15 robust and reproducible features helps in the highly susceptible to subjective bias. Consequently, reducing the effect interpretation of already published clinical studies and assists in on subjective bias through fully automatic image segmentation is vital assessing the reliability of a published radiomics signature. for future radiomics research, leading the way toward a combination

226 www.investigativeradiology.com © 2018 Wolters Kluwer Health, Inc. All rights reserved.

Copyright © 2019 Wolters Kluwer Health, Inc. All rights reserved. Investigative Radiology • Volume 54, Number 4, April 2019 Robustness of Radiomics in MRI

of radiomics with artificial intelligence and deep learning for automated Studies using images from routine clinical practice should be performed image segmentation. to assess the transferability of our results to imaging, for example, in Histogram features, which have been extensively used in the cancer patients. radiomics literature, resulted in less reproducible features than shape, Our study has several limitations. First, it is a tissue phantom GLZLM, and GLRLM features. It is well known that—in contrast to study without specific “lesions,” where results might not always be Hounsfield units in CT imaging or the newer quantitative MRI tech- transferable to routine clinical practice. This is especially the case in niques such as T1 or T2 mapping—relative signal intensities in MRI cancer patients, where reproducible tumor delineation is crucial. How- are not comparable between different scanners, field strengths, and ex- ever, because clinical test-retest MRI data are difficult to obtain in can- aminations. It is though not surprising, that histogram features, which cer patients due to the need of repetitive contrast administration in the rely on absolute gray-level values encoded in each pixel, are prone to light of the ongoing debate concerning gadolinium, our study repre- outliers, for example, obtained by subtle differences in image seg- sents a first step toward closing the gap of knowledge in MRI radiomics mentation or by different coil sensitivity profiles between 2 scans. research. Second, we investigated only the 3 most commonly used MRI Representing simple statistical measures of absolute signal intensities, sequences and did not include quantitative MRI techniques such as histogram features such as mean, kurtosis, skewness, and percentiles mapping or diffusion-weighted imaging. Third, we did not investi- are heavily influenced by single pixel outliers, as opposed to the more gate the effect of preprocessing for standardization, for example, the robust feature classes relying on the distribution of all pixels and its resampled spatial resolution and gray-level discretization on feature ro- neighborhoods in the image, where outliers might cancel out. bustness in our data. Future phantom as well as clinical studies should Although we performed standardized postprocessing of our im- aim at analyzing robustness and reproducibility of radiomic features aging data before feature extraction such as voxel size resampling, we also in quantitative MRI scans and compare the results to those of the observed some differences between images acquired with low and high present study. resolution. Although the difference in the number of robust features was In conclusion, our study shows that FLAIR imaging, especially marginal for T1w and T2w low- and high-resolution images, high- at higher resolutions, delivers the highest number of robust radiomic resolution FLAIR imaging resulted in a considerably higher number features and thus can be reliably applied in future clinical studies. In of robust features as compared with low-resolution FLAIR imaging. contrast, radiomics extracted from T1w and T2w imaging should be The influence of matrix size and image interpolation on texture features used with care, and only robust and reproducible features should be se- had previously been systematically investigated by Mayerhoefer and lected for building a radiomics signature. The 15 features with excellent colleagues,16 showing that images acquired at higher resolution allow robustness and reproducibility across all sequences might be most for better classification accuracy. In their study, misclassification rates suitable for the design of future radiomics signatures, although these were extremely high at very low resolutions (matrix size 16  16), lead- results should be validated in a clinical study in cancer patients. Reduc- ing to the conclusion that texture features might lose their ability to ad- ing the effect on subjective bias shown by high interobserver variability equately describe the “true” physical or biological properties of the of radiomic features through fully automatic image segmentation is vi- analyzed tissue below certain resolution levels and highlighting the im- tal for future radiomics research, leading the way toward a combination portance of adequate spatial resolution in MRI for radiomic studies.16 of radiomics with artificial intelligence and deep learning for automated Image interpolation improved classification rates, which might be an image segmentation. explanation for the minor differences between feature robustness in   low- and high-resolution T1w and T2w images resampled to 1 1 REFERENCES 1 mm in our study. 1. Obermeyer Z, Emanuel EJ. Predicting the future—big data, machine learning, and The low resolution used in our study still was considerably – 16 clinical medicine. NEnglJMed. 2016;375:1216 1219. higher than the lowest resolution used by Mayerhoefer et al and also 2. Lambin P, Leijenaar RTH, Deist TM, et al. Radiomics: the bridge between medical higher compared with the smallest structures of the fruits/vegetables imaging and personalized medicine. Nat Rev Clin Oncol. 2017;14:749–762. used in our phantom. This might explain the higher robustness of our 3. Kim JH, Ko ES, Lim Y, et al. Breast cancer heterogeneity: MR imaging texture low resolution sequence variants as compared with the study by analysis and survival outcomes. Radiology. 2016;160261. Mayerhoefer et al.16 Because the aim of our study was to determine 4. Ng F, Ganeshan B, Kozarski R, et al. Assessment of primary colorectal cancer het- whether sequence variants, which are likely being used in routine clin- erogeneity by using whole-tumor texture analysis: contrast-enhanced CT texture – ical imaging, are robust enough for radiomic analyses, we limited our as a biomarker of 5-year survival. Radiology. 2013;266:177 184. 5. Baessler B, Luecke C, Lurz J, et al. Cardiac MRI texture analysis of T1 and T2 analysis to 2 sequence variants frequently encountered in clinical prac- maps in patients with infarctlike acute myocarditis. Radiology. 2018;180411. tice instead of the lowest possible resolution. 6. Baessler B, Mannil M, Oebel S, et al. Subacute and chronic left ventricular The higher robustness of radiomic features derived from FLAIR myocardial scar: accuracy of texture analysis on nonenhanced cine MR images. images as compared with T1w and especially T2w images (which differ Radiology. 2017;170213. from FLAIR only by some minor acquisition settings such as the lack of 7. Zwanenburg A, Leger S, Vallieres M, et al. Image biomarker standardisation ini- cerebrospinal fluid suppression) remains difficult to explain. One po- tiative. arXiv. 2017; 1612.07003v5. tential explanation might be that FLAIR images allow for a more subtle 8. Balagurunathan Y, Kumar V, Gu Y, et al. Test-retest reproducibility analysis of – gray-level discretization after standardized image postprocessing due to lung CT image features. J Digit Imaging. 2014;27:805 823. 9. Lubner MG, Smith AD, Sandrasegaran K, et al. CT texture analysis: definitions, ap- suppression of the cerebrospinal fluid signal and thus the spread of plications, biologic correlates, and challenges. Radiographics. 2017;37:1483–1503. more gray levels across a smaller range. This might increase the robust- 10. Mackin D, Fave X, Zhang L, et al. Measuring computed tomography scanner var- ness, especially of intensity, but also of texture features. It must be iability of radiomics features. Invest Radiol. 2015;50:757–765. noted, however, that the MR sequences used in our study differed con- 11. Nyflot MJ, Yang F, Byrd D, et al. Quantitative radiomics: impact of stochastic siderably with regard to their base resolution, for example, the voxel effects on textural feature analysis implies the need for standards. J Med Imag size of the high-resolution FLAIR images being about the same as the (Bellingham, Wash). 2015;2:041002. voxel size of the low-resolution T2w images. Although we performed 12. Shafiq-Ul-Hassan M, Zhang GG, Latifi K, et al. Intrinsic dependencies of CT radiomic features on voxel size and number of gray levels. Med Phys. 2017;44: postacquisition voxel resampling, this might not lead to a complete re- 1050–1062. moval of the initial resolution differences. In addition, the robustness of 13. Shaikh FA, Kolowitz BJ, Awan O, et al. Technical challenges in the clinical appli- features extracted from different MR sequences may also depend on the cation of radiomics. JCO Clinical Cancer Informatics.2017;1:1–8. physical properties of the analyzed object and thereby lead to a superior 14. Zhao B, Tan Y,Tsai WY,et al. Reproducibility of radiomics for deciphering tumor robustness of FLAIR images for radiomic analyses in fruits/vegetables. phenotype with imaging. Sci Rep. 2016;6:23428.

© 2018 Wolters Kluwer Health, Inc. All rights reserved. www.investigativeradiology.com 227

Copyright © 2019 Wolters Kluwer Health, Inc. All rights reserved. Baeßler et al Investigative Radiology • Volume 54, Number 4, April 2019

15. Zhao B, Tan Y, Tsai WY, et al. Exploring variability in CT characterization of 23. Schuster DP. The opportunities and challenges of developing imaging bio- tumors: a preliminary phantom study. Transl Oncol. 2014;7:88–93. markers to study lung function and disease. Am J Respir Crit Care Med. 16. Mayerhoefer ME, Szomolanyi P,Jirak D, et al. Effects of magnetic resonance im- 2007;176:224–230. age interpolation on the results of texture-based pattern classification: a phantom 24. Khan JN, Singh A, Nazir SA, et al. Comparison of cardiovascular magnetic reso- study. Invest Radiol. 2009;44:405–411. nance feature tracking and tagging for the assessment of left ventricular systolic 17. Collewet G, Strzelecki M, Mariette F. Influence of MRI acquisition protocols and strain in acute myocardial infarction. Eur J Radiol. 2015;84:840–848. image intensity normalization methods on texture classification. Magn Reson 25. Schmidt B, Dick A, Treutlein M, et al. Intra- and inter-observer reproducibility of Imaging. 2004;22:81–91. global and regional magnetic resonance feature tracking derived strain parameters 18. Park JE, Kim HS. Radiomics as a quantitative imaging biomarker: practical con- of the left and right ventricle. Eur J Radiol.2017;89:97–105. siderations and the current standpoint in neuro-oncologic studies. Nucl Med Mol 26. R: A language and environment for statistical computing. [computer program]. – Imag.2018;52:99 108. Vienna, Austria: R Foundation for Statistical Computing; 2017. 19. Saha A, Harowicz MR, Mazurowski MA. Breast cancer MRI radiomics: an over- 27. RStudio: Integrated Development for R. [computer program]. Version 1.0.136. view of algorithmic features and impact of inter-reader variability in annotating Boston, MA; 2016. tumors. Med Phys. 2018;45:3076–3085. 28. DescTools: Tools for Descriptive Statistics [computer program]. Version 0.99.24; 2018. 20. Giri S, Chung YC, Merchant A, et al. T2 quantification for improved detection of myocardial edema. JCardiovascMagnReson. 2009;11:56. 29. Psych: Procedures for Psychological, Psychometric, and Personality Research 21. Fedorov A, Beichel R, Kalpathy-Cramer J, et al. 3D Slicer as an image computing [computer program]. Version 1.7. 8. Northwestern University, Evanston, platform for the Quantitative Imaging Network. Magn Reson Imaging.2012;30: IL; 2017. 1323–1341. 30. Ggplot2: Elegant Graphics for Data Analysis [computer program].NewYork, 22. Nioche C, Orlhac F, Boughdad S, et al. LIFEx: a freeware for radiomic feature NY: Springer-Verlag; 2009. calculation in multimodality imaging to accelerate advances in the characteriza- 31. Epi: A Package for Statistical Analysis in Epidemiology. [computer program]. tion of tumor heterogeneity. Cancer Res.2018. Version 2.12; 2017.

228 www.investigativeradiology.com © 2018 Wolters Kluwer Health, Inc. All rights reserved.

Copyright © 2019 Wolters Kluwer Health, Inc. All rights reserved.