Diagnosis by Volatile Organic Compounds in Exhaled Breath from Lung Cancer Patients Using Support Vector Machine Algorithm

sensors

Article Diagnosis by Volatile Organic Compounds in Exhaled Breath from Lung Cancer Patients Using Support Vector Machine Algorithm

Yuichi Sakumura 1,*, Yutaro Koyama 1, Hiroaki Tokutake 1, Toyoaki Hida 2, Kazuo Sato 3, Toshio Itoh 4, Takafumi Akamatsu 4 and Woosuck Shin 4,*

1 Department of Information Science and Technology, Aichi Prefectural University, Nagakute 480-1198, Japan; [email protected] (Y.K.); [email protected] (H.T.) 2 Department of Thoracic Oncology, Aichi Cancer Center, 1-1 Kanokoden, Chikusa-ku, Nagoya 464-8681, Japan; [email protected] 3 Department of Mechanical Engineering, Aichi Institute of Technology, Toyota 470-0392, Japan; [email protected] 4 Department of Materials and Chemistry, National Institute of Advanced Industrial Science and Technology (AIST), Shimo-Shidami, Moriyama-ku, Nagoya 463-8560, Japan; [email protected] (T.I.); [email protected] (T.A.) * Correspondence: [email protected] (Y.S.); [email protected] (W.S.); Tel.: +81-561-64-1111 (Y.S.); +81-52-736-7107 (W.S.)

Academic Editor: W. Rudolf Seitz Received: 15 November 2016; Accepted: 29 January 2017; Published: 4 February 2017

Abstract: Monitoring exhaled breath is a very attractive, noninvasive screening technique for early diagnosis of diseases, especially lung cancer. However, the technique provides insufficient accuracy because the exhaled air has many crucial volatile organic compounds (VOCs) at very low concentrations (ppb level). We analyzed the breath exhaled by lung cancer patients and healthy subjects (controls) using gas chromatography/mass spectrometry (GC/MS), and performed a subsequent statistical analysis to diagnose lung cancer based on the combination of multiple lung cancer-related VOCs. We detected 68 VOCs as marker species using GC/MS analysis. We reduced the number of VOCs and used support vector machine (SVM) algorithm to classify the samples. We observed that a combination of five VOCs (CHN, methanol, CH3CN, isoprene, 1-propanol) is sufficient for 89.0% screening accuracy, and hence, it can be used for the design and development of a desktop GC-sensor analysis system for lung cancer.

Keywords: lung cancer; volatile organic compounds (VOCs); exhaled air; screening; gas chromatography– mass spectrometry analysis; support vector machine (SVM)

1. Introduction Balancing the quality of life and sharp increase in healthcare expenses is an important social issue. Breath analysis is a noninvasive technique, allows easy sample collection, and provides quick results; thus, it is gaining attention as a new diagnostic technology. Breath is composed mainly of nitrogen (the most abundant gas in the atmosphere) along with carbon dioxide produced by respiration, oxygen that was not consumed, and water vapor. In addition, it contains more than 100 additional types of gas components in different concentrations, which provide information that may be useful to monitor health conditions such as disease or stress. Gas-sensing technologies (e.g., selective and quantitative gas detection) are necessary to measure the concentration of different gas species related to halitosis, metabolism, and diseases.

Sensors 2017, 17, 287; doi:10.3390/s17020287 www.mdpi.com/journal/sensors Sensors 2017, 17, 287 2 of 12

Some volatile organic compounds (VOCs) in exhaled breath are expected to be useful as biomarkers for diseases, including cancer [1,2]. Lung cancer has become a major concern in Japan because it is the top cause of death by disease in the country. Lung cancer has a high mortality rate because it has often progressed by the time a patient perceives any symptoms and is diagnosed. If lung cancer is detected earlier, it can be treated by surgery and subsequent chemotherapy. However, no good early diagnostics are available. It is difficult to detect lung cancer using chest X-ray radiography (CXR) [3]; thus, diagnosis is provided after sputum analysis and extensive examination with low-dose computed tomographic (LDCT) scanning. Monitoring the breath is one of the most noninvasive screening techniques available for early diagnosis [4–11]; however, this method is limited by its poor accuracy because the exhaled breath has many VOCs at very low concentrations (ppb level), and there is no clear protocol of breath sampling [12]. Gas chromatography–mass spectrometry (GC/MS) is one of the best methods for detecting low-concentration VOCs; however, this method is expensive, and the instrumentation is not portable. If some specific VOCs have sufficient information for diagnosing diseases, one could measure these VOCs with relatively high resolution using a suitable technique, and detection of the other gas species would not be necessary. VOCs have been reported as biomarkers for lung cancer [5,8,13–18]; however, various other compounds have also been reported as possible biomarkers, suggesting disagreement in literature. For these reasons, VOC patterns (i.e., not single VOCs, but combinations of several VOCs) should be used for exhaled breath analysis for the diagnosis of diseases [5,8,13–19]. Here, we seek to determine which gas species are more important and how many are necessary to ensure system reliability. The optimized prototype system should be of reasonable size and cost; the number of gas species used should be fewer than 10, and the system should be able to detect the essential components of those gases. Recent studies have demonstrated computer-assisted diagnosis by measuring multiple VOCs. Various algorithms have been applied to examine lung cancer diagnosis using multiple VOCs; for example, forward stepwise discriminant analysis [5,13], partial least-squares regression [14], logistic regression [15,18], random forest classification [20], weighted digital sum discriminator [21], and linear canonical discriminant analysis with principal component analysis (PCA) [17]. The support vector machine (SVM) is a powerful supervised machine learning model based on statistical learning theory [22]. SVM has been successfully used in the field of brain science to classify brain tumors [23,24], Alzheimer’s disease [25,26], and depression [27,28], based on magnetic resonance imaging (MRI) data. SVM-based research has been performed for VOC analysis to classify lung cancer cells [29], smoking subjects [30], patients with chronic obstructive pulmonary disease [31], and patients with head-and-neck cancer [32]. Many studies have examined VOCs in the exhaled breath; however, SVM-based analysis with a raw VOC data set has not been examined to select the essential VOCs for diagnosing lung cancer. We have developed a prototype system for monitoring exhaled breath that can replace GC/MS. It combines a highly sensitive gas sensor with GC to separate the gases. The prototype system has simple GC columns, a simple gas-condenser unit, and SnO2-based semiconductor gas sensors [33]. For further development of the prototype system, we analyzed the VOCs detected in the breath of lung cancer patients and healthy subjects (controls) to determine the most effective combination of VOCs for diagnosing lung cancer. We applied a nonlinear SVM classification to various subsets of the VOCs detected by the GC/MS system and did not preprocess the VOCs by PCA because it is difficult to select VOCs from the principal components, which are composed of the multiple VOC features, and validate diagnose ability by the selected VOCs. We also did not select VOCs that have a significant difference in concentration between cancer and healthy samples. It is likely that the diagnosis is possible by the VOCs that have no significant difference in VOC concentration. We performed leave-one-out cross-validation of the samples (patients and controls) for each of the combinations of VOCs and evaluated the true positive rate (sensitivity) and accuracy of diagnosis for the left-out sample. We found that a specific combination of a small number of VOCs could diagnose lung cancer with a high accuracy. Sensors 2017, 17, 287 3 of 12

2. Breath Gas Analysis and Diagnosis Methods Sensors 2017, 17, 287 3 of 12 2.1. Breath Collection and Analysis 2. Breath Gas Analysis and Diagnosis Methods After obtaining approval from the local ethics committees of Aichi Cancer Center and the National Institute2.1. Breath of Advanced Collection and Industrial Analysis Science and Technology and written informed consent from the participants,After 107 obtaining patients approval with lung from cancer the local and ethics 29 healthy committees individuals of Aichi wereCancer enrolled Center inand this the study. The stageNational and Institute histology of Advanced of the lung Industrial cancer were Science omitted and Technology and it was and simply written labeled informed as “lung consent cancer”. The numbersfrom the participants, of patients 107 with patients stage with I, II, lung III andcancer IV and lung 29 cancerhealthy wereindividuals 55, 15, were 28 andenrolled 9, respectively. in this The numbersstudy. The of stage smoker, and histology ex-smoker, of the and lung nonsmoker cancer were for omitted lung cancer and it patientswas simply were labeled 47, 15 asand “lung 45 and thosecancer”. for healthy The numbers individuals of patients were 5, wi 3 andth stage 21, respectively.I, II, III, and IV lung cancer were 55, 15, 28, and 9, Humanrespectively. breath The numbers and ambient of smoker, air ex-smoker, in a room an atd nonsmoker Aichi Cancer for lung Center cancer were patients collected were 47, using an Analytic15, and 45, Barrier and those Bag (Omifor healthy Odor-Air individuals Service were Corp., 5, 3, Omihachiman, and 21, respectively. Japan). Before sample collection, Human breath and ambient air in a room at Aichi Cancer Center were collected using an the volunteers did not eat or smoke for several hours and they stayed in the room for at least 10 min. Analytic Barrier Bag (Omi Odor-Air Service Corp., Omihachiman, Japan). Before sample collection, All volunteers blew their alveolar breath into a 1 L Analytic Barrier Bag immediately after they exhaled the volunteers did not eat or smoke for several hours and they stayed in the room for at least 10 theirmin. respiratory All volunteers tract airblew in their a consultation alveolar breath room. into Thea 1 L breathAnalytic was Barrier analyzed Bag immediately using a GCMS-QP2010 after they instrumentexhaled (Shimadzu, their respiratory Kyoto, tract Japan) air equipped in a consultation with a TD-2 room. gas-condensing The breath was unit analyzed (Shimadzu) using (Figure a 1). The TD-2GCMS-QP2010 has a gas instrument aspiration (Shimadzu, unit and Kyoto, a cold Japan) trap for equipped condensation with a TD-2 of low-concentration gas-condensing unit VOCs. The GC/MS(Shimadzu) system (Figure used 1). The helium TD-2 gas has (99.9995% a gas aspiration purity, unit Taiyo and Nippon a cold trap Sanso, for Japan)condensation as the of carrier gas. Alow-concentration DB-1 series 123-1063 VOCs. The gas GC/MS column system (Agilent used Technologies, helium gas (99.9995% Santa Clara, purity, CA, Taiyo USA) Nippon was used. The backgroundSanso, Japan) VOCsas the incarrier the roomgas. A air DB-1 and series the VOCs123-1063 from gas thecolumn exhaled (Agilent air wereTechnologies, analyzed, Santa and the concentrationsClara, CA, USA) of background was used. VOCsThe background were subtracted VOCs in from the room the results air and for the the VOCs exhaled from the air priorexhaled to data air were analyzed, and the concentrations of background VOCs were subtracted from the results for analysis. The concentrations of these VOCs were excluded from the results of breath analysis in this the exhaled air prior to data analysis. The concentrations of these VOCs were excluded from the study.results The detailsof breath are analysis reported in this elsewhere study. The [33 details]. are reported elsewhere [33].

FigureFigure 1. 1. BreathBreath sampling sampling and and gas gas analysis analysis by GC/MS. by GC/MS.

We detected 63 VOCs. Among them, three VOCs may come from cancer treatment drugs and We40 VOCs detected were 63 present VOCs. at Amonglow concentrations, them, three close VOCs to maythe detection come from limit cancer of GC/MS. treatment We deleted drugs and 40 VOCsthesewere 43 VOCs present and examined at low concentrations, the SVM diagnosis close using to the the detection remaining limit 20 VOCs of GC/MS. (Table 1). We deleted these 43 VOCs and examined the SVM diagnosis using the remaining 20 VOCs (Table1). Table 1. Selected volatile organic compounds (VOCs) for the computer-assisted diagnostic analysis. Table 1. ButaneSelected †,‡ volatile organicCH3CN †,‡compounds (VOCs)CHCl3 †,‡ for the computer-assistedMethanol † diagnosticAcetone analysis. ‡ CHN ‡ Ethanol ‡ 1-Propanol 2-Propanol C8H16 Isoprene †,‡ Dichlorobenzene†,‡ C8H17OH†,‡ Xylene† Methylcyclohexane‡ Butane CH3CN CHCl3 Methanol Acetone Toluene C2H3CN Limonene Nonanal Unknown 1 ‡ ‡ † WilcoxonCHN test; < 0.05Ethanol, ‡ Kolmogorov–Smirnov1-Propanol test; < 0.05 2-Propanol, 1 this VOC could not C be8H identified.16 Isoprene Dichlorobenzene C8H17OH Xylene Methylcyclohexane 1 Toluene C2H3CN Limonene Nonanal Unknown † Wilcoxon test: p < 0.05; ‡ Kolmogorov–Smirnov test: p < 0.05; 1 this VOC could not be identiﬁed.

Sensors 2017, 17, 287 4 of 12 Sensors 2017, 17, 287 4 of 12

2.2. Data Sets 2.2. Data Sets As listed in Table 1, many of the VOCs have no significant differences between cancer and As listed in Table1, many of the VOCs have no signiﬁcant differences between cancer and healthy healthy samples, and the concentration distributions of the VOCs (nine VOCs are selected and samples, and the concentration distributions of the VOCs (nine VOCs are selected and shown in shown in Figure 2) show unclear boundaries between the cancer and healthy control samples. Some Figureof the2) showlung cancer unclear samples boundaries contained between higher the cancerconcentrations and healthy of VOCs control than samples. the healthy Some ofsamples; the lung cancerhowever, samples there contained was a wide higher overlap concentrations of the types of of VOCs VOCs found than in the each healthy sample. samples; This suggests however, that there it wasis aquite wide difficult overlap to of diagnose the types lung of VOCscancer foundusing ina single each sample.VOC; thus, This diagnosis suggests using that itmultiple is quite VOCs difﬁcult tois diagnose necessary. lung cancer using a single VOC; thus, diagnosis using multiple VOCs is necessary.

CH3CN CHCl3 Methanol 0.5 0.35 0.4 0.3 0.4 0.25 0.3 0.3 0.2 0.2 0.15 0.2 Probability Probability 0.1 Probability 0.1 0.1 0.05

0 0 0 00.10.20.3 00.511.522.5 0 20406080100 ppb) ppb) ppb) (a) (b) (c)

CHN Ethanol 1-Propanol 0.35 0.35 0.3

0.3 0.3 0.25 0.25 0.25 0.2 0.2 0.2 0.15 0.15 0.15

Probability Probability Probability 0.1 0.1 0.1 0.05 0.05 0.05 0 0 01234 0 50 100 150 200 0246 ppb) ppb) ppb) (d) (e) (f)

Isoprene C2H3CN Limonene 0.35 0.35 0.5 0.3 0.3 0.4 0.25 0.25 0.2 0.2 0.3 0.15 0.15 0.2 Probability Probability 0.1 0.1 Probability 0.1 0.05 0.05

0 0 0 0 50 100 150 01234 00.511.5 ppb) ppb) ppb) (g) (h) (i)

Figure 2. Comparison of VOC concentration distributions from lung cancer (red, n = 107) and healthy Figure 2. Comparison of VOC concentration distributions from lung cancer (red, n = 107) and (green, n = 29) controls’ breath; (a) CH3CN; (b) CHCl3; (c) methanol; (d) CHN; (e) ethanol; (f) 1-propanol; healthy (green, n = 29) controls’ breath; (a) CH3CN; (b) CHCl3;(c) methanol; (d) CHN; (e) ethanol; (g) isoprene; (h) C2H3CN; and (i) limonene. The VOCs in (a–e) show significant differences between (f) 1-propanol; (g) isoprene; (h)C2H3CN; and (i) limonene. The VOCs in (a–e) show significant samples, while those in (f–i) do not show significant differences (Table 1). The distributions of the differences between samples, while those in (f–i) do not show significant differences (Table1). remaining 11 VOCs are shown in the Supplementary Information (Figure S1). The distributions of the remaining 11 VOCs are shown in the Supplementary Information (Figure S1). There are 1,048,575 (= ∑ C, where C represents -combinations of elements) VOC 20 combinationsThere are 1,048,575of the 20 VOCs (=∑i= listed1 20C iin, whereTable 1.n CWek represents applied nonlineark-combinations SVM diagnosis of n elements)to each of the VOC combinationscombinations of and the 20evaluated VOCs listed their inaccuracy Table1. Welevels, applied as described nonlinear below. SVM diagnosisA VOC that to each has ofno the combinationscontribution and toward evaluated improving their the accuracy diagnostic levels, accuracy as described should below. be removed A VOC from that hasthe data no contribution set even towardif it has improving a large contribution the diagnostic to the accuracy principal should component be removed space, and from vice the versa. data setIn addition, even if it reducing has a large contributionthe number to of thepossible principal VOCs component is helpful for space, designing and vice a portable versa. VOC In addition, detector. reducing the number of possibleThe VOCs imbalanced is helpful sample for designing numbers a portablein the VOC VOC da detector.ta set (lung cancer patients: 107; healthy individuals:The imbalanced 29) likely sample cause an numbers inappropriate in the classification. VOC data set To (lungcomplete cancer the patients:sample numbers, 107; healthy we individuals:introduced 29)a synthetic likely cause minority an inappropriateoversampling te classification.chnique [34–36]. To completeOne original the sample sample (healthy numbers, we introduced a synthetic minority oversampling technique [34–36]. One original sample (healthy Sensors 2017,, 1717,, 287287 5 of 12

control in our case) was randomly chosen, and two virtual samples were interpolated at a random controlpoint Sensorsbetween in our 2017 case), the17, 287 chosen was randomly sample and chosen, the two and twosample virtuals that samples are nearest were to interpolated the chosen atone a5 random (caseof 12 of pointnearest between number the =2 chosen; Figure sample 3).and By therepeating two samples this process, that are we nearest provided to the 107 chosen healthy one (casecontrol of nearestsamples.control number in ourk = case)2; Figure was randomly3). By repeating chosen, thisand two process, virtual we samples provided were 107 interpolated healthy control at a random samples. point between the chosen sample and the two samples that are nearest to the chosen one (case of nearest number =2; Figure 3). By repeating this process, we provided 107 healthy control samples.

Figure 3.3. Schematic illustrating the oversampling techni techniqueque to obtain the same number of healthy control samples to to that that of of the the lung lung cancer cancer patien patients.ts. After After one one sample sample (red) (red is randomly) is randomly chosen, chosen, two twosamples samplesFigure (blue) 3. (blue Schematicare) arerandomly randomly illustrating interpolated interpolated the oversampling on onthe the linestechni lines quebetween between to obtain the the thechosen chosen same samplenumber sample ofand and healthy the two nearestcontrol samples samples ((yellow).yellow to that). of the lung cancer patients. After one sample (red) is randomly chosen, two samples (blue) are randomly interpolated on the lines between the chosen sample and the two nearest samples (yellow). 2.3. SVM Classifier Classifier SVM2.3. SVM is an Classifier algorithm that determines a flat flat classification classification boundary between two-class data data sets. sets. The concentrationconcentrationSVM is an distribution distributionalgorithm that of of thedetermines the selected selected a VOCsflat classificationVOCs is broad is broad and boundary shows and between unclearshows two-class boundariesunclear databoundaries betweensets. thebetween cancerThe the concentration and cancer healthy and samples;distribution healthy thus,samples; of itthe is selected difficultthus, it VOCs tois clearlydifficult is broad classify to clearlyand the shows VOCclassify unclear samples the boundariesVOC using samples linear SVMusing [ between22linear]. We SVM introducedthe cancer[22]. We and a nonlinearintroduced healthy samples; SVM a nonlinear [37 thus,] with itSVM ais Gaussian difficult [37] with to kernel clearlya Gaussian function, classify kernel whichthe VOC function, is widelysamples which used foris widely classifyingusing used linear biologicalfor SVM classifying [22]. data We introduced setsbiological (e.g., microarrayadata nonlinear sets (e.g., SVM gene microarray[37] expressions with a Gaussian gene [38 –expressions40 kernel], DNA function, fragments [38–40], which DNA [41], is widely used for classifying biological data sets (e.g., microarray gene expressions [38–40], DNA andfragments cell shapes [41], [and42]). cell In general, shapes the[42]). data In pointgeneral, coordinates the data arepoint transformed coordinates to aare higher transformed dimensional to a fragments [41], and cell shapes [42]). In general, the data point coordinates are transformed to a coordinatehigher dimensional space, where coordinate SVM can drawspace, a flatwhere boundary SVM betweencan draw the a transformed flat boundary two-class between data setsthe higher dimensional coordinate space, where SVM can draw a flat boundary between the transformed two-class data sets (Figure 4). The coordinate transformation is characterized by the (Figuretransformed4). The coordinate two-class transformation data sets (Figure is characterized4). The coordinate by the transformation kernel function. is characterized We used a by Gaussian the kernel function. We used a Gaussian2 2 kernel function, exp−‖ −‖ ⁄2 , where and kernelkernel function, function.exp We−k xused1 − xa2 Gaussiank /2σ ,kernel where function,x1 and xexp2 represent−‖ − normalized‖ ⁄2 , where VOC data and points andrepresentσ representis a parameternormalized normalized that VOC scales VOC data data the points distancepoints and and between is is a a parameter parameter the points. thatthat Another scales scales parameter,the the distance distance Cbetween, regulatesbetween the the penaltypoints.points. Another for misclassification. Another parameter, parameter, Here,, regulates, regulates we set theσ the= penalty penalty1.5 and for forC misclassification.=misclassification.1000 to reduce Here, theHere, numberwe we set set =1.5 of =1.5 data and points and = 1000 that = determine1000 to reduce the to reduce classification the numberthe number boundary of dataof data points (supportpoints that that vectors), determine determine by thethe which classification classification SVM can boundary avoid boundary overfitting (support (support the vectors), by which SVM can avoid overfitting the dataset. All computations were performed using dataset.vectors), Allby computationswhich SVM can were avoid performed overfitting using the SVM dataset. functions All computations (svmtrain, svmclassify) were performed within using the SVM functions (svmtrain, svmclassify) within the Statistics and Machine Learning Toolbox of StatisticsSVM functions and Machine (svmtrain, Learning svmclassify) Toolbox ofwithin MATLAB the (MathWorks).Statistics and Machine Learning Toolbox of MATLABMATLAB (MathWorks). (MathWorks). Transformed coordinates Transformed coordinates

Lung cancer Lung cancer Healthy Healthy

(a) (b) Figure 4. Schematic illustrating (nonlineara) support vector machine (SVM). (a) The two-class(b) data set is Figure 4. Schematic illustrating nonlinear support vector machine (SVM). (a) The two-class data set composed of two VOCs (VOC 1 and VOC 2; left panel), which are transformed into a different coordinate is composed of two VOCs (VOC 1 and VOC 2; left panel), which are transformed into a different Figure space4. Schematic (right panel) illustrating where the nonlinear dataset can support be classified vector by machinea flat boundary. (SVM). (b () aThe) The SVM two-class boundary data (thick set is coordinate space (right panel) where the dataset can be classified by a flat boundary; (b) The SVM composedline) ofis determinedtwo VOCs using(VOC data 1 and points VOC called 2; left support panel) ve, ctorswhich (thick are circles).transformed The number into a ofdifferent support coordinatevectors spaceboundary (rightshould panel) be(thick small where to line) avoid isthe determined ov dataseterfitting can the using bedata classified data points. points by called a flat supportboundary. vectors (b) The (thick SVM circles). boundary The number (thick line) isof determined support vectors using should data points be small called to avoid support overfitting vectors the(thick data circles). points. The number of support vectors should2.4. be smallEvaluation to avoid of Classification overfitting theAccuracy data points. 2.4. EvaluationWe introduced of Classification the leave-one-out Accuracy cross-validation (LOOCV) method, which is widely used in 2.4. Evaluation of Classification Accuracy Webiology introduced [43,44] and the breath leave-one-out gas analysis cross-validation [5,14,45,46], to evaluate (LOOCV) the method,capability which of the SVM is widely diagnosis used in biologyWefor [ 43 introducedthe,44 given] and data breath the set. leave-one-out gasIn LOOCV, analysis one [5cross-validation, 14data,45 ,point46], to is evaluateleft (LOOCV) out of the the capability method,data set to which of evaluate the SVMis thewidely diagnosis accuracy used forin of the diagnosis, while the remaining data points are used to train the classifier. Then, the left-out thebiology given [43,44] data set.and In breath LOOCV, gas oneanalysis data point[5,14,45,46], is left out to evaluate of the data the set capability to evaluate of the SVM accuracy diagnosis of the data point is diagnosed by the trained classifier (Figure 5). This process is repeated for each sample diagnosis,for the given while data the set. remaining In LOOCV, data one points data are point used is to left train out the of classifier.the data set Then, to evaluate the left-out the data accuracy point of the diagnosis, while the remaining data points are used to train the classifier. Then, the left-out data point is diagnosed by the trained classifier (Figure 5). This process is repeated for each sample

Sensors 2017, 17, 287 6 of 12

Sensors 2017, 17, 287 6 of 12 is diagnosedSensors 2017 by, 17, the 287 trained classiﬁer (Figure5). This process is repeated for each sample to6 compute of 12 the trueto compute positive the rate true (TPR positive= TP /rate(TP (TPR+ FN=)),⁄ true negative + ), true rate negative (TNR = rateTN (TNR/TN=++ FP⁄), and accuracy), to compute the true positive rate (TPR=⁄ + ), true negative rate (TNR=+⁄ ), (ACCand= (accuracyTP + TN (ACC)/(TP = +FN + + TN⁄ + +FP )), + where + 29 healthy), where controls29 healthy were controls used were as true used negative as and accuracy (ACC = + ⁄ + + + ), where 29 healthy controls were used as samples.true negative These values samples. equal These 100% values if a completelyequal 100% if accurate a completely diagnosis accurate is achieved. diagnosis Weis achieved. applied We LOOCV true negative samples. These values equal 100% if a completely accurate diagnosis is achieved. We to allapplied of the VOCLOOCV combinations to all of the to VOC screen combinations the effective to combinationsscreen the effective for cancer combinations diagnosis. for cancer diagnosis.applied LOOCV to all of the VOC combinations to screen the effective combinations for cancer diagnosis. Lung cancer ID Healthy control ID 1234Lung cancer ID n 1Healthy control ID n Train 1234・・・ n 1・・・ n Testrain ・・・・・・Test Train ・・・・・・Testrain ・・・・・・Test ・・・

Train ・・・・・・Testrain ・・・・・・Test Train ・・・・・・Testrain ・・・・・・Test

Figure 5. 5. SchematicSchematic illustrating illustrating the the leave-one-out leave-one-out cross-validation cross-validation (LOOCV) (LOOCV) procedure. procedure. A data Apoint data ispoint isrepeatedlyFigure repeatedly 5. Schematic exchanged exchanged illustrating to categorize to categorize the the leave-one-out training the training and cross-validation testing and data testing set. (LOOCV) data set. procedure. A data point is repeatedly exchanged to categorize the training and testing data set. 3. Results and Discussion 3. Results3. Results and and Discussion Discussion 3.1. Optimal Number of VOCs for Classification 3.1. Optimal3.1. Optimal Number Number of VOCs of VOCs for for Classification Classification The diagnostic accuracy using the original data set (n = 107 for lung cancer patients and n = 29 The diagnostic accuracy using the original data set (n = 107 for lung cancer patients and n = 29 for healthyThe diagnostic individuals) accuracy and oversampled using the original healthy data samples set (n (n= 107 = 78) for depends lung cancer on the patients number and of VOCsn = 29 for healthytrainedfor healthy individuals)by theindividuals) SVM classifier, and and oversampled oversampled as summarized healthy healthy in Figure samplessamples 6a. (n ( nThe == 78) 78)accuracy depends depends increases on onthe thenumber as number more of VOCs of VOCs trainedaretrained byincluded, the by SVMthe and SVM classifier, the classifier, maximum as as summarized accuracysummarized is achieved inin Figure using 66a.a. The The9 or accuracy accuracy10 VOCs, increases increaseswhile the as more asbest more TPR VOCs VOCsis are included,saturatedare included, even and and for the onethe maximum maximumVOC, and accuracy theaccuracy best TNR isis achievedachieved decreases using usingabove 9 94or VOCs. or 10 10 VOCs, VOCs,In contrast, while while thethe thebestnumbers bestTPR of TPRis is saturatedcorrespondingsaturated even even for support onefor one VOC, VOC,vectors and and of the thethe best best ACC TNR TNR and decreasesdecr TPReases classification above above 4 4VOCs. are VOCs. the In lowest Incontrast, contrast, (18.7% the numbers theof the numbers data of of correspondingpoints)corresponding when support there support are vectors 5vectors trained of of theVOCs, the ACC ACC and andand that TPR of TNR cl classificationassification reaches almost are are the bottom the lowest lowest for (18.7% 4 (18.7%VOCs of the (Figure of data the data points) when there are 5 trained VOCs, and that of TNR reaches almost bottom for 4 VOCs (Figure points)6b). when These there results are suggest 5 trained that, VOCs, without and thatoverfitting, of TNR 5 reaches VOCs are almost sufficient bottom for for 89.0% 4 VOCs diagnostic (Figure 6b). 6b). These results suggest that, without overfitting, 5 VOCs are sufficient for 89.0% diagnostic Theseaccuracy, results suggest and that that, the without95% TPR- overfitting, and 89% TNR-based 5 VOCs are diagnoses sufficient are for possible 89.0% diagnosticwhen using accuracy, 5 and 4 and VOCs,accuracy, respectively. and that the 95% TPR- and 89% TNR-based diagnoses are possible when using 5 and 4 that theVOCs, 95% respectively. TPR- and 89% TNR-based diagnoses are possible when using 5 and 4 VOCs, respectively. 100 200 100 (%) data points all of Fraction 10095 200 max ACC 100 (%) data points all of Fraction 200 max TPRACC 9095 max TNR 160 max TPR 75 8590 max TNR 8085 160 75 120 7580 120 50

Rate(%) 7075 max ACC 80 50 (mean ± ± s.d.) (mean

Rate(%) 70 65 TPRmax (max ACC ACC) 80

TNR (max ACC) ± s.d.) (mean 25 6065 TPR (max ACC) 40 maxTNR TPR(max ACC) 25 5560 max TNRTPR Number of support vectors 40

5055 max TNR Number of support vectors 0 0 12345678910 12345678910 50 0 0 12345678910Number of trained VOCs 12345678910Number of trained VOCs Number( ofa) trained VOCs Number( ofb) trained VOCs (a) (b) Figure 6. Dependency of the performance of SVM diagnosis on the number of trained VOCs of the data Figure 6. Dependency of the performance of SVM diagnosis on the number of trained VOCs of the data Figureset (lung 6. cancerDependency patients, ofn = the 107; performance healthy individual of SVMs, n diagnosis= 29, oversampling on the number healthy samples, of trained n = VOCs 78). (a of) the set (lung cancer patients, n = 107; healthy individuals, n = 29, oversampling healthy samples, n = 78). (a) dataBest accuracy set (lung (ACC, cancer blue patients, line) withn =the 107; correspondin healthy individuals,g true positiven =rate 29, (TPR, oversampling solid red line) healthy and true samples, Best accuracy (ACC, blue line) with the corresponding true positive rate (TPR, solid red line) and true nnegative= 78). (ratea) Best (TNR, accuracy solid green (ACC, line) blue within line) all withcombin theations corresponding of each number true of positive trained VOCs rate (TPR, (from solid1 to red 10).negative The dashedrate (TNR, red solidand green line)lines withinrepresen allt combinthe bestations TPR ofand each TNR, number respectively. of trained (b )VOCs The number (from 1 ofto line) and true negative rate (TNR, solid green line) within all combinations of each number of trained support10). The vectorsdashed thatred areand used green in linesthe classifier represen int the(a) forbest the TPR best and ACC TNR, (blue), respectively. TPR (red), ( band) The TNR number (green). of VOCssupport (from vectors 1 tothat 10). are The used dashed in the classifier red and in green (a) for lines the best represent ACC (blue), the best TPR TPR (red), and and TNR, TNR respectively;(green). (b) The number of support vectors that are used in the classiﬁer in (a) for the best ACC (blue), TPR (red), and TNR (green). Left and right y-axes represent the actual number of data points and fraction of all data points, respectively. Sensors 2017, 17, 287 7 of 12

3.2. Effective VOC Combinations for Diagnosing Lung Cancer We determined effective VOC combinations for diagnosing lung cancer with high values for ACC (Table2), TPR (Table3), and TNR (Table4) by ﬁxing the number of trained VOCs at the value that provided a small number of support vectors (n = 5 for ACC, n = 5 for TPR, and n = 4 for TNR; Figure6b). The VOC combinations are sorted by ACC, TPR, and TNR. The most important VOC combinations could not be determined because the differences between the rates in these tables are not large, and the VOC concentrations may contain noises caused by the sensor or sampling process. However, certain VOCs were common in each combination while some VOCs (e.g., nonanal and toluene) were rarely used for diagnosis.

Table 2. Top 10 VOC combinations sorted by ACC (%) (5 VOCs were trained, MC = methylcyclohexane; boldface: most frequent VOC).

Rank 1 2 2 2 2 6 6 6 9 9 ACC 89.0 88.2 88.2 88.2 88.2 86.8 86.8 86.8 86.0 86.0 TPR 92.5 91.6 93.5 91.6 92.5 91.6 89.7 92.5 91.6 87.9 TNR 75.9 75.9 69.0 75.9 72.4 69.0 75.9 65.5 65.5 79.3 CHN CHN Methanol Methanol Butane CHN CHN C2H3CN CHN CHN Methanol CH3CN Acetone Isoprene CH3CN Methanol Butane Isoprene Ethanol CH3CN VOCs CH3CN C2H3CN C2H3CN Xylene Isoprene CH3CN CH3CN 1-Propanol Isoprene Isoprene Isoprene Isoprene Isoprene Unknown-1 1-Propanol 1-Propanol Isoprene Unknown-1 1-Propanol CHCl3 1-Propanol CHCl3 1-Propanol C8H17OH Xylene MC CHCl3 C8H17OH Toluene Xylene

Table 3. Top 10 VOC combinations sorted by TPR (%) (5 VOCs were trained, MC = methylcyclohexane; boldface: most frequent VOC).

Rank 1 2 3 3 3 3 3 3 3 10 ACC 84.6 88.2 86.0 89.0 84.6 88.2 85.3 85.3 86.8 86.8 TPR 94.4 93.5 92.5 92.5 92.5 92.5 92.5 92.5 92.5 91.6 TNR 48.3 69.0 62.1 75.9 55.2 72.4 58.6 58.6 65.5 69.0 Butane Methanol Ethanol CHN CHN Butane Ethanol Ethanol C2H3CN CHN Ethanol Acetone CH3CN Methanol Ethanol CH3CN CH3CN CH3CN Isoprene Methanol VOCs Acetone C2H3CN C2H3CN CH3CN C2H3CN Isoprene Acetone MC 1-Propanol CH3CN C2H3CN Isoprene Isoprene Isoprene 1-Propanol 1-Propanol 2-Propanol Unknown-1 Unknown-1 1-Propanol Toluene 1-Propanol 1-Propanol 1-Propanol CHCl3 Xylene C2H3CN C8H17OH C8H17OH MC

Table 4. Top 10 VOC combinations sorted by TNR (%) (4 VOCs were trained, MC = methylcyclohexane; boldface: most frequent VOCs).

Rank 1 2 2 2 5 5 5 5 5 10 ACC 84.6 89.0 88.2 84.6 88.2 85.3 86.0 85.3 86.8 86.8 TPR 82.2 86.9 76.6 78.5 77.6 79.4 72.9 78.5 71.0 83.2 TNR 89.7 86.2 86.2 86.2 82.8 82.8 82.8 82.8 82.8 79.3 CHN CHN CHN CHN CHN CHN Methanol CH3CN CH3CN CHN Isoprene CH3CN Methanol CH3CN Methanol Methanol CH3CN Acetone Isoprene Methanol VOCs Xylene Isoprene 2-Propanol CHCl3 CH3CN Isoprene Isoprene Unknown-1 MC CH3CN Limonene CHCl3 Nonanal Dichlorobenzene C2H3CN Limonene Limonene C8H17OH Nonanal CHCl3

The variation of VOC combinations in the top diagnosis above was probably caused by the noise in the detected concentration, cancer type, or cancer stage. We extracted the VOCs that were frequently present in the top 10 combinations (Table5). The results show that: (1) CH 3CN and isoprene are commonly used for all diagnoses; (2) the frequently used combination in ACC is the same as the best combination in Table2; (3) 1-propanol, C 2H3CN, and ethanol are specific to the TPR-based diagnosis; (4) the frequently used combination in TPR is same as the third combination in Table3; (5) CHN, CH3CN, and methanol are specific to the TNR-based diagnosis; (6) the frequently used combination in TNR, except for methanol, is the same as the second combination in Table5; and (7) the group of VOCs in the ACC-based diagnosis (“Top ACC” in Table5) contains a mixture of VOCs from the TPR- and TNR-based diagnoses and justifies the definition of ACC (i.e., the indicator merging TPR and TNR). If the eight VOCs listed in Table5 (two from ACC, three from TPR, and three from TNR) were used, the SVM diagnosis would show a performance of 84.6% for ACC, 91.6% for TPR, and 58.6% for TNR, with 54.2 ± 5.47 for the number of support vectors. This is not a particularly bad performance, Sensors 2017, 17, 287 8 of 12 Sensors 2017, 17, 287 8 of 12

ACC,but TNR, 91.6% in particular,for TPR, and would 58.6% provide for TNR, a low with performance. 54.2 ± 5.47 Thisfor the is probably number causedof support by the vectors. small numberThis is notof originala particularly healthy bad controls. performance, The extracted but TNR, VOCs in fromparticular, the VOCs would in Tableprovide1 are a differentlow performance. from the result This isof probably our previous caused study by [the33], small where number the selection of original of target healthy VOCs controls. is different. The extracted VOCs from the VOCs in Table 1 are different from the result of our previous study [33], where the selection of targetTable VOCs 5. isVOCs different. that are frequently used in the top 10 ACC (Table2), TPR (Table3), and TNR (Table4) combinations. The VOCs written in boldface are the same as those in Tables2–4 and represent the Table VOCs5. VOCs that that are usedare frequently most frequently used in thethe ACC-,top 10 TPR-, ACC or (Table TNR-based 2), TPR diagnoses. (Table 3), VOCs and commonlyTNR (Table 4) combinations.used inevery The VOCs diagnosis written and in those boldface speciﬁcally are the used same for as TPR- those and in TNR-basedTables 2–4 and diagnoses represent are coloredthe VOCs by that are usedblue, most red, andfrequently green, respectively.in the ACC-, TPR-, or TNR-based diagnoses. VOCs commonly used in every diagnosis and those specifically used for TPR- and TNR-based diagnoses are colored by blue, red, and green, respectively. ACC TPR TNR Rank VOC Count Rank VOC Count Rank VOC Count 1 IsopreneACC 9 1 1-PropanolTPR 7 1 TNR CHN 7 Rank VOC Count Rank VOC Count Rank VOC Count 2 1 CHNIsoprene 6 9 12 1-ProC2Hpanol3CN 7 6 1 1 CHN CH3CN7 7 2 1-Propanol2 CHN 6 6 2 CCH2H33CNCN 6 6 1 3 CH3CNMethanol 7 5 2 2 CH3CN1-Propanol 6 6 24 CHEthanol3CN 6 5 3 3 MethanolIsoprene 5 5 2 CH3CN 6 4 Ethanol 5 3 Isoprene 5 5 Methanol 4 4 Isoprene 5 5 CHCl3 3 5 Methanol 4 4 Isoprene 5 5 CHCl3 3

Furthermore,Furthermore, to examineexamine the the discriminability discriminability between between the the cancer cancer and healthyand healthy samples, samples, the scatter the scatterdiagrams diagrams for six combinationsfor six combinations of three of VOCs three in VO TableCs 5in were Table plotted 5 were on plotted 3D coordinates on 3D coordinates (Figure7). (FigureThe results 7). representThe results much represent smaller much overlaps smaller between overlaps the lung between cancer andthe healthylung cancer groups and than healthy the 1D groupsrepresentation than the in 1D Figure representation2, and suggest in Figure a high 2, discriminability and suggest a high between discriminability cancer patients between and healthy cancer patientssubjects and using healthy the VOC subjects combination using the rather VOC than combination single VOCs. rather than single VOCs.

8 150 7 8 100 7 6 6 5 50 5 4 (ppb) Isoprene 4 0 3 3 5 Propanol (ppb)

2 0 − 2 4 Propanol (ppb) C 1

− H

1 1 1 N 3 50 ) b ( 0 p p 100 p 0 p 2 0 5 ( 100 ) b 4 e 0 1 pb ) 50 3 150 n 2 50 l (p 1 b) C 2 re 3 no 100 (pp HN ( 1 p CH 4 ha 0 nol ppb) 0 o N (ppb 5 0 et etha Is (a) ) M (b) M (c)

8 8 7 7 6 6 5 4 150 5 3 100 4

Propanol (ppb) Propanol 2 − 3 Propanol (ppb) Propanol 1

1 50 − 0 1 2 0 (ppb) Isoprene 0 Is 1 o 0 p 50 0 r 0 e 0.1 20 n 100 C 40 ) 150 0.2 0.3 0.4 e H b 100 50 0 0.1 3 60 p Isop ) ( C 0.2 (p rene (ppb) H3CN (ppb p 150 0 N 80 l C p 40 20 ( 100 no (f) b 80 60 pp 0.3 a ) 120100 b 120 h 140 ol (ppb) ) et Methan (d) 0.4 140 M (e)

FigureFigure 7. VOC 7. VOCdistributions distributions on a on3D arepresentation 3D representation for the for thelist listof top of top accuracy accuracy combinations combinations in in Table Table 5.5. CHN, CHN,isoprene, isoprene, 1-propanol 1-propanol (a); CHN, (a); CHN, methanol, methanol, 1-propanol 1-propanol (b); ( bCHN,); CHN, methanol, methanol, isoprene isoprene (c ();c );isoprene, isoprene, methanol,methanol, 1-propanol 1-propanol (d); CH (d3CN,); CH methanol,3CN, methanol, isoprene isoprene (e); and (isoprene,e); and isoprene, CH3CN, CH1-propanol3CN, 1-propanol (f). The red (f). and greenThe redcircles and represent green circles lung represent cancer lungpatients cancer and patients healthy and controls, healthy respectively, controls, respectively, and the andblue thecircles blue indicatecircles the oversampling indicate the oversampling data. The oversampling data. The oversampling data are more data widely are more spread widely than spread the original than the healthy original sampleshealthy in this samples range because in this rangesome becauseof the healthy some ofsamples the healthy exist samplesoutside of exist the outsideaxis range. of the axis range.

3.3.3.3. Correlation Correlation between between Canc Cancerer Stage Stage and and Distance Distance from from the the Classification Classification Boundary Boundary DataData points points near near the the classification classification boundary boundary contain contain a a property property of of each each class class because because SVM SVM providesprovides a a boundary boundary between between the the two-class two-class data data points. points. In In other other words, words, the the data data points points that that are are far far fromfrom the the boundary boundary have the specificspecific propertyproperty ofof theirtheir class. class. In In the the SVM SVM diagnosis, diagnosis, the the data data samples samples of of low cancer stages are located near the boundary and those of high stages are far from the boundary (Figure 8a). Such a distance-based feature extraction has been theoretically studied

Sensors 2017, 17, 287 9 of 12

Sensorslow cancer 2017, 17 stages, 287 are located near the boundary and those of high stages are far from the boundary9 of 12 (Figure8a). Such a distance-based feature extraction has been theoretically studied [ 47,48] and applied [47,48]to MRI and images applied of the to brainMRI images [49]. Thus, of the we brain computed [49]. Thus, the distanceswe computed of the the cancer distances samples of the from cancer the samplesboundary from using the the boundary best VOC using combination the best VOC in the combination TPR rank within the the TPR LOO rank fashion; with the the LOO test fashion; sample thedistances test sample are computed distances by are the computed classiﬁer developed by the classifier by the remainingdeveloped learning by the samples.remaining The learning higher samples.cancer stage The samples higher cancer are located stage relatively samples are far fromlocate thed relatively boundary far (Figure from 8theb). boundary This suggests (Figure that 8b). the ThisSVM suggests diagnosis that could the beSVM used diagnosis for estimating could thebe used cancer for stage estimating of a patient. the cancer The ﬁrst-stage stage of a patients patient.have The first-stagerelatively longpatients distances. have Thisrelatively may be long caused distances. by noise This in VOCs, may mislabelingbe caused ofby stage, noise or in nonlinear VOCs, mislabelingtransformation of stage, of the or VOCs nonlinear near thetransf boundary.ormation of the VOCs near the boundary.

2 (mean ± S.E.M.) ± (mean

Distanceboundary from 1

0 1234 Cancer stage (a) (b)

Figure 8. ((aa)) SchematicSchematic illustration illustration of of the the hypothesis hypothesis that th theat the cancer cancer stage stage correlates correlates with with distance distance from fromthe SVM the boundarySVM boundary in the in transformed the transformed coordinates coordinates space; (space.b) The (yb-axis) The indicates y-axis indicates the distance the distance from the fromSVM the boundary. SVM boundary. The learning The learning VOC combination VOC combinatio of the bestn of TPRthe best in Table TPR 3in (butane, Table 3 ethanol,(butane, acetone,ethanol, acetone, C2H3CN, and toluene) was used for computing the test sample distance. C2H3CN, and toluene) was used for computing the test sample distance.

4. Summary Summary and and Conclusions Conclusions We have applied nonlinear SVM classification to the detection of VOCs for lung cancer We have applied nonlinear SVM classification to the detection of VOCs for lung cancer diagnosis diagnosis with leave-one-out cross-validation, and have determined the optimal VOC patterns for with leave-one-out cross-validation, and have determined the optimal VOC patterns for the diagnosis. the diagnosis. Optimal combinations of VOCs depend on ACC, TPR, or TNR (Tables 2–4). The TPR- Optimal combinations of VOCs depend on ACC, TPR, or TNR (Tables2–4). The TPR- and TNR-based and TNR-based optimal combinations are useful for biologically investigating why cancer patients optimal combinations are useful for biologically investigating why cancer patients and healthy people and healthy people are characterized by these VOCs. The ACC-based optimal combinations will be are characterized by these VOCs. The ACC-based optimal combinations will be used for diagnosing used for diagnosing subjects. The TPR-based diagnosis is better for avoiding a risk of false negative. subjects. The TPR-based diagnosis is better for avoiding a risk of false negative. The efficient strategy The efficient strategy is to develop a diagnosis tool based on the ACC-based diagnosis while the is to develop a diagnosis tool based on the ACC-based diagnosis while the TPR-based diagnoses is TPR-based diagnoses is used in hospitals because improving the ACC diagnosis also improves the used in hospitals because improving the ACC diagnosis also improves the TPR diagnosis. TPR diagnosis. The TNR results were lower than that for TPR for all VOC combinations. This is possibly caused by The TNR results were lower than that for TPR for all VOC combinations. This is possibly the oversampling of the healthy controls and will be improved by collecting VOCs from more healthy caused by the oversampling of the healthy controls and will be improved by collecting VOCs from individuals. For the correlation between the SVM distance and cancer stage, a possible alternative more healthy individuals. For the correlation between the SVM distance and cancer stage, a application would be to classify samples into 5 classes (stages 1–4 and healthy) by SVM. This work may possible alternative application would be to classify samples into 5 classes (stages 1–4 and healthy) be performed in the future, because we cannot currently obtain a high accuracy using a multiclass SVM. by SVM. This work may be performed in the future, because we cannot currently obtain a high The optimal VOC set was selected based on the VOC concentrations, each of which was detected accuracy using a multiclass SVM. by the same GC/MS. There is little quantitative variation added by the GC/MS. If we use a different The optimal VOC set was selected based on the VOC concentrations, each of which was GC/MS that has different VOC sensitivities, some VOCs will have different measured concentrations. detected by the same GC/MS. There is little quantitative variation added by the GC/MS. If we use a Even in this case, the SVM diagnosis will select effective VOC combinations similar to those in this different GC/MS that has different VOC sensitivities, some VOCs will have different measured work, because the VOC concentrations are normalized in the SVM; a relative concentration correlation concentrations. Even in this case, the SVM diagnosis will select effective VOC combinations similar between samples is important for classification. However, this argument does not hold, and different to those in this work, because the VOC concentrations are normalized in the SVM; a relative VOC combinations are possibly selected, in the case that the detected VOCs differ depending on the concentration correlation between samples is important for classification. However, this argument GC/MS because of VOC sensitivity. does not hold, and different VOC combinations are possibly selected, in the case that the detected We showed that a diagnosis with 89.0% accuracy can be performed using five VOCs. This highly VOCs differ depending on the GC/MS because of VOC sensitivity. efficient SVM classification will be integrated into the prototype breath analyzer for lung cancer We showed that a diagnosis with 89.0% accuracy can be performed using five VOCs. This highly efficient SVM classification will be integrated into the prototype breath analyzer for lung cancer screening utilizing double GC columns and sensors [33], and a new breath test in Aichi

Sensors 2017, 17, 287 10 of 12 screening utilizing double GC columns and sensors [33], and a new breath test in Aichi Cancer Center is going to start. It must be confirmed that the selected VOCs are also optimal sets when we diagnose with the prototype analyzer specialized to these optimal VOCs in future. We promote this integration and prototype analyzer, expecting that the SVM classifier can be used for the further development of a desktop GC-sensor analysis system for lung cancer. Furthermore, if the precise VOC composition of five VOC mixtures is measured, the cancer stage can be predicted, as it is correlated with the distance of a cancer sample from the SVM classification boundary.

Supplementary Materials: The following are available online at http://www.mdpi.com/1424-8220/17/2/287/ s1, Figure S1: VOC concentration distributions from lung cancer (red, n = 107) and healthy (green, n = 29) controls’ breath. Acknowledgments: This work was supported by the Knowledge Hub Aichi (the priority research project: P3-G3-S1) of Aichi Prefecture and Special Research Aid of the President of Aichi Prefectural University, Japan. The authors also thank Profs. Takaharu Kondo (Chubu University, Japan), Kazushi Ikeda (Nara Institute of Science and Technology, Japan), Junichiro Yoshimoto (Nara Institute of Science and Technology, Japan), and Thomas N. Sato (ERATO Sato Live Bio-Forecasting Project, Japan) for their comments regarding the breath analysis. Author Contributions: Y.S. designed the data analysis and wrote the paper; Y.K. and H.T. developed the MATLAB code; T.H. collected breath gases; K.S. designed the study; T.I. performed the gas analysis experiments; T.A. contributed the experimental methods; W.S. conceived of and designed the study of breath analysis, and wrote the paper. All authors discussed the results and the implications of this manuscript. Conﬂicts of Interest: The authors declare no conﬂict of interest.

References

1. Gordon, S.; Szidon, J.; Krotoszynski, B.; Gibbons, R.; O’Neill, H. Volatile organic compounds in exhaled air from patients with lung cancer. Clin. Chem. 1985, 31, 1278–1282. [PubMed] 2. Kharitonov, S.A.; Barnes, P.J. Biomarkers of some pulmonary diseases in exhaled breath. Biomarkers 2002, 7, 1–32. [CrossRef][PubMed] 3. Bach, P.B.; Kelley, M.J.; Tate, R.C.; McCrory, D.C. Screening for lung cancer: A review of the current literature. Chest 2003, 123, 72S–82S. [CrossRef][PubMed] 4. Corazza, G.; Menozzi, M.; Strocchi, A.; Rasciti, L.; Vaira, D.; Lecchini, R.; Avanzini, P.; Chezzi, C.; Gasbarrini, G. The diagnosis of small bowel bacterial overgrowth. Reliability of jejunal culture and inadequacy of breath hydrogen testing. Gastroenterology 1990, 98, 302–309. [CrossRef] 5. Phillips, M.; Gleeson, K.; Hughes, J.M.B.; Greenberg, J.; Cataneo, R.N.; Baker, L.; McVay, W.P. Volatile organic compounds in breath as markers of lung cancer: A cross-sectional study. Lancet 1999, 353, 1930–1933. [CrossRef] 6. Amann, A.; Poupart, G.; Telser, S.; Ledochowski, M.; Schmid, A.; Mechtcheriakov, S. Applications of breath gas analysis in medicine. Int. J. Mass Spectrom. 2004, 239, 227–233. [CrossRef] 7. Amann, A.; Smith, D. Breath Analysis for Clinical Diagnosis and Therapeutic Monitoring: (With CD-ROM); World Scientiﬁc: Singapore, 2005. 8. Machado, R.F.; Laskowski, D.; Deffenderfer, O.; Burch, T.; Zheng, S.; Mazzone, P.J.; Mekhail, T.; Jennings, C.; Stoller, J.K.; Pyle, J. Detection of lung cancer by sensor array analyses of exhaled breath. Am. J. Respir. Crit. Care Med. 2005, 171, 1286–1291. [CrossRef][PubMed] 9. Wehinger, A.; Schmid, A.; Mechtcheriakov, S.; Ledochowski, M.; Grabmer, C.; Gastl, G.A.; Amann, A. Lung cancer detection by proton transfer reaction mass-spectrometric analysis of human breath gas. Int. J. Mass Spectrom. 2007, 265, 49–59. [CrossRef] 10. Mazzone, P.J. Analysis of volatile organic compounds in the exhaled breath for the diagnosis of lung cancer. J. Thorac. Oncol. 2008, 3, 774–780. [CrossRef][PubMed] 11. Fuchs, P.; Loeseken, C.; Schubert, J.K.; Miekisch, W. Breath gas aldehydes as biomarkers of lung cancer. Int. J. Cancer 2010, 126, 2663–2670. [CrossRef][PubMed] 12. Lourenco, C.; Turner, C. Breath analysis in disease diagnosis: Methodological considerations and applications. Metabolites 2014, 4, 465–498. [CrossRef][PubMed] Sensors 2017, 17, 287 11 of 12

13. Phillips, M.; Cataneo, R.N.; Cummin, A.R.; Gagliardi, A.J.; Gleeson, K.; Greenberg, J.; Maxfield, R.A.; Rom, W.N. Detection of lung cancer with volatile markers in the breath. Chest J. 2003, 123, 2115–2123. [CrossRef] 14. Di Natale, C.; Macagnano, A.; Martinelli, E.; Paolesse, R.; D’Arcangelo, G.; Roscioni, C.; Finazzi-Agrò, A.; D’Amico, A. Lung cancer identification by the analysis of breath by means of an array of non-selective gas sensors. Biosens. Bioelectron. 2003, 18, 1209–1218. [CrossRef] 15. Phillips, M.; Altorki, N.; Austin, J.H.; Cameron, R.B.; Cataneo, R.N.; Greenberg, J.; Kloss, R.; Maxfield, R.A.; Munawar, M.I.; Pass, H.I. Prediction of lung cancer using volatile biomarkers in breath. Cancer Biomark. 2007, 3, 95–109. [CrossRef][PubMed] 16. Peng, G.; Tisch, U.; Adams, O.; Hakim, M.; Shehada, N.; Broza, Y.Y.; Billan, S.; Abdah-Bortnyak, R.; Kuten, A.; Haick, H. Diagnosing lung cancer in exhaled breath using gold nanoparticles. Nat. Nanotechnol. 2009, 4, 669–673. [CrossRef][PubMed] 17. Dragonieri, S.; Annema, J.T.; Schot, R.; van der Schee, M.P.; Spanevello, A.; Carratú, P.; Resta, O.; Rabe, K.F.; Sterk, P.J. An electronic nose in the discrimination of patients with non-small cell lung cancer and COPD. Lung Cancer 2009, 64, 166–170. [CrossRef][PubMed] 18. Mazzone, P.J.; Wang, X.-F.; Xu, Y.; Mekhail, T.; Beukemann, M.C.; Na, J.; Kemling, J.W.; Suslick, K.S.; Sasidhar, M. Exhaled breath analysis with a colorimetric sensor array for the identification and characterization of lung cancer. J. Thorac. Oncol. 2012, 7, 137–142. [CrossRef][PubMed] 19. Pennazza, G.; Santonico, M.; Martinelli, E.; D’Amico, A.; Di Natale, C. Interpretation of exhaled volatile organic compounds. In Exhaled Biomarkers; European Respiratory Society: Lausanne, Switzerland, 2010. 20. Mazzone, P.J.; Hammel, J.; Dweik, R.; Na, J.; Czich, C.; Laskowski, D.; Mekhail, T. Diagnosis of lung cancer by the analysis of exhaled breath with a colorimetric sensor array. Thorax 2007, 62, 565–568. [CrossRef] [PubMed] 21. Phillips, M.; Altorki, N.; Austin, J.H.; Cameron, R.B.; Cataneo, R.N.; Kloss, R.; Maxfield, R.A.; Munawar, M.I.; Pass, H.I.; Rashid, A. Detection of lung cancer using weighted digital analysis of breath biomarkers. Clin. Chim. Acta 2008, 393, 76–84. [CrossRef][PubMed] 22. Vapnik, V.; Lerner, A. Pattern recognition using generalized portrait method. Autom. Remote Control 1963, 24, 774–780. 23. Ruan, S.; Lebonvallet, S.; Merabet, A.; Constans, J.-M. Tumor segmentation from a multispectral MRI images by using support vector machine classification. In Proceedings of the 2007 4th IEEE International Symposium on Biomedical Imaging: From Nano to Macro, Arlington, VA, USA, 12–15 April 2007; pp. 1236–1239. 24. Zacharaki, E.I.; Wang, S.; Chawla, S.; Soo Yoo, D.; Wolf, R.; Melhem, E.R.; Davatzikos, C. Classification of brain tumor type and grade using MRI texture and shape in a machine learning scheme. Magn. Reson. Med. 2009, 62, 1609–1618. [CrossRef][PubMed] 25. Klöppel, S.; Stonnington, C.M.; Chu, C.; Draganski, B.; Scahill, R.I.; Rohrer, J.D.; Fox, N.C.; Jack, C.R.; Ashburner, J.; Frackowiak, R.S. Automatic classification of MR scans in Alzheimer’s disease. Brain 2008, 131, 681–689. [CrossRef][PubMed] 26. Ortiz, A.; Górriz, J.M.; Ramírez, J.; Martínez-Murcia, F.J. LVQ-SVM based CAD tool applied to structural MRI for the diagnosis of the Alzheimer’s disease. Pattern Recognit. Lett. 2013, 34, 1725–1733. [CrossRef] 27. Marquand, A.F.; Mourão-Miranda, J.; Brammer, M.J.; Cleare, A.J.; Fu, C.H. Neuroanatomy of verbal working memory as a diagnostic biomarker for depression. Neuroreport 2008, 19, 1507–1511. [CrossRef][PubMed] 28. Nouretdinov, I.; Costafreda, S.G.; Gammerman, A.; Chervonenkis, A.; Vovk, V.; Vapnik, V.; Fu, C.H. Machine learning classification with confidence: Application of transductive conformal predictors to MRI-based diagnostic and prognostic markers in depression. Neuroimage 2011, 56, 809–813. [CrossRef][PubMed] 29. Barash, O.; Peled, N.; Tisch, U.; Bunn, P.A., Jr.; Hirsch, F.R.; Haick, H. Classification of lung cancer histology by gold nanoparticle sensors. Nanomedicine 2012, 8, 580–589. [CrossRef][PubMed] 30. Van Berkel, J.J.; Dallinga, J.W.; Moller, G.M.; Godschalk, R.W.; Moonen, E.; Wouters, E.F.; van Schooten, F.J. Development of accurate classification method based on the analysis of volatile organic compounds from human exhaled air. J. Chromatogr. B 2008, 861, 101–107. [CrossRef][PubMed] 31. Van Berkel, J.J.; Dallinga, J.W.; Moller, G.M.; Godschalk, R.W.; Moonen, E.J.; Wouters, E.F.; van Schooten, F.J. A profile of volatile organic compounds in breath discriminates COPD patients from controls. Respir. Med. 2010, 104, 557–563. [CrossRef][PubMed] Sensors 2017, 17, 287 12 of 12

32. Hakim, M.; Billan, S.; Tisch, U.; Peng, G.; Dvrokind, I.; Marom, O.; Abdah-Bortnyak, R.; Kuten, A.; Haick, H. Diagnosis of head-and-neck cancer from exhaled breath. Br. J. Cancer 2011, 104, 1649–1655. [CrossRef] [PubMed] 33. Itoh, T.; Miwa, T.; Tsuruta, A.; Akamatsu, T.; Izu, N.; Shin, W.; Park, J.; Hida, T.; Eda, T.; Setoguchi, Y. Development of an Exhaled Breath Monitoring System with Semiconductive Gas Sensors, a Gas Condenser Unit, and Gas Chromatograph Columns. Sensors 2016, 16, 1891–1906. [CrossRef][PubMed] 34. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. 35. Akbani, R.; Kwek, S.; Japkowicz, N. Applying support vector machines to imbalanced datasets. In Machine Learning: ECML 2004, Proceedings of the 15th European Conference on Machine Learning, Pisa, Italy, 20–24 September 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 39–50. 36. Chawla, N.V. Data mining for imbalanced datasets: An overview. In Data Mining and Knowledge Discovery Handbook; Springer: Berlin/Heidelberg, Germany, 2005; pp. 853–867. 37. Bernhard, E.B.; Isabelle, M.G.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992; pp. 144–152. 38. Furey, T.S.; Cristianini, N.; Duffy, N.; Bednarski, D.W.; Schummer, M.; Haussler, D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16, 906–914. [CrossRef][PubMed] 39. Brown, M.P.; Grundy, W.N.; Lin, D.; Cristianini, N.; Sugnet, C.W.; Furey, T.S.; Ares, M.; Haussler, D. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA 2000, 97, 262–267. [CrossRef][PubMed] 40. Chai, H.; Domeniconi, C. An evaluation of gene selection methods for multi-class microarray data classification. In Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics, Pisa, Italy, 20–24 September 2004; pp. 3–10. 41. McHardy, A.C.; Martin, H.G.; Tsirigos, A.; Hugenholtz, P.; Rigoutsos, I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 2007, 4, 63–72. [CrossRef][PubMed] 42. Yin, Z.; Sadok, A.; Sailem, H.; McCarthy, A.; Xia, X.; Li, F.; Garcia, M.A.; Evans, L.; Barr, A.R.; Perrimon, N. A screen for morphological complexity identifies regulators of switch-like transitions between discrete cell shapes. Nat. Cell Biol. 2013, 15, 860–871. [CrossRef][PubMed] 43. Van’t Veer, L.J.; Dai, H.; van de Vijver, M.J.; He, Y.D.; Hart, A.A.; Mao, M.; Peterse, H.L.; van der Kooy, K.; Marton, M.J.; Witteveen, A.T. Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415, 530–536. [CrossRef][PubMed] 44. Ambroise, C.; McLachlan, G.J. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 2002, 99, 6562–6566. [CrossRef][PubMed] 45. Phillips, M.; Cataneo, R.N.; Condos, R.; Erickson, G.A.R.; Greenberg, J.; La Bombardi, V.; Munawar, M.I.; Tietje, O. Volatile biomarkers of pulmonary tuberculosis in the breath. Tuberculosis 2007, 87, 44–52. [CrossRef] [PubMed] 46. Fens, N.; Zwinderman, A.H.; van der Schee, M.P.; de Nijs, S.B.; Dijkers, E.; Roldaan, A.C.; Cheung, D.; Bel, E.H.; Sterk, P.J. Exhaled breath profiling enables discrimination of chronic obstructive pulmonary disease and asthma. Am. J. Respir. Crit. Care Med. 2009, 180, 1076–1082. [CrossRef][PubMed] 47. Gilad-Bachrach, R.; Navot, A.; Tishby, N. Margin based feature selection-theory and algorithms. In Proceedings of the Twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004; pp. 144–152. 48. Navot, A.; Shpigelman, L.; Tishby, N.; Vaadia, E. Nearest neighbor based feature selection for regression and its application to neural activity. Adv. Neural Inf. Process. Syst. 2006, 18, 995–1002. 49. Klöppel, S.; Abdulkadir, A.; Jack, C.R.; Koutsouleris, N.; Mourão-Miranda, J.; Vemuri, P. Diagnostic neuroimaging across diseases. Neuroimage 2012, 61, 457–463. [CrossRef][PubMed]

© 2017 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).