Clinical Chemistry 63:12 Informatics and Statistics 1847–1855 (2017)

Very Deep Convolutional Neural Networks for Morphologic Classification of Erythrocytes Thomas J.S. Durant,1† Eben M. Olson,1† Wade L. Schulz,1 and Richard Torres1*

BACKGROUND: Morphologic profiling of the erythrocyte cells, including abnormal leukocytes in lymphoma, leu- population is a widely used and clinically valuable diag- kemia, and dysplasia; intracellular parasites such as ma- Downloaded from https://academic.oup.com/clinchem/article/63/12/1847/5612763 by guest on 29 September 2021 nostic modality, but one that relies on a slow manual laria and Anaplasma; and platelet changes characteristic process associated with significant labor cost and limited of specific causes of thrombocytopenia and myeloprolif- reproducibility. Automated profiling of erythrocytes from erative disease (1, 2). Similarly, aberrant erythrocytic digital images by capable machine learning approaches forms can be associated with renal and liver disease, he- would augment the throughput and value of morphologic moglobinopathies, toxins, and dysplasia, and are criti- analysis. To this end, we sought to evaluate the performance cally important for the detection of hemolytic disorders of leading implementation strategies for convolutional neu- such as thrombotic thrombocytopenia purpura (2, 3). ral networks (CNNs) when applied to classification of Automated analyzers typically rely on a com- erythrocytes based on morphology. bination of laser-light scatter, fluorescence, and imped- ance, as well as other flow-based physical or cytochemical METHODS: Erythrocytes were manually classified into 1 properties that are insensitive to these morphologic of 10 classes using a custom-developed Web application. changes, making visual microscopy frequently necessary Using recent literature to guide architectural consider- (4). ations for neural network design, we implemented Unfortunately, gold-standard morphologic profil- a “very deep” CNN, consisting of Ͼ150 layers, with ing of blood cells relies heavily on manual smear process- dense shortcut connections. ing techniques and visual inspection with limitations from quality-control and economic scalability. Blood RESULTS: The final database comprised 3737 labeled smear preparation and interpretation are thought to be cells. Ensemble model predictions on unseen data dem- negatively affected by observer bias, slide distribution er- onstrated a harmonic mean of recall and precision met- rors, statistical sampling error, and recording errors, and rics of 92.70% and 89.39%, respectively. Of the 748 cells also involve labor-intensive processes that require highly in the test set, 23 misclassification errors were made, with trained individuals, rendering them time and cost- a correct classification frequency of 90.60%, represented prohibitive (5–8). Although the process remains in uni- as a harmonic mean across the 10 morphologic classes. versal use, it is typically performed under the stewardship of institutional guidelines, to limit requests in the interest CONCLUSIONS: These findings indicate that erythrocyte of conserving resources (9). As such, there has been con- morphology profiles could be measured with a high de- siderable interest in improving sensitivity and specificity gree of accuracy with “very deep” CNNs. Further, these of automated analyzers for morphologic abnormalities, data support future efforts to expand classes and optimize and in developing automated classification of images of practical performance in a clinical environment as a pre- peripheral blood smears (10). lude to full implementation as a clinical tool. Historically, efforts to automate morphologic classi- © 2017 American Association for Clinical Chemistry fication have used statistical models that rely on input that is derived in a way similar to analysis by morpholo- gists (11). Known as feature engineering, such attempts Microscopic examination of peripheral blood is standard quantify predetermined morphologic features from dig- practice in clinical medicine and serves an important role ital images to serve as input to prediction algorithms. in the diagnosis of both hematologic and nonhemato- Although these models have demonstrated the potential logic disease. It is uniquely capable of discerning clini- to discern between basic morphologic classes, they gen- cally relevant morphologic features of hematopoietic erally distinguish only a small number of categories and

1 Department of Laboratory Medicine, Yale University School of Medicine, New Haven, CT. Received May 18, 2017; accepted August 11, 2017. * Address correspondence to this author at: Department of Laboratory Medicine, 55 Park Previously published online at DOI: 10.1373/clinchem.2017.276345 St., PS345D, New Haven, CT 06511. Fax 203-937-4746; e-mail [email protected]. © 2017 American Association for Clinical Chemistry † T.J.S. Durant and E.M. Olson contributed equally to this work.

1847 show limited accuracy, particularly for red blood cells CNNs rely on input from patterns that mimic hu- (11–13). man visual recognition fields, known as filters, which are Implementation of feature engineering has also been mathematically convolved with the image of interest. In studied as input for modern machine learning (ML)2 contrast to traditional ANNs taking input across an en- algorithms, such as artificial neural networks (ANNs), tire image simultaneously, convolutional filters operate which have been an effective combination for leukocyte on limited neighborhoods to produce “feature maps” of classification (14). A commercial system that relies on local patterns, which are combined by subsequent layers ANNs for leukocyte classification, CellaVision (Cel- into more abstract features (20). This strategy of local lavision AB), was first Food and Drug Administration- hierarchical feature extraction allows models to be less approved for automated image analysis in 2001 and is sensitive to perturbations in image orientation, size, or Downloaded from https://academic.oup.com/clinchem/article/63/12/1847/5612763 by guest on 29 September 2021 widely available as an add-on for several automated rotation and reduces the number of examples needed to hematology analyzer systems (15). However, the Food train a model, leading to model generalizability and bet- and Drug Administration-approved implementation ter performance (18). Although CNNs are currently a is limited to human-assisted classification, as it re- predominant ML technique for complex image recogni- quires confirmation of leukocyte classifications by a tion tasks, their performance as applied to the morpho- skilled operator. Such reliance on human confirma- logic classification of erythrocytes has not been described. tion can limit the potential benefits of ML technology New data have demonstrated that the improved per- in clinical medicine, and future applications may ben- formance of CNNs can be positively influenced by increas- efit from modern and more capable algorithms to pur- ing their “depth,” or number of layers. Early implementa- sue unassisted classification. tions of multilayer neural networks consisted of 5 to 10 CellaVision also offers a non-Food and Drug layers of learned features (24, 25). Modifications in training Administration-approved image analysis solution for techniques, connection patterns, and improved computa- erythrocytes that is similarly based on ANNs, using 80 tional power have gradually allowed increasing network predetermined object features to classify cells from 17 depth with the potential for improved performance. How- morphologic classes, directed into 4 qualitative cate- ever, as the depth of CNNs increases, information used to gories. However, published reports for CellaVision red train the network (i.e., adjustment of feature weights) can blood cell classification demonstrate limited specific- diminish, causing gradient-based methods to remain within ity and variable accuracy without reclassification by local error minima and fail to converge (26, 27). To address human operators (16, 17). Also, the intended use is this issue, recent methods have evolved in which “shortcut” described as qualitative rather than a quantitative met- connections are added to aid information flow between ric, without reliable single-cell resolution, an aspect shallow and deep layers, allowing gradients to persist that limits its practical and clinical utility (16). Hence, throughout the network and yielding improvements in per- there is considerable need for more robust analytic formance on a variety of benchmark tasks (28). approaches for erythrocyte classification, capable of Given these recent developments in ML, we hypo- accurate quantitative analysis. thesized that CNNs could be used to develop high- In recent years, important advancements in image performance classification models and serve to quantify classification performance have been made through the erythrocyte morphologic profiles with high single-cell reso- implementation of learned feature representations, with lution accuracy. Automated classification of erythrocytes convolutional neural networks (CNNs) as a prominent from digital images by capable ML approaches would aug- example (18–21). Determination of which ML method ment the throughput, precision, and clinical value of eryth- to use for a given problem is a decision that may be rocyte morphologic analysis. With this in mind, we sought guided by results from annual benchmark challenges in to determine the “off-the-shelf” performance metrics of a the ML community. In 2012, work from Krizhevsky recent approach to neural networks applied to morphologic et al. (18) demonstrated a significant improvement in per- classification of erythrocytes. In particular, we evaluated sin- formance metrics relative to previous methods adapted to gle erythrocyte subtype recognition in digitized blood the LSVRC-2010 ImageNet data set. Since then, litera- smears through the implementation of a newly published ture has shown CNNs have continued to improve and architecture with “very deep” structure and dense shortcut outperform previous implementations through advance- connections known as DenseNet (28). ments in network structure and depth (21–24). Materials and Methods

HARDWARE AND OPERATING SYSTEMS Training CNNs requires a considerable amount of com- 2 Nonstandard abbreviations: ML, machine learning; ANN, artificial neural network; CNN, convolutional neural network; TP, true positive; TN, true negative; FP, false positive; FN, putation during the development phase, even if per- false negative; CCF, correct classification frequency; NLL, negative log loss. formed on high-performance central processing units.

1848 Clinical Chemistry 63:12 (2017) Machine Learning for Erythrocyte Morphology Downloaded from https://academic.oup.com/clinchem/article/63/12/1847/5612763 by guest on 29 September 2021

Fig. 1. Process diagram. Labeled erythrocytes were separated into training and test sets (A). The training data set was further subdivided into training and validation sets (B). DenseNet was trained on 3 instances using training data and evaluated on validation data (C). Best performing models were selected for predictions on unseen data (D). Predicted probabilities were averaged across replicates for ensemble predictions (E).

To accelerate development, it is standard practice to train A custom graphical user interface was developed for deep neural networks on graphics processing units. Ac- the labeling process, which allows annotators to pan and cordingly, the computation for training DenseNet for zoom within the wide-field images and to mark central this study was performed on a local Linux server (Intel x–y coordinates of erythrocytes with a class label. These Core i7, 8 processing cores) running Ubuntu (version coordinates were used to automatically crop individual 14.04) with an NVidia Titan X graphics processing unit, erythrocytes into 70 ϫ 70, 3-channel byte-arrays. The CUDA Toolkit 7.5, and cuDNN RC5. Web application was created in Python (version 2.7) us- ing the open-source Web framework Django (30). The DATA SET CURATION application was deployed and accessed within a Docker Supervised ML relies on accurately labeled data to learn (Docker, Inc.) container running on a single cloud-based which feature representations produce the most success- Linux application server (Amazon Web Services). ful prediction of the target. The work flow for our study involved the curation of a training data set followed by IMAGE PREPROCESSING multiple preprocessing steps. First, specimens were se- The initial data set was created by combining the labels lected during routine clinical work flow based on the provided by the 2 annotators. Any cells with discordant prevalence of 1 or more types of preselected morpholo- labels were discarded. Labeled erythrocyte images were gies. Wide-field images were obtained by extracting stan- then randomly divided into training and test sets with an dardized JPEG-compressed images of peripheral blood 80:20 split. Training data were further subdivided 80:20 smears at 100ϫ magnification from a CellaVision Slide into training and validation data sets (Fig. 1). Perfor- Scanning Unit and were uploaded to a custom-built mance measured on data used for parameter optimiza- cloud-based Web application for labeling (29). Images of tion may be overly optimistic, and as a result, a model individual erythrocytes were labeled as 1 of 10 possible that is selected based on such performance metrics may morphologic classes. generalize poorly to new data. Therefore, a validation set The classification of erythrocytes was performed by a is used to perform model selection while reserving the test laboratory medicine resident and an attending hemato- data for final performance measurement. pathologist. Because of the large number of class examples Lastly, label-preserving transformations, e.g., rota- that are required to achieve good performance with CNNs, tion and mirroring, were performed on the training set. smears identified on routine clinical slide review with a high This data augmentation process effectively increases the prevalence of abnormal morphologies were preferentially number of images available during training and, impor- selected for use. For any given sample image, cells were la- tantly, reduces orientation-specific learning, thereby re- beled randomly for training and validation. The 10 class ducing overfitting for morphologies with limited training labels chosen were based on clinical significance and avail- data. ability: , (teardrop cells), acantho- cytes, , stomatocytes, spherocytes, NETWORK IMPLEMENTATION (target cells), , overlap, and normal. “Overlap” Implementation of DenseNet was done using the open- refers to 2 erythrocytes with no evidence of physical separa- source Python packages Theano (31) and Lasagne (32). tion between them, included owing to their high prevalence The architecture of DenseNet is based on the work de- on peripheral blood smears. scribed by Huang et al. (28) and contains 39 convolu-

Clinical Chemistry 63:12 (2017) 1849 tional layers. Layers are grouped into “blocks” in which TN TP Specificity ϭ Recall ϭ each layer receives input not only from the previous layer TN ϩ FP TP ϩ FN but also from all previous layers within the same block (dense shortcut connections). All convolutional kernels TP ϩ TN CCF ϭ within the blocks are 3 ϫ 3, and the blocks are joined TP ϩ FP ϩ TN ϩ FN using “transition layers” consisting of a 1 ϫ 1 convolu- tion followed by an average-pooling down-sampling TP Precision ϭ layer. Finally, a global averaging layer followed by a fully TP ϩ FP connected layer with a softmax nonlinearity produces a Downloaded from https://academic.oup.com/clinchem/article/63/12/1847/5612763 by guest on 29 September 2021 single output probability distribution for the predicted The F1 score represents the harmonic mean of precision class (33). and recall. This metric is thought to offer a more conser- vative view of model performance relative to CCF when TRAINING PROTOCOL the class distribution is unequal (36). Cross-entropy, or Training was performed using minibatch stochastic gra- negative log loss (NLL), was calculated for training and dient descent (34, 35). Performance during training was validation predictions and was plotted as a function of monitored on a validation set created by a random 80:20 training iterations to visually assess for overfitting. F1 split of the training set. The network was trained for 300 score and NLL are defined below: epochs (iterations over the full training set) with criteria to decrease learning rate if validation metrics did not precision ⅐ recall F1 score ϭ 2 ⅐ improve after 30 iterations. Model parameters were saved precision ϩ recall as a “snapshot” after each reduction of validation error. ϭ Ϫ The snapshot that performed best on the validation set NLL log py was used on unseen data. Network hyperparameters were based on commonly used settings and were not exten- Here py corresponds to the predicted probability assigned sively tuned for this task (see Table 1 in the Data Supple- to a sample’s true label y, and the NLL is averaged over all ment that accompanies the online version of this article samples in the set. This metric is minimized by correct at http://www.clinchem.org/content/vol63/issue12). predictions and particularly penalizes predictions that are Lastly, 3 replicates of DenseNet were trained using confident but incorrect. unique random seed initializers to evaluate the repro- Lastly, probability distributions for each prediction ducibility and calculate ensemble predictions as de- on test data are represented as a scalar value between 0 scribed below. and 1. As a probability is assigned to each class, this value can be viewed as a measure of certainty regarding each TEST PROTOCOL prediction. It is standard practice to evaluate the perfor- It is common practice with ML to use a combination, or mance of a model with consideration of the second and ensemble, of model predictions to improve performance third most likely predictions, as this can provide further metrics. As model errors are expected to be somewhat insight into model performance. Accordingly, CCF, in- uncorrelated, averaging or otherwise combining predic- cluding top-2 and top-3 model predictions, was also cal- tions can create a consensus model more reliable than its culated for ensemble predictions on unseen test data. components. Following completion of training for each replicate, the model with the lowest error rate on valida- Results tion data was selected to perform final predictions of morphologic class on the test data set. For each erythro- DATA SET cyte image, a probability distribution was generated Peripheral smears from 97 specimens were uploaded to across the 10 possible classes, and the class with the high- the custom Web application for labeling. The initial data est probability was selected. A simple average of predicted set consisted of 4032 labeled cells with the number of probabilities from the 3 replicates was used to produce an cells labeled per slide ranging from 1 to 329 and a median ensemble prediction. value of 18. As previously noted, cells with labels discor- Ensemble predictions on test data were analyzed for dant between reviewers were removed, reducing the true positives (TP), true negatives (TN), false positives initial data set to 3737. This data set was divided 80:20 (FP), and false negatives (FP). These parameters were into training and test sets that consisted of 2989 and 748 used to calculate a variety of performance metrics, includ- cells, respectively (Fig. 1). The distribution of classes was ing specificity, recall (equivalent to “sensitivity”), preci- skewed toward normal, , and target cell mor- sion (equivalent to “positive predictive value”), and the phologies, which was primarily a function of prevalence correct classification frequency (CCF): among smear specimens selected for inclusion (Table 1).

1850 Clinical Chemistry 63:12 (2017) Machine Learning for Erythrocyte Morphology

Table 1. Raw number of morphologic classes relative to each respective data set.a

Data set

Training Validation Test Normal 728 81 203 226 25 63

Dacrocyte 64 7 18 Downloaded from https://academic.oup.com/clinchem/article/63/12/1847/5612763 by guest on 29 September 2021 Schistocyte 546 61 152 66 7 18 120 13 33 Target cell 523 58 145 Fig. 2. The geometric average of negative log loss for train- Stomatocyte 79 9 22 ing and test predictions across the 3 training replicates of Spherocyte 173 19 48 DenseNet is depicted on the left y axis and the harmonic Overlap 166 18 46 mean of CCF of test on the opposite y axis. Both are plotted as a function of training epochs for DenseNet. a Cellswererandomlydistributed80:20betweentrainingandtestsets,respectively. Training cells were further subdivided 80:20 into training and validation sets, respectively. model suggestions revealed 99.20% and 99.60% CCF, respectively. TRAINING Recall and precision varied across morphologic DenseNet was trained 3 times, for 300 epochs each, and classes and ranged from 61% to 100% and 85% to took an average of 227 (ϩ/Ϫ2.5) min to complete, with 100%, with harmonic means of 90.52% and 95.08%, an average of 45 s per iteration. The best validation per- respectively. Dacrocytes demonstrated the lowest recall formance was observed after the 288th, 295th, and (61%) and the lowest precision (85%). Specificity was 292nd epochs for each training replicate, with validation less variable, demonstrating consistently high values accuracies of 97.5, 96.6 and 97.19, respectively, and an ranging from 98% to 100% across the 10 morphologic average NLL of 0.096. Subsequently, there was no mea- classes. F1 score ranged from 71% to 100%, with dacro- surable decrease of validation loss, indicating no empiri- cytes also yielding the lowest performance (Table 3). cal evidence of model improvement. Therefore, model Poor performance of the class is most likely parameters from these iterations were used to perform attributable to a small number of labeled training exam- predictions on the test data set following completion of ples (n ϭ 64), although, as discussed below (see Discus- training, with subsequent averaging of predicted proba- sion), intrinsic ambiguity in human classification can bilities for calculation of ensemble predictions. compound this effect. Increasing the number of labeled Because of known issues with overtraining of CNNs cells for underrepresented morphologies would most leading to poor generalization, average NLL and CCF likely result in improved performance. rates were plotted as a function of training epochs to Misclassification errors were noted to be consistent evaluate visually for overfitting (Fig. 2). The minor di- across 3 DenseNet model replicates. Errors were ex- vergence between validation and training NLL noted in tracted with original training labels and top-3 ensemble the later iterations suggested negligible overfitting; there- predictions with corresponding probabilities (Fig. 3). fore, overfitting was deemed unlikely to have had an im- Three of these ensemble predictions failed to produce a portant effect on test set performance. prediction in the top-3 suggestions to match the training label (Fig. 3, E, H, and I). On post hoc review, 3 of the TEST PERFORMANCE METRICS cells were noted to have training labels that were deemed Ensemble model predictions on unseen data resulted in to be inaccurate (Fig. 3, E, H, and M). An accurate sug- 23 misclassification errors out of the 748 cells in the test gestion from the model was seen in the top-3 predictions set (96.92% CCF). The CCF ranged from 66.1% to for 2 of the mislabeled cells (Fig. 3, H and M). The cell 100.0% across the 10 morphologic classes, with a har- denoted in Fig. 3E would have been retrospectively clas- monic mean of 90.60%. The highest rate of misclassifi- sified as an acanthocyte, which was not a class label that cation occurred with dacrocytes and , both was found in the top-3 model suggestions. of which were predominantly misclassified as schistocytes Although subjective, visual inspection of results can (Table 2). Analysis with consideration of top-2 and top-3 provide insight into sources of model error. Inspection of

Clinical Chemistry 63:12 (2017) 1851 Table 2. Ensemble model predictions on unseen test data.a

Predicted

Target Normal Echinocyte Dacrocyte Schistocyte Elliptocyte Acanthocyte cell Stomatocyte Spherocyte Overlap Actual Normal 202 1

Echinocyte 1 62 Downloaded from https://academic.oup.com/clinchem/article/63/12/1847/5612763 by guest on 29 September 2021 Dacrocyte 11 5 1 1 Schistocyte 1 148 2 1 Elliptocyte 1 17 Acanthocyte 4 28 1 Target cell 145 Stomatocyte 1 21 Spherocyte 1 1 1 45 Overlap 46

a Columns indicate ensemble predictions of DenseNet, and rows represent the human reference standard. Elements on main diagonal represent agreement in predicted classifica- tions. Elements off main diagonal represent disagreements with the human reference standard. Highlighted cells represent errors consistently made across all 3 training replicates of DenseNet.

consistently misclassified cells suggests that a proportion Discussion of model failure could be attributed to the variable accu- racy of human labeling. In addition, overdependence of In 1978, Bacus et al. published one of the earliest papers the network on local features is also suggested. In Fig. 3K, using modeling of morphometric features to discriminate the lack of central pallor in combination with the radially between morphologic classes of erythrocytes by a priori symmetrical nature of the cell and lack of sharp cytoplas- feature engineering (13). In their study, 5 feature sets mic extension of the dacrocyte may reasonably produce a were used to demonstrate separation between possible top-3 prediction profile similar to that we see. Also, the classes, with the caveat that performance metrics were lateral projections of a cell, as seen in Fig. 3H, may acti- derived from the training data, which results in an over- vate neurons that favor a dacrocyte prediction. Lastly, as estimation of model accuracy. In addition, their model seen with Fig. 3C, a normal cell was classified as a target required 11 class types be condensed into 6 for their cell, which is likely attributable to an overlying platelet. feature set to provide adequate separation between classes. Implementation of feature engineering was also explored by Albertini et al., who created a model based on Table 3. F1 score, precision, and recall of ensemble 4 preselected erythrocyte indices to discern between 7 predictions of DenseNet following 3 training iterations. morphologic classes. Results from their study demon- strated sensitivities that ranged from 32% to 90% with a F1 score Precision Recall geometric mean of 61%, indicating it was not fully opti- mized with regard to statistical accuracy (12). More re- Normal 1.00 1.00 1.00 cently, published analyses of performance by ANN-based Echinocyte 0.99 1.00 0.98 morphologic classification of erythrocytes have shown Dacrocyte 0.71 0.85 0.61 preclassification sensitivity and specificity ranging from Schistocyte 0.95 0.93 0.97 17.6% to 100.0% and 46.3% to 100.0%, respectively, Elliptocyte 0.94 0.94 0.94 based on qualitative categorization without reference to Acanthocyte 0.89 0.93 0.85 single-cell classification accuracy (17). Target cell 0.99 0.99 1.00 In this study, we have shown that implementations Stomatocyte 0.95 0.95 0.95 of modern very deep CNNs offer a powerful approach Spherocyte 0.94 0.94 0.94 for automation of peripheral blood smear quantitative Overlap 1.00 1.00 1.00 single-cell analysis of erythrocyte populations. Without significant modification, DenseNet was capable of high

1852 Clinical Chemistry 63:12 (2017) Machine Learning for Erythrocyte Morphology Downloaded from https://academic.oup.com/clinchem/article/63/12/1847/5612763 by guest on 29 September 2021

Fig. 3. Misclassification predictions on unseen test data representing errors that were made consistently across all 3 training replicates of DenseNet. “True” labels are indicated at the top of each image. Top-3 ensemble model predictions with associated probability scores are listed below each image.

performance metrics when applied to morphologic clas- efficient labeling of cells, which allowed curation of 2989 sification of erythrocytes. Low FP and FN rates indicate cells for our training set. However, because of the variable that DenseNet is capable of accurate interpretation of prevalence of cell types, the distribution of labels was not previously unseen data. Moreover, given that this im- homogeneous, and 4 (dacrocytes, elliptocytes, acantho- plementation of DenseNet underwent minimal hyper- cytes, and stomatocytes) of the 10 classes had Ͻ150 cells parameter tuning for this task, opportunities for opti- in their training set. Consequently, these classes demon- mization of predictive performance are likely, making strated the lowest F1 measures, ranging from 0.71 to expected achievable accuracy even greater. 0.95. Pertaining to the lowest CCF results, studies have It is expected that the metrics for these classes would shown that increasing numbers of training examples for be substantially improved with an expanded training set. computer vision tasks improves performance metrics, so Nonetheless, our results demonstrate that even with lim- the success of image classification tasks can be largely ited quantities of training examples, high performance dependent on the availability of labeled data (18). Our can still be achieved by very deep CNNs. These results are study relied on a custom Web application to facilitate an important consideration for deep learning implemen-

Clinical Chemistry 63:12 (2017) 1853 tations in hematopathology and for applications in clin- than the computational classification of individual fea- ical medicine, where training examples of particular class tures, i.e., feature engineering, eliminates reliance on fal- types may be few for rare entities. However, high accura- lible human determination of feature representations rel- cies in this setting should be interpreted with caution, as evant for a given task. In principle, feature learning may models trained with few class examples may not manifest help overcome a significant limitation of comparable ap- inaccuracies until tested against future data sets. proaches for automated erythrocyte classification. Another prominent driver of inaccuracies in deep ML is a rapidly moving field with constantly emerg- neural networks is overfitting, which may decrease model ing methodologies and technologies that may be readily generalizability to future data (37). Overfitting can be adaptable to high performance automation of current suspected when the training error continues to decline tests. Our results demonstrate that implementations of Downloaded from https://academic.oup.com/clinchem/article/63/12/1847/5612763 by guest on 29 September 2021 throughout training but incrementally increases when very deep CNNs with dense shortcut connections be- validation data are analyzed. In these instances, it is be- tween layers may be a powerful tool for automated eryth- lieved that the model has memorized patterns of the rocyte classification. The high degree of accuracy indi- training data but has failed to generalize its predictions to cates the potential for improved standardization and new input data. intrarater reliability for subjective image classification In our study, visual inspection of error plots suggests tasks in the clinical laboratory. Although ML methods negligible overfitting with this network that was unlikely are commonly outperformed within relatively short time to have had a significant effect on test set performance. frames, performance appears to have reached a threshold Overfitting would be suspected in the case of a positive of practical implementation as a clinical diagnostic aid. slope of the validation error curve with a concurrent neg- Future work will seek to expand classes and optimize ative or zero slope of training error, increasing the gap practical performance in a clinical environment as a pre- between the 2. Although we did not observe this in our lude to full implementation as a clinical tool. study, some overfitting is often unavoidable with deep learning models, and impact on generalizability can only be elucidated with testing against alternative data sets. Author Contributions: All authors confirmed they have contributed to Although the need for further optimization persists, the intellectual content of this paper and have met the following 3 require- these results remain superior to previously published ef- ments: (a) significant contributions to the conception and design, acquisi- forts at automated erythrocyte classification (16, 17). tion of data, or analysis and interpretation of data; (b) drafting or revising Notably, the results indicate performance could be im- the article for intellectual content; and (c) final approval of the published article. proved through larger training sets, avoidance of human classification errors during training, and optimization Authors’ Disclosures or Potential Conflicts of Interest: No authors of tunable network parameters. However, the fact that declared any potential conflicts of interest. DenseNet relies on learned feature representations rather Role of Sponsor: No sponsor was declared. References

1. Bain BJ. Diagnosis from the blood smear. N Engl J Med and WBC differential analysis. Lab Hematol 2005;11: Int J Lab Hematol 2016;38:366–74. 2005;353:498–507. 83–90. 17. Egele´ A, Stouten K, Heul-Nieuwenhuijsen L, Bruin L, 2. Gallagher PG. Red cell membrane disorders. ASH Edu- 10. Horn CL, Mansoor A, Wood B, Nelson H, Higa D, Lee LH, TeunsR,GelderW,RiedlJ.Classificationofseveralmor- cation Program Book 2005;2005:13–8. Naugler C. Performance of the Cellavision(®) DM96 phological abnormalities by DM96 digi- 3. Ford J. Red blood cell morphology. Int J Lab Hematol system for detecting red blood cell morphologic abnor- tal imaging. Int J Lab Hematol 2016;38:e98–101. 2013;35:351–7. malities. J Pathol Inform 2015;6:11. 18. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classifi- 4. Ceelie H, Dinkelaar RB, van Gelder W. Examination of 11. Prewitt J, Mendelsohn ML. The analysis of cell images. cation with deep convolutional neural networks. Adv peripheral blood films using automated microscopy; Ann N Y Acad Sci 1966;128:1035–53. Neural Inform Proc Sys 2012;25:1097–105. evaluation of Diffmaster Octavia and Cellavision DM96. 12. Albertini MC, Teodori L, Piatti E, Piacentini MP, Accorsi 19. IandolaF,MoskewiczM,KarayevS,GirshickR,DarrellT, J Clin Pathol 2007;60:72–9. A, Rocchi MB. Automated analysis of morphometric pa- Keutzer K. Densenet: implementing efficient convnet 5. Rumke CL. Imprecision of ratio-derived differential leu- rameters for accurate definition of erythrocyte cell descriptor pyramids. arXiv 2014;1404.1869. kocyte counts. Blood Cells 1985;11:311–4. shape. Cytometry A 2003;52:12–8. 20. Zeiler MD, Fergus R. Visualizing and understanding 6. Kratz A, Bengtsson H-I, Casey JE, Keefe JM, Beatrice GH, 13. Bacus J, Belanger M, Aggarwal R, Trobaugh F. Image convolutionalnetworks.EuropeanConferenceonCom- Grzybek DY, et al. Performance evaluation of the processing for automated erythrocyte classification. puter Vision. Zurich: Springer; 2014:818–33. Cellavision DM96 system. Am J Clin Pathol 2005;124: J Histochem Cytochem 1976;24:195–201. 21. He K, Zhang X, Ren S, Sun J. Deep residual learning for 770–81. 14. Lee LH, Mansoor A, Wood B, Nelson H, Higa D, Naugler image recognition. Proceedings of the 2016 IEEE Con- 7. Ru¨mke C. Statistical reflections on finding atypical cells. C. Performance of Cellavision DM96 in leukocyte clas- ference on Computer Vision and Pattern Recognition Blood Cells 1984;11:141–4. sification. J Pathol Inform 2013;4:14. 2016;770–8. 8. Koepke J, Dotson M, Shifman M. A critical evaluation of 15. 510(k) premarket notification: Diffmaster Octavia 22. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma the manual/visual differential leukocyte counting automatic hematology analyzer. 2000; Vol. 510(K) S, et al. Imagenet large scale visual recognition chal- method. Blood Cells 1984;11:173–86. Number: K003301. lenge. Int J Comput Vis 2015;115:211–52. 9. Barnes PW, McFadden SL, Machin SJ, Simson E. The 16. Criel M, Godefroid M, Deckers B, Devos H, Cauwelier B, 23. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based international consensus group for hematology review: Emmerechts J. Evaluation of the red blood cell ad- learning applied to document recognition. Proceed- suggested criteria for action following automated CBC vanced software application on the Cellavision DM96. ings of the IEEE 1998;86:2278–324.

1854 Clinical Chemistry 63:12 (2017) Machine Learning for Erythrocyte Morphology

24. Srivastava RK, Greff K, Schmidhuber J. Training very varaja V, et al. Can automated blood film analysis re- pathology image analysis: a comprehensive tutorial deep networks. Adv Neural Inf Process Syst 2015;28: place the manual differential? An evaluation of the Cel- with selected use cases. J Pathol Inform 2016;7:29. 2377–85. lavision DM96 automated image analysis system. Int 34. Robbins H, Monro S. A stochastic approximation 25. Simonyan K, Zisserman A. Very deep convolutional net- J Lab Hematol 2009;31:48–60. method. Ann Math Stat 1951:400–7. works for large-scale image recognition. arXiv 2015; 30. Holovaty A, Kaplan-Moss J. The definitive guide to 35. LeQ,NgiamJ,CoatesA,LahiriA,ProchnowB,LeQV,Ng 1409.1556v6. Django: web development done right. http://www. AY. On optimization methods for deep learning. Pro- 26. Glorot X, Bengio Y. Understanding the difficulty of springer.com/in/book/9781430203315 (Accessed Au- ceedings of the 28th International Conference on Ma- training deep feedforward neural networks. Aistats gust 2017). chine Learning, 2011;ICML-11:265–72. 2010;9:249–56. 31. Al-Rfou R, Alain G, Almahairi A, Angermueller C, Bah- 36. Powers DM. Evaluation: from precision, recall and 27. Bengio Y, Simard P, Frasconi P. Learning long-term de- danau D, Ballas N, et al. Theano: a Python framework f-measure to roc, informedness, markedness and corre- pendencies with gradient descent is difficult. IEEE Trans for fast computation of mathematical expressions. lation. J Machine Learn Tech 2011;2:37–63.

Neural Networks 1994;5:157–66. arXiv 2016;1605.02688. 37. Lawrence S, Giles CL, Tsoi AC. Lessons in neural network Downloaded from https://academic.oup.com/clinchem/article/63/12/1847/5612763 by guest on 29 September 2021 28. Huang G, Liu Z, Weinberger KQ, van der Maaten L. 32. Dieleman S, Schlu¨ter J, Raffel C, Olson E, Sønderby SK, training: overfitting may be harder than expected. Pro- Densely connected convolutional networks. arXiv 2016; Nouri D, et al. Lasagne: first release. Geneva: Zenodo; ceedings of the Fourteenth National Conference on Ar- 1608.06993. 2015. tificial Intelligence, AAAI-97. Menlo Park (CA): AAAI 29. Briggs C, Longair I, Slavik M, Thwaite K, Mills R, Tha- 33. Janowczyk A, Madabhushi A. Deep learning for digital Press; 1997:540–5.

Clinical Chemistry 63:12 (2017) 1855