2017 IEEE 13th International Conference on eScience

Towards a Fully Automated Diagnostic System for Orthodontic Treatment in Dentistry

Seiya Murata∗, Chonho Lee†, Chihiro Tanikawa‡ and Susumu Date†

∗Graduate School of Information Science and Technology, Osaka University Email: [email protected] †Cybermedia Center, Osaka University Email: {leech, date}@cmc.osaka-u.ac.jp ‡Graduate School of Dentistry, Osaka University Email: [email protected]

Abstract—A deep learning technique has emerged as a success- healthcare. For example, such imaging enables anyone to self- ful approach for diagnostic imaging. Along with the increasing check the degree of and abnormality from demands for dental healthcare, the automation of diagnostic oral and facial images, which are the causes of masticatory imaging is increasingly desired in the field of for many reasons (e.g., remote assessment, cost reduction, etc.). dysfunction, apnea syndromes, and pyorrhea, etc. Moreover, However, orthodontic diagnoses generally require dental and it leads to objective diagnosis that is important for dental medical scientists to diagnose a patient from a comprehensive and medical scientists and their patients because the diagnosis perspective, by looking at the mouth and face from different directly affects to the treatment plan, treatment priorities, and angles and assessing various features. This assessment process insurance coverage. takes a great deal of time even for a single patient, and tends to generate variation in the diagnosis among dental and However, dental and medical scientist are struggling against medical scientists. In this paper, the authors propose a deep time and various limits against accuracy when assessing their learning model to automate diagnostic imaging, which provides patients. In an orthodontic clinic, a dentist generally carries an objective morphological assessment of facial features for out the examination, consultation, and treatment. In addition, orthodontic treatment. The automated diagnostic imaging system other dentists and specialists meet and spend a great deal of dramatically reduces the time needed for the assessment process. It also helps provide objective diagnosis that is important for time assessing patients and creating their medical records. For dental and medical scientists as well as their patients because example, in Osaka University Dental Hospital1, more than a the diagnosis directly affects to the treatment plan, treatment hundred patients visit every day, and dentists regularly spend priorities, and even insurance coverage. The proposed deep two to three hours creating a medical record for just one learning model outperforms a conventional convolutional neural patient. network model in its assessment accuracy. Additionally, the authors present a work-in-progress development of a data science Furthermore, an orthodontic diagnosis has a specific diffi- platform with a secure data staging mechanism, which supports culty in that dentists must diagnose a patient comprehensively computation for training our proposed deep learning model. The by looking at the entire face while assessing multiple parts of platform is expected to allow users (e.g., dental and medical the face from different angles, rather than simply targeting one scientists) to securely share data and flexibly conduct their data analytics by running advanced machine learning algorithms (e.g., part of the mouth and face. For instance, a dentist must first deep learning) on high performance computing resources (e.g., a look at the frontal face of a patient and examine the patient GPU cluster). for drooping eyelid and/or distortion of the nose; the dentist must confirm maxillary protrusion and/or from I. INTRODUCTION the side of the face; check the alignment of the teeth from The rapidly increasing availability of medical and dental the oral exam; and then, finally give the patient a severity data is becoming a driving force for the adoption of data- score. This complex assessment process consequently brings driven approaches, which generates the motivation to automate variation into the diagnosis among different dentists. medical tasks such as diagnostic imaging, disease progression For this reason, and in order to improve the assessment modelling, and cohort analysis. In particular, diagnostic imag- speed and accuracy, the authors propose a deep learning ing has been achieving success in diagnosing the presence of model to automate diagnostic imaging. This model provides tuberculosis in chest x-rays [1], detecting diabetic retinopathy an objective morphological assessment of facial features. To from retinal photographs [2] as well as locating breast cancer the best of our knowledge, this is the first attempt to apply in pathology images [3], all by utilizing deep learning tech- deep learning technique to an orthodontic assessment. This niques [4]. automation dramatically reduces the assessment workload for Automated diagnostic imaging is eagerly desired in the field the dentists and also prevent variation in the diagnosis. of orthodontics as well, along with the increasing demands for dental healthcare, becoming one of the factors for all forms of 1http://hospital.dent.osaka-u.ac.jp

978-1-5386-2686-3/17 $31.00 © 2017 IEEE 1 DOI 10.1109/eScience.2017.12 To train the proposed model, computational support for a (e.g., patients’ facial images) and generates multilabels (e.g., workload on a large-sized dataset is necessary. Therefore, the the morphological assessment of facial features). To the best authors are developing a data science platform with a secure of our knowledge, this is the first attempt to apply a deep data staging mechanism via VPN connection between Dental learning technique to an orthodontic assessment. Hospital1 and Cybermedia Center2 in Osaka University. The Transfer Learning: In practice, it is very difficult to collect platform is expected to allow us to securely transfer training and prepare a large number of medical images (e.g., X-rays, data (i.e., patients’ facial images) to compute nodes, and then CT scans and MRIs) to train a deep learning model from to process such data on a GPU cluster, and finally, to obtain scratch. An insufficient number of training data causes low the trained model. After the completion of training, the data is performance of analytics, e.g., classification and prediction deleted, and the network session is closed. While the promise accuracy. In such a case, a transfer learning approach may help of big data analytics is materializing, there is still a non- to improve performance [5], [6]. Transfer learning attempts negligible gap between its potential and usability in practice to gain knowledge from one or more source tasks and apply due to various factors that are inherent in the data itself such the knowledge to a target task. There are a few CNN-based as scale, heterogeneity and privacy. The authors aim to fill the models pretrained on a very large dataset, e.g., ImageNet, gap and make it possible for authorized users (e.g., dentists, for image classification. The size of our dataset is relatively clinical experts and researchers) to securely share data among small compared to those models. We initialize our model with themselves and to flexibly conduct their data analytics by the weights of ImageNet and fine-tune layers (especially later running advanced machine learning (ML) algorithms on high layers) by using our dataset. performance computing (HPC) resources. In this paper, we present the proposed deep leaning-based Multi-label Image Classification: In general, medical images diagnostic imaging system, and also introduce the envisioned may contain multiple regions of interest to be evaluated. As data science platform. The remainder of this paper is organized described above, an orthodontist diagnoses a patient based on as follows. Section II introduces some preliminaries and re- the assessment results of various facial parts from several dif- lated work. Section III describes the deep learning model, and ferent facial images. Thus, for automated diagnostic imaging, a Section IV shows its performance accuracy and discusses its typical single-label (binary or multi-class) image classification practical use. Finally, we present a data science platform that model should be extended to solve the problem of a multi- supports the computation of the automated diagnostic system label image classification. The multi-labels correspond to the in Section V, followed by our conclusions. assessment results of facial parts. A common approach to the multi-label image classification II. PRELIMINARIES AND RELATED WORK is to extend CNN so that it handles multiple single-label Deep Learning: In recent years, the deep learning technique image classification problems [7]. However, this approach has has become the most popular and successful approach for drawbacks in its computational efficiency as follows. First, the problems in machine learning and image recognition. There number of parameters to be learned keeps increasing, which are two well-known models, called the convolutional neural is caused by a large number of label combinations. The label network (CNN) and the recurrent neural network (RNN). The space with m labels is exponentially expanded from O(2) to CNN is generally composed of convolutional layers, pooling O(2m). Secondly, the approach fails to learn the dependency layers and fully-connected layers. The convolution of the between multiple labels. It does not consider two relevant images and filters makes it possible to extract the target labels; for example, ”drooping of the right eyelid” and ”nose features and to preserve the spatial relationship between pixels distortion to the right” that can be seen in the same patient. from image distortion. The pooling layer is normally operated To overcome the drawbacks, [8], [9] divided an image in-between successive convolution layers to reduce the spatial into sub-images or regions that include candidate targets and size of features and the amount of model parameters. The applied multiple single-label image classification models into fully-connected layer is a traditional neural network layer these sub-images. This approach preliminarily requires an where the features of the next layer are a linear combination extra process to crop such regions. Also, because multiple of the features of the previous layer. RNN is a neural network models are independently run for the corresponding targets, model where neurons (i.e., nodes in the network) have recur- there is no clue for obtaining label relationship information. rent connections and capture the short and long term temporal Inspired by [10] that proposes an image captioning model, our dependencies in the data. It is widely used for modelling time- model with RNN sequentially classifies (or assesses) multiple series data by capturing the short and long term temporal regions without pre-processing such as cropping, and then dependencies in the data. Thus, the prediction of the next time predicts labels (at later layers in RNN) based on the previously step is affected by that of the previous time step. predicted labels (at earlier layers in RNN) for the other regions. A variety of works have been proposed in diagnostic imag- ing [1]–[3] using a CNN model. In this paper, we propose Attention Mechanism: For the deep leaning model, we a hybrid model of the CNN and RNN, which takes images employ RNN with an attention mechanism. The motivation for using the attention mechanism is because a large part of the 2http://www.cmc.osaka-u.ac.jp information is irrelevant in predicting a particular label, so we

2 do not need to use all available information. The mechanism an image classification task and published at the ImageNet finds the most relevant piece of information (i.e., the subarea Large Scale Visual Recognition Challenge (ILSVRC). Ima- of an image) to predict the corresponding label. geNet is a dataset of general object images, but our dataset is In our model, RNN generates multi-labels one by one, composed of facial images. Thus, we fine-tune the last three which correspond to the assessment results of facial parts (i.e., convolutional layers when training the entire model on our the subarea of an image). The attention mechanism in the RNN dataset. lets the network sequentially focus on a subarea of the image, We extract the output of the convolutional layer as image evaluate it, and then change its focus to some other area of feature vectors, denoted by the image. Thus, we know which part of an image affects D A = {a1,...,aL},ai ∈ R . (1) the corresponding label prediction. This mimics how doctors assess a patient by focusing on multiple parts of the face and Each image feature ai is a D dimensional vector (i.e., the then zooming out to the entire face step by step. number of channels) corresponding to the i-th part of the input L End-to-end Training: Shin et al. [11] proposes a combined image divided into grids. model of CNN and RNN for image annotation of chest X-ray B. Predicting labels images. The authors first use a publicly available radiology dataset to train the CNNs to classify seventeen disease names. We use Long Short Term Memory (LSTM) [13], [14] as Based on the trained CNN features, the RNNs are trained to the implementation of RNN for predicting labels. The LSTM architecture consists of LSTM blocks as shown in Fig.2. describe the contexts of a detected disease. Eventually, they Each LSTM block at time step t computes a hidden state ht use the weights of a trained pair of CNN/RNN and compute using three inputs such as the previous hidden state ht−1, the the average feature of all hidden layers in what they call y z the joint image/text context for composite image labelling. previously predicted label t−1, and a context vector t that Unlike [11] with multi-step learning, our model is an end- captures the visual information associated with a particular to-end learning that sequentially generates labels in order image location. The context vector is computed by of the corresponding parts of face to the entire face. From zt = φ({αi,t}, {ai}) (2) feature extraction to learning the desired result, deep learning a algorithms act as full pipelines for solving the tasks at hand. where i is an image feature vector extracted for a particular image location i, and αi,t is its corresponding weight at t, III. A DEEP LEARNING-BASED DIAGNOSTIC IMAGING representing the relative importance to give to the location i. SYSTEM FOR ORTHODONTIC TREATMENT Note that there are a few ways to interpret the weight [10]. This section describes in detail the proposed deep learning How to model a function φ depends on the interpretation. The model, a key engine for an automated diagnostic imaging next subsection explains in detail how to compute zt and αi,t. system for orthodontic treatment. As mentioned above, we The initial hidden state h0 and the memory state c0 of the consider an orthodontic diagnosis a multi-label image clas- LSTM block affect the prediction of the first label. These sification. The model is trained on patients’ facial images to variables are initially set to the average of the image feature predict a set of the assessments of facial parts such as the eye, vectors over all locations, computed by separate multilayer nose, mouth, jaw, and so on. In practice, the assessments of perceptrons. different facial parts have a mutual dependency, so we need The predicting labels include the assessments of facial parts to design a model that learns this dependency. Appreciate to (e.g., mouth, jaw) and the entire face. We train RNN to predict RNN, each unit of the RNN predicts a label by taking into the labels in the order of the mouth, jaw, and then for the account previously predicted labels. Inspired by [10] which entire face. Because each LSTM of the RNN predicts a label describes the content of an image, we propose a hybrid model by taking into account previously predicted labels, the mutual using CNN and RNN with an attention mechanism. dependency between the labels is preserved. Specifically, CNN takes an image and extracts the image C. Attention mechanism features. Using the features (a vector representation) obtained at the last convolution layer, RNN produces a sequence of The motivation for using the attention mechanism is that multi-labels, each of which corresponds to the assessment of because a large part of information is irrelevant to predict one facial part. The attention mechanism in RNN tells the a particular label, so we do not need to use all available network which sub-area of the image impacts the prediction of information. The mechanism finds the most relevant piece of the particular labels. This helps reduce the computational cost information (e.g., the subarea of an image) to lead the model by selecting and learning the most relevant parts of the image to predict the correct label. for the predicted labels. As illustrated in Fig.1, subsequent Two types of attention mechanisms are described in [10], subsections describe the model step by step. stochastic Hard attention and deterministic Soft attention. Stochastic Hard attention is trained by reinforcement learning A. Extracting image feature vectors while deterministic Soft attention is trained by a standard back Our model uses CNN to extract image feature vectors. For propagation method so that it can be trained in an end-to-end the CNN, we use a VGG-19 model [12], which is trained for manner. Thus, we focus on the Soft attention.

3 Fig. 1. An illustration of the proposed end-to-end deep learning model. Note that a vector mark is omitted for ht, zt for simplicity.

This weighted sum of sub-images over all locations implies that the expected images can predict a certain label. Thus, each αi,t is interpreted as a probability in which the target label can be predicted from a source image.

IV. EVALUATION

This section explains the dataset and experimental setting used to train the proposed model, and then shows the exper- imental results to evaluate the accuracy performance of the proposed model.

y z Fig. 2. LSTM block : t−1 is a previous predicted label, t is a context A. Dataset and Experimental Setting vector, and ht−1 is a previous hidden state. σ is a sigmoid function and tanh is a hyperbolic tangent function.⊕ is an addition and  is a Hadamard product. The training dataset includes a set of patients’ front facial images, stored in the Department of Orthodontics at Osaka University Dental Hospital, labelled by dentists and their students. Figure 3 shows a sample patient’s image and a list of As mentioned above, αi,t is an attention weight of the assessments (i.e., labels) including the facial parts, distortion corresponding image feature ai at time step t. The Soft directions, severity, etc. In this paper, we focus on the mouth attention interprets the weight as the relative importance to and jaw, which are more important parts than the other parts give to location i. It is computed as the normalized importance from the viewpoint of orthodontic treatment, as well as the score over all locations by entire face. The labels are three-grade distortions of the mouth, exp (e ) jaw and entire face. For example, the mouth and jaw labels α =  i,t . i,t L are: “deviation to the left”, “normal”, and “deviation to the k=1 exp (et,k) right”. The entire face labels are “mild distortion or normal”, The importance score of i-th location is computed from ai and “moderate distortion” and “severe distortion”. The size of the ht−1, the previous hidden state of RNN, by input image is 304 pixels × 224 pixels. e = MLP(a ,h ) We were only given 352 patients’ images. Because we i,t i t−1 thought the number was too small, we augmented the dataset where MLP is a multilayer perceptron that computes a score with horizontally inverted available images. The additional according to how well ai and ht−1 are matched. Considering images are labelled in the opposite direction to the images the importance scores or attention weights, Soft attention in the original assessment. In total, the dataset contains 704 computes a context vector by images. In this experiment, we measure the classification accuracy L z = α a . of each of the facial parts by a 10-fold 90/10 cross validation. t i,t i Then, we compare the proposed model with multiple CNNs, i=1

4 `QQ]1J$ .V HQ`JV` Q` CV` V7V7 1CR TABLE II THE CLASSIFICATION ACCURACY (%) OF THE PROPOSED MODEL,  `:1$. JQV5 Q1:`R CV` 7 1CR TRAINED ON DIFFERENT SIZES OF A DATASET. Q% . R1 Q` VR Q %]]V` CV` 7 QRV`: V # of images for training Mouth Jaw Face Average 1 Q` 1QJ Q` :1 Q .V CV` 7 1CR 200 64.3 42.9 64.3 57.1 `Q `%R1J$ .V CQ1V` C1]7 1CR 400 68.6 45.7 65.7 60.0 634 74.2 48.6 65.7 62.9 1 Q` 1QJ Q` `:HV Q% C1V7 QRV`: V 8 8 8

Fig. 3. A sample patient’s image and a list of sample assessments (i.e., labels) including the facial part, distortion direction, severity, etc. each of which is independently trained to produce the assess- ment of one facial part. The CNN is replaced with VGG-19 pretrained by ImageNet.

B. Experimental Results Fig. 4. Images with visual attention of Patient A who has heavy distortions Table I shows the classification accuracy (%) for each facial around his mouth and jaw. The model correctly predicts the labels. part, averaged in the results of a 10-fold cross validation. The number in parenthesis indicates the standard deviation. The proposed model slightly improves the accuracy with a lower standard deviation. During the cross validation experiment, we have observed a big difference between the accuracy of our model and that of another model, in certain pairs of training and testing datasets. For example, the accuracy difference for the jaw is relatively higher than that for other parts. This improvement leads to Fig. 5. Images with visual attention of Patient B who does not have any improve the overall accuracy of the entire face. Even though severe problems. The model correctly predicts the labels. the accuracy itself is still low, the proposed model has better mechanisms to learn visual attention and label dependency contributes to increasing the accuracy.

TABLE I THE AVERAGE CLASSIFICATION ACCURACY (%) OF THE RESULTS OF A 10-FOLD CROSS VALIDATION.THE NUMBER IN PARENTHESIS INDICATES THE STANDARD DEVIATION.

Multiple CNNs Our Model Mouth 64.0 (±7.5) 65.7 (±7.4) Jaw 57.9 (±16.5) 61.3 (±12.7) Fig. 6. Images with visual attention of Patient C who has mild distortions Face 67.1 (±9.7) 67.4 (±9.1) around his mouth and jaw. The model predicts the wrong labels. Average 63.0 (±9.6) 64.8 (±7.7) Worst, Best 40.0, 74.3 49.5, 74.3

L =19× 14 = 266 One reason for the low accuracy might be the size of the . Each of these regions is extracted training data. To learn relevant image features to produce by four max pooling layers, so they are considered the part correct labels, we need more data. Table II shows the accuracy where the convolution layers strongly react. Images with the results when the proposed model is trained on a different attention are obtained by upsampling corresponding weights α dataset size: 200, 300, and 634 patients’ images. We have ( i) and superimposing them on the original images. Since observed a trend of increasing performance; therefore, we the image feature vectors for one image do not change, the believe that the accuracy will increase as the number of attention is determined by the hidden state of LSTM at each samples increases. We are now collecting more samples for step. Relations between image feature vectors and the hidden training, and also performing data cleansing to correctly label states are learned by multilayer perceptron. Since the order of the samples. predicting the labels is fixed first to the mouth, jaw, and the entire face, we expected the model to successfully determine C. Visual Attention the next attention from the hidden state. As explained in Section III, attention is a set of L weights Figures 4, 5 and 6 are some of the input images (i.e., attached to local regions of an input image. In our experiment, patients’ facial images) with the visual attention. The white

5 regions of the images represent the attention, interpreted as the image pixels that have a relatively stronger influence in ]VJCQ1 QJ `QCCV` predicting the labels. %]V`HQI]% V` C% V` 1` %:C1<: 1QJ Images in Figure 4 are of a sample patient who has heavy V 1Q`@QJ `QC 7 VI distortions around his mouth and jaw. Images in Figure 5 are of  :$VJL% a sample patient who does not have any severe problems. Two Q0V`JRRVI:JR VH%`V  `QJ RVJR sets of those figures are obtained when the model correctly Q$1J QG :` L JR Q$1J predicts the assessment of the mouth (left), jaw (middle) and Q 1`1H: 1QJ VJ :C Q ]1 :C the entire face (right). We have observed that, when the model `:R H.QQC Q` VJ 1 `7 7GV`IVR1: VJ V`^_ predicts labels for the mouth, jaw and the entire face, the regions of attention are around the corresponding facial parts. Fig. 7. An overview of data science platform with a secure data staging Whereas attention is located around the jaw in Figure 4, mechanism over VPN connection between Dental Hospital1 and Cybermedia 2 attention is along the outline of the face in Figure 5. It seems Center , Osaka University. that jaw labels can be predicted from both facial parts and the outline of the face. When the model assesses the entire face label, the region of attention is smoothly distributed over the V:C .H:`V V`01HV  ]]C1H: 1QJ entire face. 1%:C1<: 1QJ : :J:C7 1H Q:C:JR V .QR Images in Figure 6 are of a sample patient who has mild Q.Q` `VR1H 10V I:$V distortions around his mouth and jaw; but, the model predicts J:C71 J:C71 VHQ$J1 1QJ J%]V`01VR  : 1 1H:C : :J:C7 1H VV] V:`J1J$ the wrong labels. In contrast to the cases of Patients A and B, V:`J1J$ V:`J1J$ the attention is in a clutter. The accuracy of this experiment is still insufficient for practical use as image diagnosis. The attention model is relatively complex, so the amount of data in the dataset is :1 : : %`: VR : : JQ1CVR$VG: V not satisfactory for training. Generally, when using machine learning, the bigger a dataset, the more the accuracy of the prediction improves. Therefore, by improving the model by `%H %`VR : : J `%H %`VR : : VJ Q` : : : : %`: 1QJ increasing data, we will improve accuracy. : : : : : : : : !6 `:H 1QJ CV:J1J$ JJQ : 1QJ J V$`: 1QJ V. T OWARDS BUILDING A DATA SCIENCE PLATFORM

The deep leaning approach is promising for diagnostic Fig. 8. An example data analytics pipeline consisting of pre-processing for imaging; however, to achieve better accuracy, the model re- data anlaytics. quires a large dataset size and high performance computing resources to support the computational workload of the train- ing. In addition, while the promise of big data analytics is and demands is the automated diagnostic imaging discussed materializing, there is still a non-negligible gap between its in this paper. potential and usability in practice for medical professionals In reality, however, medical and dental scientists have great due to various factors that are inherent in the data itself such difficulty using the supercomputer systems at the Cybermedia as scale and privacy. Center although the Cybermedia Center and the dental hospital In order to fill the gap, we are working on the develop- are located on the same campus. The primary reason for the ment of a data science platform with a secure data staging difficulty is the government’s regulations and guidelines as set mechanism between the Dental Hospital and the Cybermedia forth by the Ministry of Health, Labor and Welfare, Japan. Center (CMC), a supercomputing center at Osaka University. Based on these regulations and guidelines, the dental hospital The platform is expected to allow any authorized medical and has set up their own strict security policy and rules in terms dental scientists to securely share data among themselves and of data treatment. In addition, lack of the technical solutions to conduct their data analytics on high performance computing exists that allow isolated and independent computing and resources. networking environments from third parties on the campus. Figure 7 illustrates the data science platform which we A. Secure Data Staging have envisioned and have been prototyping to alleviate the In the daily practice of medical and dental services, privacy lack of technical solutions. This platform is expected to rich and confidential data such as facial images and medi- allow dental doctors and scientists to use the supercomputing cal records have been proliferating and accumulating. This systems for their scientific research. In this environment, it medical and dental situation has been raising the needs and is assumed that security-sensitive data is located and moved demands on supercomputing systems for medical and dental to the supercomputing systems at the Cybermedia Center. To scientists ’practices and research. An example of such needs embody this plan, we are now working of the prototyping of

6 an on-demand secure staging mechanism that enables security- information and solve the incompleteness of data. This will sensitive data to be securely moved between the hospital and contribute to the effectiveness and efficiency of this whole the Cybermedia Center, by interlocking Software Defined Net- process. Fourth, data integration will combine various sources working (SDN) with the job management system deployed on of data to enrich information for further analysis. Finally, the supercomputing systems. SDN is a concept of networking the processed data will be modelled and analyzed, and then that allows the dynamic control of packet flows on network. In analytics results will be visualized and interpreted. this research, by making use of the network programmability In this paper, we presented automated diagnostic imaging brought by SDN, the job management system can establish as a practical example of data analytics in the medical and a virtual private network for isolation of the networking dental scientific field. In this way, a medical and dental expert environment only when data located on the dental hospital can follow the aforementioned processes to collect a patient’s is required. For the isolation of the computing environment, facial images from a few different sources (extraction), remove the job management system sets up a set of computing node the incorrect data (cleansing), assess them for correct labelling scheduled to be used for computation so that no one can access (annotation); and then, train the proposed deep leaning model this information. (modelling) by utilizing a GPU machine. Using the trained In our plan, the proposed platform works as follows. First, model, the expert automates the assessment process for the the job management system sets up a virtual private net- new patient’s data and visualizes how the model decides to work through the use of SDN functionalities and then moves predict the assessments. data from the dental hospital to supercomputing systems, That data analytics pipeline can also be fully or semi- when a computing job is about to start. Simultaneously, the fully automated by using crowdsourcing system [15], [16], job management system sets up a set of computing nodes. active learning [17] and transfer learning [18]. We leave this After sending the data staging to supercomputing systems, automation for our future work [19]. the system disconnects the VPN so that third parties cannot access either supercomputing systems or the data. After the VI. CONCLUSION computation is finished, the job management system sets up a virtual private network again and then moves back data to the In this paper, we proposed a deep CNN/RNN model with dental hospital. Finally, the job management system removes an attention mechanism to automate diagnostic imaging for data on the supercomputer system and then releases the set of orthodontic treatment. We compared the proposed model with computing nodes used for the computation. a model with multiple CNNs and showed the performance im- provement in terms of the assessment accuracy. The proposed B. Data Analytics Pipeline model makes it possible for doctors to reduce their assessment Supported by this data science platform, medical and dental workload. In future work, we will increase the number of scientists can efficiently conduct their scientific research with accurate data to improve the accuracy, and also increase the data analytics. To make the best analytics, all the information number of facial parts to be assessed, e.g, eye and nose. must be collected, cleaned, integrated, stored, analyzed and We also presented a work-in-progress, i.e., a data science interpreted in a suitable manner. The whole process is a data platform with a secure data staging mechanism. The proposed analytics pipeline where different algorithms or systems focus platform will allow scientists to securely share privacy-rich on different specific targets and are coupled together to deliver data and easily perform data analytics using supercomputing an end-to-end solution. This process can also be viewed as a resources to efficiently conduct their scientific research. Even- software stack where at each phase there are multiple solutions tually, we also plan to build a dental healthcare application and the actual choice depends on the data type (e.g., image, that fully or semi-fully automates diagnosis and treatment plan sound, text and sensor data) or application requirements (e.g. generation using a smartphone or mobile devices. Successful predictive analysis, cohort analysis and image recognition). remote and automated diagnostic imaging will also be ex- Figure 8 shows an example data analytics pipeline consist- panded to other fields such as otolaryngology (ear and nose) ing of a data curation phase for cleansing, annotation and inte- and ophthalmology (eye). gration, and a data analytics phase with analysis methods and visualization tools. Before available data is directly processed ACKNOWLEDGMENT for analysis, the data needs to go through several steps to refine it according to the application requirements. This work was supported by JSPS KAKENHI Grant Num- First, the data needs to be acquired and extracted from ber JP17KT0083, and partly supported by JSPS KAKENHI various data sources (e.g., different departments or labs), and Grant Number JP16H02802 and JP17K00168. The authors shared among users. Second, obtained raw data is probably would like to thank Prof. Takashi Yamashiro and Assistant heterogeneous, composed of structured, unstructured, and sen- Prof. Kazunori Nozaki, Osaka University Dental Hospital, for sor data, which is also typically noisy due to inaccuracies, setting up environment and medical dataset for the exper- missing, and biased evaluations, etc. Hence, data cleansing is iments. This research accomplishment was partly achieved required to remove such data inconsistencies and errors. Third, through the use of the supercomputer and the PC cluster for medical experts will perform data annotation to gather meta- large-scale visualization (VCC) at the CMC.

7 REFERENCES [1] P. Lakhani and B. Sundaram, “Deep learning at chest radiography: Au- tomated classification of pulmonary tuberculosis by using convolutional neural networks,” Radiology, 2017. [2] V. Gulshan, L. Peng, M. Coram et al., “Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs,” JAMA, vol. 316, 2016. [3] A. Janowczyk and A. Madabhushi, “Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases,” Journal of Pathology Informatics, vol. 7, 2016. [4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Informa- tion Processing Systems, vol. 1, 2012. [5] J. Antony, K. McGuinness, N. E. Connor, and K. Moran, “Quantifying radiographic knee osteoarthritis severity using deep convolutional neural networks,” arXiv preprint arXiv:1609.02469, 2016. [6] E. Kim, M. Corte-Real, and Z. Baloch, “A deep semantic mobile applica- tion for thyroid cytopathology,” in SPIE Medical Imaging. International Society for Optics and Photonics, 2016, pp. 97 890A–97 890A. [7] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe, “Deep con- volutional ranking for multilabel image annotation,” arXiv preprint arXiv:1312.4894, 2013. [8] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448. [9] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99. [10] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning, 2015, pp. 2048–2057. [11] H.-C. Shin, K. Roberts, L. Lu, D. Demner-Fushman, J. Yao, and R. M. Summers, “Learning to read chest x-rays: recurrent neural cascade model for automated image annotation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2497–2506. [12] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [13] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [14] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with lstm,” Neural computation, vol. 12, no. 10, pp. 2451–2471, 2000. [15] C. Ye, H. Wang, K. Li, Q. Chen, J. Chen, J. Song, and W. Yuan, CrowdCleaner: A Data Cleaning System Based on Crowdsourcing. Springer International Publishing, 2014, pp. 657–661. [16] X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye, “Katara: A data cleaning system powered by knowledge bases and crowdsourcing,” in Proc. of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 1247–1261. [17] M. Sharma and M. Bilgic, “Evidence-based uncertainty sampling for active learning,” Data Mining and Knowledge Discovery, 2017. [18] Z. Ma, Y. Yang, F. Nie, N. Sebe, S. Yan, and A. Hauptmann, “Harnessing lab knowledge for real-world action recognition,” International Journal of Computer Vision, 2014. [19] C. Lee, S. Murata, K. Ishigaki, and S. Date, “A data analytics pipeline for smart healthcare applications,” in Proc. of the Workshop on Sustained Simulation Performance, 2017.

8