applied sciences

Article Posture Recognition Using Ensemble Deep Models under Various Home Environments

Yeong-Hyeon Byeon 1, Jae-Yeon Lee 2, Do-Hyung Kim 2 and Keun-Chang Kwak 1,*

1 Department of Control and Instrumentation Engineering, Chosun University, Gwangju 61452, Korea; [email protected] 2 Intelligent Robotics Research Division, Electronics Telecommunications Research Institute, Daejeon 61452, Korea; [email protected] (J.-Y.L.); [email protected] (D.-H.K.) * Correspondence: [email protected]; Tel.: +82-062-230-6086

 Received: 26 December 2019; Accepted: 11 February 2020; Published: 14 February 2020 

Abstract: This paper is concerned with posture recognition using ensemble convolutional neural networks (CNNs) in home environments. With the increasing number of elderly people living alone at home, posture recognition is very important for helping elderly people cope with sudden danger. Traditionally, to recognize posture, it was necessary to obtain the coordinates of the body points, depth, frame information of video, and so on. In conventional , there is a limitation in recognizing posture directly using only an image. However, with advancements in the latest , it is possible to achieve good performance in posture recognition using only an image. Thus, we performed experiments based on VGGNet, ResNet, DenseNet, InceptionResNet, and Xception as pre-trained CNNs using five types of preprocessing. On the basis of these deep learning methods, we finally present the ensemble deep model combined by majority and average methods. The experiments were performed by a posture database constructed at the Electronics and Telecommunications Research Institute (ETRI), Korea. This database consists of 51,000 images with 10 postures from 51 home environments. The experimental results reveal that the ensemble system by InceptionResNetV2s with five types of preprocessing shows good performance in comparison to other combination methods and the pre-trained CNN itself.

Keywords: ensemble deep models; convolutional neural network; posture recognition; preconfigured CNNs; posture database; home environments

1. Introduction Posture recognition is a technology that classifies and identifies the posture of a person and has received considerable attention in the field of . Posture is the arrangement of the body skeleton that arises naturally or mandatorily through the motion of a person. Posture recognition helps detect crimes, such as kidnapping or assault, using camera input [1] and also provides a service robot with important information for judging a situation to perform advanced functions in an automatic system. Posture recognition also helps rehabilitate posture correction in the medical field, is used as game content in the entertainment field, and provides suggestions to athletes for maximizing their ability in sports [2,3]. Additionally, it helps elderly people who live alone and have difficulty performing certain activities, by determining sudden danger from their posture in home environments. Thus, there are several studies on posture recognition because it is an important technique in our society. Several studies have been performed on posture analysis, involving recognition, estimation, etc. Chan [4] proposed the scalable feedforward neural network-action mixture model to estimate three-dimensional (3D) human poses using viewpoint and shape feature histogram features extracted from a 3D point-cloud input. This model is based on mapping that converts a Bayesian network

Appl. Sci. 2020, 10, 1287; doi:10.3390/app10041287 www.mdpi.com/journal/applsci Appl. Sci. 2020, 10, 1287 2 of 26 to a feedforward neural network. Veges [5] performed 3D human pose estimation with Siamese equivariant embedding. Two-dimensional (2D) positions were detected, and then the detection was lifted into 3D coordinates. A rotation equivariant hidden representation was learned by the Siamese architecture to compensate for the lack of . Stommel [6] proposed the spatiotemporal segmentation of keypoints given by a skeletonization of depth contours. Here, the Kinect generates both a color image and a depth map. After the depth map is filtered, a 2D skeletonization of the contour points is utilized as a keypoint detector. The human detection to a 2D clustering problem is simplified by the extracted keypoints. For all poses, the distances to other poses were calculated and arranged by similarity. Shum [7] proposed a method for reconstructing a valid movement from the deficient, noisy posture provided by Kinect. Kinect localizes the positions of the body parts of a person. However, when some body parts are occluded, the accuracy decreases, because Kinect uses a single depth camera. Thus, the measurements are objectively evaluated to obtain a reliability score for each body part. By fusing the reliability scores into a query of the motion database, kinematically valid similar postures are obtained. Commonly posture recognition is studied using the inertial sensor in a smart phone or other wearable devices. Lee [8] performed automatic classification of squat posture using inertial sensors via deep learning. One correct and five wrong squat postures were defined and classified using inertial data from five inertial sensors attached on the lower body, random forest, and convolutional neural network long short-term memory. Chowdhury [9] studied detailed activity recognition with a smartphone using trees and support vector machines. The data from the accelerometer in the smartphone were used to recognize detailed activity, like sitting on a chair not simply sitting. Wu [10] studied yoga posture recognition with wearable inertial sensors based on a two-stage classifier. The artificial neural network and fuzzy C-means were used to divide yoga postures. Idris [11] studied human posture recognition using an android smartphone and artificial neural network. The gyroscope data from two smartphones attached to the arm were used to classify four gestures. To acquire data from inertial sensors or smartphones, a sensor usually needs to attach to the body, which is inconvenient for elderly people at home. The development of deep learning has greatly exceeded the performance of machine learning; thus, deep learning is actively studied. Various models and learning methods have been developed. The depth of deep learning networks has expanded from tens to hundreds [12], in contrast to conventional neural networks with a depth of two to three. Deep learning networks abstract data to a high level through a combination of nonlinear transforms. There are many deep learning methods, such as deep belief networks [13], deep Q networks [14], deep neural networks [15], recurrent neural networks [16], and convolutional neural networks (CNNs) [17]. CNNs are designed by integrating a feature extractor and a classifier into a network to automatically train them through data and exhibit the optimal performance for image processing [18]. There are many pre-trained deep models based on the CNN, such as VGGNet [19], ResNet [20], DenseNet [21], InceptionResNet [22], and Xception [23]. These pre-trained models can be used for . Transfer learning reuses the parameters of the models, which are trained from a large-scale database using high-performance processing units for a long time. Transfer learning can be effectively used in cases where a neural network needs to be trained with a small database and there is insufficient time to train the network with new data [24]. There are many studies on posture estimation based on deep learning. Thompson [25] used a CNN to extract several scale features of body parts. These features include a high-order spatial model sourced from a Markov random field and indicate the structural constraint of the domain between joints. Careira [26] proposed a feedback algorithm employing a top-down strategy called iterative error feedback. It carries the learning hierarchical representation of the CNN from the input to the output with a self-modifying model. Pishchulin [27] concentrated on local information by reducing the receptive field size of the fast R-CNN (Region based CNN) [28]. Thus, partial detection was converted into multi-label classification and combined with DeepCut to perform bottom-up inference. Insafutdinov [29] proposed residual learning [20] that includes more context by increasing the size of the receptive field and the depth of the network. Georgakopoulos [30] proposed a methodology Appl. Sci. 2020, 10, 1287 3 of 26 for classifying poses from binary human silhouettes using a CNN, and the method was improved by image features based on modified Zernike moments for fisheye images. The training set is composed of a composite image created from a 3D human model using the calibration model of the fisheye camera. The test set is the actual image acquired by the fisheye camera [31]. To increase performance, many studies have used ensemble deep models at various applications. Lee [32] designed an ensemble stacked auto-encoder based on sum and product for classifying horse gaits using wavelet packets from motion data of the rider. Maguolo [33] studied an ensemble of convolutional neural networks trained with different activation functions using sum rule to improve the performance in small- or medium-sized biomedical datasets. Kim [34] studied deep learning based on 1D ensemble networks using an electrocardiogram for user recognition. The ensemble network is composed of three CNN models with different parameters and their outputs are combined into single data. Traditionally, to recognize posture, it was necessary to obtain the coordinates of the body points or inertial data. This was achieved using a depth camera such as Kinect, image processing through a body model, or devices for capturing motion connected to the body; regarding the latter, it is a nuisance to wear these sensors with care in everyday life. Posture recognition, using images that do not require a sensor attached to the body, does not have this problem. Since posture recognition is performed using images, it can be applied to inexpensive cameras, and the device used for acquiring experimental data also has an inexpensive feature even though it supports a depth camera. In conventional machine learning, there is a limitation to recognizing posture directly using only an image [12,35–37]. However, owing to advancements in deep learning, good performance in posture recognition can be achieved using only one image. In the present study, several pre-trained CNNs were employed for recognizing the posture of a person using various preprocessed 2D images. To train the deep neural network for posture recognition, a large number of posture images was required. Therefore, a daily domestic posture database was directly constructed for posture recognition under the assumption of an environment of domestic service robots. The database includes ten postures: “standing normal”, “standing bent”, “sitting sofa”, “sitting chair”, “sitting floor”, “sitting squat”, “lying face”, “lying back”, “lying side”, and “lying crouched”. The training, validation, and test sets were real captured images, not synthetic images. Moreover, ensemble CNN models were studied to improve performance. The performances of posture recognition using ensemble CNNs with various types of preprocessing, not studied thus far, were compared. In the case of single models, type 2 exhibited 13.63%, 3.12%, 2.79%, and 0.76% higher accuracy than types 0, 1, 3, and 4, respectively, under transfer learning; and VGG19 exhibited 15.78%, 0.78%, 3.53%, 4.11%, 10.61%, and 16.02% higher accuracy than the simple CNN, VGG16 [19], ResNet50 [20], DenseNet201 [21], InceptionResNetV2 [22], and Xception [23], respectively, under transfer learning. In the case of ensemble systems, the ensemble system combining InceptionResNetV2s using average of scores was 2.02% higher than the best single model in our experiments under nontransfer learning. We performed posture recognition that can be applied to general security cameras to detect and respond to human risks. To do so, we acquired a large amount of data for training a neural network. In order to use the existing CNN models with fixed input form, various methods were applied to fit the input form. We compared the performance of posture recognition by applying various existing CNN models, and proposed ensemble models by input forms and CNN models. We performed posture recognition using ensemble preconfigured deep models in home environments. Section2 describes the deep models of the CNN. Section3 discusses the database and experimental methods. Section4 presents the experimental results, and Section5 concludes the paper.

2. Deep Models of CNN There are many pre-trained deep models based on the CNN, such as VGGNet [19], ResNet [20], DenseNet [21], InceptionResNet [22], and Xception [23]. These pre-trained models can be used for transfer learning. Transfer learning reuses the parameters of the models, which are trained from a large-scale database using high-performance processing units for a long time. Transfer learning can Appl. Sci. 2020, 10, 1287 4 of 26 be effectively used in cases where a neural network needs to be trained with a small database and there is insufficient time to train the network with new data [24].

2.1. VGGNet Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 25 VGGNet is ranked second in the ILSVRC 2014, following GoogLeNet, but it is more widely used than GoogLeNet2.1. VGGNet because it has a significantly simpler structure. VGGNet uses a relatively small 3 3 × (conv) filter and a 1 1 convolution filter, in contrast to AlexNet’s 11 11 filters in the VGGNet is ranked second ×in the ILSVRC 2014, following GoogLeNet, but it is more× widely used first and ZFNet’s 7 7 filters. When the nonlinear rectified linear unit (ReLU) than GoogLeNet because× it has a significantly simpler structure. VGGNet uses a relatively small 3 × is applied after the 1 1 convolution, the model becomes more discriminative. Additionally, a smaller 3 convolution (conv)× filter and a 1 × 1 convolution filter, in contrast to AlexNet’s 11 × 11 filters in the filterfirst requires layer and less ZFNet’s parameters 7 × 7 filters. to be When learned the nonlin and resultsear rectified in higher linear unit processing (ReLU) activation speed. function In VGGNet, a maxpoolis applied with after a 2 the 21 kernel× 1 convolution, size and the 2 strides,model becomes 2 fully more connected discriminative. (FC) layers Additionally, with 4096 a nodes,smaller 1 FC × layerfilter with requires 1000 nodes, less parameters and 1 Softmax to be learned layer isand used. resultsTwo in 3higher3 convolution processing speed. layers In andVGGNet, three a 3 3 × × convolutionmaxpool layerswith a have2 × 2 kernel effective size receptive and 2 strides, fields 2 fully of 5 connected5 and 7 (FC)7, layers respectively. with 4096 By nodes, doubling 1 FC the × × numberlayer of with filters 1000 after nodes, each and maxpool 1 Softmax layer, layer the is spatial used. dimensionsTwo 3 × 3 convolution are reduced, layers but and the networkthree 3 × 3 depth convolution layers have effective receptive fields of 5 × 5 and 7 × 7, respectively. By doubling the is increased. Originally, VGGNet was designed to investigate how errors are affected by the network number of filters after each maxpool layer, the spatial dimensions are reduced, but the network depth depth. There are models with layer depths of 8, 11, 13, 16, and 19 in VGGNet. As the network depth is increased. Originally, VGGNet was designed to investigate how errors are affected by the network increases,depth. the There error are decreases,models with but layer the depths error of increases 8, 11, 13, 16, if the and layer 19 in VGGNet. depth exceeds As the network 19. VGGNet depth uses data augmentationincreases, the error of scale decreases, jittering but forthe error training increase ands isif trainedthe layer with depth batch exceeds gradient 19. VGGNet descent uses using data four Nvidiaaugmentation Titan Black of graphics scale jittering processing for training units forand approximately is trained with threebatch weeksgradient [19 descent]. Figure using1 shows four the structureNvidia of Titan VGGNet. Black graphics In an expression processing with units thefor approximately form L@M N,three L, weeks M, and [19]. N Figure represent 1 shows the the sizes of × the map,structure the row of VGGNet. of the kernel, In an expression and the column with the ofform the L@M kernel, × N, respectively. L, M, and N represent The @ is the asymbol sizes of the used to separatemap, the the map row sizeof the and kernel, filter and size the here. column of the kernel, respectively. The @ is a symbol used to separate the map size and filter size here.

Figure 1. Structure of VGGNet. FC, fully connected. Figure 1. Structure of VGGNet. FC, fully connected. 2.2. ResNet 2.2. ResNet Deep layers of neural networks result in a vanishing gradient, an exploding gradient, and degradation. Deep layers of neural networks result in a vanishing gradient, an exploding gradient, and The vanishing gradient refers to when a propagated gradient becomes too small, and exploding gradient degradation. The vanishing gradient refers to when a propagated gradient becomes too small, and refersexploding to when a gradient propagated refers gradient to when becomes a propagated too large gradient to train. becomes Degradation too large indicates to train. that Degradation the deep neural networkindicates has worse that the performance deep neural ne thantwork the has shallow worse performance neural network than eventhe shallow though neural there network is no . even ResNetthough attempts there to is solve no overfitting. these problems ResNet by attempts reusing to the solve input these features problems of the by previous reusing the layer. inputFigure features2 shows the structureof the previous of ResNet. layer. The Figure output 2 shows of Y isthe calculated structure of using ResNet. the inputThe output of X, andof Y theis calculated input of Xusing is reused the by beinginput added ofto X, the and output the input of Y. of This X is is reused called aby skip being connection. added to Then,the output learning of Y. is This performed is called so a that skip ReLU (W connection.X) converges Then, to0, learning indicating is performed that the output so that ofReLU Y is almost(W × X) equalconverges to X. to This 0, indicating reduces the that vanishing the × gradient,output and of evenY is almost small equal changes to X. in This the reduces input are the delivered vanishing to gradient, the output. and even The numbersmall changes of intermediate in the input are delivered to the output. The number of intermediate layers in the section of the skip layers in the section of the skip connection can be set arbitrarily, and ResNet uses this method to stack connection can be set arbitrarily, and ResNet uses this method to stack layers deeply. ResNet is layers deeply. ResNet is structured according to VGGNet [19] and uses the convolution of 3 3 filters, structured according to VGGNet [19] and uses the convolution of 3 × 3 filters, but not pooling× or dropout. The pooling is replaced with the convolution of 2 strides. After every two , the input layer is added at the output [20].

Appl. Sci. 2020, 10, 1287 5 of 26 Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 25 but not pooling or dropout. The pooling is replaced with the convolution of 2 strides. After every two convolutions,Appl. Sci. 2020 the, 10 input, x FOR layerPEER REVIEW is added at the output [20]. 5 of 25

Figure 2. StructureStructure of of ResNet. ResNet. Figure 2. Structure of ResNet. 2.3. DenseNet DenseNet 2.3. DenseNet A typical network structure is a sequential combination of convolution, activation, and pooling. A typicalA typical network network structure structure is isa asequential sequential combinationcombination of of convolution, convolution, activation, activation, and andpooling. pooling. In contrast to the typical network, DenseNet solves the degradation problem by introducing a new In contrastIn contrast to the to thetypical typical network, network, DenseNet DenseNet so solveslves the degradationdegradation problem problem by byintroducing introducing a new a new conceptconcept called called dense dense connectivity. connectivity. DenseNet DenseNet has has approximatapproximatelyapproximatelyely 12 12 filters filters filters per per layer layer and and uses uses the dense the dense dense connectivityconnectivity patternpattern pattern toto continuously continuouslyto continuously pile pile pile up up up feature featur feature maps maps in in in previous previous previous layers,layers, layers, eeffectivelyff ectivelyeffectively conveying conveying conveying the theinformation informationthe information in the in the early in theearly layers early layers layers to the to to the latter the latter latter layers. layers layers This. ThisThis allows allows allows entire entire entire feature feature feature maps maps maps to enter enter to enter within within within the thenetwork networkthe network evenly evenly into evenly theinto lastinto the classifier,the last last classifier, classifier, while simultaneously whilwhile simultaneously reducing reducing reducing the total the numberthe total total number of parameters,number of of parameters,makingparameters, the networkmaking making the su ffithe networkciently network learnable. sufficiently sufficiently The learlear densenable. connection TheThe dense dense alsoconnection connection functions also also asfunctions regularization, functions as as regularization, which reduces the overfitting even for small datasets. Dense connectivity is expressed regularization,which reduces thewhich overfitting reduces eventhe overfitting for small datasets. even for Densesmall datasets. connectivity Dense is expressedconnectivity by is Equation expressed (1) and shownby Equation in Figure (1) and3. shown DenseNet in Figure divides 3. DenseNet the entire divides network the entire into network several into dense several blocks dense and blocks groups by Equationand groups (1) andlayers shown with the in sameFigure size 3. of DenseNet the feature di mapvides into the the entire same networkdense block. into The several part of dense pooling blocks layers with the same size of the feature map into the same dense block. The part of pooling and and groupsand convolution layers with is called the same the transition size of the layer. feature This map layer into comprises the same a batch dense no rmalizationblock. The part(BN) oflayer, pooling convolution is called the transition layer. This layer comprises a (BN) layer, and convolutiona 1 × 1 convolution is called layer the for transition adjusting thelayer. dimensions This layer of thecomprises feature map, a batch and noa 2rmalization × 2 average pooling (BN) layer, a 1 1 convolution layer for adjusting the dimensions of the feature map, and a 2 2 average pooling a 1 ×× 1layer. convolution A bottleneck layer structure for adjusting (i.e., BN-ReLU-conv(1)-BN-ReLU-conv(3)), the dimensions of the feature map, is andemployed a 2 × 2 to average reduce poolingthe layer.layer.computational A A bottleneck complexity.structure (i.e (i.e.,Usually,., BN-ReLU-conv(1)-BN-ReLU-conv(3)), global average pooling is used instead is is employed employedof FC, which to to reduce reduce most the computationalnetworks have complexity.complexity. in the last Usually, layer.Usually, The global networkglobal average averag is trained poolinge pooling via isstochastic used is insteadused gradient instead of FC,descent whichof FC,[21]. most Figurewhich networks 4most networkshaveshows in the have the last structure in layer. the last The of DenseNet.layer. network The is network trained viais tr stochasticained via gradientstochastic descent gradient [21 descent]. Figure [21].4 shows Figure the 4 structure of DenseNet. shows the structure of DenseNet. 𝑥 =𝐻([𝑥,𝑥,…,𝑥]) (1) xm = Hm([x0, x1, ... , xm]) (1) 𝑥 =𝐻([𝑥,𝑥,…,𝑥]) (1)

Figure 3. Dense connectivity.

FigureFigure 3. DenseDense connectivity. connectivity.

Appl. Sci. 2020, 10, 1287 6 of 26 Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 25 Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 25

Figure 4. Structure of DenseNet. FigureFigure 4.4. StructureStructure ofof DenseNet.DenseNet.

2.4.2.4. InceptionResNetV2 2.4. InceptionResNetV2 TheThe objective of of the the inception inception module module is is to to cove coverr a wide a wide area area of ofthe the image image while while maintaining maintaining the The objective of the inception module is to cover a wide area of the image while maintaining the theresolution resolution for forsmaller smaller information. information. Thus, Thus, convolutions convolutions of ofdifferent different sizes sizes from from 1 1× 11 to to 5 5 × 5 areare resolution for smaller information. Thus, convolutions of different sizes from 1 ×× 1 to 5 ×× 5 are performedperformed inin parallel.parallel. The inception module firstfirst performs convolution with a 11 × 1 filterfilter andand thenthen performed in parallel. The inception module first performs convolution with a 1 ×× 1 filter and then performsperforms convolutionsconvolutions with with filters filters of diofff differenterent sizes, sizes, because because the convolution the convolution with a with 1 1 a filter 1 × reduces1 filter performs convolutions with filters of different sizes, because the convolution with× a 1 × 1 filter thereduces number the ofnumber feature of maps feature and thusmaps the and computational thus the computational cost. The results cost. of The convolutions results of performedconvolutions in reduces the number of feature maps and thus the computational cost. The results of convolutions parallelperformed are concatenatedin parallel atare the concatenated output layer ofat thethe inception output module.layer of The the InceptionResNetV2 inception module. consists The performed in parallel are concatenated at the output layer of the inception module. The ofInceptionResNetV2 a stem layer, three consists types of of inception a stem layer, modules three (A, types B, andof inception C), and twomodules types (A, of reductionB, and C), modulesand two InceptionResNetV2 consists of a stem layer, three types of inception modules (A, B, and C), and two (Atypes and of B). reductionThe stem modules layer is the(A and frontal B). The layer, stem and layer the reduction is the frontal module layer, is and used the to reduction reduce the module size of types of reduction modules (A and B). The stem layer is the frontal layer, and the reduction module theis used feature to mapreduce in InceptionResNetV2.the size of the feature The inceptionmap in InceptionResNetV2. modules of InceptionResNetV2 The inception are basedmodules on the of is used to reduce the size of the feature map in InceptionResNetV2. The inception modules of integrationInceptionResNetV2 of the inception are based module on the and integration the skip connection of the inception of ResNet module [22]. and The the stem skip layer, connection three types of InceptionResNetV2 are based on the integration of the inception module and the skip connection of ofResNet inception [22]. modulesThe stem (A, layer, B, and three C), types two typesof incept of reductionion modules modules (A, B, (Aand and C), B),two and types the of structure reduction of ResNet [22]. The stem layer, three types of inception modules (A, B, and C), two types of reduction InceptionResNetV2modules (A and B), areand shown the structure in Figures of5 InceptionR–8, respectively.esNetV2 In an are expression shown in withFigures the 5–8, form respectively. L@M N:O, modules (A and B), and the structure of InceptionResNetV2 are shown in Figures 5–8, respectively.× L,In M,an N,expression and O represent with the the form sizes L@M×N:O, of the map, L, M, the N, row and of theO represent kernel, the the column sizes of of the the map, kernel, the and row the of In an expression with the form L@M×N:O, L, M, N, and O represent the sizes of the map, the row of stride,the kernel, respectively. the column of the kernel, and the stride, respectively. the kernel, the column of the kernel, and the stride, respectively.

Figure 5. Stem layer of InceptionResNetV2. BN, batch normalization; ReLU, rectified linear unit. Figure 5. Stem layer of InceptionResNetV2. BN, batch normalization; ReLU, rectified linear unit. Figure 5. Stem layer of InceptionResNetV2. BN, batch normalization; ReLU, rectified linear unit. 2.5. Xception 2.5. Xception 2.5. XceptionXception, which is based on inception, seeks to completely separate the search for relationships Xception, which is based on inception, seeks to completely separate the search for relationships betweenXception, channels which from is the based search on for inception, regional informationseeks to completely on images. separate In Xception, the search depth-wise for relationships separable between channels from the search for regional information on images. In Xception, depth-wise convolutionbetween channels is performed from the for search each channel, for regional and the information result is projected on images. to the In new Xception, channel depth-wise space via separable convolution is performed for each channel, and the result is projected to the new channel 1separable1 convolution. convolution If the is existing performed convolution for each createdchannel, a and feature the mapresult considering is projected all to the the channels new channel and space× via 1 × 1 convolution. If the existing convolution created a feature map considering all the localspace information, via 1 × 1 convolution. the depth-wise If the convolution existing conv createsolution one feature created map a feature for each map channel, considering and then all 1 the1 channels and local information, the depth-wise convolution creates one feature map for each channel,× convolutionchannels and is local performed information, to adjust the depth-wise the number conv of featureolution maps. creates The one 1 feature1 convolution map for each is called channel, the and then 1 × 1 convolution is performed to adjust the number of feature× maps. The 1 × 1 convolution pointwiseand then 1 convolution × 1 convolution (point-conv). is performed In inception, to adjust the each number convolution of feature is followed maps. The by 1 the × 1 nonlinearity convolution is called the pointwise convolution (point-conv). In inception, each convolution is followed by the ofis thecalled ReLU; the pointwise however, inconvolution a depth-wise (point-conv). separable convolution,In inception, theeach first convolution convolution is followed is not followed by the nonlinearity of the ReLU; however, in a depth-wise separable convolution, the first convolution is bynonlinearity nonlinearity. of the Xception ReLU; has however, 36 convolution in a depth-wise layers in separable 14 modules convolution, for feature the extraction. first convolution Except for is not followed by nonlinearity. Xception has 36 convolution layers in 14 modules for feature extraction. not followed by nonlinearity. Xception has 36 convolution layers in 14 modules for feature extraction. Except for the beginning and end layers, each module has a linear residual connection. In summary, Except for the beginning and end layers, each module has a linear residual connection. In summary,

Appl. Sci. 2020, 10, 1287 7 of 26 theAppl. beginning Sci. 2020, 10 and, x FOR end PEER layers, REVIEW each module has a linear residual connection. In summary, Xception7 of is25 formed by linearly stacking depth-wise separable convolution layers with residual connections [23]. FigureXception6 shows is formed the structure by linearly of Xception. stacking depth-wise separable convolution layers with residual connections [23]. Figure 6 shows the structure of Xception.

Figure 6. Structure of Xception. Figure 6. Structure of Xception. 3. Database and Experiment Method 3. Database and Experiment Method 3.1. Construction of Database for Posture Recognition 3.1. Construction of Database for Posture Recognition Training a deep neural network for posture recognition requires a large amount of posture images.Training Thus, thea deep daily neural domestic network posture for database posture wasrecognition constructed requires for posture a large recognition amount of under posture the assumptionimages. Thus, of the an environmentdaily domestic of posture domestic database service robots.was constructed Astra (i.e., for an posture assembly recognition of sensors under for color the imagesassumption (red, of green, an environment and blue (RGB)), of domestic depth imagesservice robots. (infrared Astra rays), (i.e., and an sound assembly sources of sensors (microphones) for color madeimages by (red, Orbbec), green, was and used blue to (RGB)), construct depth a daily images domestic (infrared posture rays), database. and sound It sources senses the (microphones) depth from 0.6made to 8by m Orbbec), and the was color used inside to construct a field of a view daily of do 60mestic49.5 posture73 database.. The resolution It senses andthe depth frame from rate ◦ × ◦ × ◦ are0.6 to640 8 m480 and pixels the colorand inside 30 fps, a respectively,field of view forof 60° the × RGB 49.5° and × 73°. depth The images.resolution The and size frame of Astrarate are is × 165640 × 30480 pixels40 m3 and, its power30 fps, consumptionrespectively, isfor 2.4 the W RGB (through and depth Universal images. Serial The Bus size (USB) of Astra 2.0), is and 165 it × has 30 × × two× 40 microphones. m3, its power Figure consumption7 shows the is 2.4 Astra W for(through capturing Universal posture Serial images. Bus It can(USB) be developed2.0), and it using has thetwo Astramicrophones. software Figure development 7 shows kit the (SDK) Astra and for OpenNI capturin ing severalposture operating images. It systems can be (e.g.,developed Android, using Linux, the andAstra Windows software 7, development 8, and 10). The kit SDK (SDK) also and supports OpenNI body in trackingseveral operating with a skeleton systems [38 (e.g.,]. To constructAndroid, Linux, and Windows 7, 8, and 10). The SDK also supports body tracking with a skeleton [38]. To construct the database, a graphical user interface (GUI) for the capture tool was developed using the

Appl. Sci. 2020, 10, 1287 8 of 26 Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 25

Microsoft Foundation Class library and the Astra SDK in Windows 7. Figure 8 shows the developed theAppl. database, Sci. 2020, 10 a, x graphical FOR PEER userREVIEW interface (GUI) for the capture tool was developed using the Microsoft8 of 25 GUI for constructing the posture database. Foundation Class library and the Astra SDK in Windows 7. Figure8 shows the developed GUI for constructingMicrosoft Foundation the posture Class database. library and the Astra SDK in Windows 7. Figure 8 shows the developed GUI for constructing the posture database.

Figure 7. Astra camera for capturing posture images. RBG, red, blue, green; USB, Universal Serial Bus. Figure 7. Astra camera for capturing posture images. RBG, red, blue, green; USB, Universal Serial Bus. Figure 7. Astra camera for capturing posture images. RBG, red, blue, green; USB, Universal Serial Bus.

Figure 8. Graphical user interface (GUI) for constructing the posture database. Figure 8. Graphical user interface (GUI) for constructing the posture database. A total of 51 homes participated in the construction of the daily domestic posture database. A total of 51 homes participated in the construction of the daily domestic posture database. These homes hadFigure a living 8. Graphical room, user a kitchen, interface a (GUI) small for room, constructing a dinner the table, posture chairs, database. a bed, and a sofa. These homes had a living room, a kitchen, a small room, a dinner table, chairs, a bed, and a sofa. More than two persons in each home equally contributed as subjects for the posture images. Ten postures More than two persons in each home equally contributed as subjects for the posture images. Ten wereA defined total of to 51 construct homes theparticipated daily domestic in the posture construc database.tion of the The daily postures domestic were“standing posture database. normal”, postures were defined to construct the daily domestic posture database. The postures were “standing “standingThese homes bent”, had “sitting a living sofa”, room, “sitting a kitchen, chair”, a small “sitting room, floor”, a dinner “sitting table, squat”, chairs, “lying a bed, face”, and a “lying sofa. normal”, “standing bent”, “sitting sofa”, “sitting chair”, “sitting floor”, “sitting squat”, “lying face”, back”,More than “lying two side”, persons and in “lying each crouched”. home equally Each cont homeributed generated as subjects 100 images for the per posture posture, images. and eachTen “lying back”, “lying side”, and “lying crouched”. Each home generated 100 images per posture, and imagepostures included were defined one subject. to construc Each homet the daily generated domestic 1000 posture posture database. images (10 The postures postures were100 images “standing per each image included one subject. Each home generated 1000 posture images (10 postures× × 100 images posture).normal”, Thus,“standing the totalbent”, number “sitting of so imagesfa”, “sitting was 51,000 chair”, (1000 “sitting posture floor”, images “sitting51 squat”, homes). “lying Each imageface”, per posture). Thus, the total number of images was 51,000 (1000 posture images× × 51 homes). Each was“lying captured back”, “lying by varying side”, the and position, “lying typecrouched”. of room, Each prop home (such generated as clothes), 100 furniture images per (such posture, as a chair), and image was captured by varying the position, type of room, prop (such as clothes), furniture (such as poseeach image of the included sensor, pose one subject. of the person, Each home and generated small movement. 1000 posture Figure images9 shows (10 postures the environment × 100 images for a chair), pose of the sensor, pose of the person, and small movement. Figure 9 shows the environment capturingper posture). the Thus, posture the images total number using Astra.of images Figure wa s10 51,000 shows (1000 examples posture of images posture × images51 homes). captured Each for capturing the posture images using Astra. Figure 10 shows examples of posture images captured usingimage Astra.was captured by varying the position, type of room, prop (such as clothes), furniture (such as ausing chair), Astra. pose of the sensor, pose of the person, and small movement. Figure 9 shows the environment for capturing the posture images using Astra. Figure 10 shows examples of posture images captured using Astra.

Appl. Sci. 2020, 10, x 1287 FOR PEER REVIEW 99 of of 2625 Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 25

Figure 9. Environment for capturing posture images using Astra. Figure 9. Environment for capturing po posturesture images using Astra.

(a) (b) (a) (b)

(c) (d) (c) (d)

(e) (f) (e) (f) Figure 10. Cont.

Appl. Sci. 2020, 10, 1287 10 of 26 Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 25

(g) (h)

(i) (j)

Figure 10. Examples of posture images captured from Astra camera in various home environments: ((a)) standing normal; ((b)) standing bent; (c) sitting sofa; ( d) sitting chair; ( e) sitting sitting floor; floor; ( (f)) sitting sitting squat; squat; ((g)) lyinglying face;face; ((h)) lyinglying back;back; ((ii)) lyinglying side;side; ((jj)) lyinglying crouched.crouched.

3.2. Preprocessing Types To segmentsegment thethe person person images, images, you you only only look look once once (YOLO)-v3 (YOLO)-v3 was was used used as a as person a person detector detector [39]. After[39]. After segmenting segmenting the person, the person, preprocessing preprocessing was performed was performed to consider to consider various various input methodsinput methods which havewhich di havefferent different pros and pros cons and because cons because the neural the neural networks networks were limitedwere limited in input in input as a square as a square fixed image.fixed image. The prosThe pros and and cons cons are are based based on on tightness tightness of of person person and and distortion distortion of of person person byby stretching image. FiveFive typestypes ofof preprocessingpreprocessing for for cropping cropping images images were were defined defined to inputto input posture posture images images into into the neuralthe neural network: network: In type In type 0, the 0, original the original image image was resized was resized to the sizeto the of thesize input of the layer. input This layer. changed This thechanged original the ratiooriginal of the ratio image. of the In image. type 1,In thetype person 1, the imageperson was image segmented was segmented from the from original the original image whileimagesatisfying while satisfying the size the of thesizeinput of the layer. input The layer. original The original ratio of ratio the image of the wasimage maintained. was maintained. In type In 2, thetype person 2, the person image wasimage tightly was tightly segmented segmented from the from original the original image image regardless regardless of the sizeof the of size the of input the layer.input Then,layer. theThen, person the person image wasimage resized was resized while maintaining while maintaining the ratio the of theratio original of the imageoriginal until image the sizeuntil of the the size row of or the column row or of column the person of the was person equal was to the equal size ofto the rowsize orof columnthe row ofor thecolumn input of layer, the respectively.input layer, respectively. The extra border The extra due to border the di ffdueerence to th ine thedifference image ratioin the between image ratio the person between and the the person input layerand the was input zero-padded. layer was Inzero-padded. type 3, the personIn type image 3, the wasperson tightly image segmented was tightly from segmented the original from image the regardlessoriginal image of the regardless size of the of inputthe size layer. of the Then, input the layer. person Then, image the person was resized image to was the resized size of to the the input size layer.of the Thisinput changed layer. This the changed original ratiothe original of the image.ratio of In the type image. 4, the In person type 4, imagethe person was tightlyimage was segmented tightly fromsegmented the original from the image original regardless image of regardless the size of of the the input size of layer. the input Then, layer. the person Then, image the person was resized image whilewas resized maintaining while themaintaining ratio of the the original ratio of image the original until the image size of until the rowthe size or column of the row of the or person column was of equalthe person to the was size equal of the to rowthe size or columnof the row of or the column input layer,of the respectively.input layer, respectively. The extra border The extra due border to the diduefference to the indifference the image in ratiothe image between ratio the between person andthe person the input and layer the input was replicated layer was to replicated the edge ofto the originaledge of the image. original Figure image. 11 shows Figure the 11 examples shows the of ex fiveamples types of of five original types and of original preprocessed and preprocessed images. images.

Appl. Sci. 2020, 10, 1287 11 of 26 Appl. Sci. 2020, 10, x FOR PEER REVIEW 11 of 25

(a) (b) (c)

(d) (e) (f)

Figure 11. Examples of fivefive types of original and preprocessed images: ( a) original image; ( b) type 0; ((cc)) typetype 1;1; ((dd)) typetype 2;2; ((ee)) typetype 4;4; ((ff)) typetype 5.5.

3.3. Posture Recognition Using Ensemble CNN Models As more peoplepeople live alone, there isis aa needneed forfor waysways toto copecope withwith crimecrime oror dyingdying alone.alone. PosturePosture recognition emerged asas a solution because it contains some information about the person's person’s situation. situation. However, postureposture recognitionrecognition using the conventionalconventional inertial sensor, the sensorsensor is cumbersomecumbersome to attach to the body. For the elderly, itit is didifficultfficult toto wearwear the sensor all the time and extra help is needed for them to do so. However, However, image-based posture recognitionrecognition throughthrough aa cameracamera cancan be applied to the existing camera system andand therethere isis nono need to attachattach any sensors. In addition, low-cost webcams can also be applied because the posture is recognized onlyonly byby thethe image. It has been difficult difficult to recognize posture by usingusing onlyonly two-dimensionaltwo-dimensional images, but recently, deepdeep learninglearning hashas provedproved excellentexcellent in various fields.fields. We applied the technique to thethe 2D2D imageimage basedbased postureposture recognition.recognition. First, a daily domestic posture database was constructed.constructed. Then, Then, the posture images were processed to didifferentfferent types of preprocessing. The CNN models were VGG1 VGG16,6, VGG19 [19], [19], ResNet50 [20], [20], DenseNet201 DenseNet201 [21], [21], InceptionResNetV2 [22[22],], and and Xception Xception [23 [23].]. Transfer Transfer learning learning is e ffisectively effectively used used in cases in wherecases where a neural a networkneural network needs to needs be trained to be trained with a small with databasea small data andbase there and is insuthereffi iscient insufficient time to train time the to train network the withnetwork new with data. new In thisdata. study, In this posture study, recognitionposture reco wasgnition performed was performed by training by thetraining CNNs the using CNNs transfer using learning.transfer learning. Most of Most the parameters of the parameters used to extractused to the extract feature the mapsfeature were maps fixed, were and fixed, only and the only final the FC layersfinal FC and layers Softmax and wereSoftmax trained. were In trained. addition, In CNNsaddition, were CNNs trained were by updatingtrained by all updating the parameters all the (nontransferparameters (nontransfer learning). Figure learning). 12 shows Figure the 12 postureshows the recognition posture recognition using the preprocessed using the preprocessed images for diimagesfferent for CNN different models. CNN Then models. CNNs Then of the CNNs deep of models the deep and models CNNs and of theCNNs preprocessing of the preprocessing types are combinedtypes are combined using the using majority the ofmajority outputs of and outputs average and of average scores asof ensemblescores as ensemble methods. methods. The majority The ofmajority outputs of decides outputs the decides most votedthe most class voted among cla outputsss among of CNNs.outputs Theof CNNs average. The of scoreaverage decides of score the classdecides with the a class max scorewith a from max averaged score from scores averaged of CNNs. scores Tables of CNNs.1 and 2Tables show 1 the and ensemble 2 show the methods ensemble of majoritymethods voteof majority and score vote average. and score Figure average. 13 shows Figure the 13 variousshows the ensemble various systems ensemble by systems input types by input and CNNtypes models.and CNN models.

Appl. Sci. 2020, 10, 1287 12 of 26

Table 1. Ensemble method of majority vote.

1. Net results scores of each class as 1 n matrix. ×

sc = [x1 x2 x3 ... xn]

2. Get the outputs by picking the class to the max value in scores as one hot matrix.

oc = [0 0 1 ... 0]

3. Sum the one hot matrix of all classifiers.

PC v = c=1 oc

4. Get the final output by picking the class to the max value in sum of one hot matrix.

ov = [1 0 0 ... 0]

Table 2. Ensemble method of score average.

1. Net results scores of each class as 1 n matrix. ×

sc = [x1 x2 x3 ... xn]

2. Calculate the average of scores of all classifiers.

1 PC a = C c=1 sc

3. Get the final output by picking the class to the max value in average of scores.

oa = [1 0 0 ... 0]

The number of postures is the number of classes. The number of defined postures was 10: “standing normal”, “standing bent”, “sitting sofa”, “sitting chair”, “sitting floor”, “sitting squat”, “lying face”, “lying back”, “lying side”, and “lying crouched”. The 10 posture classes were divided into four categories: standing, sitting, lying, and lying crouched. Instead of 10 and 4 classes, there were 11 and 5 when other images, such as the background, were considered as a separate class. The results of posture recognition under 10, 4, 11, and 5 classes were obtained. The other images were configured by programmatically cropping the original posture images randomly while avoiding the person to the greatest extent possible. These various numbers of classes have the advantage to extract context information. For example, a person does not sleep in a standing posture. Figure 14 shows the four categories. Appl. Sci. 2020, 10, x FOR PEER REVIEW 12 of 25

Table 1. Ensemble method of majority vote.

1. Net results scores of each class as 1×n matrix.

𝑠 =[𝑥 𝑥 𝑥 … 𝑥] 2. Get the outputs by picking the class to the max value in scores as one hot matrix.

𝑜 = [0 0 1 … 0] 3. Sum the one hot matrix of all classifiers. v= 𝑜 4. Get the final output by picking the class to the max value in sum of one hot matrix.

𝑜 = [1 0 0 … 0]

Table 2. Ensemble method of score average.

1. Net results scores of each class as 1×n matrix.

𝑠 =[𝑥 𝑥 𝑥 … 𝑥] 2. Calculate the average of scores of all classifiers. 1 a= 𝑠 𝐶 3. Get the final output by picking the class to the max value in average of scores.

𝑜 = [1 0 0 … 0]

The number of postures is the number of classes. The number of defined postures was 10: “standing normal”, “standing bent”, “sitting sofa”, “sitting chair”, “sitting floor”, “sitting squat”, “lying face”, “lying back”, “lying side”, and “lying crouched”. The 10 posture classes were divided into four categories: standing, sitting, lying, and lying crouched. Instead of 10 and 4 classes, there were 11 and 5 when other images, such as the background, were considered as a separate class. The results of posture recognition under 10, 4, 11, and 5 classes were obtained. The other images were configured by programmatically cropping the original posture images randomly while avoiding the person to the greatest extent possible. These various numbers of classes have the advantage to extract contextAppl. Sci. information.2020, 10, 1287 For example, a person does not sleep in a standing posture. Figure 14 shows13 ofthe 26 four categories.

Figure 12. Ensemble deep models using the preprocessed images with different convolutional neural Figure 12. Ensemble deep models using the preprocessed images with different convolutional neural Appl. Sci.network 2020, 10 (CNN), x FOR models. PEER REVIEW 13 of 25 network (CNN) models.

Figure 13. Various ensemble deep models by inputinput types and the pre-trained CNNs.CNNs.

Figure 14. Examples of four categories with 10 postures.

The posture-recognition performance was evaluated with regard to accuracy, which was defined as the number of correct classifications (CC) divided by the number of total classifications (i.e., the sum of CC and the number of wrong classifications (WC)) [40]:

Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 25

Appl. Sci. 2020, 10, 1287 14 of 26 Figure 13. Various ensemble deep models by input types and the pre-trained CNNs.

Figure 14. Examples of four categories with 10 postures. Figure 14. Examples of four categories with 10 postures. The posture-recognition performance was evaluated with regard to accuracy, which was defined as theThe number posture-recognition of correct classifications performance (CC )was divided evaluate by thed with number regard of to total accuracy, classifications which was (i.e., defined the sum asof theCC numberand the numberof correct of classifications wrong classifications (CC) divided (WC)) [by40 ]:the number of total classifications (i.e., the sum of CC and the number of wrong classifications (WC)) [40]: CC Accuracy = (2) CC + WC

4. Experimental Results

4.1. Database The daily domestic posture database was directly constructed from 51 homes using Astra. All homes had a living room, a kitchen, a small room, a dinner table, chairs, a bed, and a sofa. The postures included “standing normal”, “standing bent”, “sitting sofa”, “sitting chair”, “sitting floor”, “sitting squat”, “lying face”, “lying back”, “lying side”, and “lying crouched”. Each home had 100 images per posture, and each image included one subject. Each home had 1000 posture images (10 postures 100 images per posture). The total number of images was 51,000 (1000 posture images × 51 homes). The images were captured by varying the position, type of room, prop (such as clothes), × furniture (such as a chair), pose of the sensor, pose of the person, and small movement. The posture database was constructed with men and women ranging from 19 to 68 years of age. Figure 15 shows the data configuration for the training, validation, and testing. The posture images were subjected to different types of preprocessing. To segment the person image, YOLO-v3 was used as a person detector [39], and the segmented person images were filtered manually. The number of preprocessed posture images for the training was 3910 for standing normal, 3802 for standing bent, 3899 for sitting sofa, 3899 for sitting chair, 3896 for sitting floor, 3859 for sitting squat, 3752 for lying face, 3777 for lying back, 3773 for lying side, 3505 for lying crouched, and 6802 for other images. The number of preprocessed posture images for the validation was 689 for standing normal, 670 for standing bent, 688 for sitting sofa, 687 for sitting chair, 687 for sitting floor, 681 for sitting squat, 662 for lying face, 666 for lying back, 665 for lying side, 618 for lying crouched, and 1200 for other images. The number of Appl. Sci. 2020, 10, x FOR PEER REVIEW 14 of 25

𝐶𝐶 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (2) 𝐶𝐶 + 𝑊𝐶

4. Experimental Results

4.1. Database The daily domestic posture database was directly constructed from 51 homes using Astra. All homes had a living room, a kitchen, a small room, a dinner table, chairs, a bed, and a sofa. The postures included “standing normal”, “standing bent”, “sitting sofa”, “sitting chair”, “sitting floor”, “sitting squat”, “lying face”, “lying back”, “lying side”, and “lying crouched”. Each home had 100 images per posture, and each image included one subject. Each home had 1000 posture images (10 postures × 100 images per posture). The total number of images was 51,000 (1000 posture images × 51 homes). The images were captured by varying the position, type of room, prop (such as clothes), furniture (such as a chair), pose of the sensor, pose of the person, and small movement. The posture database was constructed with men and women ranging from 19 to 68 years of age. Figure 15 shows the data configuration for the training, validation, and testing. The posture images were subjected to different types of preprocessing. To segment the person image, YOLO-v3 was used as a person detector [39], and the segmented person images were filtered manually. The number of preprocessed posture images for the training was 3910 for standing normal, 3802 for standing bent, 3899 for sitting sofa, 3899 for sitting chair, 3896 for sitting floor, 3859 for sitting squat, 3752 for lying face, 3777 for lying back, 3773 for lying side, 3505 for lying crouched, and 6802 for other images. The number of Appl.preprocessed Sci. 2020, 10, 1287posture images for the validation was 689 for standing normal, 670 for standing15 ofbent, 26 688 for sitting sofa, 687 for sitting chair, 687 for sitting floor, 681 for sitting squat, 662 for lying face, 666 for lying back, 665 for lying side, 618 for lying crouched, and 1200 for other images. The number preprocessed posture images for the testing was 500 for standing normal, 497 for standing bent, 500 for of preprocessed posture images for the testing was 500 for standing normal, 497 for standing bent, sitting sofa, 500 for sitting chair, 497 for sitting floor, 498 for sitting squat, 489 for lying face, 489 for 500 for sitting sofa, 500 for sitting chair, 497 for sitting floor, 498 for sitting squat, 489 for lying face, lying back, 482 for lying side, 481 for lying crouched, and 2001 for other images. The total number of 489 for lying back, 482 for lying side, 481 for lying crouched, and 2001 for other images. The total images for the training, validation, and testing was 44,874, 7913, and 6934, respectively. number of images for the training, validation, and testing was 44,874, 7913, and 6934, respectively.

Figure 15. Data configuration for training, validation, and testing data set. Figure 15. Data configuration for training, validation, and testing data set. 4.2. Experimental Results 4.2. Experimental Results The computer used in the experiment had the following specifications: Intel i7-6850K central processingThe computer unit at 3.60 used GHz, in the Nvidia experiment GeForce had GTX the fo 1080llowing Ti, 64specifications: GB of random-access Intel i7-6850K memory, central andprocessing Windows unit 10 64-bitat 3.60 operating GHz, Nvidia system. GeForce The CNN GTX models 1080 Ti, used 64 forGB posture of random-access recognition inmemory, this study and wereWindows VGG16, 10 VGG19 64-bit [operating19], ResNet50 syst [em.20], The DenseNet201 CNN models [21], InceptionResNetV2used for posture recognition [22], and Xception in this study [23], whichwere areVGG16, pre-trained VGG19 CNNs. [19], ResNet50 A pre-trained [20], DenseNet201 CNN can be used[21], InceptionResNetV2 for transfer learning [22], when and new Xception data need[23], to which be classified are pre-trained efficiently. CNNs. Most A of pre-trained the parameters CNN used can be to extractused for the transfer feature learning maps were when fixed, new anddata only need the to final be classified FC layers effici andently. Softmax Most were of the trained. parameters A simple used CNN to extract (the conventional the feature maps method) were wasfixed, added and toonly the the experiment final FC forlayers performance and Softmax comparison. were trained. The A structure simple CNN of the (the simple conventional CNN is depictedmethod) in was Figure added 16. to The the simple experiment CNN was for performa trained bynce updating comparison. all the The parameters, structure because of the simple it was CNN not ais pre-trained depicted in model. Figure Here, 16. The the simple number CNN of postures was trained wasthe by numberupdating of all classes. the parameters, The number because of defined it was posturesnot a pre-trained was 10: “standing model. Here, normal”, the number “standing of bent”,postures “sitting was the sofa”, number “sitting of chair”,classes. “sittingThe number floor”, of “sittingdefined squat”, postures “lying was face”,10: “standing “lying back”, normal”, “lying “standing side”, andbent”, “lying “sitting crouched”. sofa”, “sitting If a class chair”, of “other“sitting images” (e.g., background) was added to the 10 posture classes, then the number of total classes was 11. The training, validation, and testing data sets for the 11 classes consisted of 44,874, 7913, and 6934 images, respectively. Additionally, the 10 posture classes were divided into four categories: standing, sitting, lying, and lying crouched. If a class for “other images” was added to these four posture classes, the number of total classes was five. The performance for the other numbers of classes was measured through simple mapping of the trained model with the data of 11 classes. The simple CNN was trained with a batch size of 128, 50 epochs, and the RMSProp optimizer [41]. The other pre-trained models were trained via transfer learning with a batch size of 128, 30 epochs, and the Adadelta optimizer [42]. The highest accuracies without data augmentation under transfer learning were 78.35% for VGG19 and type 2 preprocessing with the posture data of 11 classes, 69.89% for VGG19 and type 2 preprocessing with the posture data of 10 classes, 88.86% for VGG19 and type 2 preprocessing with the posture data of five classes, and 88.50% for ResNet50 and type 1 preprocessing with the posture data of four classes. Table3 presents the classification performance of the CNNs without data augmentation under transfer learning. Figure 17 shows the confusion matrix for VGG19 with type 2 preprocessing, which exhibited the best performance for the posture data of 11 classes without data augmentation under transfer learning. Tables4–6 present the accuracies of the CNNs for the posture data of 10, 5, and 4 classes, respectively, without data augmentation under transfer learning. Appl. Sci. 2020, 10, x FOR PEER REVIEW 15 of 25 floor”, “sitting squat”, “lying face”, “lying back”, “lying side”, and “lying crouched”. If a class of “other images” (e.g., background) was added to the 10 posture classes, then the number of total classes was 11. The training, validation, and testing data sets for the 11 classes consisted of 44,874, 7913, and 6934 images, respectively. Additionally, the 10 posture classes were divided into four categories: standing, sitting, lying, and lying crouched. If a class for “other images” was added to these four posture classes, the number of total classes was five. The performance for the other numbers of classes was measured through simple mapping of the trained model with the data of 11 classes. The simple CNN was trained with a batch size of 128, 50 epochs, and the RMSProp optimizer [41]. The other pre-trained models were trained via transfer learning with a batch size of 128, 30 epochs, and the Adadelta optimizer [42]. The highest accuracies without data augmentation under transfer learning were 78.35% for VGG19 and type 2 preprocessing with the posture data of 11 classes, 69.89% for VGG19 and type 2 preprocessing with the posture data of 10 classes, 88.86% for VGG19 and type 2 preprocessing with the posture data of five classes, and 88.50% for ResNet50 and type 1 preprocessing with the posture data of four classes. Table 3 presents the classification performance of the CNNs without data augmentation under transfer learning. Figure 17 shows the confusion matrix for VGG19 with type 2 preprocessing, which exhibited the best performance for the posture data of 11 classes without data augmentation under transfer learning. Tables 4–6 present the accuraciesAppl. Sci. 2020 of, 10 the, 1287 CNNs for the posture data of 10, 5, and 4 classes, respectively, without16 data of 26 augmentation under transfer learning.

Figure 16. Figure 16. StructureStructure of of the the simple simple CNN. CNN. Table 3. Classification performance of the CNNs without data augmentation under transfer learning Table(11 classes). 3. Classification performance of the CNNs without data augmentation under transfer learning (11 classes). Single Model (%) Type 0 Type 1 Type 2 Type 3 Type 4 Single Model (%) Type 0 Type 1 Type 2 Type 3 Type 4 Training 91.50 70.82 81.12 85.42 68.87 SimpleCNN ValidationTraining 75.6591.50 70. 58.7082 81.12 63.74 85.42 57.98 68.87 59.34 SimpleCNN ValidationTesting 45.8075.65 58.70 61.03 63.74 67.18 57.98 60.85 59.34 63.48 TrainingTesting 89.40 45.80 61.03 88.44 67.18 91.72 60.85 85.87 63.48 85.76 VGG16 ValidationTraining 74.5089.40 88. 74.6244 91.72 76.92 85.87 72.65 85.76 74.59 VGG16 ValidationTesting 63.4474.50 74.62 70.12 76.92 76.64 72.65 70.78 74.59 75.31 TrainingTesting 90.09 63.44 70.12 91.17 76.64 89.56 70.78 83.06 75.31 86.18 VGG19 ValidationTraining 74.5190.09 91. 74.2217 89.56 77.18 83.06 73.12 86.18 74.93 VGG19 ValidationTesting 64.8774.51 74.22 69.82 77.18 78.35 73.12 71.46 74.93 74.94 TrainingTesting 60.03 64.87 69.82 71.56 78.35 60.67 71.46 67.48 74.94 68.05 ResNet50ResNet50 ValidationTraining 58.4460.03 71.56 69.10 60.67 59.92 67.48 65.61 68.05 65.64 Testing 55.60 66.83 57.92 65.55 61.81

Train 62.17 70.52 62.53 70.65 59.44 DenseNet201 Validation 60.68 67.56 61.27 68.44 58.34 Test 54.51 64.16 55.88 66.51 50.68 Train 44.98 55.63 55.10 59.68 56.01 InceptionResNetV2 Validation 43.86 53.67 54.71 59.39 55.90 Test 40.28 49.35 53.20 59.61 52.48 Train 43.56 57.12 51.18 57.21 56.52 Xception Validation 42.34 54.73 49.62 55.98 54.15 Test 34.74 50.03 46.09 54.66 52.32 Appl. Sci. 2020, 10, x FOR PEER REVIEW 16 of 25

Validation 58.44 69.10 59.92 65.61 65.64 Testing 55.60 66.83 57.92 65.55 61.81 Train 62.17 70.52 62.53 70.65 59.44 DenseNet201 Validation 60.68 67.56 61.27 68.44 58.34 Test 54.51 64.16 55.88 66.51 50.68 Train 44.98 55.63 55.10 59.68 56.01 InceptionResNetV2 Validation 43.86 53.67 54.71 59.39 55.90 Test 40.28 49.35 53.20 59.61 52.48 Train 43.56 57.12 51.18 57.21 56.52 Appl. Sci. 2020, 10, 1287Xception Validation 42.34 54.73 49.62 55.98 54.15 17 of 26 Test 34.74 50.03 46.09 54.66 52.32

462 14 3 16 0 4 0 0 0 0 1 92.4% 1 6.7% 0.2% 0.0% 0.2% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 7.6% 41 393 3 29 1 10 1 0 1 15 3 79.1% 2 0.6% 5.7% 0.0% 0.4% 0.0% 0.1% 0.0% 0.0% 0.0% 0.2% 0.0% 20.9% 11 7 430 19 12 6 2 1 4 4 4 86.0% 3 0.2% 0.1% 6.2% 0.3% 0.2% 0.1% 0.0% 0.0% 0.1% 0.1% 0.1% 14.0% 16 11 59 373 7 27 1 0 0 1 5 74.6% 4 0.2% 0.2% 0.9% 5.4% 0.1% 0.4% 0.0% 0.0% 0.0% 0.0% 0.1% 25.4% 1 1 49 11 354 53 3 3 8 12 2 71.2% 5 0.0% 0.0% 0.7% 0.2% 5.1% 0.8% 0.0% 0.0% 0.1% 0.2% 0.0% 28.8% 2 31 18 10 91 293 1 1 2 42 7 58.8% 6 0.0% 0.4% 0.3% 0.1% 1.3% 4.2% 0.0% 0.0% 0.0% 0.6% 0.1% 41.2% 0 0 5 0 1 0 287 71 115 8 2 58.7% 7 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 4.1% 1.0% 1.7% 0.1% 0.0% 41.3% 1 1 9 0 4 1 75 242 141 10 5 49.5% 8 0.0% 0.0% 0.1% 0.0% 0.1% 0.0% 1.1% 3.5% 2.0% 0.1% 0.1% 50.5% 0 0 1 0 3 1 100 65 291 19 2 60.4% 9 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 1.4% 0.9% 4.2% 0.3% 0.0% 39.6% 0 27 5 3 17 53 11 12 20 328 5 68.2% 10 0.0% 0.4% 0.1% 0.0% 0.2% 0.8% 0.2% 0.2% 0.3% 4.7% 0.1% 31.8% 3 5 3 0 7 2 1 0 0 0 1980 99.0% 11 0.0% 0.1% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 28.6% 1.0% 86.0%80.2%73.5%80.9%71.2%65.1%59.5%61.3%50.0%74.7%98.2%78.4% 14.0%19.8%26.5%19.1%28.8%34.9%40.5%38.7%50.0%25.3% 1.8% 21.6%

1 2 3 4 5 6 7 8 9 10 11 Target Class Figure 17. Confusion matrix for VGG19 with preprocessing of type 2 without data augmentation. Figure 17. Confusion matrix for VGG19 with preprocessing of type 2 without data augmentation. Table 4. Classification performance of the CNNs without data augmentation under transfer learning (10Table classes). 4. Classification performance of the CNNs without data augmentation under transfer learning (10 classes). Single Model (%) Type 0 Type 1 Type 2 Type 3 Type 4 SimpleCNNSingle Model (%) 25.36 Type 0 48.16 Type 1 53.88 Type 2 51.63 Type 3 Type 51.54 4 VGG16SimpleCNN 48.95 25.36 62.6348.16 67.3453.88 63.6551.63 66.7951.54 VGG19VGG16 51.18 48.95 61.9462.63 69.8967.34 64.7163.65 66.9566.79 ResNet50VGG19 53.12 51.18 66.3061.94 62.7669.89 60.3664.71 64.9066.95 DenseNet201ResNet50 55.0153.12 62.5966.30 61.0362.76 61.3760.36 57.1064.90 InceptionResNetV2 43.79 56.24 55.60 54.68 58.26 XceptionDenseNet201 36.1655.01 52.2762.59 53.6961.03 50.0361.37 55.2757.10 InceptionResNetV2 43.79 56.24 55.60 54.68 58.26 Xception 36.16 52.27 53.69 50.03 55.27 Table 5. Classification performance of the CNNs without data augmentation under transfer learning (5 classes). Table 5. Classification performance of the CNNs without data augmentation under transfer learning (5 classes). Single Model (%) Type 0 Type 1 Type 2 Type 3 Type 4

SingleRaw Model (%) 56.29 Type 0 74.40 Type 1 79.53 Type 2 Type72.92 3 Type 78.46 4 VGG16 75.52 79.82 88.05 83.36 85.84 VGG19Raw 76.23 56.29 80.6774.40 88.8679.53 83.6772.92 78.46 86.47 ResNet50VGG16 74.91 75.52 84.3579.82 78.7388.05 82.3983.36 85.84 80.43 DenseNet201VGG19 70.87 76.23 79.7380.67 77.0888.86 81.7383.67 86.47 73.60 InceptionResNetV2ResNet50 57.9674.91 68.7684.35 72.9978.73 74.7182.39 80.43 74.08 Xception 52.08 66.72 66.19 67.19 69.41 DenseNet201 70.87 79.73 77.08 81.73 73.60

Appl. Sci. 2020, 10, 1287 18 of 26

Table 6. Classification performance of the CNNs without data augmentation under transfer learning (4 classes).

Single Model (%) Type 0 Type 1 Type 2 Type 3 Type 4 Raw 46.40 69.94 74.51 70.31 74.99 VGG16 69.71 77.73 85.25 82.17 83.32 VGG19 70.71 78.64 86.34 82.63 84.51 ResNet50 78.25 88.50 86.99 83.46 87.08 DenseNet201 75.36 82.73 85.59 82.44 83.35 InceptionResNetV2 64.67 77.96 79.46 75.49 83.11 Xception 57.41 72.38 76.00 67.58 75.63

Next, the experiment was performed in the same way, but data augmentation was added for the training. The data were augmented with a rotation range of 10, a shear range of 10, a zoom range of 0.2, and a horizontal flip of “true”. The highest accuracies with data augmentation under transfer learning were 76.54% for VGG19 and type 2 preprocessing with the posture data of 11 classes, 67.47% for VGG19 and type 2 preprocessing with the posture data of 10 classes, 87.92% for VGG19 and type 2 preprocessing with the posture data of five classes, and 85.44% for DenseNet201 and type 2 preprocessing with the posture data of four classes. Table7 presents the classification performance of the CNNs with data augmentation under transfer learning. Figure 18 shows the confusion matrix for VGG19 and type 2 preprocessing, which exhibited the best performance for the posture data of 11 classes with data augmentation under transfer learning. Tables8–10 present the accuracies of the CNNs for the posture data of 10, 5, and 4 classes, respectively, with data augmentation under transfer learning.

Table 7. Classification performance of the CNNs with data augmentation under transfer learning (11 classes).

Single Model (%) Type 0 Type 1 Type 2 Type 3 Type 4 Training 69.89 66.35 65.54 62.11 67.34 SimpleCNN Validation 62.92 59.47 63.02 43.42 61.97 Testing 42.70 52.63 64.35 50.00 60.15 Training 82.07 82.77 86.78 81.40 82.16 VGG16 Validation 73.73 73.02 76.40 72.37 74.29 Testing 61.62 67.35 76.05 68.84 72.01 Training 82.69 83.98 85.57 82.80 83.27 VGG19 Validation 73.13 73.21 76.74 72.64 74.14 Testing 62.86 67.39 76.54 69.71 73.67 Training 58.33 67.89 57.76 66.10 62.82 ResNet50 Validation 56.74 64.85 55.84 63.07 61.36 Testing 53.43 60.67 49.63 63.04 54.15 Training 68.33 71.84 73.12 67.96 67.00 DenseNet201 Validation 66.35 17.27 70.03 18.34 63.61 Testing 57.04 66.41 64.25 61.08 54.24 Training 50.22 59.04 58.68 58.99 60.40 InceptionResNetV2 Validation 48.10 15.05 56.78 15.15 58.42 Testing 44.58 52.55 52.91 56.69 57.27 Training 44.31 57.97 54.08 55.36 57.00 Xception Validation 42.85 15.09 09.51 14.30 55.22 Testing 35.15 52.54 45.63 53.53 50.94 Appl. Sci. 2020, 10, 1287 19 of 26 Appl. Sci. 2020, 10, x FOR PEER REVIEW 18 of 25

474 5 1 12 4 1 0 1 0 2 0 94.8% 1 6.8% 0.1% 0.0% 0.2% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 5.2% 71 310 3 43 13 34 0 0 0 19 4 62.4% 2 1.0% 4.5% 0.0% 0.6% 0.2% 0.5% 0.0% 0.0% 0.0% 0.3% 0.1% 37.6% 9 1 392 38 23 16 3 2 7 5 4 78.4% 3 0.1% 0.0% 5.7% 0.5% 0.3% 0.2% 0.0% 0.0% 0.1% 0.1% 0.1% 21.6% 19 7 36 384 22 23 0 0 1 4 4 76.8% 4 0.3% 0.1% 0.5% 5.5% 0.3% 0.3% 0.0% 0.0% 0.0% 0.1% 0.1% 23.2% 0 1 28 4 377 47 4 0 6 26 4 75.9% 5 0.0% 0.0% 0.4% 0.1% 5.4% 0.7% 0.1% 0.0% 0.1% 0.4% 0.1% 24.1% 1 17 11 9 117 282 2 2 2 49 6 56.6% 6 0.0% 0.2% 0.2% 0.1% 1.7% 4.1% 0.0% 0.0% 0.0% 0.7% 0.1% 43.4% 0 0 5 0 2 0 277 106 86 12 1 56.6% 7 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 4.0% 1.5% 1.2% 0.2% 0.0% 43.4% 3 0 9 0 7 2 93 265 86 20 4 54.2% 8 0.0% 0.0% 0.1% 0.0% 0.1% 0.0% 1.3% 3.8% 1.2% 0.3% 0.1% 45.8% 0 0 4 0 4 0 105 101 229 37 2 47.5% 9 0.0% 0.0% 0.1% 0.0% 0.1% 0.0% 1.5% 1.5% 3.3% 0.5% 0.0% 52.5% 0 9 7 1 24 50 9 20 12 344 5 71.5% 10 0.0% 0.1% 0.1% 0.0% 0.3% 0.7% 0.1% 0.3% 0.2% 5.0% 0.1% 28.5% 3 4 6 2 6 1 1 3 0 2 1973 98.6% 11 0.0% 0.1% 0.1% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 28.5% 1.4% 81.7%87.6%78.1%77.9%62.9%61.8%56.1%53.0%53.4%66.2%98.3%76.5% 18.3%12.4%21.9%22.1%37.1%38.2%43.9%47.0%46.6%33.8% 1.7% 23.5%

1 2 3 4 5 6 7 8 9 10 11 Target Class . Figure 18. Confusion matrix for VGG19 with preprocessing of type 2. Figure 18. Confusion matrix for VGG19 with preprocessing of type 2. Table 8. Classification performance of the CNNs with data augmentation under transfer learning (10Table classes). 8. Classification performance of the CNNs with data augmentation under transfer learning (10 classes). Single Model (%) Type 0 Type 1 Type 2 Type 3 Type 4 SimpleCNNSingle Model (%) 24.20 Type 0 38.35 Type 1 49.95 Type 2 Type 40.17 3 Type 46.41 4 VGG16SimpleCNN 46.99 24.20 61.1938.35 66.7849.95 40.17 61.65 46.41 63.88 VGG19VGG16 48.94 46.99 60.4561.19 67.4766.78 61.65 62.40 63.88 65.12 ResNet50 48.85 64.00 58.69 58.22 59.42 DenseNet201VGG19 54.66 48.94 63.7560.45 64.2467.47 62.40 59.67 65.12 60.11 InceptionResNetV2ResNet50 46.2248.85 55.9664.00 53.2158.69 58.22 53.93 59.42 59.68 XceptionDenseNet201 38.0254.66 54.3663.75 50.4364.24 59.67 47.67 60.11 53.82 InceptionResNetV2 46.22 55.96 53.21 53.93 59.68 Table 9. ClassificationXception performance of the38.02 CNNs 54.36 with data50.43 augmentation 47.67 under 53.82 transfer learning (5 classes). Table 9. Classification performance of the CNNs with data augmentation under transfer learning (5 Single Model (%) Type 0 Type 1 Type 2 Type 3 Type 4 classes). Raw 51.43 60.43 78.71 61.96 73.37 SingleVGG16 Model (%) 72.37 Type 0 78.44 Type 1 87.52 Type 2 Type 82.19 3 Type 83.76 4 VGG19Raw 74.44 51.43 78.1960.43 87.9278.71 61.96 82.49 73.37 84.83 ResNet50VGG16 69.54 72.37 78.7078.44 73.4787.52 82.19 78.49 83.76 75.94 DenseNet201VGG19 71.00 74.44 80.7078.19 81.1787.92 82.49 79.79 84.83 75.83 InceptionResNetV2 61.67 71.71 73.74 72.34 77.13 XceptionResNet50 52.5069.54 68.6578.70 66.3873.47 78.49 66.55 75.94 68.93 DenseNet201 71.00 80.70 81.17 79.79 75.83 InceptionResNetV2 61.67 71.71 73.74 72.34 77.13 Xception 52.50 68.65 66.38 66.55 68.93

Table 10. Classification performance of the CNNs with data augmentation under transfer learning (4 classes).

Single Model (%) Type 0 Type 1 Type 2 Type 3 Type 4 Raw 42.31 53.72 73.55 58.97 68.33 VGG16 66.13 77.50 84.75 81.15 81.78 VGG19 68.83 76.71 85.24 81.22 82.46

Appl. Sci. 2020, 10, 1287 20 of 26

Table 10. Classification performance of the CNNs with data augmentation under transfer learning (4 classes).

Single Model (%) Type 0 Type 1 Type 2 Type 3 Type 4 Raw 42.31 53.72 73.55 58.97 68.33 VGG16 66.13 77.50 84.75 81.15 81.78 VGG19 68.83 76.71 85.24 81.22 82.46 ResNet50 70.85 85.37 85.05 79.46 84.69 DenseNet201 73.14 82.72 85.44 83.68 84.92 InceptionResNetV2 67.04 78.69 79.21 74.61 83.64 Xception 58.73 73.89 73.76 66.28 75.34

The CNNs were trained via nontransfer learning. VGG19, ResNet50, ResNet101, and InceptionResNetV2 were considered for nontransfer learning. The first three models were trained with a batch size of 60, three epochs, and the Adam optimizer [43]; the last model was trained with a batch size of 20, one epoch, and the Adam optimizer. The highest accuracies were 93.32% for InceptionResNetV2 and type 2 preprocessing with the posture data of 11 classes without data augmentation under nontransfer learning. Table 11 presents the classification performance for 11 classes without augmentation under nontransfer learning. Table 12 presents the training time of a single model for 11 classes without augmentation under nontransfer learning. Figure 19 shows the training process for InceptionResNetV2 and type 2 preprocessing, which exhibited the best performance for the posture data of 11 classes without data augmentation under nontransfer learning. Figure 20 shows the activations for InceptionResNetV2 and type 2 preprocessing. Figure 21 shows the activations of the last ReLU for ResNet50 and type 2 preprocessing.

Table 11. Classification performance for 11 classes without augmentation under nontransfer learning.

Single Model (%) Type 0 Type 1 Type 2 Type 3 Type 4 Training 91.67 90.99 91.20 89.88 87.37 VGG19 Validation 90.42 90.19 89.76 88.79 86.75 Testing 88.40 87.76 90.22 88.29 87.06 Training 95.47 96.37 96.16 95.60 95.62 ResNet50 Validation 92.70 92.30 93.54 92.04 92.54 Testing 89.99 88.77 91.78 88.43 90.48 Training 96.81 96.02 96.11 95.16 95.58 ResNet101 Validation 93.86 92.44 92.90 91.97 91.94 Testing 90.93 88.04 91.29 88.32 89.17 Training 95.15 95.25 95.65 92.80 94.31 InceptionResNetV2 Validation 93.78 91.89 94.01 92.04 91.90 Testing 92.54 88.98 93.32 89.50 91.09

Table12. Trainingtime of a single model for 11 classes without augmentation under nontransfer learning.

Single Model (hh:mm) Type 0 Type 1 Type 2 Type 3 Type 4 VGG19 Training 26:35 28:38 27:04 27:37 28:10 ResNet50 Training 10:39 12:22 12:09 08:02 08:45 ResNet101 Training 25:46 27:05 27:30 26:09 27:01 InceptionResNetV2 Training 53:23 57:54 59:09 58:25 61:44 Appl. Sci. 2020, 10, x FOR PEER REVIEW 20 of 25 Appl. Sci. 2020, 10, x FOR PEER REVIEW 20 of 25 Appl. Sci. 2020, 10, 1287 21 of 26

100 7 Training Appl.10090 Sci. 2020, 10, x FOR PEER REVIEW 7 20 Validationof 25 6 Training 8090 Validation 6 5 7080 100 7 6070 5 4 Training 90 Validation 6 5060 80 Loss 4 3 40

Accuracy (%) Accuracy 50 70 5 Loss 3 3040 Accuracy (%) Accuracy 60 2 4 20 30 50 2 Loss 3 1 40 Training 1020 (%) Accuracy Validation 1 100 30 Training 2 0 0 500 1000 1500 2000Validation 2500 0 200 400 600 800 1000 1200 1400 1600 1800 2000 20 0 Iteration 0 Iteration 0 500 1000 1500 2000 2500 1 0 200 400 600 800 1000 1200 1400 1600 1800 2000 10 Training Iteration Validation Iteration 0 (a) 0 (b) 0 500 1000 1500 2000 2500 0 200 400 600 800 1000 1200 1400 1600 1800 2000 (a) Iteration Iteration(b) Figure 19. Training process for InceptionResNetV2 with preprocessing of type 2: (a) accuracy; (b) loss. Figure 19. TrainingTraining process process(a) for for InceptionResNetV InceptionResNetV22 with with preprocessing preprocessing of of type type(b) 2: 2: (a (a) )accuracy; accuracy; (b (b) )loss. loss. Figure 19. Training process for InceptionResNetV2 with preprocessing of type 2: (a) accuracy; (b) loss.

(a) (b) (a) (b) Figure 20. Activations(a )for InceptionResNetV2 with preprocessing of type (b2,) which exhibited the best FigureFigure 20. 20.Activations Activations for for InceptionResNetV2 InceptionResNetV2 withwith preprocessingpreprocessing of of type 2, 2, which exhibited exhibited the the best best performance for the posture data of 11 classes without data augmentation under nontransfer learning: (a) Figureperformanceperformance 20. Activations for for the the posture posture for InceptionResNetV2 data data of of 11 11 classesclasses withoutwithou witht preprodata data augmentation augmentationcessing of undertype under 2,nontransfer which nontransfer exhibited learning: learning: (thea) best activations of first convolution; (b) activations of last convolution. performance(a) activationsactivations for of of thefirst first postureconvolution; convolution; data ( bof) ( bactivations11) activationsclasses ofwithou last of convolution. lastt data convolution. augmentation under nontransfer learning: (a) activations of first convolution; (b) activations of last convolution.

Figure 21. Activations of the last ReLU for InceptionResNetV2 with preprocessing of type 2, which exhibited the best performance for the posture data of 11 classes without data augmentation under nontransfer learning.

Appl. Sci. 2020, 10, 1287 22 of 26

The trained CNN models from Table 11 are combined using the majority of outputs and average of scores. First, the five CNNs trained with different input types are combined under VGG19, ResNet50, ResNet101, and InceptionResNetV2 as EV19TNet, ER50TNet, ER101TNet, and EIR2TNet, respectively. Table 13 describes the classification performance of ensemble deep models designed by input types. Secondly, the four CNNs trained with different models are combined under input types from 0 to 4 as ET0MNet, ET1MNet, ET2MNet, ET3MNet, and ET4MNet, respectively. Table 14 indicates the classification performance of ensemble deep models designed by the pre-trained CNNs. Table 15 indicates the training time of the ensemble system for 11 classes without augmentation under nontransfer learning. In ensemble systems, the highest accuracy is 95.34% of InceptionResNetV2s using average of scores.

Table 13. Classification performance of ensemble deep models designed by input types.

Ensemble System Majority Average EV19TNet Testing 92.77 93.42 ER50TNet Testing 93.70 94.12 ER101TNet Testing 93.94 94.51 EIR2TNet Testing 94.76 95.34

Table 14. Classification performance of ensemble deep models designed by the pre-trained CNNs.

Ensemble System Majority Average ET0MNet Testing 93.08 94.07 ET1MNet Testing 91.53 92.77 ET2MNet Testing 93.94 94.58 ET3MNet Testing 91.30 92.31 ET4MNet Testing 92.86 93.67

Table 15. Training time of ensemble deep models for 11 classes without augmentation under nontransfer learning.

Ensemble System (hh:mm) Majority EV19TNet Training 138:04 ER50TNet Training 51:57 ER101TNet Training 133:31 EIR2TNet Training 290:35 ET0MNet Training 116:23 ET1MNet Training 125:59 ET2MNet Training 125:52 ET3MNet Training 120:13 ET4MNet Training 125:40

The classification performances listed in Tables3–10 were computed by average value. Figure 22 shows theaveragevaluesofclassificationperformanceundertransferlearningforthedifferenttypesofpreprocessing. The experimental results on preprocessing of type 2 showed 13.63%, 3.12%, 2.79%, and 0.76% higher classification rate than types 0, 1, 3, and 4, respectively. Figure 23 shows a comparison of the average total accuracies for the different models under transfer learning. The ‘SimCNN’, ‘DenseNet’, and ‘IncResNet’ are simple CNN, DenseNet201, and InceptionResNetV2, respectively. The experimental results by VGG19 showed 15.78%, 0.78%, 3.53%, 4.11%, 10.61%, and 16.02% higher classification performance than the simple CNN, VGG16 [19], ResNet50 [20], DenseNet201 [21], InceptionResNetV2 [22], and Xception [23], respectively. Appl. Sci. 2020, 10, x FOR PEER REVIEW 22 of 25

Appl.16.02% Sci. 2020higher, 10, x FORclassification PEER REVIEW performance than the simple CNN, VGG16 [19], ResNet5022 of[20], 25 DenseNet201 [21], InceptionResNetV2 [22], and Xception [23], respectively. 16.02%Appl. Sci. 2020higher, 10, 1287classification performance than the simple CNN, VGG16 [19], ResNet5023 [20], of 26 DenseNet201 [21], InceptionResNetV2 [22], and Xception [23], respectively.

Figure 22. Comparison of the average total accuracies for the different types of preprocessing under Figuretransfer 22. learning.Comparison of the average total accuracies for the different types of preprocessing under Figure 22. Comparison of the average total accuracies for the different types of preprocessing under transfer learning. transfer learning.

. Figure 23. Comparison of the average total accuracies for the different models under transfer learning. . Figure 23. Comparison of the average total accuracies for the different models under transfer learning. Figure 24 visualizes the classification performance of ensemble deep models from the results listed in TablesFigureFigure 13 23. and24 Comparison visualizes 14. Here, ‘BestSingle’ofthe the classification average is total the accuracies bestperforma result fornce of the single of different ensemble models models listeddeep under models in Table transfer from11. learning. As the shown results in Figurelisted in 24 Tables, the ensemble 13 and 14. deep Here, model ‘BestSingle’ by InceptionResNetV2s is the best result withof single average models method listed showed in Table the 11. best As Figure 24 visualizes the classification performance of ensemble deep models from the results classificationshown in Figure performance 24, the ensemble in comparison deep model to other by InceptionResNetV2s models. with average method showed listed in Tables 13 and 14. Here, ‘BestSingle’ is the best result of single models listed in Table 11. As the best classification performance in comparison to other models. shown in Figure 24, the ensemble deep model by InceptionResNetV2s with average method showed the best classification performance in comparison to other models.

Appl. Sci. 2020, 10, 1287 24 of 26 Appl. Sci. 2020, 10, x FOR PEER REVIEW 23 of 25

. Figure 24. Performance comparison by various ensemble deep models. Figure 24. Performance comparison by various ensemble deep models. 5. Conclusions 5. ConclusionsWe present ensemble deep models designed by pre-trained CNNs and various types of preprocessingWe present for ensemble posture recognition deep models under designed various by home pre-trained environments. CNNs and The pertainedvarious types CNNs of usedpreprocessing in this paper for posture were performed recognition by under VGGNet, vari ResNet,ous home DenseNet, environments. InceptionResNet, The pertained and CNNs Xception, used frequentlyin this paper used were in conjunction performed with by deep VGGNet, learning. ResN Finally,et, DenseNet, we performed InceptionR systematicesNet, experiments and Xception, from afrequently large posture used database in conjunction constructed with bydeep the learni Electronicsng. Finally, and Telecommunications we performed systematic Research experiments Institute (ETRI).from a Thelarge experimental posture database results constructed reveal that theby ensemblethe Electronics deep modeland Telecommunications shows good performance Research in comparisonInstitute (ETRI). with theThepre-trained experimental CNN results itself. reve Thus,al that we expectthe ensemble that the presenteddeep model method shows will good be anperformance important techniquein comparison to help with elderly the peoplepre-trained from CNN sudden itself. danger Thus, in homewe expect environments. that the presented In future research,method will we shall be an perform important the research technique on theto behaviorhelp elderly recognition people of from elderly sudden people danger from a largein home 3D databaseenvironments. for behavior In future recognition research, we constructed shall perform under th homee research service on the robot behavior environments. recognition of elderly people from a large 3D database for behavior recognition constructed under home service robot Author Contributions: Conceptualization, Y.-H.B., J.-Y.L., and K.-C.K.; methodology, Y.-H.B., J.-Y.L., and K.-C.K.; software,environments. Y.-H.B., J.-Y.L., D.-H.K., and K.-C.K.; validation, Y.-H.B., J.-Y.L., D.-H.K., and K.-C.K.; formal analysis, Y.-H.B. and K.-C.K.; investigation, Y.-H.B., J.-Y.L., and K.-C.K.; resources, J.-Y.L., D.-H.K., and K.-C.K.; ,Author Contributions: Y.-H.B., J.-Y.L., D.-H.K.,Conceptualization, and K.-C.K.; Y.-H.B., writing—original J.-Y.L., and draftK.-C.K.; preparation, methodology, Y.-H.B.; Y.-H.B., writing—review J.-Y.L., and and K.- editing,C.K.; software, J.-Y.L. and Y.-H.B., K.-C.K.; J.-Y.L., visualization, D.-H.K., Y.-H.B.,and K.-C.K.; J.-Y.L., validation, and K.-C.K.; Y.-H.B., supervision, J.-Y.L., K.-C.K.; D.-H.K., project and K.-C.K.; administration, formal J.-Y.L.; funding acquisition, J.-Y.L. All authors have read and agreed to the published version of the manuscript. analysis, Y.-H.B. and K.-C.K.; investigation, Y.-H.B., J.-Y.L., and K.-C.K.; resources, J.-Y.L., D.-H.K., and K.-C.K.; Funding:data curation,This researchY.-H.B., wasJ.-Y.L., supported D.-H.K., by and the K.-C.K.; Basic Science writing—original Research Program draft prep througharation, the NationalY.-H.B.; writing— Research Foundationreview and ofediting, Korea J.-Y.L. (NRF) and funded K.-C.K.; by the visualization, Ministry of EducationY.-H.B., J.-Y.L., (No. and 2017R1A6A1A03015496). K.-C.K.; supervision, K.-C.K.; This work project was supported by the ICT R&D program of MSIT/IITP (2017-0-00162, Development of Human-care Robot Technology administration, J.-Y.L.; funding acquisition, J.-Y.L. for Aging Society). ConflictsFunding: ofThis Interest: researchThe was authors supported declare by no the conflict Basic ofScience interest. Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. 2017R1A6A1A03015496). This work was Referencessupported by the ICT R&D program of MSIT/IITP (2017-0-00162, Development of Human-care Robot Technology for Aging Society). 1. Park, J.H.; Song, K.H.; Kim, Y.S. A kidnapping detection using human pose estimation in intelligent video Conflicts of Interest: The authors declare no conflicts of interest. surveillance systems. J. Korea Soc. Comput. Inf. 2018, 23, 9–16. 2. Qiang, B.; Zhang, S.; Zhan, Y.; Xie, W.; Zhao, T. Improved convolutional pose machines for human pose References estimation using image sensor data. Sensors 2019, 19, 718. [CrossRef][PubMed] 3.1. Huang,Park, J.H.; Z.; Liu,Song, Y.; K.H.; Fang, Kim, Y.; Horn, Y.S. A B.K.P. kidnapping Video-Based detection Fall Detection using human for Seniors pose estimation with Human in intelligent Pose Estimation. video Insurveillance Proceedings systems. of the 4thJ. Korea International Soc. Comput. Conference Inf. 2018, on 23 Universal, 9–16. Village, Boston, MA, USA, 21–24 October 2. 2018;Qiang, pp. B.; 1–4. Zhang, S.; Zhan, Y.; Xie, W.; Zhao, T. Improved convolutional pose machines for human pose 4. Chan,estimation K.C.; using Koh, C.K.;image Lee, sensor C.S.G. data. An Sensors automatic 2019 design, 19, 718. of factors in a human-pose estimation system using 3. neuralHuang, networks. Z.; Liu, Y.;IEEE Fang, Trans. Y.; Syst. Horn, Man B.K.P. Cybern. Video-Based Syst. 2016, 46Fall, 875–887. Detection [CrossRef for Seniors] with Human Pose 5. Veges,Estimation. M.; Varga,In Proceeding V.; Lorincz, of 4th A.International 3D human Conference pose estimation on Universal withsiamese Village, equivariantBoston, MA, embedding.USA, 21–24 NeurocomputingOctober 2018; p.2019 9b-6., 339 , 194–201. [CrossRef] 6.4. Stommel,Chan, K.C.; M.; Koh, Beetz, C.K.; M.; Lee, Wu, C.S.G. W. Model-free An automatic detection, design encoding, of factors retrieval, in a human-pose and visualization estimation of humansystem posesusing fromneural Kinect networks. data. IEEEIEEE Trans. AsmeTrans. Syst. Man Mechatron. Cybern.2015 Syst., 202016, 865–875., 46, 875–887. [CrossRef ] 5. Veges, M.; Varga, V.; Lorincz, A. 3D human pose estimation with siamese equivariant embedding. Neurocomputing 2019, 339, 194–201.

Appl. Sci. 2020, 10, 1287 25 of 26

7. Shum, H.P.H.; Ho, E.S.L.; Jiang, Y.; Takagi, S. Real-time posture reconstruction for Microsoft Kinect. IEEE Trans. Cybern. 2013, 43, 1357–1369. [CrossRef] 8. Lee, J.; Joo, H.; Lee, J.; Chee, Y. Automatic classi?cation of squat posture using inertial sensors: Deep learning approach. Sensors 2020, 20, 361. [CrossRef] 9. Chowdhury, I.R.; Saha, J.; Chowdhury, C. Detailed Activity Recognition with Smartphones. In Proceedings of the Fifth International Conference on Emerging Applications of Information Technology, Kolkata, India, 12–13 January 2018; pp. 1–4. 10. Wu, Z.; Zhang, J.; Chen, K.; Fu, C. Yoga posture recognition and quantitative evaluation with wearable sensors based on two-stage classifer and prior bayesian network. Sensors 2019, 19, 5129. [CrossRef] 11. Idris, M.I.; Zabidi, A.; Yassun, I.M.; Ali, M.S.A.M. Human Posture Recognition Using Android Smartphone and Artificial Neural Network. In Proceedings of the IEEE Control and System Gradate Research Colloquium, Shah Alam, Malaysia, 10–11 August 2015; pp. 120–124. 12. Pak, M.S.; Kim, S.H. A Review of Deep Learning in Image Recognition. In Proceedings of the International Conference on Computer Applications and Information Processing Technology, Kuta Bali, Indonesia, 8–10 August 2017; pp. 1–3. 13. Diao, W.; Sun, X.; Zheng, X.; Dou, F.; Wang, H.; Fu, K. Efficient saliency-based object detection in remote sensing images using deep belief networks. IEEE Geosci. Remote Sens. Lett. 2016, 13, 137–141. [CrossRef] 14. Sasaki, H.; Horiuchi, T.; Kato, S. A Study on Vision-Based Mobile Robot Learning by Deep Q-Network. In Proceedings of the Annual Conference of Society of Instrument Control Engineers, Kanazawa, Japan, 19–22 September 2017; pp. 799–804. 15. Chang, C.H. Deep and shallow architecture of multilayer neural networks. IEEE Neural Netw. Learn. Syst. 2015, 26, 2477–2486. [CrossRef] 16. Mou, L.; Ghamisi, P.; Zhu, X.X. Deep recurrent neural networks for hyperspectral image classification. IEEE Geosci. Remote Sens. 2017, 55, 3639–3655. [CrossRef] 17. Callet, P.L.; Viard-Gaudin, C.; Barba, D. A convolutional neural network approach for objective video quality assessment. IEEE Neural Netw. 2006, 17, 1316–1327. [CrossRef][PubMed] 18. Hou, J.C.; Wang, S.S.; Lai, Y.H.; Tsao, Y.; Chang, H.W.; Wang, H.M. Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Emerg. Top. Comput. Intell. 2018, 2, 117–128. [CrossRef] 19. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2015, arXiv:1409.1556v6. Available online: https://arxiv.org/abs/1409.1556 (accessed on 5 December 2019). 20. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and , Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. 21. Huang, G.; Liu, Z.; Maaten, L.V.D.; Weinberger, K.Q. Densely connected convolutional networks. arXiv 2018, arXiv:1608.06993v5. Available online: https://arxiv.org/abs/1608.06993 (accessed on 5 December 2019). 22. Szegedy, C.; Ioffe, S.; Vanhoucke, V. Inception-v4, Inception-ResNet and the impact of residual connections on learning. arXiv 2016, arXiv:1602.07261v2. Available online: https://arxiv.org/abs/1602.07261 (accessed on 5 December 2019). 23. Chollet, F. Xception: Deep learning with depthwise separable convolutions. arXiv 2017, arXiv:1610.02357v3. Available online: https://arxiv.org/abs/1610.02357 (accessed on 8 December 2019). 24. Shao, L.; Zhu, F.; Li, X. Transfer learning for visual categorization: A survey. IEEE Neural Netw. Learn. Syst. 2015, 26, 1019–1034. [CrossRef] 25. Tompson, J.; Jain, A.; LeCun, Y.; Bregler, C. Joint training of a convolutional network and a graphical model for human pose estimation. arXiv 2014, arXiv:1406.2984v2. Available online: https://arxiv.org/abs/1406.2984 (accessed on 8 December 2019). 26. Carreira, J.; Agrawal, P.; Fragkiadaki, K.; Malik, J. Human pose estimation with iterative error feedback. arXiv 2016, arXiv:1507.06550v3. Available online: https://arxiv.org/abs/1507.06550 (accessed on 8 December 2019). 27. Pishchulin, L.; Insafutdinov, E.; Tang, S.; Andres, B.; Andriluka, M.; Gehler, P.; Schiele, B. DeepCut: Joint subset partition and labeling for multi person pose estimation. arXiv 2016, arXiv:1511.06645v2. Available online: https://arxiv.org/abs/1511.06645 (accessed on 8 December 2019). 28. Girshick, R. Fast R-CNN. arXiv 2015, arXiv:1504.08083v2. Available online: https://arxiv.org/abs/1504.08083 (accessed on 8 December 2019). Appl. Sci. 2020, 10, 1287 26 of 26

29. Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; Schiele, B. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. Adv. Concepts Intell. Vis. Syst. 2016, 9910, 34–50. 30. Georgakopoulos, S.V.; Kottari, K.; Delibasis, K.; Plagianakos, V.P.; Maglogiannis, I. Pose recognition using convolutional neural networks on omni-directional images. Neurocomputing 2018, 280, 23–31. [CrossRef] 31. Liu, Y.; Xu, Y.; Li, S.B. 2-D Human Pose Estimation from Images Based on Deep Learning: A Review. In Proceedings of the 2nd IEEE Advanced Information Management, Communicates, Electronic and Automation Control Conference, Xi’an, China, 25–27 May 2018; pp. 462–465. 32. Lee, J.N.; Byeon, Y.H.; Kwak, K.C. Design of ensemble stacked auto-encoder for classification of horse gaits with MEMS inertial sensor technology. Micromachines 2018, 9, 411. [CrossRef][PubMed] 33. Maguolo, G.; Nanni, L.; Ghidoni, S. Ensemble of convolutional neural networks trained with different activation features. arXiv 2019, arXiv:1905.02473v4. Available online: https://arxiv.org/abs/1905.02473 (accessed on 10 December 2019). 34. Kim, M.G.; Pan, S.B. Deep learning based on 1-D ensemble networks using ECG for real-time user recognition. IEEE Trans. Ind. Inform. 2019, 15, 5656–5663. [CrossRef] 35. Kahlouche, S.; Ouadah, N.; Belhocine, M.; Boukandoura, M. Human Pose Recognition and Tacking Using RGB-D Camera. In Proceedings of the 8th International Conference on Modelling, Identification and Control, Algiers, Algeria, 15–17 November 2016; pp. 520–525. 36. Na, Y.J.; Wang, C.W.; Jung, H.Y.; Ho, J.G.; Choi, Y.K.; Min, S.D. Real-Time Sleep Posture Recognition Algorithm Using Kinect System. In Proceedings of the Korean Institute of Electrical Engineers Conference on Biomedical System, Hoengseong, Korea, 15–17 February 2016; pp. 27–30. 37. Kim, S.C.; Cha, J.H. Posture Recognition and Spatial Cognition with Hybrid Sensor. In Proceedings of the Korean Society of Precision Engineering Conference, Jeju, Korea, 29–31 May 2013; pp. 971–972. 38. Giancola, S.; Valenti, M.; Sala, R. A survey on 3D cameras: Metrological comparison of time-of-flight, structured-light and active stereoscopy technologies. In Springer Briefs in Computer Science; Zdonik, S., Shekhar, S., Wu, X., Jain, L.C., Padua, D., Shen, X.S., Furht, B., Subrahmanian, V.S., Hebert, M., Ikeuchi, K., et al., Eds.; Springer Nature: Cham, Switzerland, 2018; pp. 29–39. ISBN 9783319917603. 39. Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767v1. Available online: https://arxiv.org/abs/1804.02767 (accessed on 10 December 2019). 40. Choi, H.S.; Lee, B.H.; Yoon, S.R. Biometric authentication using noisy electrocardiograms acquired by mobile sensors. IEEE Access 2016, 4, 1266–1273. [CrossRef] 41. RMSprop Optimization Algorithm for with Neural Networks. Available online: https: //insidebigdata.com/2017/09/24/rmsprop-optimization-algorithm-gradient-descent-neural-networks/ (accessed on 16 May 2019). 42. Zeiler, M.D. Adadelta: An adaptive learning rate method. arXiv 2012, arXiv:1212.5701v1. Available online: https://arxiv.org/abs/1212.5701 (accessed on 13 December 2019). 43. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2017, arXiv:1412.6980v9. Available online: https://arxiv.org/abs/1412.6980 (accessed on 13 December 2019).

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).