Journal of Imaging

Article Image Processing Technique and for an Elderly Care Monitoring System

Swe Nwe Nwe Htun 1,* , Thi Thi Zin 2 and Pyke Tin 3

1 Interdisciplinary Graduate School of Agriculture and Engineering, University of Miyazaki, Miyazaki 889-2192, Japan 2 Graduate School of Engineering, University of Miyazaki, Miyazaki 889-2192, Japan; [email protected] 3 International Relation Center, University of Miyazaki, Miyazaki 889-2192, Japan; [email protected] * Correspondence: [email protected]

 Received: 19 May 2020; Accepted: 10 June 2020; Published: 13 June 2020 

Abstract: Advances in image processing technologies have provided more precise views in medical and health care management systems. Among many other topics, this paper focuses on several aspects of video-based monitoring systems for elderly people living independently. Major concerns are patients with chronic diseases and adults with a decline in physical fitness, as well as falling among elderly people, which is a source of life-threatening injuries and a leading cause of death. Therefore, in this paper, we propose a video-vision-based monitoring system using image processing technology and a Hidden Markov Model for differentiating falls from normal states for people. Specifically, the proposed system is composed of four modules: (1) object detection; (2) feature extraction; (3) analysis for differentiating normal states from falls; and (4) a decision-making process using a Hidden Markov Model for sequential states of abnormal and normal. In the object detection module, background and foreground segmentation is performed by applying the Mixture of Gaussians model, and graph cut is applied for foreground refinement. In the feature extraction module, the postures and positions of detected objects are estimated by applying the hybrid features of the virtual grounding point, inclusive of its related area and the aspect ratio of the object. In the analysis module, for differentiating normal, abnormal, or falling states, statistical computations called the moving average and modified difference are conducted, both of which are employed to estimate the points and periods of falls. Then, the local maximum or local minimum and the half width value are determined in the observed modified difference to more precisely estimate the period of a falling state. Finally, the decision-making process is conducted by developing a Hidden Markov Model. The experimental results used the Le2i fall detection dataset, and showed that our proposed system is robust and reliable and has a high detection rate.

Keywords: normal and abnormal states; fall detection; Mixture of Gaussians; graph cut; virtual grounding point; Hidden Markov Model

1. Introduction The risk of falling is a huge issue, not only for older adults in aging societies, but also for younger adults with drug and alcohol problems and for patients with chronic diseases. Falling has serious health consequences and is a leading cause of death. According to a survey by the World Health Organization [1], falling occurs most frequently in the 28–35% of individuals between the ages of 65 and 70 and in the 42% who are over 70 years old. Falling is particularly dangerous for persons who live alone in an indoor environment, because much time can pass before they receive assistance. In this

J. Imaging 2020, 6, 49; doi:10.3390/jimaging6060049 www.mdpi.com/journal/jimaging J. Imaging 2020, 6, 49 2 of 23 situation, many countries are adopting policies to increase the life expectancy, by providing extra care to people living independently. For this reason, much research has focused on developing a robust fall-detection process in smart home systems using their respective specialized technologies. Progress in developing such intelligent technologies holds the promise of improving the quality of life for the aged and infirm. Popular fall detection and prediction systems have emerged based on wearable monitoring devices, ambient devices, vision-based devices, and portable devices. The most commonly used wearable devices feature accelerometers and gyroscopes embedded in belts, watches, or pendants that are reasonably comfortable to wear. However, some of these devices have disadvantages that limit their usability, such as excessive power consumption. Some of these devices produce false alarms when triggered by normal body movements, and some require the manual activation of an alarm after a fall. In addition, the elderly often forgets to wear the devices. However, wearable devices based on machine learning and intelligent systems are effective for monitoring people in both indoor and outdoor environments. Various interesting approaches for fall detection systems using wearable devices are discussed in [2–5], in which accelerometers and gyroscopes are employed to collect data on the emotional state and body movements of subjects. The processing of data collected using ambient devices provides information without demanding user intervention. Commercially available devices using advanced technologies include presence sensors, motion sensors, and vibration sensors, which can be embedded in an indoor environment, such as on furniture. Presence sensors can detect the tiniest movements of subjects, such as the movement of a finger, with a high resolution and precision. They can easily be set on high ceilings, as well as the floor, to detect the smallest movement. Compared to presence sensors, motion sensors are useful in perceiving the arm movements involved in walking in detection zones. Such zones are selected as high-traffic areas for detecting moving objects in busy indoor or outdoor areas. Vibration sensors are useful for detecting falling events, and can distinguish the activities through vibrations. Vibration sensors installed on floors or beds acquire vibration data that can be analyzed by determining the fall detection ratio [6]. Although such sensors do not disturb the people involved, they can generate false alarms. In recent years, the development of smart phone technologies that are incorporated with accelerometers has provided some very interesting approaches for healthcare monitoring systems, including fall detection. Mobile phone-based fall detection algorithms have been presented in [7–10]. However, these devices are not useful if the monitored subject does not always have them in their hand. For these reasons, video monitoring systems based on computer vision and machine learning are potentially more beneficial and reliable for fall detection. Vision-based devices utilizing cameras have some of the same limitations as ambient devices, such as the fact that the devices must be installed in several places to provide full coverage of the required areas over a long period of time. However, video surveillance systems can effectively predict specific human activities, such as walking, sitting down, going to bed, and getting up from bed [11,12], as well as detecting fall events [13]. Moreover, a large amount of visual information is captured in the video record of falling events. The various definitions of falling events, as well as the reasons and circumstances behind such events and other abnormal occurrences in real-world environments, should be visually analyzed in monitoring systems. The vision-based detection of abnormal or fall events could become an important tool in elderly care systems. In developing better vision-based video monitoring systems, abnormal event detection holds great promise, but faces many challenges in real-world environments. In developing applications for this purpose, establishing a fall detection system presents difficulties because the dynamic conditions involved in falls are not well-understood. As an important focus, researchers must investigate the universal features common to all falling events. Therefore, we must differentiate falls from normal activities in a comprehensive way that reflects the scene of falling scenarios. Any system developed must be robust in the face of changing postures and positions, such as a loss of balance, and abrupt changes in direction. These conditions must be considered in reliably assessing abnormal events. J. Imaging 2020, 6, 49 3 of 23

However, research trends and current best practices indicate that the prospects are great for using vision-based monitoring to improve the quality of health care. Therefore, we propose a vision-based system for fall detection by differentiating abnormal behavior or falls from normal states. This system includes simple and uncomplicated feature extraction or effective statistical analyses. Key to detecting abnormal or falling events is the recognition that they involve a loss of balance. The three research objectives of this proposed system are as follows:

Develop a monitoring system that provides a visual understanding of a person’s situation and • can judge whether the state is abnormal or normal based on video data acquired using a simple and affordable RGB camera; Develop an individualized and modified statistical analysis on each of the extracted features, • providing trustworthy information, not only on the definite moment of a fall, but also on the period of a fall; Develop an efficient way of using a Hidden Markov Model (HMM) for the detailed detection of • sequential abnormal and normal states for the person being monitored.

Our proposed system intends to establish a long-term monitoring system for facilitating independent living. Major steps in the system include (1) feature extractions in estimating positions and postures of the person by utilizing the virtual grounding point (VGP) concept, and related visual features inspired by our previous research [13]; (2) modified statistical analysis to estimate the time interval or period for falls and normal states through extensive feature observation; and (3) the establishment of a Hidden Markov Model (HMM) to detect the sequential normal and fall states of the person. The rest of the paper is organized as follows: Section2 presents related works; Section3 presents theoretical analysis and methodologies of the proposed system; Section4 presents and evaluates experimental results, comparing the robustness and limitations of the proposed systems with existing algorithms; and finally, Section5 presents the conclusions of this work.

2. Related Works Related state-of-the-art fall detection systems will be discussed in this section. In order to effectively define the postures assumed by a human object, it is very important to perform feature extraction or selection, analyze the selected features, and set detection rules in a vision-based video monitoring system, especially for monitoring in health care. The most common feature extraction methods used in fall detection systems involve the human shape and motion-history images. The fall detection system presented in [14] is a method based on motion history images (MHI) and changes in the human shape. In this system, information on the history of motion occurring over a fixed interval can be obtained from MHI. Then, the human shape is obtained by constructing a blob using an approximate ellipse. Finally, fall detection is achieved by considering three factors: motion quantification, analysis of the human shape, and the lack of motion after a fall. Motion quantification allows sudden motion to be detected when a person falls. The approximated ellipse constructed on the object can provide information about changes in the human shape, more precisely, changes in orientation. The final analysis provides a moving ellipse, which indicates whether the person is moving after a fall. The decision confirming a fall is made when the ellipse stops moving for five seconds after a fall. The video sequences are captured using wall-mounted cameras to cover wide areas, and the results are presented in 2D motion and shape information. Extended work on the system presented in [15] involves both 2D and 3D information for fall detection, as the researchers intended to recover localization information on the person relative to the ground. This extended process of feature extraction involves computing the 3D head trajectories of a person, and a fall is detected if the velocity of the head exceeds a certain value and the position of the head is too close to the floor or the ground. Similarly, a process of detecting unnatural falls has been developed [16], which includes background subtraction using the frame difference method, feature extraction using MHI, and a change in human shape and classification using a support vector machine (SVM). Compared with J. Imaging 2020, 6, 49 4 of 23 the previous approach [15], the system involves constructing three specific features on the human shape, namely, the orientation of the approximated ellipse, its aspect ratio, and silhouette detection below a threshold line. The aspect ratio of the ellipse describes changes in major and minor axes, and differentiates a fall from normal activities. After a fall, the previously moving object lies on the ground. For this reason, the threshold line is set by considering a suitable height from the ground. After that, fall and non-fall objects can be differentiated according to the height of the silhouette object. Finally, classifications are performed using the support vector machine (SVM), k-nearest neighbor (KNN) classifier, Stochastic Gradient Descent (SGD), Decision Tree (DT), and Gradient Boosting (GB). The DT algorithm consistently outperforms the rest, with a high detection rate, as confirmed using the Le2i fall detection dataset. Another fall detection system for elderly care is proposed in [17]. This system firstly conducts background subtraction to segment out the silhouette moving object, and then tracks the object to determine the trajectory. Secondly, a timed motion history image (tMHI) is constructed to detect high velocities. Then, the motion is quantified to acquire pixel values for the tMHI, which are divided by the number of pixels in the detected silhouette object. Similar to the approach presented in [16], this system provides a useful definition of the human body posture using the following combined features as the input: the ratio, orientation, and major and minor semi-axes of the fitted ellipse. In this system, both MHI and a projection histogram are applied to confirm that a falling event has occurred. In addition, the position of the head can be tracked in sequential frames in order to obtain useful information, since the trajectory of the head is visible most of the time. Finally, a multilayer perceptron (MLP) neural network is applied on the extracted features to classify falls and non-falls, with an accuracy of 99.24%, and a precision of 99.60%, as confirmed using the UR fall detection dataset. In addition, an extensive automated human fall-recognition process is proposed in [18] to support independent living for the elderly in indoor environments using the Le2i fall-detection dataset. This system is based on motion, orientation, and histogram features, and achieves an overall accuracy of 99.82%. The approaches in [14–18] discussed above focus on distinguishing falls from normal activities. However, any monitoring system should take into account consecutive daily activities in a real-world environment. Such activities include walking, standing, and sitting, as well as transitioning between these activities. In this regard, our previous work [19] proposed human action analysis based on motion history and the orientation of the human shape. As its main purpose, this system takes into consideration a prediction of the degree of mobility for the elderly in daily activities, such as getting into and out of bed. These activities include the following consecutive actions: sitting, transitioning from sitting to lying, lying, transitioning from lying to sitting, transitioning from sitting to standing, and walking. We firstly conducted background subtraction [20,21] for the proper separation of foreground and background objects. Among the features used in this system are tMHI and the orientation of the approximated ellipse. However, the construction of one ellipse is not enough for detecting the human object region. As a key part of our design, two approximated ellipses are constructed using horizontal and vertical histogram values. The vertical histogram is employed for the whole body region, and the horizontal histogram is employed for the upper body region. Then, motion quantification is used to analyze these human activities. In considering detailed sequential actions, multiple threshold values are observed, which depend on the shape orientation and the coefficient of motion. Rather than using fall scenes, the experiments were conducted using the normal scenes in the 14 videos of the Le2i dataset. After that, we extended our system [22] to analyze not only normal activities, but also falls. In this system, the virtual grounding point (VGP) is introduced for feature extraction, and analyses are performed by the combined features of tMHI and VGP. The overall accuracy of the proposed system for the detection of falling events was 82.18%, as experimentally confirmed using 15 videos from the Le2i fall detection dataset. To improve the accuracy of detecting falls in a given video sequence, we extended our research [13] by incorporating feature selection using the VGP concept with its related features and statistical analysis for estimating the falling period, as well as two classification methods, namely the support vector machine (SVM) and period detection J. Imaging 2020, 6, 49 5 of 23

(PD). A comparison with existing approaches using the Le2i dataset showed that our SVM approach outperforms the rest, with a precision of 93%, a recall of 100%, and an accuracy of 100%. However, our previous system exclusively concentrated on differentiating falling events from normal events; in other words, classifying the videos which contain abnormal or normal events. A long-term monitoring system for home care requires the effective detection of sequential normal and abnormal states. Another vision-based fall detection system using a convolutional neural network has been developed by Adrian [23]. This system was designed to work on human motion, avoiding any dependence on the appearance of an image. An optical flow image generator is utilized to efficiently represent human motion. A set of optical flow images are stacked and then set as inputs in a convolutional neural network (CNN) that can learn longer time-related features. A fully connected neural network (FCNN) receives these features as inputs, and produces fall and non-fall outputs. The best overall accuracy for this system is 97%, as confirmed using the Le2i fall-detection dataset. Moreover, the approach used in [24] proposes fall detection based on body keypoints and sequence-to-sequence architecture. In this system, a skeleton framework is modeled to receive a sequence of observed frames. Then, the coordinates of keypoints of the object are extracted from observed frames. The bounding boxes of the detected object are given to the tracking algorithm for clustering body keypoints belonging to the same person in different video sequences. A keypoint-vectorization method is exploited to extract salient features from associated coordinates. Next, the pose prediction phase is conducted, predicting the vectors of future keypoints for the person. Finally, falls are classified using the Le2i dataset, achieving an accuracy of 97%, a precision of 90.8%, a recall of 98.3%, and an F1-score of 0.944. A related approach is presented in [25] in an automatic fall-detection system with an RGB-D camera using a Hidden Markov Model (HMM). Background subtraction is firstly performed by averaging the depth map to learn the background. The Kinect optical parameters for factory use are employed to obtain a real-world coordinate system, and an Open Natural Interaction (OpenNI) is used for a Kinect-to-real-world transformation. After that, the center of mass of the person is extracted to calculate the vertical speed from that point. The standard deviation for all points belonging to a person are then calculated. After that, these three features are used as inputs to calculate the probability of HMM. Finally, the forward-backward and Viterbi algorithms are applied to classify the states of normal activities and falls. The experiments were conducted on young and healthy subjects, and occlusions were not included. In future research, this system will be tested using real-life unhealthy subjects with occlusions. In addition, other fall detection systems for shop floor applications [26] were modeled using an HMM based on the vertical velocity, area variance, and height of a person, using cameras positioned to provide a top view. This incident detection method focuses on two things: detecting people in restricted areas and detecting falls. Potential fall events are analyzed based on specific features and on circumstances such as whether the person can get up or not. Analysis is also based on an allocation of status according to the location, whether the event occurs in an area that is restricted to all personnel, where work is ongoing, or where maintenance is being performed. Such considerations are valuable in identifying health and safety issues. In future work, this system will be improved by incorporating more incidents, such as collisions. Vision-based monitoring systems for detecting falls or abnormal events can be powerful tools in various applications. These new technologies have great potential as intelligent monitoring systems. Therefore, we propose a detection system for indoor environments that include normal activities; one that is based on the consecutive states of abnormal or fall events using image processing techniques and a Hidden Markov Model. This detection system will be ideal for elderly care monitoring systems.

3. Proposed Architecture of the System The proposed monitoring system provides a way to detect normal, abnormal, and falling states for persons being monitored, such as seniors with a limited mobility, patients with chronic diseases, and those recovering from surgical procedures. The proposed system is composed of the following four main modules: (1) object detection; (2) feature extraction using the virtual grounding point (VGP) J. Imaging 2020, 6, 49 6 of 23 concept;J. Imaging (3) 2020 analysis, 6, x FOR of PEER normal, REVIEW abnormal, or falling events; and (4) establishment of decision-making6 of 23 rules utilizing a Hidden Markov Model (HMM). The major work flow for our proposed system is rules utilizing a Hidden Markov Model (HMM). The major work flow for our proposed system is illustratedillustrated in in Figure Figure1, and1, and specialized specialized theoretical theoretical analyses analyses forfor eacheach modulemodule are are described described in in the the followingfollowing sections. sections.

Figure 1. The major work flow of our proposed system. Figure 1. The major work flow of our proposed system.

3.1.3.1. Object Object Detection Detection ThisThis module module features features an an object object detection detection procedure procedure similar similar to to that that used used in ourin our previous previous work work [13 ], and[13], is briefly and is briefly described described here. Wehere. select We select a specific a specific Mixture Mixture of Gaussians of Gaussians (MoG (MoG) distribution) distribution [20 [20]] for modelingfor modeling the foreground, the foreground, which which is updated is updated frame byfram frame.e by Weframe. also We use also specific use low-rankspecific low-rank subspace subspace learning for modeling the background for each successive frame in the video sequences. learning for modeling the background for each successive frame in the video sequences. After that, After that, an expectation and maximization (EM) algorithm is used for updating the foreground and an expectation and maximization (EM) algorithm is used for updating the foreground and background background parameters of each new frame. However, the quality of the foregrounds is unsatisfactory parameters of each new frame. However, the quality of the foregrounds is unsatisfactory because because ghost effects are included in the resultant foregrounds. In real-life video sequences, as well ghost effects are included in the resultant foregrounds. In real-life video sequences, as well as in as in simulated video sequences, much redundant data occurs, such as when a person moves very simulatedslowly or video stays sequences, in place for much a long redundant period dataof time. occurs, In such such situations, as when a most person systems moves have very slowlytrouble or staysrecognizing in place fora person a long as period the foreground. of time. InTo suchovercome situations, this problem, most systems a graph havecut algorithm trouble recognizingis applied a personto refine as the the foreground foreground. [21]. To A overcome set of the thisresulting problem, foreground a graph and cut background algorithm frames is applied is given to refine as theinputs foreground for the [video21]. Asequences. set of the Then, resulting we seek foreground binary labels and regarded background as the frames foreground, is given which as is inputs set forto the 1, videoand the sequences. background, Then, which we is seek set to binary 0. Thes labelse foreground regarded and as background the foreground, labels which are calculated is set to 1, andby the constructing background, a graph which structure, is set to G 0.= (v Thesevertex, ε), foreground where vvertexand is the background set of vertices labels or pixels are calculatedand ε is the by constructingset of edges a graphlinking structure, the closestG =four-connected(vvertex, ε), where pixelsvvertex [21].is The the maximum-flow set of vertices orand pixels minimum-cut and ε is the setmethod of edges is linkingused to find the closestthe vertex four-connected label with a mi pixelsnimum [21 energy]. The function. maximum-flow Most traditional and minimum-cut graph cut methodalgorithms is used work to find on manual the vertex drawings label withof the a scribbles minimum and energy region function. of interest Most for the traditional targeted object graph in cut algorithmsevery frame. work In onapplying manual graph drawings cut theory of the for scribbles refining andthe foreground, region of interest we obtain for thethe targetedassociated object MoG in everyforeground frame. In mask applying in every graph 100th cut frame, theory instead for refining of repeatedly the foreground, re-drawing we the obtain scribbles the associatedand regionMoG of foregroundinterest. The mask combination in every 100th of these frame, two effective instead ofme repeatedlythods optimizes re-drawing the solution the scribbles for object and detection. region of interest. The combination of these two effective methods optimizes the solution for object detection. 3.2. Feature Extraction 3.2. FeatureIn this Extraction module, feature extraction is performed by applying the virtual grounding point (VGP), asIn well this as module,the area and feature aspect extraction ratio of the is performedobject [13]. The by applyingcomputation the strategies virtual grounding for feature point extraction (VGP ), as wellare simple as the areaand andeffective. aspect Using ratio these of the strategies object [13, ].we The adapted computation our previously strategies proposed for feature method extraction of arefeature simple extraction and effective. [13] Usingin this theseextension strategies, of our wework. adapted Here, ourwe briefly previously describe proposed our previous method research of feature using the VGP concept. A virtual grounding point is a point which is constructed for exploring the extraction [13] in this extension of our work. Here, we briefly describe our previous research using the position and posture of humans. Specialized technical details are described as follows.

J. Imaging 2020, 6, 49 7 of 23

VGPJ. Imagingconcept. 2020, 6, Ax FOR virtual PEER grounding REVIEW point is a point which is constructed for exploring the position7 of and 23 posture of humans. Specialized technical details are described as follows. Let p be the positionposition atat timetimet ton on the the detected detected silhouette silhouette object. object. Then, Then, the the centroid centroid of of object objectC (Ct)( ist) isextracted extracted as as defined defined in in Equation Equation (1): (1): ()= () () () C t xc t , yc t , (1) C(t) = (xc(t), yc(t)), (1) N () N () where t represents p(t) = (x(t),y(t)), and xc and yc denote that xc = xi /PN N and yc = yi / N where t represents p(t) = (x(t),y(t)), and xc and yc denote that xci=1= i=1(xi/N) andi=1yc = PN respectively.i=1(yi/N) respectively. Then, a vertical line from the top-most to the bottom-most row corresponding to the x axis of C(t) is constructed on the detected obje object.ct. Similarly, Similarly, a horizontal line from the left to the right column corresponding to the y axis of C(t) is constructed. constructed. Finally, Finally, a central point where the vertical line along the x axis crosses the horizontal line along the y axis is marked as a VGP,, which can can be be defined defined as in Equation (2): VGP(t) = (x (t), y (t)), (2) ()= ()VGP ()VGP () VGP t xVGP t , yVGP t , (2) where the VGP is a virtual grounding point at time t. whereThen, the VGP the pointis a virtual distance grounding (d) between point the at centroidtime t. C(t) and its corresponding virtual grounding pointThen,VGP (thet) is point formulated distance as ( shownd) between in Equation the centroid (3): C(t) and its corresponding virtual grounding point VGP(t) is formulated as shown in Equation (3):

d(t) ()= =yVGP(t ()) −yc( ()t) , (3) d t yVGP t − yc t , (3) where d(t) can be denoted as the pointpoint distancedistance between centroid C(t) and its corresponding virtual grounding point VGP((tt)) on on the the detected detected object object at at time time tt to extract the features for the position and posture of the person. After that, the area of the object shape ( a) can be derived to obtain knowledge for the shape regularity of the moving object, as described in Equation (4):   XNN XNN  ()= ( ) a(ta) t=  II(ii,,j)j,,tt,, (4)   i=i1==11j=j 1  where aa((tt)) is is the the measurement measurement of theof areathe area indicating indicating the relative the relative size of thesize object of the obtained object obtained by summing by summingall the pixels all the in the pixels object in atthe time objectt. at time t. Finally, the aspect ratio (r) with respect to the object is simply calculated to estimate the human posture as shown in Equation (5): ( ) = ( ) ( ) r tr()t =wwt ()t//hh ()tt , (5) where r(t) is the aspect ratio of the object foundfound by dividing the value of the width ( w) by that of the height ( h) for the object at time t. The The concepts concepts for for deriving deriving the the indi individualizedvidualized features features are illustrated in Figure 2 2..

(a) (b) (c)

Figure 2. 2. FeatureFeature extraction: extraction: (a) ( aPoint) Point distance distance (d); ((db);) area (b) area of the of object the object (a); (c ()a aspect); (c) aspect ratio of ratio the object of the (objectr). (r).

3.3. Analysis of Abnormal and Normal Events For detecting abnormal or falling events, we first computed the moving average (MA) [13] using features extracted from sequential data. In our previous research [13], we used two features (point distance (d) and aspect ratio (r)) for analyzing the data points of the moving object. In this extension

J. Imaging 2020, 6, 49 8 of 23

3.3. Analysis of Abnormal and Normal Events For detecting abnormal or falling events, we first computed the moving average (MA)[13] using features extracted from sequential data. In our previous research [13], we used two features (point distance (d) and aspect ratio (r)) for analyzing the data points of the moving object. In this extension of our work, statistical analysis of the area of the object shape on the observable data series is used with one additional feature. As a result, our proposed formulation uses the following three features: (1) the point distance (d) between the centroid and the corresponding virtual grounding point; (2) the area of the object shape (a); and (3) the aspect ratio of the object (r), as shown in Equation (6):

f =t+N 1 X MA(t, F, N) = F(t), F(t) = d(t), a(t), r(t), (6) 2N + 1 f =t N − where MA(t,F,N) is an output value for the average period in N at time t, N represents the number of periods, and F(t) is the calculation on the three extracted features: the point distance (d), the area of the object shape (a), and the aspect ratio (r) at time t. The optimized threshold (Th) value is considered in detecting abnormal or falling events. Here, we set the threshold value of N at Th(MA(t,d,Th)), Th(MA(t,a,Th)), and Th(MA(t,r,Th)), where Th = 51, which depends on the video frame rate. The process for selecting the optimal threshold is similar to that used in our observations in previous research [13]. Calculating the moving average for the sequential data approximates irregular and falling events by observing the crossing point on primary and average feature points. After that, we define the difference calculation, specifically the modified difference (MD), for more precisely estimating abnormal and falling states on the crossing point of the moving average (MA), as shown in Equation (7):

MD(t, F, N , N ) = MA(t, F(t + N + N ), N ) MA(t, F(t N N ), N ), F(t) = d(t), a(t), r(t), (7) 0 1 0 1 1 − − 0 − 1 1 where MD(t,F,N0,N1) denotes the modified difference for the three selected features (d, a, and r) at time t and here, the selected optimal value of the threshold for N0 = 0 and N1 = 51. In analyzing a series of data based on the point distance (d), the modified difference (MD) reached a minimum point, which indicates a high possibility of a fall. The modified difference observed for the aspect ratio (r) of the person indicates a high possibility of a fall with a maximum point. Here, we say that the highest point is the local maximum (lmax) and the lowest point is the local minimum (lmin). The concepts of discovering a point which is important in referring an abnormal state or a fall by utilizing r and d are illustrated in Figures3 and4, respectively. When we observe stationary points for the modified difference on the area of the object shape (a), we make an interesting discovery that both lmax and lmin occur with respect to the falling posture and direction of the object. The observations for lmax and lmin which correspond to the different falling postures and positions (sideways falls, forward falls, and backward falls) are described in Figures5–7, respectively. This observation and analysis of the area of the object shape is part of the extension of our previous work [13], and can be properly used in further observations to develop decision-making rules. Then, we analyze the interval or period of the abnormal event by computing the half width value (vhw)[13] on the observed curve of modified difference (MD), as formulated in Equation (8). The half width value (vhw) more accurately estimates the period of abnormal and normal events. At the halfway point on the largest curve obtained from the computation of MD, the starting point (f 1) and ending point (f 2) are set to represent irregular fall events. After that, the periods of events are observed by calculating the distance of the two points f 1 and f 2.

v = f1 f2 , (8) hw − J. Imaging 2020, 6, 49 9 of 23

where vhw is the estimated period for falling assumed by the half of the curve including lmax or lmin, and f 1 and f 2 represent the starting and ending points of a fall, respectively. Illustrations for developing a better understanding of vhw are embedded in Figures3–7. In Figure3, the analysis for a high possibility of abnormal (falling) points is illustrated based on the aspect ratio (r) with respect to the crossing point of the moving average (MA), local maximum J. Imaging 2020, 6, x FOR PEER REVIEW 9 of 23 (lmax), and half width value (vhw) obtained from the distance between f 1 and f 2. Figure4 describes a J. Imaging 2020, 6, x FOR PEER REVIEW 9 of 23 high possibilityJ. Imaging 2020 of, 6, x abnormal FOR PEER REVIEW points based on the point distance (d) with respect to the crossing9 of point 23 of distance between f1 and f2. By observing the area of the object shape (a), we noticed that the local the moving average (MA1 ), local2 minimum (l ), and half width value (v ) obtained from the distance minimumdistance between (lmin) occurs f and in thef . By posture observing involvedmin the area in falling of the sideways object shape and (backwards,hwa), we noticed as demonstrated that the local distance between f1 and f2. By observing the area of the object shape (a), we noticed that the local min betweeninminimum Figuresf 1 and 5(f l2and.) By occurs 6, observingrespectively. in the posture the In areaFigure involved of 7, the the object inlocal falling maximum shape sideways (a), (l wemax and) occurs noticed backwards, for that the posture theas demonstrated local involved minimum minimum (lmin) occurs in the posture involved in falling sideways and backwards, as demonstrated (lmin) occursinin fallingFigures in forward the 5 and posture 6, when respectively. involved analyzing In in Figureafalling high possibility7, sidewaysthe local ofmaximum and falling backwards, points (lmax) occursbased as demonstratedonfor thethe areaposture of the involved in object Figures 5 in Figures 5 and 6, respectively. In Figure 7, the local maximum (lmax) occurs for the posture involved in falling forward when analyzing a high possibilityl of falling points based on the area of the object and6,inshape. respectively. falling forward In when Figure analyzing7, the local a high maximum possibility ( ofmax falling) occurs points for based the postureon the area involved of the object in falling shape. forwardshape. when analyzing a high possibility of falling points based on the area of the object shape. 0.8 L 0.35 0.70.8 max 0.30.35 0.8 L 0.35 0.7 L max 0.250.3 0.6 max ) 0.3 r 0.7 Crossing point of MA v 0.20.25 0.50.6 hw ) Crossing point of MA 0.25 r

v _51( 0.6 hw 0.150.2 ) 0.5 r r Crossing point of MA 0.4 v 0.2 AsRr 0.5 hw 0.10.15 _51( r 0.4 _51( AsR 0.3 0.150.1 MA_51(AsR)MA_r 51(r) r 0.4 0.05

Dif_MA AsR 0.3 0.1 r 0.2 00.05 Dif_MA_51(AsR)Dif_MA_MA_51(AsR)MA_51(r51() r) 0.3 Dif_MA MA_51(AsR)MA_51(r) 0.10.2 0.050 Dif_MA_51(AsR)Dif_MA_51(r) -0.05 Dif_MA 0.2 f 0 Dif_MA_51(AsR)Dif_MA_51(r) 0.10 1 f2 -0.1-0.05

0.1 1 f1 f2 -0.05 0 18 35 52 69 86 -0.1

103 120 137 154 171 188 205 222 239 f 256 273 290 307 324 341 358 375 392 409 0 1 1 f2 -0.1 18 35 52 69 86 103 120 137 154 171 188 205 222 239 256 273 290 307 324 341 358 375 392 409 1 Frames 18 35 52 69 86 103 120 137 154 171 188 Frames205 222 239 256 273 290 307 324 341 358 375 392 409 Frames Figure 3. Analysis for a high possibility of abnormal (falling) points based on the aspect ratio (r). FigureFigure 3. Analysis 3. Analysis for a for high a high possibility possibility of of abnormal abnormal (falling) (falling) points based based on on the the aspect aspect ratio ratio (r). (r). Figure 3. Analysis for a high possibility of abnormal (falling) points based on the aspect ratio (r). 100 Crossing point of MA 45 10090 Crossing point of MA 3545 100 45 8090 Crossing point of MA 35 90 25 ) 7080 35 d 80 1525 ) 6070 d

25 ) _51(

70 15 d d 5060 v 5 DCVd hw 15 _51( d 604050 v 5 hw -5 _51( MA_51(DCV)MA_DCVd 51(d) d 503040 v 5 DCVd hw Dif_MA -15-5 MA_51(DCV)MA_51(d) 4030 Dif_MA_51(DCV)Dif_MA_51(d) 20 -5 Dif_MA MA_51(DCV)MA_51(d) 30 -25-15 Dif_MA_51(DCV)Dif_MA_51(d) 1020 Dif_MA -15 Dif_MA_51(DCV)Dif_MA_51(d) 20100 f1 Lmin f2 -35-25

10 1 f L f -25 0 18 35 52 69 86 1 min 2 -35 103 120 137 154 171 188 205 222 239 256 273 290 307 324 341 358 375 392 409 0 1 f1 L f2 -35

18 35 52 69 86 min 103 120 137 154 171 188 205 222 239 256 273 290 307 324 341 358 375 392 409 1 Frames 18 35 52 69 86 103 120 137 154 171 188 Frames205 222 239 256 273 290 307 324 341 358 375 392 409 Frames FigureFigure 4. Analysis 4. Analysis for afor high a high possibility possibility of of abnormal abnormal (falling)(falling) points based based on on the the point point distance distance (d). (d). Figure 4. Analysis for a high possibility of abnormal (falling) points based on the point distance (d). Figure 4. Analysis for a high possibility of abnormal (falling) points based on the point distance (d).

Figure 5. Analysis for a high possibility of abnormal (falling) points based on the area of the object Figure 5. Analysis for a high possibility of abnormal (falling) points based on the area of the object shapeshape (Figurea) by ( fallinga5.) byAnalysis falling sideways. sideways.for a high possibility of abnormal (falling) points based on the area of the object Figure 5. Analysis for a high possibility of abnormal (falling) points based on the area of the object shape (a) by falling sideways. shape (a) by falling sideways.

J. Imaging 2020, 6, 49 10 of 23 J. Imaging 2020, 6, x FOR PEER REVIEW 10 of 23 J. Imaging 2020, 6, x FOR PEER REVIEW 10 of 23

FigureFigure 6. Analysis 6. Analysis for afor high a high possibility possibility of of abnormal abnormal (falling)(falling) points points based based on on the the area area of the of theobject object Figure 6. Analysis for a high possibility of abnormal (falling) points based on the area of the object shapea (a) by falling backwards. shapeshape ( ) by (a) fallingby falling backwards. backwards.

Figure 7. Analysis for a high possibility of abnormal (falling) points based on the area of theobject Figure 7. Analysis for a high possibility of abnormal (falling) points based on the area of the object Figurea 7. Analysis for a high possibility of abnormal (falling) points based on the area of the object shapeshape ( ) by (a falling) by falling forwards. forwards. shape (a) by falling forwards. Finally,Finally, we consider we consider the the threshold threshold values values of of the the time time intervalsintervals for for abnormal abnormal and and normal normal events events Finally, we consider the threshold values of the time intervals for abnormal and normal events basedbased on the on half the half width width value value (v (hwvhw)) inin a given given video video sequence. sequence. Let A Let = {aA1,a=2,…,{a1a,ka} 2represent, ... ,ak} the represent videos the based on the half width value (vhw) in a given video sequence. Let A = {a1,a2,…,ak} represent the videos videosthat that include include the the abnormal abnormal events, events, and andN = {Nn1,=n2,…,n{n1,nk}2 ,represent... ,nk} represent the videos the composed videos of composed the normal of the that include the abnormal events, and N = {n1,n2,…,nk} represent the videos composed of the normal normalevents. events. After After that, that, we calculate we calculate events. After that, we calculate α =∈min()vA, α = max()v ∈ N, . α =∈1 hw α 2= ()hw∈ (9) α1 = minmin(v()vA) A,, α 2 =2 maxmax(vvhw) NN,,. .(9) (9) 1 hwhw ∈ hw ∈ Then, the mid value or falling period detection (PD) is calculated for three features, namely, the Then, the mid value or falling period detection (PD) is calculated for three features, namely, the Then,point distance the mid (d value) between or falling the centroid period and detection its corresponding (PD) is virtual calculated grounding for three point features, of the object, namely, point distance (d) between the centroid and its corresponding virtual grounding point of the object, the pointthe area distance of the ( dobject) between shape the (a) as centroid an extensive and its feature corresponding observation, virtual and the grounding aspect ratio point (r), which of the are object, the area of the object shape (a) as an extensive feature observation, and the aspect ratio (r), which are the areaformulated of the object as shown shape in (Equationsa) as an extensive (10), (11), and feature (12), observation, respectively. and the aspect ratio (r), which are formulated as shown in Equations (10), (11), and (12), respectively. formulated as shown in Equations (10)–(12),()α ()d respectively.+α ()d l if v ≥ PD()d PD()d =()α ()1 +α ()2 , 1 hw≥ (), 1 d 2 d l1 if vhw PD d (10) PD()d = 2 , l2 otherwise , (l otherwise (10) (α1(d) + α22(d))  l2 if v PD(d) PD(d) = , 1 hw ≥ , (10) ()α2()a +α ()a l2l if otherwisev ≥ PD()a PD()a =()α ()1 +α ()2 , 1 hw≥ (), 1 a 2 a l1 if vhw PD a (11) PD()a = 2 ,  l2 otherwise , (11) 2 (l2 otherwise (α1(a) + α2(a)) l if v PD(a) PD(a) = , 1 hw ≥ , (11) ()α ()r +α ()r l if v ≥ PD()r PD()r =()α ()21 +α ()2 , l21 otherwisehw≥ (), 1 r 2 r l1 if vhw PD r (12) PD()r = 2 ,  l2 otherwise , (12) 2 (l2 otherwise (α1(r) + α2(r)) l1 if vhw PD(r) where α1 and α2 representPD (ther) =minimum and maximum, threshold≥ values used, to estimate the period (12) where α1 and α2 represent the minimum and2 maximuml 2thresholdotherwise values used to estimate the period of abnormal and normal events, respectively; l1 represents the class label for the given video that of abnormal and normal events, respectively; l1 represents the class label for the given video that where α1 and α2 represent the minimum and maximum threshold values used to estimate the period of abnormal and normal events, respectively; l represents the class label for the given video that 1 includes an abnormal event; and l2 represents the class label which indicates that the given video does not include an abnormal event. J. Imaging 2020, 6, x FOR PEER REVIEW 11 of 23 J. Imaging 2020, 6, 49 11 of 23 includes an abnormal event; and l2 represents the class label which indicates that the given video does not include an abnormal event. 3.4. Decision-Making Rules 3.4. Decision-Making Rules As the main contributions of this research, we developed considerations for decision-making rules. AbnormalAs the main and contributions normal actions of this for theresearch, person we being developed monitored considerations will be detected for decision-making for every detailed momentrules. Abnormal by employing and normal a Hidden actions Markov for Modelthe pers (HMMon being). monitored will be detected for every detailed moment by employing a Hidden Markov Model (HMM). In order to describe Hidden Markov Chains, we define two states: S = {S1,S2}. Let S1 be an In order to describe Hidden Markov Chains, we define two states: S = {S1,S2}. Let S1 be an abnormal state which includes “falling,” and let S2 be a normal state, including a set of actions, {“fallingabnormal,” “lying state ,which” “lying includes to sitting “falling,” “sitting,” and to let standing S2 be a,” normal “walking state,,” “walking including to a sittingset of ,”actions, “sitting ,” {“falling,” “lying,” “lying to sitting,” “sitting to standing,” “walking,” “walking to sitting,” “sitting,” and “sitting to lying”}. The process would be started in one of these two states, and proceed smoothly and “sitting to lying”}. The process would be started in one of these two states, and proceed smoothly from one state to another. If the chain were presently in state S1, then it would proceed to the next state from one state to another. If the chain were presently in state S1, then it would proceed to the next S2 with a specific probability. However, it should be noted that this probability will not be based on state S2 with a specific probability. However, it should be noted that this probability will not be based which state the chain was in before the current state. on which state the chain was in before the current state. In order to form the Markov Transition matrix, we firstly observe each of the given videos by In order to form the Markov Transition matrix, we firstly observe each of the given videos by developingdeveloping fixed fixed intervals. intervals. The The video video sequencesequence willwill becomebecome a sequence of of intervals, intervals, such such as as t,t ,t t++ d,d t, t + 2d+, t2d+, t3 +d ,3...d, …,, t t+ +nd nd.. Here,Here, d isis the the length length of of the the interval interval to to be be predefined, predefined, in inwhich which only only one one abnormal abnormal oror normal normal state state occurs. occurs. TheThe videovideo sequences are are ma manuallynually observed observed to toobtain obtain the the occurrence occurrence state state symbolsymbol for for every every interval. interval. These video sequences sequences provide provide the the sequences sequences of ofthe the occurrences occurrences of the of the states,states,S S= ={ S{S1..1..SS22},}, asas illustratedillustrated in in Figure Figure 88.. In In this this figure, figure, S S1 1representsrepresents the the abnormal abnormal state state and and S2 S2 representsrepresents the the normal normal state. state.

FigureFigure 8.8. Observed sequence sequence of of the the occurrences occurrences of ofstates. states.

Then,Then, the the co-occurrence co-occurrence matrixmatrix M is constructed as as shown shown below: below: " # CC11 C12  M M= =  11 12 , , (13) (13) CC2121 C2222 

where C11 represents the number of pairs (S1,S1), C12 represents the number of pairs (S1,S2), C21 where C11 represents the number of pairs (S1,S1), C12 represents the number of pairs (S1,S2), C21 represents the number of pairs (S2,S1), and C22 represents the number of pairs (S2,S2). After that, each represents the number of pairs (S2,S1), and C22 represents the number of pairs (S2,S2). After that, eachrow row of matrix of matrix M isM summedis summed as shown as shown in the in following: the following: = ()+ = ()+ C1 X C11 C12 , C2 X C21 C22 , . (14) C1 = (C11 + C12), C2 = (C21 + C22), . (14) To form the Markov transition matrix, we defined the following: To form the Markov transition matrix, we defined the following: = C11 = C12 = C21 = C 22 a11 , a12 , a21 , a22 , . (15) C C C C 1C11 C112 C212 C22 2 a11 = , a12 = , a21 = , a22 = , . (15) Then, the state transition matrixC1 of the MarkovC1 ChainC for2 the HiddenC2 Markov Model (HMM) is obtained as shown in Equation (16): Then, the state transition matrix of the for the Hidden Markov Model (HMM) is obtained as shown in Equation (16): a11 a12  A =" ,# (16) a a  A =  1121 1222  , (16) a21 a22

J. Imaging 2020, 6, 49 12 of 23

J. Imaging 2020, 6, x FOR PEER REVIEW 12 of 23 where the elements in the first row of matrix A represent the probabilities for the actions following an abnormalwhere the action elements or fall. in the Similarly, first row the of elementsmatrix A inrepresent the second the probabilities row represent for thethe probabilitiesactions following forthe actionsan abnormal following action normal or fall. actions. Similarly, the elements in the second row represent the probabilities for theFor actions each following of the possible normal states,actions. emission probabilities are calculated. We define the observable featuresFor with each respect of the topossible the distance states, emission between probab the centroidilities are and calculated. the corresponding We define virtual the observable grounding pointfeatures (d), thewith area respect of the to objectthe distance shape between (a), and the the centroid aspect ratio and the (r). corresponding To define the observablevirtual grounding symbols, a suitablepoint (d), thresholdthe area ofvalue the object for eachshape of (a the), and features the aspect is selected. ratio (r). To In define doing the so, observable we firstly symbols, calculate athe featuressuitable for threshold extracting value the fallingfor each period of the for features a person, is asselected. discussed In doing in Section so, we 3.3 .firstly These calculate falling periods the canfeatures determine for extracting that each the of falling the observed period for videos a person, includes as discussed an abnormal in Section or 3.3. normal These event. falling Theperiods frame can determine that each of the observed videos includes an abnormal or normal event. The frame index numbers through the video sequences represent that a fall can be extracted between starting index numbers through the video sequences represent that a fall can be extracted between starting point f 1 and ending point f 2, as illustrated in Figure9. point f1 and ending point f2, as illustrated in Figure 9. In the upper part of Figure9, the frame numbers marked in green represent the normal activities of In the upper part of Figure 9, the frame numbers marked in green represent the normal activities a person, such as walking, standing, sitting, and lying. The frame numbers marked by black represent of a person, such as walking, standing, sitting, and lying. The frame numbers marked by black h abnormalrepresent or abnormal falling events. or falling The events. period The of period a fall of might a fall bemight obtained be obtained when when the half the half width width value value ( wv ), referring to the distance between a starting point (f ) and an ending point (f ), is greater than or equal (hwv), referring to the distance between a starting point1 (f1) and an ending point2 (f2), is greater than or toequal period to detection period detection (PD), as ( formulatedPD), as formulated in Section in Section3.3. Similarly, 3.3. Simila in therly, lower in the part lower of part Figure of9 Figure, the frame 9, numbersthe frame represent numbers the represent normal activities the normal of aactivities person. Whenof a person. the person When is involvedthe person in is activities involved such in as walking,activities walking such as to walking, sitting, andwalking sitting to tositting, standing, and sitting the analyzed to standing, half widththe analyzed value ( hhalfwv) width is less value than the detected(hwv) is less period. than the detected period.

FigureFigure 9. 9.Observations Observations inin thethe periodperiod of abnormal and and no normalrmal states. states. Upper Upper part: part: falling; falling; lower lower part: part: walkingwalking to to sitting. sitting.

AfterAfter that, that, it isit determinedis determined whether whether the the action action is abnormal is abnormal or normal or normal by observing by observing the six the possible six featurepossible values feature defined values below. defined Satisfying below. Satisfying the criteria the for criteria this determination for this determination involves passinginvolves the passing selected valuethe forselected threshold valueF forin the threshold consecutive F in framesthe consecutive from starting frames to ending from starting points for to abnormalending points and normal for eventsabnormal based and on normal the observed events halfbased width on the value, observed which half determines width value, the which period determines of events. the period of events. F1 = {f 1 . . f 2} where vhw PD(d), F2 = {f 1 . . f 2} where vhw < PD(d), F1 = {f1 . . f2} where vhw ≥ PD(d), F2 = {f1 . . f2} where vhw < PD(d),

F =F{f3 = {f1f . . f}2} where where vvhw ≥ PDPD(a(),a), FF4 == {{ff1 . . f2} fwhere} where vhw

J. Imaging 2020, 6, 49 13 of 23

whereJ. Imaging the 2020 symbol, 6, x FORF PEERdenotes REVIEW that for the observed features for HMM, the set {f 1 . . f 2} represents13 of the 23 sequential falling frames from starting point f 1 to ending point f 2. PD(d), PD(a), and PD(r) denote the periodperiod detectiondetection forfor aa fallfall oror anan abnormal eventevent withwith respectrespect toto thethe distance betweenbetween thethe centroidcentroid andand thethe correspondingcorresponding VGPVGP,, thethe areaarea ofof thethe shape,shape, andand thethe aspectaspect ratioratio ofof thethe person,person, respectively,respectively, and andv vhwhw meansmeans thethe halfhalf widthwidth valuevalue employedemployed toto determinedetermine thethe periodperiod ofof thethe abnormalabnormal event.event. AfterAfter obtainingobtaining thresholdsthresholds forfor features,features, calculationscalculations ofof emissionemission probabilitiesprobabilities areare performedperformed forfor developingdeveloping thethe HMMHMM model.model. The hierarchical tree structure usedused toto formform featuresfeatures forfor independentindependent observableobservable symbolssymbols OO isis illustratedillustrated inin FigureFigure 1010..

F5 O1 O1 = (F1, F3, F5) F3 F6 O2 O2 = (F1, F3, F6) F1 F5 O3 O3 = (F1, F4, F5) F4 F6 O4 O4 = (F1, F4, F6) Start F5 O5 O5 = (F2, F3, F5) F3 F6 O6 O6 = (F2, F3, F6) F2 F5 O7 O7 = (F2, F4, F5) F4 F6 O8 O8 = (F2, F4, F6)

Figure 10.10. Hierarchical tree structure ofof independentindependent observableobservable symbolsymbol OO..

When studying the observable symbols, we can see that F and F , F and F , and F and F are When studying the observable symbols, we can see that F11 and F2,2 F3 3and F4, 4and F5 and5 F6 are6 all alldisjointed. disjointed. The The occurrence occurrence of ofan an observable observable symbol symbol is is then then labele labeledd based based on on the eight observableobservable symbols O = {o ,o ,o ,o ,o ,o ,o ,o } for the video sequence. The emission probability of observed symbols O = {o1,1o2,o23,o34,o45,o65,o76,o8}7 for8 the video sequence. The emission probability of observed symbol o k S j symbolok (k = 1,2,…,8)k ( = 1,2,in state... ,8) Sj in(j = state 1,2) isj denoted( = 1,2) isas denoted in Equation as in (17). Equation (17). ExpectedExpected Number Number of of Occurrences Occurrences ofof oo ∈ StatesStates SS ()= kk j j bj(k)b=j k ∈ ,, (17)(17) ExpectedExpected Number Number of of Occurrences Occurrences of of States States SSjj

TheThe emissionemission probabilityprobability matrixmatrix BB isis thenthen obtainedobtained byby usingusing thethe following:following: " () () () () () () () () # (b)1 1 (b1) 2 (b1)3 b(1 )4 b(1 )5 b1( 6) b1(7) b1(8 ) Bb1= 1 b1 2 b1 3 b1 4 b1 5 b1 6 b1 7 b1 8 , . (18) B = b ()1 b ()2 b ()3 b ()4 b ()5 b ()6 b ()7 b ()8  , . (18) b2(1)2 b2(2)2 b2(32) b2(42 ) b2(25) b22(6) b22(7) b2(8)

Then, the initial probability states are assumed to be Prob(S1) = 0.8 and Prob(S2) = 0.2. Denoting Then, the initial probability states are assumed to be Prob(S1) = 0.8 and Prob(S2) = 0.2. Denoting thethe initialinitial probabilityprobability vectorvector asas ππ gives the Hidden Markov Model (HMM),), as shown in Equation (19): λ = ()A, B,π , (19) λ = (A, B, π), (19) where π is the initial probability vector. whereAfterπ is thethat, initial the Viterbi probability algorithm vector. [27] is applied to solve the HMM model to detect consecutive momentsAfter in that, the theactions. Viterbi The algorithm idea is to [compute27] is applied the best to solvehidden the stateHMM sequencemodel (abnormal to detect consecutiveand normal momentsstates) in inthe the given actions. observation The idea sequences is to compute and in the an best HMM hidden. state sequence (abnormal and normal states) in the given observation sequences and in an HMM. 4. Experiments 4. Experiments 4.1. The Dataset 4.1. The Dataset In order to demonstrate a sub-sequential experiment, we used the publicly available Le2i fall In order to demonstrate a sub-sequential experiment, we used the publicly available Le2i fall detection dataset [28], which simulates the activities of the elderly in a home environment, as well as detection dataset [28], which simulates the activities of the elderly in a home environment, as well in an office. This dataset was developed for detecting falls using artificial vision, which can be useful as in an office. This dataset was developed for detecting falls using artificial vision, which can be in helping the elderly. These simulated videos illustrate the difficulties in creating realistic video useful in helping the elderly. These simulated videos illustrate the difficulties in creating realistic sequences that include variable illumination, occlusions, and cluttered or textured backgrounds. The video sequences that include variable illumination, occlusions, and cluttered or textured backgrounds. actors performed various normal daily activities and falls, in an attempt to simulate elderly behavior. In addition, the actors did not perform acts that would be unnatural for the elderly, such as running or walking quickly. The video sequences were captured using a single RGB camera, in which the frame rate was 25 frames per second, and the resolution was 320 × 240 pixels. From this dataset, 20

J. Imaging 2020, 6, 49 14 of 23

The actors performed various normal daily activities and falls, in an attempt to simulate elderly J.behavior. Imaging 2020 In, 6 addition,, x FOR PEER the REVIEW actors did not perform acts that would be unnatural for the elderly, such14 of as23 J.running Imaging 2020 or walking, 6, x FOR PEER quickly. REVIEW The video sequences were captured using a single RGB camera, in14 which of 23 videosthe frame taken rate in was an office 25 frames were per selected, second, representi and the resolutionng normal was daily 320 activities240 pixels. and falls From as thisperformed dataset, × byvideos20 videosfour takendifferent taken in inan actors. anoffice offi cewere were selected, selected, representi representingng normal normal daily daily activities activities and and falls falls as as performed by four Acquiring different different useful actors. qualitative data for surveillance systems is very important in measuring distances,Acquiring positions, usefuluseful velocities, qualitative qualitative dataand data fordirections surveillancefor surveill for ance systemsthe objectssystems is very in is importantthe very scenes, important in measuringand isin a measuringnecessary distances, considerationdistances,positions, velocities,positions, in developing andvelocities, directions reliable and forsystems directions the objects [29]. for As in thean the importantobjects scenes, andin contributionthe is ascenes, necessary ofand the consideration is current a necessary work, in weconsiderationdeveloping have developed reliable in developing systemsguidelines [ 29reliable ].for As setting an systems important up and[29]. contributioncalibrating As an important cameras. of the contribution current First work,of all, of it we the is haveessential current developed work,to try toweguidelines confirm have developed whether for setting orguidelines upnot and the calibratingcamera for setting height cameras. up and and angle calibrating First are of all,approximately cameras. it is essential First the toof same tryall, toit for is confirm essential all videos. whether to Wetry cantoor confirm not derive the whether cameraan approximate heightor not the and linear camera angle equation areheight approximately for and camera angle are calibration. the approximately same forTo alldo the videos.so, sametwo extracted Wefor all can videos. derive features We an can derive an approximate linear equation for camera calibration. To do so, two extracted features forapproximate the object are linear employed, equation namely, for camera the aspect calibration. ratio (r To), and do so,yVGP two, corresponding extracted features to the fory coordinate. the object for the object are employed, namely, the aspect ratio (r), and yVGP, corresponding to the y coordinate. Theseare employed, two features namely, can provide the aspect information ratio (r), andon stepsyVGP in, corresponding the walking pattern to the fory coordinate. a person, as Thesedescribed two These two features can provide information on steps in the walking pattern for a person, as described infeatures Figure can 11. provide In Figure information 11, the person on steps ex inhibits the walking the same pattern posture for acorresponding person, as described to yVGP in and Figure r, 11as . in Figure 11. In Figure 11, the person exhibits the same posture corresponding to yVGP and r, as illustratedIn Figure 11 by, the the person blue and exhibits orange the dashed same posture lines. Th correspondinge interval of the to y stepVGP andcan rbe, as obtained illustrated from by the placeillustratedblue andbetween orange by thewhere dashed blue the and lines. same orange The postures intervaldashed occur. oflines. the Th stepThen,e we caninterval can be obtainedobserve of the stepthe from indexcan the be placenumbers obtained between for from frames where the andplacethe samethe between corresponding postures where occur. the values Then,same for wepostures the can point observe occur. distance theThen, index (d we), area numberscan ofobserve the forobject the frames shapeindex and numbers(a the), and corresponding aspect for frames ratio (andvaluesr), with the forcorresponding respect the point to starting distance values and (ford ),ending the area point of frames the distance object in th (e shaped ),step area (intervals.a of), andthe object aspect In order shape ratio to ( (aformulater),), andwith aspect respect a linear ratio to (r), with respect to starting and ending frames in the step intervals. In order to formulate a linear equationstarting andfor endingcamera framescalibration, in the we step assume intervals. three In kinds order of to points: formulate (yVGP a, lineard), (yVGP equation,a), and for(yVGP camera,r), as equation for camera calibration, we assume three kinds of points: (yVGP,d), (yVGP,a), and (yVGP,r), as showncalibration, in Figure we assume 12. Then, three by kinds observing of points: the (ytestinVGP,dg), points, (yVGP,a ),Figure and ( y12VGP shows,r), as shownthat no in significant Figure 12 . differencesshownThen, by in observing Figure exist between 12. the Then, testing the by variation points, observing Figure within the 12 a shows testinsceneg and thatpoints, the no significantvariation Figure 12 between di showsfferences allthat existscenes. no between significant the differencesvariation within exist between a scene and the thevariation variation within between a scene all and scenes. the variation between all scenes. 400.00 0.5 400.00 0.5 300.00 300.00 0 r VGP 200.00 0 y r VGP 200.00

y -0.5 100.00 -0.5 100.00 0.00 -1 1

0.00 10 19 28 37 46 55 64 73 82 91 -1 100 109 118 127 136 145 154 163 172 181 190 199 208 217 226 235 244 253 262 271 1 10 19 28 37 46 55 64 73 82 91

100 109 118 Frames127 136 145 154 163 172 181 190 199 208 217 226 235 244 253 262 271 Frames

VyyVGP - Cy Aspectr R. StandingVyyVGP - Cy Aspectr R. Standing, Standing to Sitting Standing Standing, Standing to Sitting Steps: 1st 2nd 3rd 4th 5th 6th 7th 8th Steps: 1st 2nd 3rd 4th 5th 6th 7th 8th

Figure 11. Two extracted features—yVGP (y coordinate of the virtual grounding point (VGP)) and Figure 11. Two extracted features—y (y coordinate of the virtual grounding point (VGP)) and aspect aspectFigure ratio11. Two (r)—which extracted can features—provide informationVGPyVGP (y coordinate on the number of the virtualof steps grounding for a person. point (VGP)) and aspectratio (r ratio)—which (r)—which can provide can provide information information on the numberon the number of steps of for steps a person. for a person.

150 150 0.8 150 y = 0.2208x + 38.061 150 y = 0.2105x + 29.558 0.60.8 y = 0.0001x + 0.3001 100 100 y = 0.0001x + 0.3001 y = 0.2208x + 38.061 y = 0.2105x + 29.558 r 0.40.6 d 100 a 100 50 50 r 0.20.4 d a 50 50 0.20 0 0 0 0 100 200 300 0 0 100 200 300 0 0 100 200 300 0 100y 200 300 0 100y 200 300 0 100y 200 300 VGP VGP VGP y y y VGP d LinearVGP (d) a LinearVGP (a) r Linear (r) d Linear (d) a Linear (a) r Linear (r) Figure 12. Analysis of camera calibration for all scenes. Figure 12. Analysis of camera calibration for all scenes. 4.2. Experimental Results Figure 12. Analysis of camera calibration for all scenes. 4.2. ExperimentalThe given video Results was firstly analyzed by setting a fixed interval. Here, we made the interval 4.2. Experimental Results everyThe frame given in whichvideo was only firstly one state analyzed occurs by (S 1setting: abnormal a fixed or interval.S2: normal). Here, Manual we made observations the interval of The given video was firstly analyzed by setting a fixed interval. Here, we made the interval everythe video frame sequences in which wereonly one performed state occurs to provide (S1: abnormal the occurrence or S2: normal). state symbol Manual for observations every interval of the or videoeveryframe. sequencesframe Such in observations which were onlyperformed for one a videostate to provideoccurs sequence ( Sthe1: usedabnormal occu inrrence these or state S experiments2: normal). symbol Manualfor are every illustrated observations interval in Figureor frame. of the 13 . Suchvideo observations sequences were for performeda video sequence to provide used the in occutheserrence experiments state symbol are illustrated for every intervalin Figure or 13. frame. For effectivelySuch observations demonstrating for a video the sequenceconsequences used ofin thethese sub-sequential experiments areexperimental illustrated results,in Figure here, 13. Forwe denoteeffectively the namedemonstrating of the person the inconsequences the video sequence of the assub-sequential “Mr. ONE.” experimental results, here, we denote the name of the person in the video sequence as “Mr. ONE.”

J. Imaging 2020, 6, 49 15 of 23

For effectively demonstrating the consequences of the sub-sequential experimental results, here, weJ. Imaging denote 2020 the, 6 name, x FOR ofPEER the REVIEW person in the video sequence as “Mr. ONE.” 15 of 23

FigureFigure 13. 13. ObservationsObservations of the the occurrences occurrences of of states states for for Mr. Mr. ONE ONE (Video (Video 1). 1).

AfterAfter that, that, as as discussed discussed inin EquationEquation (16), the the stat statee transition transition matrix matrix ofof the the Markov Markov Chain Chain for forthe the HMMHMMwas was obtained obtained by by applying applying EquationsEquations (13)–(15).(13) through (15). AfterAfter making making observations observations ofof thethe occurrence state state symbols symbols (S (S1,1S,2S),2 ),the the co-occurrence co-occurrence matrix matrix was was constructedconstructed by by applying applying Equation Equation (13),(13), as shown below. below. "  900 11550  # M = 900 11550 M = 11550 148225 11550 148225 Then, the following values were obtained using Equations (14) and (15): Then,= the= following values= were obtained= using= Equations= (14) and (15): = = a11 c11 / c1 0.07229, a12 c12 / c1 0.92771, a21 c21 / c2 0.07229, a22 c22 / c2 0.92771.

a11 After= c11 /that,c1 = we0.07229, obtaineda12 the= ctransition12/c1 = 0.92771,probabilitya21 matrix= c21/ Ac2(a=ij),0.07229, which representsa22 = c22 /thec2 probability= 0.92771. of proceeding from state Si to state Sj, by computing Equation (16). After that, we obtained the transition probability matrix A(a ), which represents the probability of 0.07229 0.92771 ij =   proceeding from state Si to state Sj, by computingA  Equation (16). 0.07229 0.92771 " # 0.07229 0.92771 By employing Equations (17) andA (18)= , the independent observable symbols O were obtained as shown in Figure 14. After that, the occurrence0.07229 of an observable 0.92771 symbol was identified using one of the labels for the eight observable symbols {o1,o2,o3,o4,o5,o6,o7,o8} for the video sequence. The emission probabilityBy employing matrix EquationsB was then(17) obtained and (18), as shown the independent below: observable symbols O were obtained as shown in Figure 14. After that, the occurrence of an observable symbol was identified using one of the  0.75 0 025 0 0 0 0 0  labels for the eightB = observable symbols {o1,o2,o3,o4,o5,o6,o7,o8} for the video sequence. . The emission probability matrix B0was.020672 then obtained0.062016 as0. shown015504 below:0 0 0.069767 0.054264 0.777778 Matrix B represents" a sequence of observation likelihoods in which each element expresses# the 0.75 0 0.25 0 0 0 0 0 probabilityB of= an observation ok (k = 1,2,…,8) being generated from a state Sj (j = 1,2). . Finally, the 0.020672Viterbi algorithm 0.062016 [27] 0.015504 was employed 0 0 for 0.069767 solving 0.054264the HMM 0.777778model to recognize sequential abnormal and normal states for “Mr. ONE.” The input parameters for implementing the Matrix B represents a sequence of observation likelihoods in which each element expresses the Viterbi algorithms were summarized as two states—abnormal and normal—including eight probability of an observation o (k = 1,2, ... ,8) being generated from a state S (j = 1,2). observable action symbols; {“fallingk ,” “lying,” “lying to sitting,” “sitting to standingj ,” “walking,” Finally, the [27] was employed for solving the HMM model to recognize “walking to sitting,” “sitting,” and “sitting to lying”}, a state transition probability matrix (A), a sequentialsequence abnormalof observation and normallikelihood states or foremission “Mr. ONE.”probability The inputmatrix parameters (B), and an for initial implementing probability the Viterbidistribution algorithms over were states. summarized The final asoutcome two states—abnormal results for “Mr. and ONE” normal—including are illustrated eightin Figure observable 15. actionAdditional symbols; experimental {“falling,” results “lying ,for” “lying sequential to sitting abnormal,” “sitting and tonormal standing states,” “walkingin each of,” the “walking video to sitting,”sequences “sitting simulated,” and by “sitting Ms. TWO to are lying”}, demons a statetrated transition in Figures probability16–18, respectively. matrix (A), a sequence of observation likelihood or emission probability matrix (B), and an initial probability distribution over states. The final outcome results for “Mr. ONE” are illustrated in Figure 15. Additional experimental results for sequential abnormal and normal states in each of the video sequences simulated by Ms. TWO are demonstrated in Figures 16–18, respectively.

J. ImagingJ. Imaging2020 2020, 6,, 496, x FOR PEER REVIEW 16 of16 23 of 23 J. Imaging 2020, 6, x FOR PEER REVIEW 16 of 23 J. Imaging 2020, 6, x FOR PEER REVIEW 16 of 23

Figure 14. Observations for independent action symbol O for Mr. ONE (Video 1). Figure 14. Observations for independent action symbol O for Mr. ONE (Video 1). FigureFigure 14. 14.Observations Observations for for independentindependent action symbol OO forfor Mr. Mr. ONE ONE (Video (Video 1). 1).

Figure 15. Results for deciding on the sequential abnormal and normal states for Mr. ONE (Video 1). FigureFigure 15. 15.Results Results for for deciding deciding onon thethe sequential abnorm abnormalal and and normal normal states states for for Mr. Mr. ONE ONE (Video (Video 1). 1). Figure 15. Results for deciding on the sequential abnormal and normal states for Mr. ONE (Video 1).

FigureFigure 16. 16.Results Results for for deciding deciding on on thethe sequentialsequential abnormalabnormal and and normal normal states states for for Ms. Ms. TWO TWO (Video (Video 1). 1). Figure 16. Results for deciding on the sequential abnormal and normal states for Ms. TWO (Video 1). Figure 16. Results for deciding on the sequential abnormal and normal states for Ms. TWO (Video 1).

J. Imaging 2020, 6, 49 17 of 23 J. Imaging 2020, 6, x FOR PEER REVIEW 17 of 23 J. Imaging 2020, 6, x FOR PEER REVIEW 17 of 23

Figure 17. Results for deciding on the sequential abnormal and normal states for Ms. TWO (Video 2). FigureFigure 17. 17.Results Results for for deciding deciding on on the the sequentialsequential abnormalabnormal and and normal normal states states for for Ms. Ms. TWO TWO (Video (Video 2). 2).

FigureFigure 18. 18.Results Results for for deciding deciding on on the the sequentialsequential abnormalabnormal and and normal normal states states for for Ms. Ms. TWO TWO (Video (Video 3). 3). Figure 18. Results for deciding on the sequential abnormal and normal states for Ms. TWO (Video 3). 4.3. Performance Evaluation of the Proposed Approach 4.3. Performance Evaluation of the Proposed Approach 4.3. Performance Evaluation of the Proposed Approach InIn order order to evaluateto evaluate the the performance performance of our of proposedour proposed system, system, we firstly we firstly used used the detection the detection process In order to evaluate the performance of our proposed system, we firstly used the detection forprocess falling andfor falling normal and events normal using events three-fold using three-fold cross-validation, cross-validation, which presents which presents the swapped the swapped variables process for falling and normal events using three-fold cross-validation, which presents the swapped forvariables learning for and learning testing and used testing in our used previous in our previo workus [13 work]. The [13]. given The videosgiven vi weredeos were classified classified as either as variables for learning and testing used in our previous work [13]. The given videos were classified as fallingeither or falling non-falling or non-falling videos. Our videos proposed. Our proposed period detection period (detectionPD) achieved (PD) a achieved precision a of precision 93.33% and of a either falling or non-falling videos. Our proposed period detection (PD) achieved a precision of recall93.33% of 93.33%. and a recall In the of current93.33%. work,In the current we enhanced work, we our enhanced detection our system detection “from system video “from to consecutive video to 93.33% and a recall of 93.33%. In the current work, we enhanced our detection system “from video to videoconsecutive sequences” video by sequences” utilizing theby utilizin analyzedg the features analyzed working features for working the PD for. The the featuresPD. The werefeatures used consecutive video sequences” by utilizing the analyzed features working for the PD. The features were used as inputs to calculate the probabilityHMM for HMM. There are four possible outcomes in aswere inputs used to calculate as inputs the to probabilitycalculate the for probability. There for HMM are four. There possible are four outcomes possible in discriminatingoutcomes in discriminating abnormal and normal states for the consecutive states of a person, and definitions and abnormaldiscriminating and normal abnormal states and for normal the consecutive states for statesthe consecutive of a person, states and of definitions a person, and and definitions relative symbols and relative symbols are described in the following: arerelative described symbols in the are following: described in the following: • Detected Abnormal State (As1): A video frame represents an abnormal state, and is correctly • DetectedDetected Abnormal Abnormal State State ( As(As1):1): AA videovideo frame represents represents an an abnormal abnormal state, state, and and is iscorrectly correctly • classified as “Positive Abnormal”; classified as “Positive Abnormal”; • classified as “Positive Abnormal”; • Undetected Abnormal State (As2): A video frame represents an abnormal state, and is incorrectly UndetectedUndetected Abnormal Abnormal State State (As (As):2): AA videovideo frame represents represents an an ab abnormalnormal state, state, and and is isincorrectly incorrectly • classified as “Negative Normal”;2 classifiedclassified as as “Negative “Negative Normal”; Normal”; • Normal State (Ns1): A video frame does not represent an abnormal state, and is correctly • Normal State (Ns1): A video frame does not represent an abnormal state, and is correctly Normalclassified State as ( Ns“Negative1): A video Normal”; frame does not represent an abnormal state, and is correctly classified • classified as “Negative Normal”; • asMis-detected “Negative Normal”; Normal State (Ns2): A video frame does not represent an abnormal state, and is • Mis-detected Normal State (Ns2): A video frame does not represent an abnormal state, and is Mis-detectedincorrectly classified Normal Stateas “Positive (Ns ): Abnormal”. A video frame does not represent an abnormal state, and is • incorrectly classified as “Positive2 Abnormal”. incorrectly classified as “Positive Abnormal”. The following Equations (20) through (24) describe as a performance evaluation the calculated The following Equations (20) through (24) describe as a performance evaluation the calculated precision, recall, accuracy, specificity rate, and negative predictive value, respectively. In evaluating precision,The following recall, accuracy, Equations specificity (20)–(24) rate, describe and nega as ative performance predictive evaluation value, respectively. the calculated In evaluating precision, recall, accuracy, specificity rate, and negative predictive value, respectively. In evaluating each of the

J. Imaging 2020, 6, 49 18 of 23 given videos, an “undecided class” is nominated in order to save the unnecessary failed states. In other words, where we consider the risk to be low for a given video, we ignore it when a normal case is misclassified as abnormal. A Precision = s1 100, (20) (As1 + Ns2) ∗ A Recall = s1 100, (21) (As1 + As2) ∗

(As + Ns ) Accuracy = 1 1 100, (22) (As1 + Ns1) ∗ N Speci f icityRate = s1 100, (23) (Ns1 + Ns2) ∗ N NPV = s1 100, (24) (Ns1 + As2) ∗ Table1 provides the precision, recall, accuracy, specificity rate, and negative predicted value (NPV) for 13 videos, including both abnormal and normal states. The remaining seven videos that include a normal state provide a precision of 100% and a recall of 100%. Figure 19 shows receiver operating characteristic (ROC) curve analyses for the results of true positive and false positive fractions, for the 13 videos that include abnormal events. The results of AUC (area under ROC curve) show that our proposed method is quite capable of distinguishing between the classes. A comparison of the performance of the proposed system for each video and related existing methods is shown in Table2. Our proposed system was implemented in MATLAB 2019a on an academic license. A series of experiments were performed on Microsoft Windows 10 Pro with an Intel (R) Core (TM) i7-4790 [email protected] GHz and 8GB RAM. The overall computation time in this system is 0.93 s per frame. In order to achieve real-time monitoring, we expect that the GPU implementation could be further fine-tuned to our system.

Table 1. Performance evaluation for 13 fall video sequences.

Videos Precision (%) Recall (%) Accuracy (%) Specificity (%) NPV (%) 1 100 100 100 96.88 96.88 2 100 100 100 96.07 96.09 3 100 86.66 98.32 99.04 97.17 4 100 100 100 97.07 97.06 5 100 100 100 90.40 90.40 6 82.98 97.50 99.76 97.89 97.63 7 98.04 100 100 99.36 99.36 8 100 100 100 94.70 94.70 9 100 83.33 97.41 90.80 99.09 10 100 100 100 97.72 97.72 11 100 100 100 98.54 98.54 12 100 100 100 96.38 96.38 13 100 100 100 88.48 88.48 J. Imaging 2020, 6, 49 19 of 23

J. Imaging 2020, 6, x FOR PEER REVIEW 19 of 23

(video 1) (video 2) (video 3)

(video 4) (video 5) (video 6)

(video 7) (video 8) (video 9)

(video 10) (video 11) (video 12) (video 13)

FigureFigure 19. Receiver 19. Receiver operating operating characteristic characteristic( (ROCROC) curve analysis analysis for for the the results results of the of thetrue true positive positive fraction (As1/(As1 + As2)), and false positive fraction (Ns2/(Ns2 + Ns1)), for 13 videos including abnormal fraction (As1/(As1 + As2)), and false positive fraction (Ns2/(Ns2 + Ns1)), for 13 videos including abnormal and normal events. The results of AUC (area under ROC curve) with respect to the true positive and normal events. The results of AUC (area under ROC curve) with respect to the true positive fraction fraction and false positive fraction show that our proposed method is quite capable of distinguishing and false positive fraction show that our proposed method is quite capable of distinguishing between between the abnormal and normal states. the abnormal and normal states.

J. Imaging 2020, 6, 49 20 of 23

Table 2. Performance of our system compared with existing methods using the same dataset.

Methods Precision (%) Recall (%) Accuracy (%) Kishanprasad [16] 94 95 —1 Suad [18] 100 95.27 99.82 Adrián [23] —1 99.00 97.00 Minjie [24] 90.8 98.3 97.8 Ours 99.05 98.37 99.8 Note: —1 means the provided data is not available.

4.4. Comparative Studies of the Merits and Demerits of Our Proposed System Kishanprasad presented a fall-detection system [16] providing a precision of 94% and a recall of 95%, as tested in an office environment from the Le2i dataset. The merits of this system were described as robust, fast, and computationally efficient. The system could be enhanced using deep learning models in future work. Demerits of the system were not discussed in detail. However, our proposed system achieved higher precision and recall rates, as shown in Table2. Suad [18] proposed a fall detection approach using threshold-based methods and a neural network. The system achieved a high detection rate in monitoring a single person in a home environment. However, their proposed system can only detect events in high-intensity light. They mentioned ways to enhance their proposed system, such as using it to detect multiple people and installing multiple cameras to make sure that the object is visible in at least one camera view. Moreover, they have mentioned adopting enhanced background subtraction for better object segmentation. Other improvements can be achieved by detecting additional features using sensors such as an accelerometer. Capturing 3D information using a depth camera could also improve the detection rate. Using simple algorithms, we achieved an accuracy similar to that of Suad’s proposed system, not attaining the accuracy of our proposed system. Adrian [23] presented a vision-based fall detection system using CNN architecture. As one drawback of the system, using optical flow images involves a heavy computational burden in generating consecutive sequences. Furthermore, the system is not robust when lighting changes. Therefore, a more reliable network could be modeled to gain a better representation of hierarchical motion from the images. In addition, more research involving multiple people in falling events would be useful before implementation in public areas. Compared to this method, our proposed system gives more accurate rates. Minjie’s work [24] investigated an approach for predicting falling events in advance using monocular videos. The system has credibility as it provided excellent performance results using a complex system architecture. However, they described two failures due to the non-detection of keypoints, which occurred when the person’s upper body was not visible from the perspective of the camera. Moreover, falls cannot be predicted in advance if the precursor of the fall is not detected. In comparison with our proposed system, we achieved a higher precision, recall, and overall accuracy, as described in Table2. In this paper, we have proposed a system for detecting normal, abnormal, and falling states using video sequences for people being monitored. In detecting normal states, our system outperformed other approaches, with an accuracy of 99.8%, using experimental videos containing non-abnormal events. In addition, our proposed system proved to be reliable and effective, with a precision of 99.05% and recall of 98.37%, as confirmed using 20 video sequences that contained abnormal and normal events. Though the system was adapted to a real-world environment, the scope of research was limited to detecting a single person. The computation time for the object detection module is slow in a real-time monitoring system. Using multiple cameras to view the human body from different perspectives would improve feature extraction. Currently, the system is only effective during the day, and should be improved for night-time use in the real world of elderly care. Moreover, the system functionality should be expanded by exploring feature analyses such as pre-fall states, dangerous J. Imaging 2020, 6, 49 21 of 23 post-fall situations, and abnormal gaits (walking patterns). We expect that smart image processing technologies have great potential for improving the quality of life of the targeted population.

5. Discussion and Conclusions The research discussed in this paper is concentrated on applied statistical analysis for developing an advanced image-processing technology in vision-based elderly care monitoring systems, facilitating health care for elderly people living independently. In brief, our proposed approach involves the following. Background subtraction is first used to detect moving and motionless objects. Simple and effective feature extraction is then performed using the virtual grounding point concept (VGP), as well as the area of the shape and the aspect ratio of the detected object. In addition, these three features are observed in detailed considerations of the moving average and modified difference calculations. Careful observations of the critical features of the local maximum or local minimum are made to predict the high possibility of an abnormal state for the point distance of the VGP, area, and aspect ratio. Moreover, the optimal threshold values are exploited by calculating the mid value or period detection for the probability distribution of the Hidden Markov Model. Finally, the consecutive abnormal and normal states are differentiated using the Viterbi algorithm. Contributing modules include an analysis of the state with an extensive observable area of shape features and making decision rules using the Hidden Markov model. Moreover, a simple analysis of camera calibration is performed using our observed VGP and its inclusive features. Experimental results indicate that our proposed system is reliable and effective for detection. In order to confirm the validity of our proposed methods, we have drawn comparisons with existing methods described in the literature. Specifically, comparisons were made between our approach and three similar methods: (1) threshold-based fall detection; (2) neural network algorithms [16,18]; and (3) the CNN approach [23,24]. Compared to these methods, our proposed method achieved a higher performance, as confirmed using the public dataset of videos taken in simulated environments that closely approximate real life. This paper focuses on detecting falls, which are common among elderly people. In future research, we will improve our analysis by applying it to behavioral features, such as timing, posture, and gait. Using the results of surveys, we will also take into consideration behavioral differences between healthy young adults and the elderly. Though human behavior is complex and inconsistent between different persons, situations, and environments, our system can improve the interpretation of human behavior through an analysis of its visual manifestations in different situations. However, further testing using large amounts of data is required for system optimization. Furthermore, we expect that our proposed system can be improved by applying techniques related to modern deep learning. In fact, future experiments involving simulations of abnormal events should prove a fruitful area for applying such techniques.

Author Contributions: Conceptualization, S.N.N.H. and P.T.; methodology, S.N.N.H.; software, S.N.N.H.; validation, T.T.Z. and P.T.; formal analysis, S.N.N.H.; investigation, T.T.Z.; resources, S.N.N.H., T.T.Z., and P.T.; data curation, S.N.N.H.; writing—original draft preparation, S.N.N.H.; writing—review and editing, S.N.N.H., T.T.Z., and P.T.; visualization, S.N.N.H.; supervision, T.T.Z.; S.N.N.H. conducted a series of experiments and all three authors analyzed the results. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Acknowledgments: The authors would like to express sincere thanks to Hiromitsu Hama who is an emeritus Professor at Osaka City University, Japan, for his kind and valuable suggestions and advice during this research work. The authors also would like to thank reviewers and editors of this special issue for all valuable suggestions and comments for improving this research paper. Conflicts of Interest: The authors declare no conflict of interest.

References

1. World Health Organization Regional Office for Europe. Healthy Ageing: Policy. Available online: http: //www.euro.who.int/en/health-topics/Life-stages/healthy-ageing/policy (accessed on 13 December 2018). J. Imaging 2020, 6, 49 22 of 23

2. Lindemann, U.; Hock, A.; Stuber, M.; Keck, W.; Becker, C. Evaluation of a fall detector based on accelerometers: A pilot study. Med. Boil. Eng. 2005, 43, 548–551. [CrossRef][PubMed] 3. Wang, C.-C.; Chiang, C.-Y.; Lin, P.-Y.; Chou, Y.-C.; Kuo, I.-T.; Huang, C.-N.; Chan, C.-T. Development of a Fall Detecting System for the Elderly Residents. In Proceedings of the 2008 2nd International Conference on Bioinformatics and Biomedical Engineering, Shanghai, China, 16–18 May 2008; pp. 1359–1362. 4. Bagalà, F.; Becker, C.; Cappello, A.; Chiari, L.; Aminian, K.; Hausdorff, J.M.; Zijlstra, W.; Klenk, J. Evaluation of Accelerometer-Based Fall Detection Algorithms on Real-World Falls. PLoS ONE 2012, 7, e37062. [CrossRef] 5. Mathie, M.J.; Coster, A.C.F.; Lovell, N.H.; Celler, B.G. Accelerometry: Providing an integrated, practical method for long-term, ambulatory monitoring of human movement. Physiol. Meas. 2004, 25, R1–R20. [CrossRef][PubMed] 6. Rimminen, H.; Lindström, J.; Linnavuo, M.; Sepponen, R. Detection of falls among the elderly by a floor sensor using the electric near field. IEEE Trans. Inf. Technol. Biomed. 2010, 14, 1475–1476. [CrossRef] [PubMed] 7. Abbate, S.; Avvenuti, M.; Bonatesta, F.; Cola, G.; Corsini, P.; Vecchio, A. A smartphone-based fall detection system. Pervasive Mob. Comput. 2012, 8, 883–899. [CrossRef] 8. Albert, M.V.; Kording, K.; Herrmann, M.; Jayaraman, A. Fall Classification by Machine Learning Using Mobile Phones. PLoS ONE 2012, 7, e36556. [CrossRef][PubMed] 9. Mao, A.; Ma, X.; He, Y.; Luo, J. Highly Portable, Sensor-Based System for Human Fall Monitoring. Sensors 2017, 17, 2096. [CrossRef] 10. Chaccour, K.; Darazi, R.; El Hassani, A.H.; Andres, E. From Fall Detection to Fall Prevention: A Generic Classification of Fall-Related Systems. IEEE Sens. J. 2017, 17, 812–822. [CrossRef] 11. Sugimoto, M.; Zin, T.T.; Takashi, T.; Shigeyoshi, N. Robust Rule-Based Method for Human Activity Recognition. IJCSNS Int. J. Comput. Sci. Netw. Secur. 2011, 11, 37–43. 12. De Miguel, K.; Brunete, A.; Hernando, M.; Gambao, E. Home Camera-Based Fall Detection System for the Elderly. Sensors 2017, 17, 2864. [CrossRef] 13. Htun, S.N.N.; Zin, T.T.; Hama, H. Virtual Grounding Point Concept for Detecting Abnormal and Normal Events in Home Care Monitoring Systems. Appl. Sci. 2020, 10, 3005. [CrossRef] 14. Rougier, C.; Meunier, J.; St-Arnaud, A.; Rousseau, J. Fall Detection from Human Shape and Motion History Using Video Surveillance. In Proceedings of the 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW’07), Niagara Falls, ON, Canada, 21–23 May 2007; pp. 875–880. [CrossRef] 15. Rougier, C.; St-Arnaud, A.; Rousseau, J.; Meunier, J. Video Surveillance for Fall Detection. In Video Surveillance; IntechOpen: London, UK, 3 February 2011. [CrossRef] 16. Kishanprasad, G.; Prachi, M. Indoor Human Fall Detection System Based on Automatic Vision Using Computer Vision and Machine Learning Algorithms. J. Eng. Sci. Technol. 2018, 13, 2587–2605. 17. Lotfi, A.; Albawendi, S.; Powell, H.; Appiah, K.; Langensiepen, C. Supporting Independent Living for Older Adults; Employing a Visual Based Fall Detection Through Analysing the Motion and Shape of the Human Body. IEEE Access 2018, 6, 70272–70282. [CrossRef] 18. Suad, G.A. Automated Human Fall Recognition from Visual Data. Ph.D. Thesis, Nottingham Trent University, Nottingham, UK, February 2019; pp. 1–160. 19. Htun, S.N.N.; Zin, T.T. Motion History and Shape Orientation Based Human Action Analysis. In Proceedings of the 2019 IEEE 8th Global Conference on Consumer Electronics (GCCE), Osaka, Japan, 15–18 October 2019; pp. 754–755. 20. Yong, H.; Meng, D.; Zuo, W.; Zhang, K. Robust Online Matrix Factorization for Dynamic Background Subtraction. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1726–1740. [CrossRef][PubMed] 21. Marki, N.; Perazzi, F.; Wang, O.; Sorkine-Hornung, A. Bilateral Space Video Segmentation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 743–751. 22. Htun, S.N.N.; Zin, T.T.; Hama, H. Human Action Analysis Using Virtual Grounding Point and Motion History. In Proceedings of the 2020 IEEE 2nd Global Conference on Life Sciences and Technologies (LifeTech), Kyoto, Japan, 10–12 March 2020; pp. 249–250. 23. Núñez-Marcos, A.; Azkune, G.; Arganda-Carreras, I. Vision-Based Fall Detection with Convolutional Neural Networks. Wirel. Commun. Mob. Comput. 2017, 2017, 1–16. [CrossRef] J. Imaging 2020, 6, 49 23 of 23

24. Minjie, H.; Yibing, N.; Shiguo, L. Falls Prediction Based on Body Keypoints and Seq2Seq Architecture. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV Workshop), Seoul, Korea, 27–28 October 2019; pp. 1251–1259. [CrossRef] 25. Amandine, D.; François, C. Automatic Fall Detection System with a RGB-D Camera using a Hidden Markov Model. In Proceedings of the ICOST-11th International Conference on Smart Homes and Health Telematics, Singapore, 19–21 June 2013; pp. 259–266. [CrossRef] 26. Triantafyllou, D.; Krinidis, S.; Ioannidis, D.; Metaxa, I.; Ziazios, C.; Tzovaras, D. A Real-time Fall Detection System for Maintenance Activities in Indoor Environments—This work has been partially supported by the European Commission through the project HORIZON 2020-INNOVATION ACTIONS (IA)-636302-SATISFACTORY. IFAC-PapersOnLine 2016, 49, 286–290. [CrossRef] 27. Daniel, J.; James, H.M. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and , 3rd ed.; Stanford University: Stanford, CA, USA, 16 October 2019. 28. Le2i Fall Detection Dataset. Available online: http://le2i.cnrs.fr/Fall-detection-Dataset?lang=fr (accessed on 27 February 2013). 29. Cattaneo, C.; Mainetti, G.; Sala, R. The Importance of Camera Calibration and Distortion Correction to Obtain Measurements with Video Surveillance Systems. J. Phys. Conf. Ser. 2015, 658, 012009. [CrossRef]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).