Multimodal Interaction with Internet of Things and

Foundations, Systems and Challenges

Joo Chan Kim

Multimodal Interaction with Internet of Things and Augmented Reality Foundations, Systems and Challenges

Author Joo Chan Kim [email protected]

Supervisors Teemu H. Laine [email protected]

Christer Åhlund [email protected]

Luleå University of Technology Department of Computer Science, Electrical and Space Engineering Division of Computer Science ISSN 1402-1536 ISBN 978-91-7790-562-2 (pdf)

Luleå 2020 www.ltu.se Abstract

The development of technology has enabled diverse modalities that can be used by humans or machines to interact with computer systems. In particular, the Internet of Things (IoT) and Augmented Reality (AR) are explored in this report due to the new modalities offered by these two innovations which could be used to build multimodal interaction systems. Researchers have utilized multiple modalities in interaction systems for providing better usabil- ity. However, the employment of multiple modalities introduces some challenges that need to be considered in the development of multimodal interaction systems to achieve high usability. In order to identify a number of remaining challenges in the research area of multimodal interaction systems with IoT and AR, we analyzed a body of literature on multimodal interaction systems from the perspectives of system architecture, input and output modalities, data processing methodology and use cases. The identified challenges are regarding of (i) multidisciplinary knowledge, (ii) reusability, scalability and security of multimodal interaction system architecture, (iii) usability of multimodal inter- action interface, (iv) adaptivity of multimodal interface design, (v) limitation of current technology, and (vi) advent of new modalities. We are expecting that the findings of this report and future research can be used to nurture the multimodal interaction system research area, which is still in its infancy.

i Table of Contents

1 Human-computer Interaction 1

2 Foundations 2 2.1 Multimodal Interaction ...... 2 2.2 Internet of Things ...... 3 2.3 Augmented Reality ...... 3

3 Multimodal Interaction - Modality 5 3.1 Input (Human → Computer) ...... 5 3.1.1 Visual signal ...... 5 3.1.2 Sound ...... 6 3.1.3 Biosignals ...... 7 3.1.4 Inertia & Location ...... 8 3.1.5 Tangible objects ...... 9 3.2 Output (Computer → Human) ...... 9 3.2.1 Visual representation ...... 9 3.2.2 Sound ...... 9 3.2.3 Haptics ...... 10 3.2.4 Others ...... 10

4 Multimodal Interaction - System Modeling 11 4.1 Integration (Fusion) ...... 11 4.1.1 Data level integration ...... 11 4.1.2 Feature level integration ...... 11 4.1.3 Decision level integration ...... 12 4.2 Presentation (Fission) ...... 12

5 Multimodal Interaction using Internet of Things & Augmented Reality 13 5.1 Internet of Things ...... 13 5.1.1 Visual signal ...... 13 5.1.2 Sound ...... 13 5.1.3 Biosignal ...... 13 5.1.4 Inertia & Location ...... 14 5.2 Augmented Reality ...... 14 5.3 IoT with AR ...... 14 5.3.1 AR for user interaction ...... 17 5.3.2 AR for interactive data representation ...... 18

6 Discussion 19

7 Conclusion 22

ii List of Figures

1 Multimodal interaction framework ...... 2 2 Internal framework of interaction system ...... 3 3 Visual input signal types ...... 5 4 Sound categories ...... 6 5 Biosignals and corresponding body positions ...... 7 6 The architecture of three integration types ...... 11 7 Interaction types in ARIoT ...... 15

List of Tables

1 Interaction type classification ...... 15 2 Challenges and Research questions ...... 20

iii Abbreviations

AR Augmented Reality BCI Brain-Computer Interface ECG Electrocardiogram EDA Electrodermal Activity EEG Electroencephalography EMG Electromyography EOG Electrooculography FPS First-person Shooter HCI Human-Computer Interaction HMD Head-Mounted Display IMU Inertial Measurement Unit IoT Internet of Things ISO International Organization for Standardization ITU International Telecommunication Union MI Multimodal Interaction MR PPG Pulse Pattern Generator RFID Radio-frequency Identification SCR Skin Conductance Response

iv 1 Human-computer Interaction

Human-Computer Interaction (HCI) is a research field that mainly focuses on design methods for a human to interact with a computer. The discipline started to grow from the 1980s [1], and the term HCI was popularized by Stuart K. Card [2]. From that time, apart from the ordinary interaction system (i.e., mouse and keyboard), researchers have started to design new interaction systems based on multimodal interaction by combining more than one interaction methods, such as speech and hand gestures [3]. Nowadays, the development of technology enables ubiquitous computing in , and this development gives a possibility to utilize many new interaction technologies, such as head-mounted displays [4], sen- sors [5], brain-computer interfaces [6], augmented reality [7], and smart objects [8]. By considering these development environments, the increase in complexity, as well as potential, of multimodal interaction is inevitable. However, an improvement in the usability of multimodal interaction for the user is a remaining challenge in HCI research [9], [10]. Another challenge related to this is that the designer of multimodal interaction systems must have multidisciplinary knowledge in diverse fields to understand the user, the system and the interaction in order to achieve high usabil- ity [9], [11]. The term ‘usability’ has been used when the study aims to evaluate the interaction system from the user’s perspective. According to the ISO 9241-11:2018 standard [12], usability is consisting of effectiveness, efficiency, and satisfaction by the user, and these aspects are typically covered by usability measurement instruments. In this report, we give an overview of multimodal interaction and focus on two aspects of multimodal interaction; the system and the interaction. In particular, this report provides an overview of state-of-the-art research on two innovations, Internet of Things (IoT) [8] and Augmented Reality (AR) [13], due to the new modalities offered by these innovations, which could be used to build multimodal interaction systems. In this report, multimodal interaction refers to multiple inputs and/or outputs from the system’s perspective. We review the state-of-art research on multimodal interaction systems with AR and/or IoT technologies that were published from 2014 to 2019. With comprehensive research, this report gives a general knowledge of multimodal interaction to parties interested in this subject and thereby helps to identify challenges for future research and development.

1 2 Foundations

In this section, we explain the definition of the terms, Multimodal Interaction (MI), Internet of Things (IoT) and Augmented Reality (AR), and elaborate upon in relation to previous research. The goals of this section are to: (i) show the ambiguity of the terms due to multiple existing definitions, and (ii) formulate the definitions to be used in this study.

2.1 Multimodal Interaction One of the key terms in multimodal interaction is modality. The sender (as well as the receiver) of the data can either be a human user or a machine. Consequently, the delivery method can incorporate both analogue (e.g., human body parts) and digital (e.g., a digital image) components. We distinguish the state from intent because of the response created based on the received data, which depends on whether the data represents the state or the intent of the sender. For example, a system can understand the user’s intent when the user uses a finger as a tool of modality to point out (i.e., gesture) or select an object (i.e., touch), whereas another modality, the heartbeat, can be used to interpret the user’s state. In this report, modality is defined as follows: Definition 2.1. Modality is a delivery method to transfer data which can be used for interpretation of the sender’s state [9] or intent [11]. Consequently, modalities in a multimodal interaction system are often connected to voluntary actions (e.g., touch, spoken words), but they can also represent signals of involuntary or autonomous nature (e.g., heartbeat, sweating, external monitoring by a third party). The modality for the interaction system can vary depending on the available devices or technology for the system and the purpose of the system. Not only a mouse and a keyboard, but also a computer vision, sound, biosignal, and human behaviour can be utilized for using modalities. In this sense, when an interaction system employs more than one modality for input or output, it is referred to as a multimodal interaction system. Figure 1 picks our framework of multimodal interaction that represents the relationship between the agent and the system by using modalities. We use ‘agent’ in this report referred to both the human and the machine. While Figure 1 illustrates the fundamental case where one agent interacts with a system, it can also be extended to cover multiagent interaction systems by adding more agents. In Section 3, various modalities that have been used for transmitting input or output data in between an agent and the interaction system will be described in detail.

Figure 1: Multimodal interaction framework

The term ‘multimodal interaction’ was used and defined in other studies [6], [9]. However, we will make our own definition based on our framework: Definition 2.2. Multimodal interaction is the process between the agent and the interaction system that allows the agent to send data by combining and organizing more than one modality (see "INPUT" in Figure 1), while output data can also be provided by the interaction system in multiple modalities (see "OUTPUT" in Figure 1). The primary purpose of multimodal interaction is to provide better usability and user experience while using multiple modalities rather than one modality. As an example, Kefi et al. [14] compared the user satisfaction with two different input modality sets to control a three-dimensional (3D) : (i) mouse only, and (ii) mouse with voice. The result of this study showed that the use of two input modalities (mouse with voice) could provide better user satisfaction than one (mouse) input modality. To provide a detailed analysis of the process of multimodal interaction and how input and output data are handled, we divide it into two steps: integration and presentation. In the integration step, the data from different modalities are merged and interpreted for the presentation step. Based on interpretation, a presentation will be made to the agent through one or more output modalities. Figure 2 visualizes the internal framework of the interaction system. Each step has a unique process for data handling, and the details of the three steps are explained in Section 4.

2 Figure 2: Internal framework of interaction system

2.2 Internet of Things The Internet of Things (IoT) is a technology powered by the ubiquitous sensing which was realized with the advance- ment of network and computing power [15]. The advancement of technology enables to create small connected sensor devices which contain enough power to gather real-time data from users or environment and transfer them through a network. With these sensors, the IoT becomes a notable technology in multimodal interaction due to the ability to provide valuable data for understanding the context of users or the environment. There is a definition of IoT by International Telecommunication Union (ITU) [16]. Definition 2.3. Internet of Things (IoT) is a global infrastructure for the information society, enabling advanced services by interconnecting (physical and virtual) things based on existing and evolving interoperable information and communication technologies. However, the type of data that IoT devices can collect is our interest in this report. So it is not applicable to our report due to the lack of description on ‘interoperable information’, thus we define the term IoT with our words. Definition 2.4. Internet of Things (IoT) is the collection of interconnected objects equipped with sensors which are able to receive data from a real-world entity, such as a human body, environment, and other physical entities, and transfer them to a destination through a network. As an example, Jo et al. [10] showed that by utilizing IoT, a system could provide the current state of merchandise in a shop to the user through a mobile device to improve the shopping experience. More IoT use cases in multimodal interaction are presented in Section 5.

2.3 Augmented Reality Augmented Reality (AR) is another notable technology that helps the user to interact with the system or the environ- ment. AR is a technology that visualizes graphical objects generated by computer onto a view of the real world, where the view is typically represented by a real-time camera feed on a device or a head-mounted display (HMD). According to our definition of modality, AR is a user interface rather than a modality. In this report, we use the well-established definition of AR by Paul Milgram and Fumio Kishino [17]: Definition 2.5. As an operational definition of Augmented Reality, we take the term to refer to any case in which an otherwise real environment is ‘augmented’ by means of virtual (computer graphic) objects. The input modality for interacting with AR can be a hand gesture, voice, face, and body gesture [13], which are often combined in multimodal interaction AR systems. For example, Lee et al. [7] built multimodal interaction with

3 and hand gesture recognition techniques to control an AR object. The result showed that the multimodal interaction method produced better usability than a one-modality interaction method. AR is implementable on devices which have a camera to capture a view of the real world and has enough computing power to present virtual objects on the screen. Additionally, there is a similar technology which is called Mixed Reality. Mixed Reality (MR) is an advanced technology of AR which diminishes the gap between the real-world and the by creating virtual objects that are aware of the properties of real world objects, and can thereby act as if they were part of the real world. Microsoft’s HoloLens is a well-known HMD that supports MR. Section 5 has more use cases of AR in multimodal interaction which are also applicable to MR technology.

4 3 Multimodal Interaction - Modality

In this section, we present the feasible input and output modalities which were used in other studies. Each modal- ity is categorized based on the type of modality, and each category has information of (i) instruments which can record/capture data for the modality, (ii) type of modality, (iii) technology for processing the captured data, and (vi) identified limitations and open challenges.

3.1 Input (Human → Computer) 3.1.1 Visual signal Vision is one of the key functions for humans to obtain information, and interaction systems can also accept visual information. In order to get input from an agent by visual signal, the interaction system commonly uses a camera for receiving the input as a form of a single image, a sequence of images, or a video clip. There are two types of visual signals that can be used as a modality by an agent. One of the types is a feasible input modality when the agent is capable of expressing humanlike gestures or has a humanlike form, whereas the other type is available when the agent has a non-humanlike form. Figure 3 illustrates the categories of visual input signal types.

Figure 3: Visual input signal types

When the agent has a humanlike form, the agent’s body can be used as a visual signal by posing or moving it. In this case, the interaction system recognizes body gestures by interpreting them for their purpose. For example, Kinect from Microsoft is a motion-sensing device for reading human body movement as input data. ‘Just Dance’ is one of the well-known game franchises developed by Ubisoft that utilizes Kinect to obtain the player’s body gestures as an input modality [18]. Players have to follow given dancing gestures by moving their bodies to beat the level. Kinect was widely used not only in games, but also in research, such as [19], physical training [20]–[22], and smart environment [23] in order to receive body gestures as an input modality. While the interaction system could recognize the entire body of an agent, a specific part of the body can be considered as a tool for input modality depending on the purpose. We divide the body into two parts based on the waist position. Above the waist is the ‘upper body’ part (e.g. arms, hands [4], [5], [24]), and beneath the waist is the ‘lower body’ (e.g. legs, feet) part. In the upper body part, there are three parts that drew attention from researchers as sources of a visual input signal; ‘head’, ‘arms’, and ‘hands’. Head is the part that shows facial expressions composed of eyes, eyebrows, lips, and nose. Facial expressions were used for detecting emotion [25], whereas facial features were used for identifying a person [26], [27]. Facial recognition systems use several facial features for comparing against models in a database. For example, iris recognition systems, such as the one proposed by Elrafaei et al. [28], work with a built-in camera in a smartphone for reading the iris to identify the user. Additionally, these facial features were not only used for identifying the user, but also for accessing the controls of the system or for other purposes. For example, eyes have features, such as gaze movement and eye-blink, which can be used to manipulate a system [29] or imply intentions of a user [30]. Arms are one of the upper body parts that are used to display most humanlike behaviours. Like when Luo et al. [31] made a robot with Kinect to imitate human arm movements by its mechanical arms. A Kinectic camera is used as the interface between human arms and mechanical arms, such as being human-robot interaction systems [32]. Hands are the last upper body part that is used as a tool in order to achieve natural and intuitive communication in HCI [33]. With fingers and hands, an agent is capable of providing various forms of visual signals to a camera from static gestures to dynamic gestures [34]. For example, sign language is an example of static gestures that can be captured through a camera [35], while hand tracking for manipulating a system is a use case of dynamic gestures [36], [37]. For more information regarding vision-based hand gesture recognition, see Rautaray and Agrawal [5]’s comprehensive survey. Whereas the upper body is composed of three distinct sources of visual input signals, the lower body has two distinct parts for visual input signals; feet and legs. According to our study, feet as the lowermost part of the human body are

5 less preferred to be detected by vision than compared to other body parts. Although many studies proposed systems that use motion-detecting sensors attached to feet or ankles [38], [39] (which are described in detail in Section 3.1.4), some studies tried to detect feet from visual information. For example, Hashem and Ghali [40] developed a system that extracts foot features to identify the user, and Lv et al. [41] used foot’s shape which was being tracked from an orthogonal view through a built-in camera of a smartphone to control a soccer ball in a game. Legs are another lower body part for multimodal interaction systems, and they have mainly been used to understand the human activity, such as gait for identification of the user [42], leg movement for controlling a character in [43] and operating a humanoid robot by imitating the user’s dancing movement [44]. When the interaction system is set to use other than humanlike form, an image captured through an optical device can be used to trigger interactions between the agent and the interaction system. According to our findings, there are two types of visual input signals to be used when the agent has a non-humanlike form. One type is marker, which is a plain 2D image that is pre-stored in a system. When the system recognizes the printed image scanned by an optical device, an interaction happens. There are various use cases for this type of visual input signals, such as barcodes [45], [46], QR code [47], [48], AR markers [49], [50], and plain printed images which need to be scanned in order to activate an interaction [49], [51]. Another type of visual signal that belongs to the non-humanlike category is markerless signal whereby the interaction system can identify the target by using object detection or image recognition techniques. However, unlike the marker type, the markerless type does not need a plain 2D image in order to trigger an interaction. In this type, the interaction system uses physical object (e.g. buildings [52], environment [53] or a car’s registration plate [54]) instead of plain printed images. There are several factors that can influence the performance of interaction between an interaction system and an agent when the system uses a visual signal as the input modality. These factors include but are not limited to overlaying objects on markers, different light conditions (especially in the dark), target placement outside of the line of sight [24], [50].

3.1.2 Sound Sound is another input modality that is used for interaction between a system and an agent [55]. It also one of the natural and intuitive communication methods to send/receive information. Sound is a wave that passes through a transmission medium such as air, liquid, or a solid object, and devices that can capture this wave and turn it into an electrical signal are used to record sound. Sound is divided into two categories depending on whether it is speech or not. Figure 4 describes these two categories for sound.

Figure 4: Sound categories (sneezing in ‘vocal sound’ by Luis Prado from the Noun Project)

According to our research, many studies used a speech to communicate with a system by giving verbal commands. In this case, the system requires to understand what the agent says; thus, the system employs speech recognition in order to form a response based on the agent’s spoken request. Modern mobile devices such as HoloLens and smartphones have speech recognition systems installed, and they are used in various forms. For example, Microsoft’s Cognitive Speech Services enables the use of voice commands in Microsoft’s products [4], [56], and Apple’s Speech API is used in their products to add the capability of providing a natural-language user interface [57]. Moreover, when speech recognition is combined with machine learning technologies, it realizes humanlike virtual assistants such as Cortana by Microsoft [58], Siri by Apple [59] and Google Assistant by Google [60]. Additionally, just as facial expressions are used to recognize emotions, voices are also used, too [61]. For example, Amandine et al. [62] validated over 2,000 emotional voice stimuli by an evaluation with 1,739 participants. The participants scored the presented emotional voice stimuli based on the voice range, valence, and arousal. Their validation results provide 20 emotions and a neutral state in three different languages (English, Swedish, and Hebrew), which can be used in other voice emotion recognition studies.

6 Another category for sound is nonspeech, which is a sound that is not a speech. Nonspeech sounds are any types of sounds, including vocal sounds that do not represent anything in a language (see sneezing in the ‘vocal sound’ area in Figure 4). Researchers tried to use nonspeech sounds in order to understand events [63] or environments [64], or even to unveil a hidden problem. As an example of the latter, Joyanta et al. [65] developed a system that listens to cardiac sound to identify abnormal conditions which imply a possibility of diseases.

3.1.3 Biosignals In this report, we categorize biosignal as an input modality when it is collected through body-attached sensors. There are lots of different biosignals of the human body that could be used for interaction; however, we present only biosignals that have been used in interaction systems. As a result of our analysis, we could find brainwave, heart rate, muscle activity (Electromyography: EMG), skin conductance (Electrodermal activity: EDA), skin temperature, and blood pressure. Each biosignal and the corresponding body position for capturing the signal is illustrated in Figure 5.

Figure 5: Biosignals and corresponding body positions

Firstly, brainwave is an electrical signal of brain activity that can be captured by Electroencephalography (EEG). In brain-computer interface (BCI) studies, researchers have used brainwaves to infer the user’s state, intentions, or even emotions in order to understand and predict the user’s need [6]. There are six frequency bands (i.e., delta, theta, alpha, beta, gamma, mu) in EEG that represent different types of brainwaves. An EEG sensor can collect these brainwaves from different positions on the head. These frequency bands should be collected and analyzed carefully based on the purpose of research because each frequency band implies different information about the brain activity of users [66], [67]. Due to the necessity of stable contacts of the EEG sensor on a specific part of the head for better quality of data, movements of the user wearing the EEG sensor are usually restricted during the time of data collection [68]; however, the advantage of the deep level of understanding the user is an attractive property of EEG for researchers. Zao et al. [69] developed an HMD to collect six frequency bands and an additional two channels for the user’s eye activity by combining EEG and Electrooculography (EOG) sensors. To reduce the interference of sensors’ cables on the user’s behaviour, the collected data were sent to the system through a wireless network. The authors verified that an HMD with EEG/EOG sensors is capable of monitoring users’ brain activity when they are exposed to visual stimulation. More use cases of EEG are documented in another research by Gang et al. [70], who focused on people with functional disabilities. Secondly, a heart is one of the internal organs that directly relate to the lives of humans. Therefore, it is not a surprise that a heart has become one of the biosignals that have attracted researchers’ academic curiosity to understand humans’ bodily states. A commonly used method to observe heart activity is the use of dedicated sensor hardware, such as an Electrocardiogram (ECG), which is attached to a human body, and which can capture the electrical activity and heart rate. There are, however, several studies that detected heart activity without attached sensors in order to overcome the intrusiveness of the attached sensors. Examples include a smartphone camera to read blood pressure from a fingertip by analyzing the captured image of the finger [71] and a system that uses a web camera to recognize blood circulation from facial skin color [72]. Due to the strong relationship between the health condition and heart’s state, heart rate has been used for monitoring people’s health state [73], for detecting critical health issues [74], and even for measuring mental stress level [75]. Moreover, studies have demonstrated the effectiveness of heart rate as a method for engaging users in interaction with games [76], [77]. Thirdly, since most parts of the human body are composed of muscles, skin, and bones, muscle activity is another biosignal source. The most common method to acquire data on muscular activity is the use of EMG [78]. Unlike other biosignals that are collected through attached sensors, such as EEG, EMG is more free from contamination by noise on collected data. This is because of how the attached sensors work to detect the biosignal. For example, an EEG sensor reads brainwaves that are propagated from the brain through pericranial muscles to the skin where the sensor can read

7 them. EEG sensors use an amplifier on the signal due to the low amplitude of the original signal in order to successfully capture the data. During the propagation and amplification processes of brainwave signals, other muscular activity may cause unwanted noise to EEG signals. In contrast, the EMG signal has a relatively high amplitude compared to the EEG signal. This indicates that the EMG signal has less probability of contamination from noise [79], [80]. With this characteristic and capability of understanding human physical activity, the EMG has been used for various subjects, such as controlling of robotic finger by muscle activity on the lower arm [81], controlling a game character by muscle activity on specific body parts (e.g., hand gestures [82], mouth movement [83]), and controlling a serious game as rehabilitation for people who have disabilities [84], [85]. Fourthly, skin, like muscles, is another part of the body that exists throughout the entire human body. Skin conductance response (SCR) is an electrical activity on the skin which is caused by internal or external stimuli, which can be interpreted as a clue of cognitive state to identify stress level [86] and emotions [87]. By using this characteristic, SCR has been used to interact with systems. For example, Shigeru et al. [88] developed a 2D game where the number of obstacles, which the player’s game character should avoid, could be controlled by the player’s SCR. Yi et al. [89] also made a game but in virtual reality (VR) with different compositions of biosignals, such as SCR and heart rate variability, to detect emotions in order to adjust the environment in the game depending on the player’s emotional state. Fifthly, not only the electrical activity but also the temperature of the skin can be a biosignal for the interaction system. The variability of skin temperature during physical activity is a useful biosignal to understand which body part was used [90]. In that sense, skin temperature was proposed as a cue to achieve better efficiency of physical exercise for rehabilitation [91]. Additionally, skin temperature was also used for identifying people’s health state [92] and mental state (e.g., stress level [93], emotions [94], [95]) due to a strong relation to bodily condition. In an interactive system, this identified health or mental state can be used as an input. For example, Chang et al. [96] developed a game that was monitoring the player’s facial skin temperature in order to detect the player’s stress status. The game automatically adjusted the difficulty level based on the player’s stress status for improving the gaming experience. To capture skin temperature, an attachable sensor [97], [98] or an infrared camera [99] is used. Lastly, blood pressure, which is another basis of life activity along with heart rate, is a biosignal that can be measured by a sphygmomanometer, which is an attachable device/sensor on the body. It reads the pressure of blood, which is caused by the pumping of a heart to push the blood into the body through vessels. Since the pressure of blood is done by a heart, blood pressure is another source of biosignals to monitor people’s health/mental state for identifying critical issues, such as cardiovascular diseases [100], hypertension [101] and stress [102]. Furthermore, systolic and diastolic blood pressures have been used for emotion detection, whereby participants’ reactions were categorized into positive and negative [103], [104]. Additionally, a game is a well-used subject with measurement of blood pressure to find a correlation between players and games, such as effects of games on health [105] and stress [106], cardiovascular reactivity depending on gender [107] and type of game (e.g., M-rated versus E-rated) [108], and efficiency for controlling the blood pressure [109]. A study by Gbenga and Thomas [110] provides a well-documented account on other measurement methods for blood pressure, whereas a study by William et al. [111] gives a literature survey about home blood pressure measurement in order to verify the applicability of it from the perspective of clinical practice.

3.1.4 Inertia & Location Regarding other types of sensors for input modality, an Inertial Measurement Unit (IMU) is an electronic device composed of accelerometers, gyroscopes, and magnetometers in order to measure force, angular velocity, and even orientation [112], [113]. By using an IMU on the human body, various human motions become recognizable by systems for producing useful information to people. For example, a fall detector for preventing injuries [114], swimming motion analyzer for achieving better performance for swimmers [115], and gait recognizer for identifying people in order to use in security-related applications [113]. Since an IMU can be installed on any target object, not only human motion but also other physical objects’ motion can be measured. As an example, Wu et al. [116] attached an IMU on a broomstick in order to control the direction of a virtual character in their game. In general, smartphones are commonly used devices that have integrated IMUs for providing inertial information to applications, and IMUs are used as a part of a controller for interaction in gaming gears, such as Wii, HTC Vive, and [117], [118]. Moreover, IMU can be used to provide positioning as an alternative for satellite-based positioning devices in places where a satellite signal is unavailable [119], [120]. Geolocation is another type of input modality that refers to the information about the geographical position of a user or a system entity. Global Positioning System (GPS) is one of the satellite-based positioning systems developed by the United States that provides geolocation data from a network of satellites. There is also a number of other satellite- based navigation systems running by other countries throughout the world, such as GLObalnaya NAvigatsionnaya Sputnikovaya Sistema (GLONASS) by Russia, Indian Regional Navigational Satellite System (IRNSS) by India, Galileo by the European Union, Quasi-Zenith Satellite System (QZSS) by Japan, and BeiDou Navigation Satellite System (BDS) by China. Geolocation data is used in diverse ways from pathfinding services (i.e., navigation) to physical activity analysis. For example, Ehrmann et al. [121] measured various data from soccer players regarding the distance they covered during matches. By using the measured data, Ehrmann et al. analyzed the relationship between soft tissue injuries and intensity of players’ physical activity and identified some variables that can be used to predict players’ injuries. Geolocation data is used in systems for interaction as well, such as in Pokémon GO for accessing

8 game contents at specific locations [122], and in monitoring systems for tracking a specific target in order to alert potential threats to safety [123].

3.1.5 Tangible objects In the aspect of input modality, tangible object refers to any object that can be touched and used by agents to provide input data in order to manipulate digital information [124]. Mouse and keyboard are commonly known tangible objects for providing input data. When any object is combined with other sensors or devices to be capable of creating input signals, the object becomes a potential tool for input modality. For example, Jacques et al. [125] developed a glove combined with an IMU to read the user’s hand gestures for controlling a virtual object. Andrea et al. [126] created a small wheel-shaped controller that can be used to insert a password by rotating the controller. Additionally, while some tangible objects create input data by using sensors attached to objects, there are some tangible objects that rely on an external device for providing input data. For example, Varsha et al. [127] created a block programming platform by using tangible blocks that each of which has an image that refers to specific programming construct. When users align the blocks to form a certain sentence, the system reads the blocks’ images by using a camera and publishes the result in the form of narration.

3.2 Output (Computer → Human) 3.2.1 Visual representation Since vision is one of the key methods to receive information for either humans or machine agents, interaction systems can utilise visual representation as an output modality in order to react to inputs captured from agents. Like Filsecker et al. [128] and Hwang et al. [129] used in their applications for presenting information, still image, animation, text, and 2D/3D objects are commonly used form of visual representation on a screen. Video is another form that can be displayed on a screen as a visual representation [130]. AR and MR, a combination of video and 2D/3D objects, are included in the visual representation category [131], [132]. Additionally, visual representation is not limited to the information presented on a screen. For example, Horeman et al. [133] developed a training device that informs the result of a suture in real-time by using red or green light for improving suture skill. The strength of visual representation is the intuitiveness when delivering information, whereas the weakness is an increase in time cost when creating high-quality content.

3.2.2 Sound Similar to the input modality described in Section 3.1.2, there are two fundamental choices that a system can utilise when the sound is used as an output modality: speech and non-speech. An interaction system can use speech output that is created by two different techniques, either a human voice or a speech synthesizer [134]. The human voice technique can imply the emotional state when the user modifies the way of speaking according to the experienced emotion [135]. This is one of the reasons why human voice is commonly used for virtual characters [136], [137], thus making them seem alive. However, the cost of developing human voice output is expensive when the system requires a large number of recorded voice lines. Therefore, the speech synthesizer arose as an alternative. Speech synthesizer is an artificial voice creator that can produce speech at a relatively low price in a short time [134]. However, it is still infeasible to convey emotional cues in speech as a human voice does. Thus a mixture of these two techniques is sometimes used in order to complement this shortcoming [138]. As examples of the use of speech synthesizers, Elouali et al. [55] and Bhargava et al. [139] developed applications that can read out text on a device’s screen by using . In contrast to speech, any type of sounds that do not represent anything in a human language is regarded as non-speech. Similar to speech, non-speech can be composed by using two different techniques, either by a digital sound synthesizer or by recording real-world sounds with a microphone. Many devices, applications, and games widely use sound effects that were created by digital sound synthesizers. For example, Nguyen et al. [140] developed a drowsiness detection system that plays an alarm sound when the driver of a car looks sleepy. Koons and Haungs [141] implemented every sound effect in their game by using a digital sound synthesizer. Additionally, there are various types of synthesizers for musical instruments to amplify and modify the style of output sound. In some cases, recorded real-world sounds were used for providing more immersive experiences. EA Digital Illusions CE AB (DICE) used real gunshot sounds in a first-person shooter (FPS) game, Battlefield 3, in order to provide a realistic game environment to players [142]–[144]. Foley sound effects are another example of recorded real-world sounds that are usually used in films. However, a Foley sound effect is a re-creation of a sound for modelling realistic impressions on events by using various objects and skills rather than using the same objects and skills to reproduce similar sounds of events [145]. Additionally, some vocal sounds are examples of non-speech sounds, such as groans and screams of virtual characters in a game [146].

9 3.2.3 Haptics A sense of touch is one of the methods for the human body to perceive stimulation. In the aspect of output modality, the term haptics refers to any method that can provide an experience of touch by exerting force [147]. Foottit et al. [148] stated two categories for the types of feedback that mainly use haptics: (i) to provide an experience of touch of real-world objects, (ii) to produce a stimulus for conveying information which is not related to real-world forces. For example, representations of material texture [149], [150], weight [151], and shape of objects [152] are examples of the first category, whereas a vibrotactile sensation for notification of specific events [153]–[155] is an example of the second category. More use cases of tactile interaction involving fingers and hands are documented in another research by Pacchierotti et al. [156].

3.2.4 Others We presented some examples related to the three sensors that humans have for detecting diverse stimuli. Visual representation through eyes, a sound that can be heard by ears, and haptics for tactile sensations. There are several studies that aimed at the remaining two senses that a human uses to obtain information: gustation and olfaction. The tongue is an organ through which humans can taste (gustation), and the nose is another organ that humans can use to smell (olfaction). Therefore, when an interaction system is able to provide gustatory and olfactory outputs to agents, there would be more chances of developing novel interaction methods [157]. For example, Risso et al. [158] developed a portable device that can deliver odours by combining up to eight fragrances, whereas Ranasinghe and Do [159] created a device, the ‘Digital Lollipop’, that can produce four different tastes (e.g., sweet, sour, bitter, salty) by evoking electrical stimulation on an agent’s tongue. Additionally, thermoception is another sense that the human uses to recognise the temperature of an object or an environment. From the perspective of output modality, the sense of temperature for an agent can be achieved by utilising either a heating system or a thermoelectric cooler. There are two ways to perceive temperature. One way is from a device that is attached to the human body [160], and another way is from an ambient temperature that is controlled by a heating system or a device [161]. Like the last type of output modality, there are several cases where the system controls physical objects or the environment as a reaction to the agent’s input. For example, Jiang et al. [162] implemented a robotic arm control system by using hand gestures and voice commands while the agent is sitting in a wheel-chair. Khan et al. [163] built a smart home management system that adjusts the light level of a room depending on the intensity of ambient light in order to reduce energy consumption.

10 4 Multimodal Interaction - System Modeling 4.1 Integration (Fusion) Since the interaction system receives input data through various modalities, it requires a step to process the input data. This process is called ‘integration’ or ‘fusion’ [6], [9], [55], [164]. During this process, the input data is synchronized and/or combined with other data based on models or algorithms that the system uses in order to produce an output. In this report, we use three different integration types that were mentioned in several studies [6], [165]–[167]. Figure 6 depicts the architecture of the three integration types as proposed by Sharma, Pavlovic and Huang [167], who based their work on the original design of Hall and Llinas [165]. The three integration types are explored further in the following sections.

(a) Data level integration

(b) Feature level integration

(c) Decision level integration

Figure 6: The architecture of three integration types redesigned by Sharma, Pavlovic and Huang [167]; the original architecture was designed by Hall and Llinas [165]

4.1.1 Data level integration Data level integration is a type of fusion that happens when raw data from multiple modalities are merged. The raw data must be based on observations of the same object that are delivered by the same modality, such as audio streams (i.e. sound) from a camera and a microphone [6] or images (i.e., visual signal) from multiple cameras [167]. As depicted in Figure 6a, data level integration is accomplished right after obtaining data from input modalities. Thus, the collected data contains abundant information due to the absence of preprocessing, and this absence leads to potential data reliability issues such as vulnerability to noise and data loss on sensor failure [6], [166], [167]. After the integration, the data is processed to extract features that are used for making a final decision to form a result. Data level integration has been used in some cases such as pattern recognition [165] and biosignal processing [6]. However, it is not commonly used in multimodal interaction systems since several different modalities are used. Thus, other multimodal interaction studies employed the other two integration types which are presented next [9], [11], [13], [168], [169].

4.1.2 Feature level integration When the input data from each sensor provides extracted features, the feature level integration can happen (Figure 6b). Feature level integration is also called ‘early fusion’ [9], [11]. In this integration process, features are obtained from the input data when the data go through data processing that is designed for each sensor [165], [167]. Features do not need to come out from the same type of modality. Features that are extracted from closely coupled and synchronized modalities (e.g., a sound of speech and images of lip movement) can be integrated in order to produce another feature [167], [168]. Due to the data processing that occurs before integration, feature level integration has

11 a relatively high data loss rate compared to the data level integration, while it is less vulnerable to noise [6], [167]. However, features from sensor data can result in a large quantity of data, which requires a high cost for computation in order to get results [6], [166], [167], [169]. Feature level integration has been used in the case of feature extraction from biosignals for affective computing [6]. Additionally, Dasarathy categorized two additional types of integration in the refined version of the data integration categorization, and one of them is the ‘data in-feature out’ type [166]. In this integration type, raw data from input sensors are preprocessed before they are merged into one data set for feature extraction. Depending on the viewpoint, this integration type can be labelled as either ‘data level integration’ or ‘feature level integration’. In other studies, this integration type is called ‘intermediate fusion (mid-level integration)’, and some examples of probabilistic graphical models for pattern recognition are categorized into this type [9], [11].

4.1.3 Decision level integration Decision level integration is the last type that takes a number of alternative decisions as inputs and produces a final decision as an output (Figure 6b). This integration type is also called ‘late fusion’ [9], [11] or ‘semantic fusion’ [168], [169]. In decision level integration, data derived from different modalities are processed independently until respective decisions have been made. A final decision or interpretation regarding the agent’s input will be made when all the decisions are ready to be merged [9], [55]. Decision level integration is used in multimodal systems in order to integrate multiple modalities that are not tightly coupled but have complementary information [6], [169]. Due to an individual decision-making process for each modality, computational cost at this stage is relatively lower than that of feature level integration [11], [167]. Dasarathy’s categorization has the ‘feature in-decision out’ type of integration, which can be labelled as either ‘feature level integration’ or ‘decision level integration’ [166]. In this integration type, the system classifies input features based on trained knowledge to get an output decision. Some pattern recognition systems utilized this type of integration process [6], [9], [11].

4.2 Presentation (Fission) After integration of input data is completed, a decision or interpretation regarding the input data needs to be delivered to the agent through one or more modalities. This stage is called ‘presentation’ or ‘fission’. The output from the multimodal interaction system will become another stimulus that can affect the agent’s behaviour [6]. Consequently, the outcome of this interaction depends on how the multimodal interaction system chooses to present the decision to the client. Thus, the way in which output is presented to agents through modalities should be carefully considered when designing a multimodal interaction system. To mitigate this challenge, Foster [170] proposed three important tasks in the presentation process (content selection and structuring, modality selection, and output coordination), whereas Rousseau et al. [171] presented a model consisting of four questions (What-Which-How-Then) that must be taken into account in design processes for multimodal presentation: (i) What is the information to present? (ii) Which modalities should we use to present this information? (iii) How to present the information using these modalities? (iv) and Then, how to handle the evolution of the resulting presentation?. Some studies employed these models to improve the efficiency of the presentation. For example, Grifoni [172] used Foster’s three tasks for analyzing the features of visual, auditory, and haptic modalities when these are used as output channels. Cost and Duarte [173] used the WWHT model to design an adaptive user interface based on user profiles, with a focus on older adults. Another example of adaptive systems was provided by Honold et al. [174], who designed an adaptive presentation system that provides visual information to agents based on their contexts.

12 5 Multimodal Interaction using Internet of Things & Augmented Reality

Before we dive into the use case of multimodal interaction with AR and IoT, we present several studies that built multimodal interaction with AR or IoT, respectively. In order to identify the modalities that have been used in AR/IoT multimodal interaction system, we analysed studies based on modalities that each multimodal interaction system used to provide the main functionality to the agent, which could be a human or another system.

5.1 Internet of Things The data collected through IoT devices have been used by multimodal interaction systems to understand the agent’s state, intent or context. As technology advances and more types of data are collected, systems become more capable of understanding and helping the agent. Furthermore, when a multimodal interaction system is combined with other technologies rather than using only one sensor, it allows the collection and analysis of large quantities of data from diverse sources for better understanding of the agent. In this section, we analyze studies that developed multimodal interaction using IoT. The analyzed studies are categorized by the primary modality used. Some studies were excluded from in-depth analysis due to the lack of information regarding technical aspects, although they exemplify use cases of IoT. As an example, Farhan et al. [130] developed a Learning Management System (LMS) that uses attention scoring assessment. The students start the learning process through a video lecture which is shown on the computer, and students are recorded by a webcam. The students’ attention level is evaluated according to the location of the face and whether the eyes are closed or not among the images taken in real time. This assessment can be used by instructors to understand the status of their students and to improve the quality of their learning experiences. Their system gathers some information (e.g. student location) from the students via an IoT infrastructure to improve the learning experience. However, the information about the IoT infrastructure was not described in detail. So we exclude such studies with major gaps in technical details from our analysis. Additionally, it is important to acknowledge the existence of important challenges and knowledge on different research topics within IoT regardless of their use of multimodality. Therefore, we identify several survey papers that focused on specific issues and use-cases, such as the IoT system in mobile phone computing [175], IoT for medical applications [176], [177], occupancy monitoring for smart buildings [178], data fusion methods for smart environments [179] and the brain-computer interface [6].

5.1.1 Visual signal The visual signal is defined here to be anything that can be captured as a set of images by an optical instrument. The captured images can be processed with the computer vision technology for recognizing movements of a target (e.g. hand gesture, body gesture, eye gaze, facial expression) or physical objects. This is a powerful modality that allows a system to perceive the visual richness of the surrounding world. As an example, Kubler et al. [180] used this approach when they developed and tested a smart parking manager using IoT. The user can reserve a parking spot through an online booking system, and when the user tries to enter the parking lot, the car plate number will be optically detected at a gateway. The interaction system will manipulate the actuator to open the gate when it finds the matched number from a booking list. After the user enters the parking lot, the user’s location will be saved in the interaction system. In the case of an accident, the interaction system gives the right to the emergency vehicle to access the parking lot based on the latest updated user location.

5.1.2 Sound Sound is a wave signal that can be captured by a sensor device. Kranz et al. [181] used the sound modality when they developed a context-aware kitchen that consists of a microphone, a camera and several IoT sensors on a knife and a cutting board. The interaction system can recognize the type of ingredients being cut by analyzing the sound of the cut. While the microphone detects the types of ingredients, the three-axis-of-force and three-axis-of-torque sensors on the knife also collect data that are used to determine the type of food. Another use case of sound modality is Saha et al. [182]’s air and sound pollution monitoring system that uses a sound and gas sensor to detect air quality and sound intensity in a specific area. When the system detects abnormal levels of noise or air pollution, an alarm would be alerted in order to make agents take actions. As an extension of the monitoring system, a cardiac sound is used to monitor an agent’s health condition [183], [184]. When the monitoring system acknowledged the abnormality in a cardiac sound stream, favourable services could be presented to an agent [185].

5.1.3 Biosignal The human body is a source of various biosignals which can be used for understanding the physical and psychological state of the person. This can offer diverse ways of interacting with the system by using body signals as a medium, thereby promoting natural interaction. The biosignal sensors are attached to the human body for recording and sending data to an interaction system. For instance, Wai et al. [186] used EEG, Pulse pattern generator (PPG) and eye tracker to understand the user’s neural, physical and physiological states. The EEG and PPG sensors were attached on a headband, and the sensors send the recorded data to a cloud server for real-time processing and access. Whereas

13 the EEG sensors were used for recording the user’s brainwave signals, the PPG sensors were used for recording the heart rate and heart rate variability. With these data, including also gaze data, authors try to detect the fatigue state in a driving scenario, and their result shows that a combination of EEG and PPG modality increases the detection accuracy compare than only using one modality.

5.1.4 Inertia & Location An accelerometer is a sensor which can be used for determining the user’s physical activity by the acceleration of their body parts. Most recent smartphones have an accelerometer, and many smart accident detectors were implemented by using built-in accelerometers of smartphones. [187]–[189] However, in the case of smartphone-based accident detectors, the smartphone’s built-in accelerometer may not sense the user’s movement in a desired manner. To overcome this issue, Chandran et al. [190] developed a helmet to detect and notify an accident for user safety. The three-axis accelerometer was attached on a helmet, and it collected data of the user’s head acceleration. The data were sent to the cloud server to detect an accident by analysing rapid changes in acceleration. When the system detects an accident, it would give a phone call and send a message that was supposed to be answered by the user within a certain amount of time. If the system does not receive an answer from the user, it will place an emergency call and a message to another number which is registered by the user beforehand. In another example, He et al. [191] attached a gyroscope and an accelerometer on the vest to get more accurate results of the user’s physical activity for fall detection. The sensors send a data to the fall detection app in a smartphone, and when the app detects a fall, the app will provide notification to registered users through different modalities, such as message, phone call, alarm, and vibration. The location is another information that can be given as an input to an interaction system. Nazari Shirehjini and Semsar [192] created an app that shows information about controllable objects around the user on their smartphone depending on the room that the user entered. The objects which are identified through the IoT devices are presented on the screen as 3D objects. The app presents the status of the objects, such as power and location, and the user can control them to turn on and off by tapping the screen with their fingers.

5.2 Augmented Reality According to our literature survey, all the studies that utilize AR in a multimodal interaction system uses visual signal as the main input modality due to the visualization of virtual content. One of the known advantages of AR is that it attracts the user’s attention and motivates the user to focus on the content, which have been demonstrated particularly in a study on education [193]. Due to these advantages, AR has been used in different studies as a way of information visualization or part of a user interface that gives an interactable object to control the system. In order to represent the AR content on the screen, the AR device requires the knowledge of the position where the AR content will be placed onto the real-world view. Visual markers is a commonly used method that provides a positional anchoring to the AR device for displaying the AR content. However, this method has the limitation that there needs to be a marker on the target object or place. Some researchers tried to combine AR with different computer vision techniques, such as object recognition, image recognition and object detection, to overcome the limitation of marker-based AR visualization. Bhargava et al. [139] used AR for text recognition and translation between English and Hindi, vice-versa. The mobile device displays the translated text on the screen and also uses sound to speak out the text. In 2015 when Bhargave’s team developed their application, Google Translate was able to translate the text only from English to Hindi [60]. Multiple input and output modalities were applied in the study by Zhu et al. [194], who developed an AR mentoring system on an HMD that helps with inspection and repair tasks. When the user asks a question with voice, the mentor system responds with certain answers in voice depending on the stored data. The mentor system also uses a camera, an IMU, and object recognition to recognize the target object and the direction of the user’s gaze to express the tasks on the user’s screen in AR. Al-Jabi and Sammaneh [195] used AR to present information about what the user needs to know when they are parking their car. Their system uses deep learning to understand the camera view when the user navigates the space. When the system recognizes the user’s location from the camera view, the AR arrow will appear on the screen for navigating the user to the closest available parking slot. If the system cannot find any available parking slot, it will visualize the remaining parking time of other cars on the roof of each car.

5.3 IoT with AR Researchers have focused on establishing efficient interaction methods using both AR and IoT to increase usability. This mixture of two techniques, AR and IoT, is sometimes called “ARIoT" [196]. According to our literature study, there are four types of interaction in ARIoT systems. Figure 7 describes each type of interaction. The first two types are for interaction through AR interfaces (Figure 7 (a) and (b)). In these types, the agent can control a real-world object by manipulating an AR interface on the mobile device such as a smartphone, tablet PC and HMD. Thus, AR is not only for presenting information but also for providing inputs for interaction. The difference between Figure 7 (a) and Figure 7 (b) is whether an IoT sensor is attached on a marker or an object, or not. Conversely, the other two types, depicted in Figure 7 (c) and (d), use AR only for displaying data on the screen. The agent can interact with augmented objects. However, it only requests the data to be presented. As in Figure 7 (a) and (b), the only difference between the two types is the presence or absence of IoT sensors on a marker or an object. However, in the case of

14 Figure 7 (c) and (d), the agent can manipulate the real-world object, and AR is only used to display the data. Based on the four interaction types illustrated in Figure 7, we categorize them into two sections: AR for interface and AR for data presentation.

Marker on IoT Marker on object

AR for interface

(a) (b)

AR for data presentation

(c) (d)

Figure 7: Interaction types in ARIoT

Table 1 presents a summary of ARIoT studies with information of input and output modalities, features of the implemented systems, and interaction types based on Table 7. ‘Visual graphics’ in output modality indicate any form of computer-generated graphical object that can be displayed on the screen in the form of image, text, video, animation and 2D/3D objects. Unless the system uses a fixed point camera, such as [197], [198], the ‘camera’ in Table 1 indicates a tool for visual signal modality due to the degree of freedom for searching the objects.

Table 1: Interaction type classification

Modality (tool) Interaction Input Output Feature type Touch (finger), vision (camera), Visual graphics (screen), sensor Manipulate the merchandise in sensor value (IoT sensor: power command (IoT sensor: power (a) a shop [10] state) control) Touch (finger), vision (camera), Visual graphics (screen), sensor Present the status and manipu- sensor value (IoT sensor: micro- command (IoT sensor: power (a) late real world objects [196] controller) control) Touch (finger), vision (cam- Check and order the books from era), sensor value (IoT sensor: Visual graphics (screen) (a) smart shelf [199] RFID) Touch (finger), vision (camera), Visual graphics (screen), sensor In-house energy management sensor value (IoT sensor: power command (IoT sensor: power (b) application [200] state) control) Touch (finger), vision (camera), sensor value (IoT sensor: tem- Visual graphics (screen) Sensor data visualization [201] (c) perature) Vision (camera), sensor value (IoT sensors: microcontroller, Present the current status of an Visual graphics (screen) (d) battery level, electrical connec- electrical panel [202] tion, power consumption)

15 Vision (camera), sensor value (IoT sensors: acceleration, O2, Visual graphics (screen), sound Task procedure visualiza- (c) CO2, humidity, pressure, tem- (audio) tion [203] perature) Click (mouse), sensor value Farm manager that presents (IoT sensors: soil moisture, Visual graphics (screen) information about planted (d) temperature, water level) crops [197] Locate IoT devices by using Vision (camera), sensor value azimuth and elevation angles Visual graphics (screen) (c) (IoT sensor: location) between a wireless transmitter and a mobile device [204] Markerless AR game for inform- Vision (camera), sensor value Visual graphics (screen) ing the agent about air pollu- (b) (IoT sensor: air pollution) tion issues [205] Click (mouse), vision (camera), Present information about sensor value (IoT sensors: tem- Visual graphics (screen) (d) monuments in AR [206] perature, humidity) Vision (camera), sensor value Measure the response time for (IoT sensors: temperature, hu- Visual graphics (screen) getting data from a cloud server (d) midity, soil moisture) to an AR application [207] Touch (finger), vision (camera), Visual graphics (screen), sen- Application that runs within sensor value (IoT sensors: tem- sor command (IoT sensor: com- a smart city with smart ob- (b) perature, power state, location) mand) jects [208] Vision (camera), sensor value (IoT sensors: temperature, Visualize IoT sensor data in AR Visual graphics (screen) (d) dust, gas concentration, carbon for miner safety [209] monoxide) Vision (camera), sensor value Visualize IoT sensor data re- (IoT sensors: temperature, bat- Visual graphics (screen) lated to Quality of Service in (c) tery level) AR [210] Web-based AR application for Vision (camera), sensor value Visual graphics (screen) monitoring the power state of (d) (IoT sensor: power state) the IoT device [211] Touch (finger), vision (camera), Public transport application sensor value (IoT sensor: loca- Visual graphics (screen) that provides information about (d) tion) the state of a bus [212] Touch (finger), vision (camera), Serious game to raise awareness sensor value (IoT sensors: tem- Visual graphics (screen) (b) on air pollution issues [213] perature, CO2) Sensor value (IoT sensors: light Evaluation system for upper level, orientation, position), Visual graphics (screen) limb function by using a physi- (c) physical object (cube) cal cube on an IoT board [198] Touch (finger), vision (camera), Visual graphics (screen), sensor Programming a smart environ- sound (voice), sensor value (IoT command (IoT sensor: power (b) ment by using AR bricks [214] sensor: power state) control) Touch (finger), vision (camera), Visual graphics (screen), sensor Application that helps changing sensor value (IoT sensors: tem- command (IoT sensor: power people’s behaviour for better (b) perature, power consumption) control) energy efficiency at schools [215] Visualize a virtual heart that Vision (camera), sensor value Visual graphics (screen), sound pulsates with an actual heart- (d) (IoT sensor: heartbeat rate) (audio) beat [216] Vision (camera), gesture Application that can build reac- Visual graphics (screen), sen- (hand), sensor value (IoT sen- tion rules between IoT devices sor command (IoT sensor: com- (a) sors: power state, air pressure, through an AR interface on an mand) temperature, humidity) HMD [217]

16 Touch (finger), vision (cam- era), sensor value (IoT sensors: Visual graphics (screen), sensor Visualize energy consumption power voltage, gas, PM2.5, command (IoT sensor: power (a) data of a smart plug [218] PM10, temperature, pressure, control) humidity) Monitor the stress value of Vision (camera), sensor value Visual graphics (screen) metal shelving based on strain (c) (IoT sensors: strain, voltage) gauges [219] Visual graphics (screen), sensor Control IoT devices by hand Vision (camera), gesture (hand) command (IoT sensors: sound, (a) gestures [220] light) Touch (finger), vision (camera), Visual graphics (screen), sensor Design robot actions through sensor value (IoT sensors: bat- command (IoT sensor: robot (b) an AR interface [221] tery, moisture) movement) Vision (camera), sensor value Guide the agent to a destination (IoT sensors: acceleration, dis- Visual graphics (screen) by using virtual text and virtual (d) tance) arrows [222] Touch (finger), vision (camera), Visual graphics (screen), sensor Control the LEGO cen- sensor value (IoT sensor: motor command (IoT sensor: motor (a) trifuge [223] value) control) Vision (camera), sensor value (IoT sensors: temperature, hu- Present real-time data from IoT Visual graphics (screen) (d) midity, light, noise, presence of sensors by using AR [224] movement data) Spatial mapping of a smart ob- Vision (camera), sensor value Visual graphics (screen), sen- ject’s location and display onto (IoT sensors: distance, micro- sor command (IoT sensor: com- the screen based on the distance (a) controller) mand) between the AR device and the smart object [225] Touch (haptic device), vision Visualize a city model by AR (camera), sensor value (IoT sen- Visual graphics (screen) onto HoloLens to navigate in (b) sor: city data), gesture (hand), the city data [226] sound (voice)

5.3.1 AR for user interaction AR interface visualizing IoT data can be used to be informed real world object. Pokrić et al. [205] developed an AR game which presents the air quality level collected from IoT sensors in real-time. In order to raise awareness on air pollution, the game provides a feature to the user where they could guess and check the actual air quality level of their city. The authors of [199] developed an app that can assist people who have motor disabilities in selecting desired books from shelves. Through AR interface, the user could check the status of the listed book on shelves which had IoT sensors attached to each section. When the user selected a book by touching the screen, there would be a real-world assistant who brings the selected books from shelves to the user. In addition, using AR, actuation affecting real world object also be carried out. Jo et al. [10], [196] used AR to present a virtual object which referred a real-world object (e.g., a lamp) on a mobile device. IoT was used to enable the manipulation of a real-world object by the user’s finger touch on the AR object, such as turn on and off the lamp. However, the AR interface can be used not only for direct control of a real world object but also for indirect control by programming the behaviours of each device depending on the situation and the relation between devices. Stefanidi et al. [214] developed a platform for programming the in-house IoT products’ behaviours. The user can set up the interaction between the products and human activity by combining virtual bricks on an AR interface. For example, when the user programmed the alarm to be rung when the door opens, it will happen in the real world, and the behaviours in the real world (e.g., turn off the alarm) will conversely be applied on the AR interface. As an example of utilizing an AR interface as a remote controller of IoT objects in a room, Cho et al. [200] tested their energy management system in bigger space, a building, that enables the control of real-world entities through an AR interface. Their system collected the overall status of energy consumption in a building and represented it on an AR map. In order to manage the energy consumption, the user can turn on and off real-world entities by pressing the corresponding buttons on the AR map with their finger.

17 5.3.2 AR for interactive data representation Unlike the aforementioned cases, when AR is only used as a tool to visualize data, with interactive data representation, the agent can select data to visualize. Seitz et al. [201] combined AR with IoT devices which are used for machines in industrial factories. The main purpose of AR was the visualization of machines’ states by using interactive data representation for data from IoT devices. Chaves-Diéguez et al. [202] used interactive data representation on an electrical panel. The authors of [203] developed an HMD system that uses AR to guide staffs to carry out a maintenance task. In order to identify the object, detection techniques used instead of marker. Finally, in order to achieve precision farming, Phupattanasilp and Tong [197] developed a farm manager which presents the information of crops through AR. The information related to crops, such as soil moisture, temperature, water level, and so forth, were gathered by IoT sensors.

18 6 Discussion

From the literature survey, we found that 36 studies have been conducted on multimodal interaction with AR and IoT within various subjects, such as smart environments, energy management, assistants to people with special needs, tourism, mobile gaming, agriculture, shopping, and more. Regardless of diversity, most of the studies used a unique framework and/or an architecture during the development process that was designed for each research purpose with different requirements on modality. Thus, only a few studies proposed a complete or partial framework and/or architecture that others could use to create multimodal interaction systems with AR and IoT. This rarity is because of the following challenges that previous research on multimodal interaction systems has identified.

Multidisciplinary Due to the fact that multimodal interaction systems utilize various modalities in many ways depending on their purpose, each modality requires an expert to implement the respective part of the system with intended level of usability and performance [201], [214]. Otherwise, developers should have multidisciplinary knowledge in diverse fields to achieve a certain degree of usability and performance in the multimodal interaction system [9], [11]. This requirement increases the difficulty of system development and becomes an obstacle to the establishment of a general framework/architecture of a multimodal interaction system.

Reusability According to our literature survey, most of the proposed multimodal interaction systems were developed for a specific purpose and only evaluated within dedicated testbeds rather than implemented and evaluated in practical environ- ments. In these systems, it is necessary to configure all devices beforehand in order to provide the intended services. Different multimodal interaction systems may have varying requirements regarding the types of devices and interaction modalities, and new device types may need to be added later; however, the previous systems are not able to provide a reusable framework/architecture to cater for these requirements. Additionally, a multimodal interaction system should be evaluated in a practical environment, with preferably different use cases, in order to verify its user experience [200], [201], [206], [211]. Most of the reviewed systems failed to do this because they were only evaluated in testbeds.

Technology Some features are not effectively served due to the limits of technology. For example, a voice recognition system can be effectively used by English speakers, while non-English speakers might have issues when using it. This can be solved by using another voice recognition system designed for a specific language. However, the use of a dedicated language recognition system may lead to a lack of reusability. Another example is sensors for biosignal modality. For example, a BCI delivers brainwaves as input signals through EEG; however, EEG has some issues regarding ease of use and data reliability [6], [227]. An EEG device requires many steps for setup, and it has several things to beware of while wearing it for ensuring reliable data collection. As an extension of the third challenge, AR technology requires further improvement on target recognition and tracking features. In marker-based tracking, when the camera cannot read enough information about the target (e.g., a marker) while it is rapidly moving, the AR system fails to track the target on which AR content should be placed. As an example, Cao et al. [221] found this challenge and planned to employ a new protocol in their system for restoring lost-tracking circumstance as future work.In markerless-based tracking, when the multimodal interaction system analyzes a camera image to estimate the location of the target, there is a high chance of incorrect coordinates due to the inability of reading of the precise z-axis values [197].

Scalability The establishment of a reusable framework and/or architecture that is capable of achieving high scalability is one of the most significant challenges for multimodal interaction system developers. The reason for this difficulty is that the multimodal interaction system needs to cover numerous types of different input/output modalities and IoT devices [167], [208], as well as cater for increased performance requirements [207], [228]. In other words, this challenge could be solved when the multimodal interaction system can be easily extended to conduct efficient management of current and future input/output modalities and IoT devices. Many partial solutions were proposed to solve the scalability challenge, such as a distributed computing to reduce the burden on a central server by using metadata [196], a management of the number of connected devices based on agent profiles or distance between agents and devices [208], and a hierarchical cloud service to reduce the response time for receiving data from a server to an IoT sensor in order to display on AR object [207]. However, most of these solutions were only proposed and partially verified in testbeds rather in real environments.

Security The assurance on a proper level of security between agents and a multimodal interaction system is another challenge. Due to the diverse connections between various agents and a number of devices through multimodality, any violation

19 of privacy and security would have a huge impact on the credibility of the multimodal interaction system. Thus, it is important to provide reliable security to every connected agent and device of multimodal interaction systems.

New modalities The development of new technologies and devices, which are feasible to be adopted in a multimodal interaction system, raises the chance of discovering new modalities [229]. For example, a further improvement of the foldable screen on a mobile device might bring a new type of modality that could be used in multimodal interaction, such as an attachable screen on human skin or an interactable clothe made by foldable screens. Not only the emergence of new technology but also the improvement of existing technology that enables more complicated and precise tasks with sufficient performance levels for use in a multimodal interaction system could introduce a new modality.

Interface design In order to provide a sufficient level of usability to agents while being able to manage diverse input/output modalities and various types of IoT devices, it is necessary for a multimodal interaction system to have a properly designed interface. The requirement of usability depends on several reasons, such as the purpose of a multimodal interaction system and available modalities [217], [225]. Especially, the design of interface could be affected by not only whether it is implementable but also by the performance level of each modality; developers should thus consider the performance level of each available modality before merging them into an interface [11], [167]. Moreover, multiple modalities used in an interface should not simply be combined without consideration of an agent’s cognitive aspects. Since there is no evidence that an agent, especially when it is a human, accepts incoming information separately via each modality, all modalities should be integrated in a complementary manner for providing fully contextualized information [9], [149]. It is an important issue to consider also when an agent is a machine. The research of human-robot interaction is a field that focused on this cognitive aspect when building a robot’s system in order to understand a human’s action and to make the robot behave like a human [230]. Another challenge for designing an interface is the flexibility on modality selection and combination [11]. Since a multimodal interaction system could be served to various types of agents, an interface would be better to have high flexibility that could provide a different combination of modalities depending on an agent’s preference while the combination of selected modality should be provided in a natural manner with proper levels of intuitiveness [167], [208].

Adaptivity of Multimodal Interface Design As an extension to the aforementioned challenge in Interface design, consideration of cognitive aspects and flexibility on modality could enable adaptivity of an interface according to an agent’s affective/cognitive state [9], [172]. The development of adaptive interfaces is one of the demanding challenges that could realize a personalized experience to agents by providing a customized interaction modality depending on the situation [167], [229], [231]. For example, the situation could comprise an agent’s preferences or the type of information that would be presented to agents. Additionally, regarding the adaptivity of an interface, there is a challenge to produce personalized feedback rather than uniform feedback for agents in order to achieve better usability [214].

Usability of Multimodal Interaction Interface The last challenge is related to the properties of multimodality. One of the key reasons for adding multiple modalities in an interaction system is to improve usability compared to a unimodal interaction system. However, multimodality might not always provide better usability. Instead, multimodality could provide a flexible interface whereby an agent can choose different modalities depending on their needs [232]. This lack of certainty on usability improvement could be caused by missing knowledge on the relationships between agents’ cognitive processes and each modality [9], [11]. Therefore, in order to resolve this challenge, an in-depth study on usability analysis in a number of multimodal interaction systems with heterogeneous modalities would be needed. From our findings, we could identify research questions that have the potential to resolve the known challenges. These questions are illustrated in Table 2 together with their respective challenges.

Table 2: Challenges and Research questions

Challenge Research question Why the combination of IoT and AR has been relatively less employed than the individual Multidisciplinary use case of each technology in multimodal interaction systems? Multidisciplinary How to facilitate the use of IoT and AR in multimodal interaction systems? What modalities are missing from the four identified modality categories that are feasible New modalities to use but not implemented yet?

20 What design of framework/architecture could be applied to a multimodal interaction Reusability system that uses any combination of the four modalities (visual signal, audio, inertia, and location) with IoT and AR in order to improve reusability? What design could be applied to increase scalability of a multimodal interaction system that uses any combination of the four modalities (visual signal, audio, and inertia, and Scalability location) with AR and IoT in terms of both device management (modalities, agents) and system performance? How can the reusability of a multimodal interaction system framework/architecture be Reusability verified? What design of framework/architecture could be used for applying the newly discovered Interface design modalities with existing modalities? In design of an adaptive/personalized interface for a multimodal interaction system, what Adaptivity of Multimodal aspects (agent-centric vs information-centric) are important for selecting input/output Interface Design modalities in a given context? Adaptivity of Multimodal How can the adaptivity of a designed multimodal interaction system frame- Interface Design work/architecture be verified in terms of usability (satisfaction, efficiency, effectiveness)? Usability of Multimodal How to ensure usability of an increasingly complex multimodal interaction environment? Interaction Interface

Since it is not easy to make a general architecture/framework that covers every possible modality in multimodal interaction with AR and IoT, we can propose a reusable architecture/framework that uses combinations of some of the four modalities (visual signal, audio, and inertia & location). The reason for choosing these four modalities is because they are commonly used in multimodal interaction systems that utilize AR and IoT. AR and IoT are usually implemented by using visual signals (e.g., marker, image/object recognition) and location (e.g., geolocation that is delivered through a wireless network) as input modalities, and visual representation (e.g., on screen), sound (e.g., audio), and haptic feedback (e.g., vibration) are delivered to agents as output modalities. The main objective of this research is to establish a reusable framework/architecture that can be used to consider during the development process of multimodal interaction systems with IoT and AR. This framework/architecture would be verified through a number of practical development projects outside of testbed environments.

21 7 Conclusion

In this report, we described the state-of-the-art knowledge regarding multimodal interaction systems that utilize IoT and AR. We classified available input and output modalities based on their types. These modalities could convey information between agents and a multimodal interaction system. We listed modalities that are not only already used in multimodal interaction systems but also used in unimodal interaction systems that have the potential to be used in multimodal interaction systems. Since multimodal interaction systems utilize various modalities, there should be a process for handling the incoming or outgoing data between agents and a multimodal interaction system. We summarized the proposed theories and ideas regarding the architecture of data management in multimodal interaction systems into two categories: the integration process that merges the data sent by agents or acquired from other sources, and the presentation process that delivers the outcome of interaction to an agent through multiple modalities. We also investigated how IoT and AR are used in multimodal interaction systems, respectively. After that, we searched multimodal interaction system studies that utilized IoT and AR and thereby enabled new forms of multimodal interaction. From the results of this literature survey, we found four interaction types that were formed by the combination of IoT and AR:

1. AR for interface and marker on IoT, Figure 7 (a) 2. AR for interface and marker on object, Figure 7 (b)

3. AR for data presentation and marker on IoT, Figure 7 (c) 4. AR for data presentation and marker on object, Figure 7 (d)

Moreover, we listed the input and output modalities that have been used in multimodal interaction systems with IoT and AR in other studies to identify what kinds of modalities are employed for enabling interaction between agents and a multimodal interaction system. In conclusion, we have identified a number of remaining challenges in the research area of multimodal interaction system with IoT and AR. The employment of multiple modalities requires a variety of knowledge from different fields in order to combine selected modalities into an interface that has a sufficient level of usability for a wider range of agents. To overcome the obstacles of development of multimodal interaction systems with IoT and AR, some researchers built a framework/architecture of multimodal interaction system. However, the research on this subject is still in its infancy due to several challenges that made previously proposed frameworks/architectures case-specific and thereby hard to be reused in other situations. Examples of these development challenges are the diversity of available modalities that raises the scalability issue and the requirement of situation-aware heterogeneous interfaces with appropriate levels of usability. Furthermore, some researchers stated the importance of personalized experience for agents while they are using a multimodal interaction system. The adaptivity of an interface that enables the provision of customized interaction modalities depending on the agent’s situation is one of the potential solutions for achieving personalized experiences. Making the system aware of the cognitive state of an agent is one of the elements that can realize adaptive multimodal interfaces thus developers should take this into account in order to build a multimodal interaction system with appropriate levels of personalization and usability. Based on the finding of state-of-the-art research documented in this report, we foresee that an important con- tribution to the research and development of multimodal interaction with IoT and AR will be (i) to design novel multimodal interaction which has decent usability, (ii) to create a reusable framework/architecture for multimodal interaction systems that use IoT and AR. We believe that this will be a good contribution to the field of multimodal interaction research and development, which both are still in its infancy.

22 References

[1] L. Liu and M. T. Özsu, Eds., Encyclopedia of database systems, ser. Springer reference. New York: Springer, 2009. [2] S. K. Card, T. P. Moran, and A. Newell, The psychology of human-computer interaction. 2018. [3] R. A. Bolt, “’’Put-that-there”: Voice and Gesture at the Graphics Interface”, in Proceedings of the 7th Annual Conference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH ’80, New York, NY, USA: ACM, 1980. [4] C. Zimmer, M. Bertram, F. Büntig, D. Drochtert, and C. Geiger, “Mobile augmented reality illustrations that entertain and inform: Design and implementation issues with the hololens”, in SIGGRAPH Asia 2017 Mobile Graphics & Interactive Applications on - SA ’17, Bangkok, Thailand: ACM Press, 2017. [5] S. S. Rautaray and A. Agrawal, “Vision based hand gesture recognition for human computer interaction: A survey”, Artificial Intelligence Review, vol. 43, no. 1, Jan. 2015. [6] H. Gürkök and A. Nijholt, “Brain–Computer Interfaces for Multimodal Interaction: A Survey and Principles”, International Journal of Human-Computer Interaction, vol. 28, no. 5, May 2012. [7] M. Lee, M. Billinghurst, W. Baek, R. Green, and W. Woo, “A usability study of multimodal input in an augmented reality environment”, Virtual Reality, vol. 17, no. 4, Nov. 2013. [8] M.-A. Nguyen, “Designing Smart Interactions for Smart Objects”, in Human Computer Interaction in the Internet of Things Era, University of Munich Department of Computer Science Media Informatics Group, 2015. [9] A. Jaimes and N. Sebe, “Multimodal human–computer interaction: A survey”, Computer Vision and Image Understanding, vol. 108, no. 1-2, Oct. 2007. [10] D. Jo and G. J. Kim, “IoT + AR: Pervasive and augmented environments for “Digi-log” shopping experience”, Human-centric Computing and Information Sciences, vol. 9, no. 1, Dec. 2019. [11] M. Turk, “Multimodal interaction: A review”, Pattern Recognition Letters, vol. 36, Jan. 2014. [12] ISO 9241-11:2018(en), Ergonomics of human-system interaction — Part 11: Usability: Definitions and concepts. [Online]. Available: https://www.iso.org/obp/ui/#iso:std:iso:9241:-11:ed-2:v1:en. [13] S. S. M. Nizam, R. Z. Abidin, N. C. Hashim, M. Chun, H. Arshad, and N. A. A. Majid, “A Review of Multimodal Interaction Technique in Augmented Reality Environment”, [14] M. Kefi, T. N. Hoang, P. Richard, and E. Verhulst, “An evaluation of multimodal interaction techniques for 3d layout constraint solver in a desktop-based virtual environment”, Virtual Reality, vol. 22, no. 4, Nov. 2018. [15] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, “Internet of Things (IoT): A vision, architectural elements, and future directions”, Future Generation Computer Systems, vol. 29, no. 7, Sep. 2013. [16] ITU, ITU-T Y.4000/Y.2060, Jun. 2012. [Online]. Available: http://handle.itu.int/11.1002/1000/11559. [17] M. Paul and F. Kishino, “A taxonomy of mixed reality visual displays”, vol. 77, no. 12, 1994. [18] Just Dance 2019. [Online]. Available: https://www.ubisoft.com/en-us/game/just-dance-2019. [19] S. Piana, A. Staglianò, F. Odone, and A. Camurri, “Adaptive Body Gesture Representation for Automatic Emotion Recognition”, ACM Transactions on Interactive Intelligent Systems, vol. 6, no. 1, Mar. 2016. [20] J. E. Pompeu, C. Torriani-Pasin, F. Doná, F. F. Ganança, K. G. da Silva, and H. B. Ferraz, “Effect of Kinect games on postural control of patients with Parkinson’s disease”, in Proceedings of the 3rd 2015 Workshop on ICTs for improving Patients Rehabilitation Research Techniques - REHAB ’15, Lisbon, Portugal: ACM Press, 2015. [21] N. Vernadakis, V. Derri, E. Tsitskari, and P. Antoniou, “The effect of Xbox Kinect intervention on balance ability for previously injured young competitive male athletes: A preliminary study”, Physical Therapy in Sport, vol. 15, no. 3, Aug. 2014. [22] H. Kayama, K. Okamoto, S. Nishiguchi, M. Yamada, T. Kuroda, and T. Aoyama, “Effect of a Kinect-Based Exercise Game on Improving Executive Cognitive Performance in Community-Dwelling Elderly: Case Control Study”, Journal of Medical Internet Research, vol. 16, no. 2, Feb. 2014. [23] T. Mizumoto, A. Fornaser, H. Suwa, K. Yasumoto, and M. De Cecco, “Kinect-Based Micro-Behavior Sensing System for Learning the Smart Assistance with Human Subjects Inside Their Homes”, in 2018 Workshop on Metrology for Industry 4.0 and IoT, Brescia: IEEE, Apr. 2018. [24] S. Deng, N. Jiang, J. Chang, S. Guo, and J. J. Zhang, “Understanding the impact of multimodal interaction using gaze informed mid-air gesture control in 3d virtual objects manipulation”, International Journal of Human-Computer Studies, vol. 105, Sep. 2017. [25] N. Jain, S. Kumar, A. Kumar, P. Shamsolmoali, and M. Zareapoor, “Hybrid deep neural networks for face emotion recognition”, Pattern Recognition Letters, vol. 115, Nov. 2018.

23 [26] M. Galterio, S. Shavit, and T. Hayajneh, “A Review of Facial Biometrics Security for Smart Devices”, Computers, vol. 7, no. 3, Jun. 2018. [27] P. B. Balla and K. T. Jadhao, “IoT Based Facial Recognition Security System”, in 2018 International Conference on Smart City and Emerging Technology (ICSCET), Mumbai: IEEE, Jan. 2018. [28] L. A. Elrefaei, D. H. Hamid, A. A. Bayazed, S. S. Bushnak, and S. Y. Maasher, “Developing Iris Recognition System for Smartphone Security”, Multimedia Tools and Applications, vol. 77, no. 12, Jun. 2018. [29] Universitas Brawijaya, G. Pangestu, F. Utaminingrum, Universitas Brawijaya, F. Bachtiar, and Universitas Brawijaya, “Eye State Recognition Using Multiple Methods for Applied to Control Smart Wheelchair”, International Journal of Intelligent Engineering and Systems, vol. 12, no. 1, Feb. 2019. [30] F. Koochaki and L. Najafizadeh, “Predicting Intention Through Eye Gaze Patterns”, in 2018 IEEE Biomedical Circuits and Systems Conference (BioCAS), Cleveland, OH: IEEE, Oct. 2018. [31] R. C. Luo, B.-H. Shih, and T.-W. Lin, “Real time human motion imitation of anthropomorphic dual arm robot based on Cartesian impedance control”, in 2013 IEEE International Symposium on Robotic and Sensors Environments (ROSE), Washington, DC, USA: IEEE, Oct. 2013. [32] K. Qian, J. Niu, and H. Yang, “Developing a Gesture Based Remote Human-Robot Interaction System Using Kinect”, International Journal of Smart Home, vol. 7, no. 4, 2013. [33] J. P. Wachs, M. Kölsch, H. Stern, and Y. Edan, “Vision-based hand-gesture applications”, Communications of the ACM, vol. 54, no. 2, Feb. 2011. [34] M. Kaâniche, “Gesture recognition from video sequences”, [35] J. R. Pansare, S. H. Gawande, and M. Ingle, “Real-Time Static Hand Gesture Recognition for American Sign Language (ASL) in Complex Background”, Journal of Signal and Information Processing, vol. 03, no. 03, 2012. [36] J. Wachs, H. Stern, Y. Edan, M. Gillam, C. Feied, M. Smithd, and J. Handler, “Real-Time Hand Gesture Interface for Browsing Medical Images”, International Journal of Intelligent Computing in Medical Sciences & Image Processing, vol. 2, no. 1, Jan. 2008. [37] Y. A. Yusoff, A. H. Basori, and F. Mohamed, “Interactive Hand and Arm Gesture Control for 2d Medical Image and 3d Volumetric Medical Visualization”, Procedia - Social and Behavioral Sciences, vol. 97, Nov. 2013. [38] Zelun Zhang and S. Poslad, “Improved Use of Foot Force Sensors and Mobile Phone GPS for Mobility Activity Recognition”, IEEE Sensors Journal, vol. 14, no. 12, Dec. 2014. [39] J. Scott, D. Dearman, K. Yatani, and K. N. Truong, “Sensing foot gestures from the pocket”, in Proceedings of the 23nd annual ACM symposium on User interface and technology - UIST ’10, New York, New York, USA: ACM Press, 2010. [40] Department of computer science University of Thi-Qar, Thi-Qar, Iraq, K. M.Hashem, and F. Ghali, “Human Identification Using Foot Features”, International Journal of Engineering and Manufacturing, vol. 6, no. 4, Jul. 2016. [41] Z. Lv, A. Halawani, S. Feng, H. Li, and S. U. Réhman, “Multimodal Hand and Foot Gesture Interaction for Handheld Devices”, ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 11, no. 1s, Oct. 2014. [42] F. Tafazzoli and R. Safabakhsh, “Model-based human gait recognition using leg and arm movements”, Engineering Applications of Artificial Intelligence, vol. 23, no. 8, Dec. 2010. [43] Chin-Chun Chang and Wen-Hsiang Tsai, “Vision-based tracking and interpretation of human leg movement for virtual reality applications”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 1, Jan. 2001. [44] S. Nakaoka, A. Nakazawa, K. Yokoi, and K. Ikeuchi, “Leg motion primitives for a dancing humanoid robot”, in IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA ’04. 2004, New Orleans, LA, USA: IEEE, 2004. [45] R. Puri and V. Jain, “Barcode Detection using OpenCV-Python”, vol. 4, no. 1, [46] J. Gao, V. Kulkarni, H. Ranavat, L. Chang, and H. Mei, “A 2d Barcode-Based Mobile Payment System”, in 2009 Third International Conference on Multimedia and Ubiquitous Engineering, Qingdao, China: IEEE, Jun. 2009. [47] S. L. Fong, D. C. W. Yung, F. Y. H. Ahmed, and A. Jamal, “Smart City Bus Application with Quick Response (QR) Code Payment”, in Proceedings of the 2019 8th International Conference on Software and Computer Applications - ICSCA ’19, Penang, Malaysia: ACM Press, 2019. [48] Z. Ayop, C. Yee, S. Anawar, E. Hamid, and M. Syahrul, “Location-aware Event Attendance System using QR Code and GPS Technology”, International Journal of Advanced Computer Science and Applications, vol. 9, no. 9, 2018.

24 [49] S. Ćuković, M. Gattullo, F. Pankratz, G. Devedžić, E. Carrabba, and K. Baizid, “Marker Based vs. Natural Feature Tracking Augmented Reality Visualization of the 3d Foot Phantom”, [50] P. Q. Brito and J. Stoyanova, “Marker versus Markerless Augmented Reality. Which Has More Impact on Users?”, International Journal of Human–Computer Interaction, vol. 34, no. 9, Sep. 2018. [51] T. Frantz, B. Jansen, J. Duerinck, and J. Vandemeulebroucke, “Augmenting Microsoft’s HoloLens with vuforia tracking for neuronavigation”, Healthcare Technology Letters, vol. 5, no. 5, Oct. 2018. [52] S. Blanco-Pons, B. Carrión-Ruiz, M. Duong, J. Chartrand, S. Fai, and J. L. Lerma, “Augmented Reality Markerless Multi-Image Outdoor Tracking System for the Historical Buildings on Parliament Hill”, Sustainability, vol. 11, no. 16, Aug. 2019. [53] Z. Balint, B. Kiss, B. Magyari, and K. Simon, “Augmented reality and image recognition based framework for treasure hunt games”, in 2012 IEEE 10th Jubilee International Symposium on Intelligent Systems and Informatics, Subotica, Serbia: IEEE, Sep. 2012. [54] S. H. Kasaei, S. M. Kasaei, and S. A. Kasaei, “New Morphology-Based Method for RobustIranian Car Plate Detection and Recognition”, International Journal of Computer Theory and Engineering, 2010. [55] N. Elouali, J. Rouillard, X. Le Pallec, and J.-C. Tarby, “Multimodal interaction: A survey from model driven engineering and mobile perspectives”, Journal on Multimodal User Interfaces, vol. 7, no. 4, Dec. 2013. [56] Cognitive Speech Services | Microsoft Azure. [Online]. Available: https://azure.microsoft.com/en-us/services/cognitive-services/speech-services/. [57] Speech | Apple Developer Documentation. [Online]. Available: https://developer.apple.com/documentation/speech. [58] Cortana - Your personal productivity assistant. [Online]. Available: https://www.microsoft.com/en-us/cortana. [59] Siri - Apple. [Online]. Available: https://www.apple.com/siri/. [60] Google Assistant is now available on Android and iPhone mobiles. [Online]. Available: https://assistant.google.com/platforms/phones/. [61] M. F. Alghifari, T. S. Gunawan, S. A. A. Qadri, M. Kartiwi, and Z. Janin, “On the use of voice activity detection in speech emotion recognition”, vol. 8, no. 4, 2019. [62] A. Lassalle, D. Pigat, H. O’Reilly, S. Berggen, S. Fridenson-Hayo, S. Tal, S. Elfström, A. Råde, O. Golan, S. Bölte, S. Baron-Cohen, and D. Lundqvist, “The EU-Emotion Voice Database”, Behavior Research Methods, vol. 51, no. 2, Apr. 2019. [63] A. Mesaros, A. Diment, B. Elizalde, T. Heittola, E. Vincent, B. Raj, and T. Virtanen, “Sound Event Detection in the DCASE 2017 Challenge”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 6, Jun. 2019. [64] S. Boutamine, D. Istrate, J. Boudy, and H. Tannous, “Smart Sound Sensor to Detect the Number of People in a Room”, in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany: IEEE, Jul. 2019. [65] J. K. Roy, T. S. Roy, and S. C. Mukhopadhyay, “Heart Sound: Detection and Analytical Approach Towards Diseases”, in Modern Sensing Technologies, S. C. Mukhopadhyay, K. P. Jayasundera, and O. A. Postolache, Eds., vol. 29, Cham: Springer International Publishing, 2019. [66] J. N. Demos, Getting started with neurofeedback, 1st ed. New York: W.W. Norton, 2005. [67] D. P. Subha, P. K. Joseph, R. Acharya U, and C. M. Lim, “EEG Signal Analysis: A Survey”, Journal of Medical Systems, vol. 34, no. 2, Apr. 2010. [68] M. R. Lakshmi, V. P. T, and C. P. V, “Survey on EEG signal processing methods”, International Journal of Advanced Research in Computer Science and Software Engineering, vol. 4, no. 1, 2014. [69] J. K. Zao, T.-P. Jung, H.-M. Chang, T.-T. Gan, Y.-T. Wang, Y.-P. Lin, W.-H. Liu, G.-Y. Zheng, C.-K. Lin, C.-H. Lin, Y.-Y. Chien, F.-C. Lin, Y.-P. Huang, S. J. Rodríguez Méndez, and F. A. Medeiros, “Augmenting VR/AR Applications with EEG/EOG Monitoring and Oculo-Vestibular Recoupling”, in Foundations of Augmented Cognition: Neuroergonomics and Operational Neuroscience, D. D. Schmorrow and C. M. Fidopiastis, Eds., vol. 9743, Cham: Springer International Publishing, 2016. [70] P. Gang, J. Hui, S. Stirenko, Y. Gordienko, T. Shemsedinov, O. Alienin, Y. Kochura, N. Gordienko, A. Rojbi, J. R. López Benito, and E. Artetxe González, “User-Driven Intelligent Interface on the Basis of Multimodal Augmented Reality and Brain-Computer Interaction for People with Functional Disabilities”, in Advances in Information and Communication Networks, K. Arai, S. Kapoor, and R. Bhatia, Eds., vol. 886, Cham: Springer International Publishing, 2019. [71] P. Pelegris, K. Banitsas, T. Orbach, and K. Marias, “A novel method to detect Heart Beat Rate using a mobile phone”, in 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, Buenos Aires: IEEE, Aug. 2010.

25 [72] S. L. Fernandes, V. P. Gurupur, N. R. Sunder, N. Arunkumar, and S. Kadry, “A novel nonintrusive decision support approach for heart rate measurement”, Pattern Recognition Letters, Jul. 2017. [73] Z. Yang, Q. Zhou, L. Lei, K. Zheng, and W. Xiang, “An IoT-cloud Based Wearable ECG Monitoring System for Smart Healthcare”, Journal of Medical Systems, vol. 40, no. 12, Dec. 2016. [74] P. Leijdekkers and V. Gay, “Personal Heart Monitoring System Using Smart Phones To Detect Life Threatening Arrhythmias”, in 19th IEEE Symposium on Computer-Based Medical Systems (CBMS’06), Salt Lake City, UT: IEEE, 2006. [75] J. Choi and R. Gutierrez-Osuna, “Using Heart Rate Monitors to Detect Mental Stress”, in 2009 Sixth International Workshop on Wearable and Implantable Body Sensor Networks, Berkeley, CA: IEEE, Jun. 2009. [76] V. Nenonen, A. Lindblad, V. Häkkinen, T. Laitinen, M. Jouhtio, and P. Hämäläinen, “Using heart rate to control an interactive game”, in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems - CHI ’07, San Jose, California, USA: ACM Press, 2007. [77] R. Magielse and P. Markopoulos, “HeartBeat: An outdoor for children”, in Proceedings of the 27th international conference on Human factors in computing systems - CHI 09, Boston, MA, USA: ACM Press, 2009. [78] A. Tanaka, “Embodied Musical Interaction: Body Physiology, Cross Modality, and Sonic Experience”, in New Directions in Music and Human-Computer Interaction, S. Holland, T. Mudd, K. Wilkie-McKenna, A. McPherson, and M. M. Wanderley, Eds., Cham: Springer International Publishing, 2019. [79] R. Ahsan, M. I. Ibrahimy, and O. O. Khalifa, “EMG Signal Classification for Human Computer Interaction: A Review”, 2009. [80] C. Vidal, A. Philominraj, and C. del, “A DSP Practical Application: Working on ECG Signal”, in Applications of Digital Signal Processing, C. Cuadrado-Laborde, Ed., InTech, Nov. 2011. [81] S. Bitzer and P. van der Smagt, “Learning EMG control of a robotic hand: Towards active prostheses”, in Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006., Orlando, FL, USA: IEEE, 2006. [82] S. Gorzkowski and G. Sarwas, “Exploitation of EMG Signals for Video Game Control”, in 2019 20th International Carpathian Control Conference (ICCC), Krakow-Wieliczka, Poland: IEEE, May 2019. [83] S.-C. Liao, F.-G. Wu, and S.-H. Feng, “Playing games with your mouth: Improving gaming experience with EMG supportive input device”, in International Association of Societies of Design Research Conference 2019, 2019. [84] M. Ghassemi, K. Triandafilou, A. Barry, M. E. Stoykov, E. Roth, F. A. Mussa-Ivaldi, D. G. Kamper, and R. Ranganathan, “Development of an EMG-Controlled Serious Game for Rehabilitation”, IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 27, no. 2, Feb. 2019. [85] M. Lyu, W.-H. Chen, X. Ding, J. Wang, Z. Pei, and B. Zhang, “Development of an EMG-Controlled Knee Exoskeleton to Assist Home Rehabilitation in a Game Context”, Frontiers in Neurorobotics, vol. 13, Aug. 2019. [86] R. Hinrichs, S. J. H. van Rooij, V. Michopoulos, K. Schultebraucks, S. Winters, J. Maples-Keller, A. O. Rothbaum, J. S. Stevens, I. Galatzer-Levy, B. O. Rothbaum, K. J. Ressler, and T. Jovanovic, “Increased Skin Conductance Response in the Immediate Aftermath of Trauma Predicts PTSD Risk”, Chronic Stress, vol. 3, Jan. 2019. [87] G. I. Christopoulos, M. A. Uy, and W. J. Yap, “The Body and the Brain: Measuring Skin Conductance Responses to Understand the Emotional Experience”, Organizational Research Methods, vol. 22, no. 1, Jan. 2019. [88] S. Sakurazawa, N. Yoshida, and N. Munekata, “Entertainment feature of a game using skin conductance response”, in Proceedings of the 2004 ACM SIGCHI International Conference on Advances in computer entertainment technology - ACE ’04, Singapore: ACM Press, 2004. [89] Y. Li, A. S. Elmaghraby, A. El-Baz, and E. M. Sokhadze, “Using physiological signal analysis to design affective VR games”, in 2015 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Abu Dhabi, United Arab Emirates: IEEE, Dec. 2015. [90] G. Tanda, “The use of infrared thermography to detect the skin temperature response to physical activity”, Journal of Physics: Conference Series, vol. 655, Nov. 2015. [91] O. Postolache, F. Lourenco, J. M. Dias Pereira, and P. S. Girao, “Serious game for physical rehabilitation: Measuring the effectiveness of virtual and real training environments”, in 2017 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Torino, Italy: IEEE, May 2017. [92] H. Jin, Y. S. Abu-Raya, and H. Haick, “Advanced Materials for Health Monitoring with Skin-Based Wearable Devices”, Advanced Healthcare Materials, vol. 6, no. 11, Jun. 2017. [93] K. A. Herborn, J. L. Graves, P. Jerem, N. P. Evans, R. Nager, D. J. McCafferty, and D. E. McKeegan, “Skin temperature reveals the intensity of acute stress”, Physiology & Behavior, vol. 152, Dec. 2015.

26 [94] V. C. R. Appel, V. L. Belini, D. H. Jong, D. V. Magalhaes, and G. A. P. Caurin, “Classifying emotions in rehabilitation robotics based on facial skin temperature”, in 5th IEEE RAS/EMBS International Conference on Biomedical Robotics and Biomechatronics, Sao Paulo, Brazil: IEEE, Aug. 2014. [95] E. Salazar-López, E. Domínguez, V. Juárez Ramos, J. de la Fuente, A. Meins, O. Iborra, G. Gálvez, M. Rodríguez-Artacho, and E. Gómez-Milán, “The mental and subjective skin: Emotion, empathy, feelings and thermography”, Consciousness and Cognition, vol. 34, Jul. 2015. [96] C. Yun, D. Shastri, I. Pavlidis, and Z. Deng, “O’ game, can you feel my frustration?: Improving user’s gaming experience via stresscam”, in Proceedings of the 27th international conference on Human factors in computing systems - CHI 09, Boston, MA, USA: ACM Press, 2009. [97] W. Vanmarkenlichtenbelt, H. Daanen, L. Wouters, R. Fronczek, R. Raymann, N. Severens, and E. Vansomeren, “Evaluation of wireless determination of skin temperature using iButtons”, Physiology & Behavior, vol. 88, no. 4-5, Jul. 2006. [98] Y. Yamamoto, D. Yamamoto, M. Takada, H. Naito, T. Arie, S. Akita, and K. Takei, “Efficient Skin Temperature Sensor and Stable Gel-Less Sticky ECG Sensor for a Wearable Flexible Healthcare Patch”, Advanced Healthcare Materials, vol. 6, no. 17, Sep. 2017. [99] V. Bernard, E. Staffa, V. Mornstein, and A. Bourek, “Infrared camera assessment of skin surface temperature – Effect of emissivity”, Physica Medica, vol. 29, no. 6, Nov. 2013. [100] D. Ettehad, C. A. Emdin, A. Kiran, S. G. Anderson, T. Callender, J. Emberson, J. Chalmers, A. Rodgers, and K. Rahimi, “Blood pressure lowering for prevention of cardiovascular disease and death: A systematic review and meta-analysis”, The Lancet, vol. 387, no. 10022, Mar. 2016. [101] on behalf of ESH Working Group on Blood Pressure Monitoring, G. Parati, G. S. Stergiou, R. Asmar, G. Bilo, P. de Leeuw, Y. Imai, K. Kario, E. Lurbe, A. Manolis, T. Mengden, E. O’Brien, T. Ohkubo, P. Padfield, P. Palatini, T. G. Pickering, J. Redon, M. Revera, L. M. Ruilope, A. Shennan, J. A. Staessen, A. Tisler, B. Waeber, A. Zanchetti, and G. Mancia, “European Society of Hypertension Practice Guidelines for home blood pressure monitoring”, Journal of Human Hypertension, vol. 24, no. 12, Dec. 2010. [102] D. Gasperin, G. Netuveli, J. S. Dias-da-Costa, and M. P. Pattussi, “Effect of psychological stress on blood pressure increase: A meta-analysis of cohort studies”, Cadernos de Saúde Pública, vol. 25, no. 4, Apr. 2009. [103] E. A. Butler, T. L. Lee, and J. J. Gross, “Does Expressing Your Emotions Raise or Lower Your Blood Pressure?: The Answer Depends on Cultural Context”, Journal of Cross-Cultural Psychology, vol. 40, no. 3, May 2009. [104] J. A. McCubbin, M. M. Merritt, J. J. Sollers, M. K. Evans, A. B. Zonderman, R. D. Lane, and J. F. Thayer, “Cardiovascular-Emotional Dampening: The Relationship Between Blood Pressure and Recognition of Emotion”, Psychosomatic Medicine, vol. 73, no. 9, Nov. 2011. [105] D. E. Warburton, S. S. Bredin, L. T. Horita, D. Zbogar, J. M. Scott, B. T. Esch, and R. E. Rhodes, “The health benefits of interactive video game exercise”, Applied Physiology, Nutrition, and Metabolism, vol. 32, no. 4, Aug. 2007. [106] A. M. Porter and P. Goolkasian, “Video Games and Stress: How Stress Appraisals and Game Content Affect Cardiovascular and Emotion Outcomes”, Frontiers in Psychology, vol. 10, May 2019. [107] R. J. Tafalla, “Gender Differences in Cardiovascular Reactivity and Game Performance Related to Sensory Modality in Violent Video Game Play”, Journal of Applied Social Psychology, vol. 37, no. 9, Sep. 2007. [108] R. Jagadheeswari, R. G. Devi, and A. J. Priya, “Evaluating the effects of video games on blood pressure and heart rate”, Drug Invention Today, vol. 10, no. 1, 2018. [109] B. P. Kerfoot, A. Turchin, E. Breydo, D. Gagnon, and P. R. Conlin, “An Online Spaced-Education Game Among Clinicians Improves Their Patients’ Time to Blood Pressure Control: A Randomized Controlled Trial”, Circulation: Cardiovascular Quality and Outcomes, vol. 7, no. 3, May 2014. [110] G. Ogedegbe and T. Pickering, “Principles and Techniques of Blood Pressure Measurement”, Cardiology Clinics, vol. 28, no. 4, Nov. 2010. [111] W. J. Verberk, A. A. Kroon, H. A. Jongen-Vancraybex, and P. W. de Leeuw, “The applicability of home blood pressure measurement in clinical practice: A review of literature”, Vascular Health and Risk Management, vol. 3, no. 6, 2007. [112] V. van Acht, E. Bongers, N. Lambert, and R. Verberne, “Miniature Wireless Inertial Sensor for Measuring Human Motions”, in 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Lyon, France: IEEE, Aug. 2007. [113] S. Sprager and M. Juric, “Inertial Sensor-Based Gait Recognition: A Review”, Sensors, vol. 15, no. 9, Sep. 2015.

27 [114] D. R., M. Naheem, S. Khandelwal, P. S.P., J. Joseph, and M. Sivaprakasam, “Fall Detection Using Kinematic Features from a Wrist-Worn Inertial Sensor”, in 2019 IEEE International Symposium on Medical Measurements and Applications (MeMeA), Istanbul, Turkey: IEEE, Jun. 2019. [115] R. Mooney, G. Corley, A. Godfrey, L. Quinlan, and G. ÓLaighin, “Inertial Sensor Technology for Elite Swimming Performance Analysis: A Systematic Review”, Sensors, vol. 16, no. 1, Dec. 2015. [116] Chin-Hao Wu, Yuan-Tse Chang, and Y.-C. Tseng, “Multi-screen cyber-physical video game: An integration with body-area inertial sensor networks”, in 2010 8th IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops), Mannheim, Germany: IEEE, Mar. 2010. [117] S. Kim, J. Kim, and D. Suh, “Game Controller Position Tracking using A2c Machine Learning on Inertial Sensors”, in 2019 IEEE Games, Entertainment, Media Conference (GEM), New Haven, CT, USA: IEEE, Jun. 2019. [118] J. Collin, P. Davidson, M. Kirkko-Jaakkola, and H. Leppakoski, “Inertial Sensors and Their Applications”, [119] S. B. Libby, D. H. Chambers, J. Chang, and J. Zumstein, “GPS-Free Navigation in Buildings - NA-241_2018_final_report_sblibby_12_31_18”, Tech. Rep. LLNL-TR–766218, 1498455, Jan. 2019. [120] A. S. Balloch, M. Meghji, R. U. Newton, N. H. Hart, J. A. Weber, I. Ahmad, and D. Habibi, “Assessment of a Novel Algorithm to Determine Change-of-Direction Angles While Running Using Inertial Sensors:” Journal of Strength and Conditioning Research, Jan. 2019. [121] F. E. Ehrmann, C. S. Duncan, D. Sindhusake, W. N. Franzsen, and D. A. Greene, “GPS and Injury Prevention in Professional Soccer:” Journal of Strength and Conditioning Research, vol. 30, no. 2, Feb. 2016. [122] J. Paavilainen, H. Korhonen, K. Alha, J. Stenros, E. Koskinen, and F. Mayra, “The Pokémon GO Experience: A Location-Based Augmented Reality Mobile Game Goes Mainstream”, in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI ’17, Denver, Colorado, USA: ACM Press, 2017. [123] M. Gor, J. Vora, S. Tanwar, S. Tyagi, N. Kumar, M. S. Obaidat, and B. Sadoun, “GATA: GPS-Arduino based Tracking and Alarm system for protection of wildlife animals”, in 2017 International Conference on Computer, Information and Telecommunication Systems (CITS), Dalian, China: IEEE, Jul. 2017. [124] C. S. González-González, M. D. Guzmán-Franco, and A. Infante-Moro, “Tangible Technologies for Childhood Education: A Systematic Review”, Sustainability, vol. 11, no. 10, May 2019. [125] J. Foottit, D. Brown, S. Marks, and A. M. Connor, “A Wearable Haptic Game Controller”, International Journal of Game Theory and Technology, vol. 2, no. 1, Mar. 2016. [126] A. Bianchi, I. Oakley, J. K. Lee, D. S. Kwon, and V. Kostakos, “Haptics for Tangible Interaction: A Vibro-Tactile Prototype”, [127] V. Koushik, D. Guinness, and S. K. Kane, “StoryBlocks: A Tangible Programming Game To Create Accessible Audio Stories”, in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI ’19, Glasgow, Scotland Uk: ACM Press, 2019. [128] M. Filsecker and D. T. Hickey, “A multilevel analysis of the effects of external rewards on elementary students’ motivation, engagement and learning in an educational game”, Computers & Education, vol. 75, Jun. 2014. [129] G.-J. Hwang, P.-H. Wu, C.-C. Chen, and N.-T. Tu, “Effects of an augmented reality-based educational game on students’ learning achievements and attitudes in real-world observations”, Interactive Learning Environments, vol. 24, no. 8, Nov. 2016. [130] M. Farhan, S. Jabbar, M. Aslam, M. Hammoudeh, M. Ahmad, S. Khalid, M. Khan, and K. Han, “IoT-based students interaction framework using attention-scoring assessment in eLearning”, Future Generation Computer Systems, 2018. [131] N.-S. Pai, P.-Y. Chen, S.-A. Chen, and S.-X. Chen, “Realization of Internet of vehicles technology integrated into an augmented reality system”, Journal of Low Frequency Noise, Vibration and Active Control, Mar. 2019. [132] O. M. Tepper, H. L. Rudy, A. Lefkowitz, K. A. Weimer, S. M. Marks, C. S. Stern, and E. S. Garfein, “Mixed Reality with HoloLens: Where Virtual Reality Meets Augmented Reality in the Operating Room”, Plastic and Reconstructive Surgery, vol. 140, no. 5, Nov. 2017. [133] T. Horeman, M. D. Blikkendaal, D. Feng, A. van Dijke, F. Jansen, J. Dankelman, and J. J. van den Dobbelsteen, “Visual Force Feedback Improves Knot-Tying Security”, Journal of Surgical Education, vol. 71, no. 1, Jan. 2014. [134] S. R. Mache, M. R. Baheti, and C. N. Mahender, “Review on Text-To-Speech Synthesizer”, vol. 4, no. 8, [135] J. Crumpton and C. L. Bethel, “A Survey of Using Vocal Prosody to Convey Emotion in Robot Speech”, International Journal of Social Robotics, vol. 8, no. 2, Apr. 2016. [136] E. Sakai, T. Itoh, and A. Ito, “A Study on Voice Actor Recommendation for Game Characters Based on Acoustic Feature Estimation and Document Co-occurrence”, in 2017 Nicograph International (NicoInt), Kyoto, Japan: IEEE, Jun. 2017.

28 [137] J. Byun and C. S. Loh, “Audial engagement: Effects of game sound on learner engagement in digital game-based learning environments”, Computers in Human Behavior, vol. 46, May 2015. [138] S. Rabin, Ed., Game AI Pro 2: Collected Wisdom of Game AI Professionals. A K Peters/CRC Press, Apr. 2015. [139] M. Bhargava, P. Dhote, A. Srivastava, and A. Kumar, “Speech enabled integrated AR-based multimodal language translation”, in 2016 Conference on Advances in Signal Processing (CASP), Pune, India: IEEE, Jun. 2016. [140] T. P. Nguyen, M. T. Chew, and S. Demidenko, “Eye tracking system to detect driver drowsiness”, in 2015 6th International Conference on Automation, Robotics and Applications (ICARA), Queenstown, New Zealand: IEEE, Feb. 2015. [141] N. Koons and M. Haungs, “Intrinsically musical game worlds: Abstract music generation as a result of gameplay”, in Proceedings of the 14th International Conference on the Foundations of Digital Games - FDG ’19, San Luis Obispo, California: ACM Press, 2019. [142] G. Informer, The Secrets Behind Battlefield 3’s Sound Design, Jun. 2011. [Online]. Available: https://www.youtube.com/watch?v=Vc8WQsIxhro. [143] B. Minto, EA DICE PKM Gun Recording Microphone Type, Position & Pre Amp Comparison - ben(dot)minto(at)dice(dot)se, Jun. 2011. [Online]. Available: https://vimeo.com/20869893. [144] P. Ng and K. Nesbitt, “Informative sound design in video games”, in Proceedings of The 9th Australasian Conference on Interactive Entertainment Matters of Life and Death - IE ’13, Melbourne, Australia: ACM Press, 2013. [145] V. Dakic, Sound Design for Film and Television. 2009. [146] T. Hartmann, K. M. Krakowiak, and M. Tsay-Vogel, “How Violent Video Games Communicate Violence: A Literature Review and Content Analysis of Moral Disengagement Factors”, Communication Monographs, vol. 81, no. 3, Jul. 2014. [147] T. M.-W. Dictionary, Haptics, Jan. 2020. [Online]. Available: https://www.merriam-webster.com/dictionary/haptics. [148] J. Foottit, D. Brown, S. Marks, and A. Connor, “Development of a wearable haptic game interface”, EAI Endorsed Transactions on Creative Technologies, vol. 3, no. 6, Apr. 2016. [149] Wai Yu and S. Brewster, “Comparing two haptic interfaces for multimodal graph rendering”, in Proceedings 10th Symposium on Haptic Interfaces for Virtual Environment and Teleoperator Systems. HAPTICS 2002, Orlando, FL, USA: IEEE Comput. Soc, 2002. [150] T. H. Massie and J. K. Salisbury, “The PHANToM Haptic Interface: A Device for Probing Virtual Objects”, [151] A. Zenner and A. Kruger, “Shifty: A Weight-Shifting Dynamic Passive Haptic Proxy to Enhance Object Perception in Virtual Reality”, IEEE Transactions on Visualization and Computer Graphics, vol. 23, no. 4, Apr. 2017. [152] K. Zareinia, Y. Maddahi, C. Ng, N. Sepehri, and G. R. Sutherland, “Performance evaluation of haptic hand-controllers in a robot-assisted surgical system: Evaluation of haptic devices in a robot-assisted surgical system”, The International Journal of Medical Robotics and Computer Assisted Surgery, vol. 11, no. 4, Dec. 2015. [153] O. S. Schneider and K. E. MacLean, “Improvising design with a Haptic Instrument”, in 2014 IEEE Haptics Symposium (HAPTICS), Houston, TX, USA: IEEE, Feb. 2014. [154] J. Kangas, D. Akkil, J. Rantala, P. Isokoski, P. Majaranta, and R. Raisamo, “Gaze gestures and haptic feedback in mobile devices”, in Proceedings of the 32nd annual ACM conference on Human factors in computing systems - CHI ’14, Toronto, Ontario, Canada: ACM Press, 2014. [155] I. Hussain, L. Meli, C. Pacchierotti, G. Salvietti, and D. Prattichizzo, “Vibrotactile haptic feedback for intuitive control of robotic extra fingers”, in 2015 IEEE World Haptics Conference (WHC), Evanston, IL: IEEE, Jun. 2015. [156] C. Pacchierotti, S. Sinclair, M. Solazzi, A. Frisoli, V. Hayward, and D. Prattichizzo, “Wearable Haptic Systems for the Fingertip and the Hand: Taxonomy, Review, and Perspectives”, IEEE Transactions on Haptics, vol. 10, no. 4, Oct. 2017. [157] M. Obrist, C. Velasco, C. Vi, N. Ranasinghe, A. Israr, A. Cheok, C. Spence, and P. Gopalakrishnakone, “Sensing the future of HCI: Touch, taste, and smell user interfaces”, interactions, vol. 23, no. 5, Aug. 2016. [158] P. Risso, M. Covarrubias Rodriguez, M. Bordegoni, and A. Gallace, “Development and Testing of a Small-Size Olfactometer for the Perception of Food and Beverages in Humans”, Frontiers in Digital Humanities, vol. 5, Apr. 2018.

29 [159] N. Ranasinghe and E. Y.-L. Do, “Digital Lollipop: Studying Electrical Stimulation on the Human Tongue to Simulate Taste Sensations”, ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 13, no. 1, Oct. 2016. [160] D. Jain, M. Sra, J. Guo, R. Marques, R. Wu, J. Chiu, and C. Schmandt, “Immersive Scuba Diving Simulator Using Virtual Reality”, in Proceedings of the 29th Annual Symposium on User Interface Software and Technology - UIST ’16, Tokyo, Japan: ACM Press, 2016. [161] J. Kim, E.-s. Jung, Y.-t. Lee, and W. Ryu, “Home appliance control framework based on smart TV set-top box”, IEEE Transactions on Consumer Electronics, vol. 61, no. 3, Aug. 2015. [162] H. Jiang, T. Zhang, J. P. Wachs, and B. S. Duerstock, “Enhanced control of a wheelchair-mounted robotic manipulator using 3-D vision and multimodal interaction”, Computer Vision and Image Understanding, vol. 149, Aug. 2016. [163] M. Khan, B. N. Silva, and K. Han, “Internet of Things Based Energy Aware Smart Home Control System”, IEEE Access, vol. 4, 2016. [164] M. Turk, “Multimodal human-computer interaction”, in Real-time vision for human-computer interaction, Springer, 2005. [165] D. Hall and J. Llinas, “An introduction to multisensor data fusion”, Proceedings of the IEEE, vol. 85, no. 1, Jan. 1997. [166] B. Dasarathy, “Sensor fusion potential exploitation-innovative architectures and illustrative applications”, Proceedings of the IEEE, vol. 85, no. 1, Jan. 1997. [167] R. Sharma, V. Pavlovic, and T. Huang, “Toward multimodal human-computer interface”, Proceedings of the IEEE, vol. 86, no. 5, May 1998. [168] A. Corradini, M. Mehta, N. O. Bernsen, J. Martin, and S. Abrilian, “Multimodal input fusion in human-computer interaction”, NATO Science Series Sub Series III Computer and Systems Sciences, vol. 198, 2005. [169] A. W. Ismail and M. S. Sunar, “Multimodal Fusion: Gesture and Speech Input in Augmented Reality Environment”, in Computational Intelligence in Information Systems, S. Phon-Amnuaisuk and T. W. Au, Eds., vol. 331, Cham: Springer International Publishing, 2015. [170] M. E. Foster, “State of the art review: Multimodal fission”, COMIC project Deliverable, vol. 6, Sep. 2002. [171] C. Rousseau, Y. Bellik, F. Vernier, and D. Bazalgette, “A framework for the intelligent multimodal presentation of information”, Signal Processing, vol. 86, no. 12, Dec. 2006. [172] “Multimodal fission”, in Multimodal human computer interaction and pervasive services, P. Grifoni, Ed., IGI Global, 2009. [173] D. Costa and C. Duarte, “Adapting Multimodal Fission to User’s Abilities”, in Universal Access in Human-Computer Interaction. Design for All and eInclusion, C. Stephanidis, Ed., vol. 6765, Berlin, Heidelberg: Springer Berlin Heidelberg, 2011. [174] F. Honold, F. Schüssel, and M. Weber, “Adaptive probabilistic fission for multimodal systems”, in Proceedings of the 24th Australian Computer-Human Interaction Conference on - OzCHI ’12, Melbourne, Australia: ACM Press, 2012. [175] A. Kamilaris and A. Pitsillides, “Mobile Phone Computing and the Internet of Things: A Survey”, [176] J. Qi, P. Yang, G. Min, O. Amft, F. Dong, and L. Xu, “Advanced Internet of Things for Personalised Healthcare System: A Survey”, [177] N. Scarpato, A. Pieroni, L. D. Nunzio, and F. Fallucchi, “E-health-IoT Universe: A Review”, International Journal on Advanced Science, Engineering and Information Technology, vol. 7, no. 6, Dec. 2017. [178] K. Akkaya, I. Guvenc, R. Aygun, N. Pala, and A. Kadri, “IoT-based occupancy monitoring techniques for energy-efficient smart buildings”, in 2015 IEEE Wireless Communications and Networking Conference Workshops (WCNCW), New Orleans, LA, USA: IEEE, Mar. 2015. [179] F. Alam, R. Mehmood, I. Katib, N. N. Albogami, and A. Albeshri, “Data Fusion and IoT for Smart Ubiquitous Environments: A Survey”, IEEE Access, vol. 5, 2017. [180] S. Kubler, J. Robert, A. Hefnawy, K. Framling, C. Cherifi, and A. Bouras, “Open IoT Ecosystem for Sporting Event Management”, IEEE Access, vol. 5, 2017. [181] M. Kranz, P. Holleis, and A. Schmidt, “Embedded Interaction: Interacting with the Internet of Things”, IEEE Internet Computing, vol. 14, no. 2, Mar. 2010. [182] A. K. Saha, S. Sircar, P. Chatterjee, S. Dutta, A. Mitra, A. Chatterjee, S. P. Chattopadhyay, and H. N. Saha, “A raspberry Pi controlled cloud based air and sound pollution monitoring system with temperature and humidity sensing”, in 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV: IEEE, Jan. 2018.

30 [183] J. Jusak, H. Pratikno, and V. H. Putra, “Internet of Medical Things for cardiac monitoring: Paving the way to 5g mobile networks”, in 2016 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT), Surabaya, Indonesia: IEEE, 2016. [184] H. Ren, H. Jin, C. Chen, H. Ghayvat, and W. Chen, “A Novel Cardiac Auscultation Monitoring System Based on Wireless Sensing for Healthcare”, IEEE Journal of Translational Engineering in Health and Medicine, vol. 6, 2018. [185] A. Ukil and U. K. Roy, “Smart cardiac health management in IoT through heart sound signal analytics and robust noise filtering”, in 2017 IEEE 28th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC), Montreal, QC: IEEE, Oct. 2017. [186] A. A. P. Wai, H. Dajiang, and N. S. Huat, “IoT-enabled multimodal sensing headwear system”, in 2018 IEEE 4th World Forum on Internet of Things (WF-IoT), Singapore: IEEE, Feb. 2018. [187] L.-J. Kau and C.-S. Chen, “A Smart Phone-Based Pocket Fall Accident Detection, Positioning, and Rescue System”, IEEE Journal of Biomedical and Health Informatics, vol. 19, no. 1, Jan. 2015. [188] F. Aloul, I. Zualkernan, R. Abu-Salma, H. Al-Ali, and M. Al-Merri, “iBump: Smartphone application to detect car accidents”, Computers & Electrical Engineering, vol. 43, Apr. 2015. [189] A. B. Faiz, A. Imteaj, and M. Chowdhury, “Smart vehicle accident detection and alarming system using a smartphone”, in 2015 International Conference on Computer and Information Engineering (ICCIE), Rajshahi, Bangladesh: IEEE, Nov. 2015. [190] S. Chandran, S. Chandrasekar, and N. E. Elizabeth, “Konnect: An Internet of Things(IoT) based smart helmet for accident detection and notification”, in 2016 IEEE Annual India Conference (INDICON), Bangalore, India: IEEE, Dec. 2016. [191] J. He, C. Hu, and X. Wang, “A Smart Device Enabled System for Autonomous Fall Detection and Alert”, International Journal of Distributed Sensor Networks, vol. 12, no. 2, Feb. 2016. [192] A. A. Nazari Shirehjini and A. Semsar, “Human interaction with IoT-based smart environments”, Multimedia Tools and Applications, vol. 76, no. 11, Jun. 2017. [193] J. Bacca, S. Baldiris, R. Fabregat, S. Graf, and Kunshuk, “Augmented Reality Trends in Education : A Systematic Review of Research and Applications”, Educational Technology & Society, vol. 17, no. 4, 2014. [194] Z. Zhu, V. Branzoi, M. Wolverton, G. Murray, N. Vitovitch, L. Yarnall, G. Acharya, S. Samarasekera, and R. Kumar, “AR-mentor: Augmented reality based mentoring system”, in 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Munich, Germany: IEEE, Sep. 2014. [195] M. Al-Jabi and H. Sammaneh, “Toward Mobile AR-based Interactive Smart Parking System”, in 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Exeter, United Kingdom: IEEE, Jun. 2018. [196] D. Jo and G. J. Kim, “ARIoT: Scalable augmented reality framework for interacting with Internet of Things appliances everywhere”, IEEE Transactions on Consumer Electronics, vol. 62, no. 3, Aug. 2016. [197] P. Phupattanasilp and S.-R. Tong, “Augmented Reality in the Integrative Internet of Things (AR-IoT): Application for Precision Farming”, Sustainability, vol. 11, no. 9, May 2019. [198] Y. He, I. Sawada, O. Fukuda, R. Shima, N. Yamaguchi, and H. Okumura, “Development of an evaluation system for upper limb function using AR technology”, in Proceedings of the Genetic and Evolutionary Computation Conference Companion on - GECCO ’18, Kyoto, Japan: ACM Press, 2018. [199] Z. Rashid, J. Melià-Seguí, R. Pous, and E. Peig, “Using Augmented Reality and Internet of Things to improve of people with motor disabilities in the context of Smart Cities”, Future Generation Computer Systems, vol. 76, Nov. 2017. [200] K. Cho, H. Jang, L. W. Park, S. Kim, and S. Park, “Energy Management System Based on Augmented Reality for Human-Computer Interaction in a Smart City”, in 2019 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA: IEEE, Jan. 2019. [201] A. Seitz, D. Henze, J. Nickles, M. Sauer, and B. Bruegge, “Augmenting the industrial Internet of Things with Emojis”, in 2018 Third International Conference on Fog and Mobile Edge Computing (FMEC), Barcelona: IEEE, Apr. 2018. [202] D. Chaves-Diéguez, A. Pellitero-Rivero, D. García-Coego, F. González-Castaño, P. Rodríguez-Hernández, Ó. Piñeiro-Gómez, F. Gil-Castiñeira, and E. Costa-Montenegro, “Providing IoT Services in Smart Cities through Dynamic Augmented Reality Markers”, Sensors, vol. 15, no. 7, Jul. 2015. [203] M. F. Alam, S. Katsikas, O. Beltramello, and S. Hadjiefthymiades, “Augmented and virtual reality based monitoring and safety system: A prototype IoT platform”, Journal of Network and Computer Applications, vol. 89, Jul. 2017.

31 [204] Y. Park, S. Yun, and K.-H. Kim, “When IoT met Augmented Reality: Visualizing the Source of the Wireless Signal in AR View”, in Proceedings of the 17th Annual International Conference on Mobile Systems, Applications, and Services - MobiSys ’19, Seoul, Republic of Korea: ACM Press, 2019. [205] B. Pokric, S. Krco, M. Pokric, P. Knezevic, and D. Jovanovic, “Engaging Citizen Communities in Smart Cities Using loT, Serious Gaming and Fast Marl

32 [222] J. A. Karasinski, R. Joyce, C. Carroll, J. Gale, and S. Hillenius, “An Augmented Reality/Internet of Things Prototype for Just-in-time Astronaut Training”, in Virtual, Augmented and Mixed Reality, S. Lackey and J. Chen, Eds., vol. 10280, Cham: Springer International Publishing, 2017. [223] Z. A. Dodevska, V. Kvrgić, and V. Štavljanin, “Augmented Reality and Internet of Things – Implementation in Projects by Using Simplified Robotic Models”, European Project Management Journal, vol. 8, no. 2, 2018. [224] D. A. Smuseva, A. Y. Rolich, L. S. Voskov, and I. Y. Malakhov, “Big Data, Internet of Things, Augmented Reality: Technology convergence in visualization issues”, Data Science, [225] K. Huo, Y. Cao, S. H. Yoon, Z. Xu, G. Chen, and K. Ramani, “Scenariot: Spatially Mapping Smart Things Within Augmented Reality Scenes”, in Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems - CHI ’18, Montreal QC, Canada: ACM Press, 2018. [226] L. Zhang, S. Chen, H. Dong, and A. El Saddik, “Visualizing Toronto City Data with HoloLens: Using Augmented Reality for a City Model”, IEEE Consumer Electronics Magazine, vol. 7, no. 3, May 2018. [227] J. Kim, J. Seo, and T. H. Laine, “Detecting boredom from eye gaze and EEG”, Biomedical Signal Processing and Control, vol. 46, Sep. 2018. [228] N. Norouzi, G. Bruder, B. Belna, S. Mutter, D. Turgut, and G. Welch, “A Systematic Review of the Convergence of Augmented Reality, Intelligent Virtual Agents, and the Internet of Things”, in Artificial Intelligence in IoT, F. Al-Turjman, Ed., Cham: Springer International Publishing, 2019. [229] B. Dumas, D. Lalanne, and S. Oviatt, “Multimodal Interfaces: A Survey of Principles, Models and Frameworks”, in Human Machine Interaction, D. Lalanne and J. Kohlas, Eds., vol. 5440, Berlin, Heidelberg: Springer Berlin Heidelberg, 2009. [230] I. Brinck and C. Balkenius, “Mutual Recognition in Human-Robot Interaction: A Deflationary Account”, Philosophy & Technology, Dec. 2018. [231] E. Stefanidi, M. Foukarakis, D. Arampatzis, M. Korozi, A. Leonidis, and M. Antona, “ParlAmI: A Multimodal Approach for Programming Intelligent Environments”, Technologies, vol. 7, no. 1, Jan. 2019. [232] S. Oviatt, “Ten myths of multimodal interaction”, Communications of the ACM, vol. 42, no. 11, Nov. 1999.

33