<<

2020 IEEE Conference on and 3D Interfaces (VR)

A Comparative Analysis of : How to Move Virtual Objects in

Hyo Jeong Kang* Jung-hye Shin† Kevin Ponto‡ University of Florida University of Wisconsin-Madison University of Wisconsin-Madison

Figure 1: Example scenes of three interaction techniques

ABSTRACT standard design practice for VR controllers is to hold a grip . Using one’s hands can be a natural and intuitive method for inter- The dynamic nature of human hands, however, makes it challenging acting with 3D objects in a mixed reality environment. This study to find and pin down the most natural hand input among possible explores three hand-interaction techniques, including the gaze and gestures—lift, grab, push, point, pinch, and more. The fundamental pinch, touch and grab, and worlds-in-miniature interaction for se- difference lies in our familiarity. Unlike a controller that typically lecting and moving virtual furniture in the 3D scene. Overall, a requires extra efforts of learning, we are more likely to adopt our comparative analysis reveals that the worlds-in-miniature provided habitual actions of using hands when interacting with the virtual the best usability and task performance than other studied techniques. object. Consequently, identifying the most natural hand interaction We also conducted in-depth interviews and analyzed participants’ is a complicated since the idea of “natural” hand interaction hand gestures in order to identify desired attributes for 3D hand can vary depends on the context, and it requires an understanding of . Findings from interviews suggest that, when it how we interact with objects in everyday life. comes to enjoyment and discoverability, users prefer directly manip- Prior 3DUI research has developed the novel metaphors for hand ulating the virtual furniture to interacting with objects remotely or gestures such as finger-pointing that works as a mouse cursor [10,15, using in-direct interactions such as gaze. Another insight this study 27, 39], a that integrates finger-pointing and provides is the critical roles of the virtual object’s visual appearance gaze [21,32,42], and multiple gestures such as grabbing and pinching in designing natural hand interaction. Gesture analysis reveals that for complex tasks [16, 20]. While these streams of research provide shapes of furniture, as well as its perceived features such as weight, valuable insights in designing innovative hand interactions, relatively largely determined the participant’s instinctive form of hand inter- little is known on tradeoff among existing technologies. Furthermore, action (i.e., lift, grab, push). Based on these findings, we present the majority of prior works are focused on the usability aspects of design suggestions that can aid 3D interaction designers to develop hand input. For instance, 3DUI research abounds with discussions a natural and intuitive hand interaction for mixed reality. on efficient and accurate hand inputs that address technical concerns such as object occlusions and small field of view [13,17,21,32]. The Index Terms: Human-centered computing—Human computer question of which hand gesture would render a natural interaction, interaction (HCI)—Interaction paradigms—Virtual reality; Human- on the other hand, remains less explored. centered computing—Human computer interaction (HCI)—HCI To this end, this study aims to provide a comprehensive review of design and evaluation methods—User studies the tradeoff among the well-known 3D hand interaction designs. We compare three 3D interactions; 1) multimodal interaction using gaze 1INTRODUCTION and pinch gesture, 2) direct touch and grab interaction, and 3) worlds- The use of hand input has received increasing attention in the virtual in-miniature. In comparing these three interactions, we first observed reality (VR) community. There is a growing body of research that participants’ behavioral responses without giving instructions. This has begun to cast light on hand-interaction; yet, many prior VR approach helps us to examine the discoverability of each interaction research predominantly focused on 3D interaction using controllers. design, thereby providing insights into natural hand interaction. We Using bare-hands offers a distinctive user experience as opposed to particularly focused on an interaction that involves selecting and using controllers. Compared to the controller where a single button arranging furniture items. Architecture and interior design are one of is often mapped to a specific action, human hands are capable of the most promising application areas in MR; thus findings from this various gestures by nature. For example, to move a virtual object, the study can provide immediate real-world implications. In comparing these three interactions, we address the following research questions. *e-mail:[email protected]fl.edu †e-mail: [email protected] • Discoverability: which interaction design requires a greater ‡e-mail: [email protected] time investment to understand the mechanism without instruc- tions? In addition to three studied interactions, are there other gestures participants try to use?

• Accuracy and Efficiency: which interaction mechanism offers

2642-5254/20/$31.00 ©2020 IEEE 275 DOI 10.1109/VR46266.2020.00-57 the user an increased efficiency and accuracy to complete the The relative merits and demerits of these two interaction tech- task? niques make it worthwhile to compare a gaze and pinch and a direct touch and grab techniques. While being considered as the most • Naturalism and usability: which interaction mechanism is natural, the primary issue with a direct touch-grab interaction is that more intuitive and easier to use? users can only interact with objects within their arms’ reach. A gaze and pinch interaction can solve such issues as it enables users to The main contributions of this paper are twofold. First, our user interact with the remote object; however, a gaze-based interaction study can help clarify the advantages and disadvantages of these may seem less natural or intuitive. three interactions, providing practical implications for designers to Worlds-in-miniature (WIM) interaction entails the advantages of combine the best features of these existing methods. Second, it both a gaze-pinch and a direct grab-touch interaction [37]. While is particularly instructive to examine the discoverability of widely- still allowing users to have a natural and direct interaction, the WIM known interaction designs. Discoverability is a critical concern for provides users miniature copies of the actual objects so that users can user experience, especially for users who are not proficient in the interact with the remote object indirectly. There are other 3DUI tech- use of modern technology and have difficulty in learning the new niques, such as go-go interaction [30] or ray-casting technique [3], interface. In this study, we observed how novice MR users responded that also support direct manipulation from a distance; yet, we ex- to each interaction when there was no instruction. The results of this amined the worlds-in-miniature given its growing popularity in VR gesture analysis can help designers to identify a gesture that would home design application (e.g., TrueScale, Home VR). render more natural interactions than existing techniques. 2.3 Comparative Studies on 3D User Interaction 2RELATED WORKS A 3D refers to an interaction mechanism on 2.1 Naturalism and Discoverability in 3D Interaction how users can execute different types of tasks in the 3D environ- The definition of the term naturalism varies widely depending on the ment [4]. In this study, we mainly address selection and manip- field of study. In 3DUI research, the term describes user interaction ulation tasks. Selection refers to the task of picking one or more that mimics the counterparts of the real-world interaction [5]. Dis- objects from the environment and manipulation refers to the task of coverability, coined by Don Norman [25], is a measure of how easily changing a virtual object’s orientation and position [5]. users can find and understand the new interface with the minimum In evaluating 3D user interaction, accuracy and efficiency have level of instruction. The previous human-computer interaction and been a fertile research topic in the previous HCI studies. As early VR research show that the major benefits of discoverability include as in 1999, Bowman and Hodges [2] compared six different 3D minimizing a learning curve [5] increasing user satisfaction [33] and manipulation techniques to find a more accurate and efficient way to enhancing users’ intention to use the system [8]. interact with a remote object in a VR environment. In their compar- Discoverability and naturalism are closely related concepts; how- ative study, a conventional raycasting offered more precise control ever, the mere fact that interaction is easily discoverable would not than using a virtual arm. More recently, Kytó et al. [21] examined always mean that the interaction is natural. For example, using a selection techniques for the head-worn AR device using a Microsoft swipe gesture is not how we navigate pages in the real world, but Hololens. Their study findings indicated that using gaze was gener- most smartphone users can easily discover the swiping feature with- ally faster than a hand-gesture or a hand-held device. In a similar out instructions. Discoverability reflects a user’s prior knowledge vein, research reported gaze input being faster than hand-pointing on and mental model of how the system should work [25]. As such, a large stereoscopic projected display [35]. Perceived usability is an- examining the discoverability of existing interfaces can be a key other critical dimension in evaluating 3D user interaction techniques. first step toward understanding users’ expectation and belief system Bernardos, Gómez, and Casar [1] compared the head-directed selec- about how hand-based mixed reality interaction should work. In this tion to the hand-based pointing on a projection screen. In their study, paper, we investigate naturalism and discoverability of interaction users found hand-based pointing far more intuitive and usable than design by observing users’ behavioral responses. An observational head-directed selection. McMahan et al. [23] compared three inter- study with existing systems can help designers fill the gap between action techniques on the projected wall and found that hand-based users’ mental models and existing interaction methods. object manipulation more natural than using controllers. In summary, much of prior research showed that more naturalistic 2.2 Three Existing Techniques interaction—using hands rather than gaze or ray-casting—provides This study compares three 3D interaction techniques; 1) multimodal better usability. With respect to accuracy and efficiency, however, interaction using gaze and pinch gesture, 2) direct touch and grab several works indicated that perceived naturalism is not a necessary interaction, and 3) worlds-in-miniature. These three interactions component for efficient and precise task performance. While these were chosen as the subject of the study based on their popularity as prior study findings provide insights into evaluating existing 3D well as the comparative advantage each interaction provides. user interactions, the differences across experiment settings preclude First, a gaze and pinch interaction is the most widely used in- any generalizable conclusion. Prior works on 3D user interaction teraction for mixed reality headset such as a Microsoft Hololens. have predominantly focused on interaction with a projected wall This multimodal interaction integrates eye gaze or head pose with and head-mounted VR displays, and relatively little is known for a gesture, enabling users to interact with the remote object. Using wearable Mixed Reality displays context. Furthermore, most of the a gaze input allows users to avoid false occlusion cues when hand previous comparative studies tend to focus on the specific interaction tracking is not available, and it also deals with the constraints of a mechanism for a single task, for example, comparing head tracking limited field of view. With a gaze and gesture interaction, users can to for a selection task [21], different ray-casting mech- select the target object with their head movement or eye gaze and anisms for a pointing task [7, 9], different finger-based projection move the object by holding a pinch gesture. techniques for a manipulation task [22]. Second, a direct touch and grab interaction is a default interaction In this study, we aim to provide a more comprehensive understand- mechanism for many VR interfaces. The classical design approach ing of 3D user interaction by examining three interaction techniques for VR controllers (e.g., Touch, Vive) is using a grip button in accomplishing series of tasks—selecting a , selecting an for grabbing the object. When using hands, users can utilize their object, and positioning the object. We further examined the dis- natural hands’ movements by directly touching an object with their coverability of three interaction design by observing participants’ bare hands and physically moving the object with arms’ movement. hand gestures. Contribution of this study lies in providing a broader

276 comparison of widely-known 3DUI techniques, thereby offering insights for designers to improve the current 3D user interaction designs.

3DESIGN IMPLEMENTATION 3.1 Device Setup Mixed Reality Setup: In order to create a mixed reality environment, We used an and a ZED Mini camera. A ZED Mini is an attachable depth-camera which features dual 2K image sensors, providing mixed-reality experience for head-worn VR devices with real-virtual occlusion. The purpose of using a ZED mini is to achieve a wider field of view. ZED mini offers 110-degree fields of view, which makes it easier to interact with large furniture objects. A ZED mini supports spatial mapping with depth occlusion which is useful for collision detection, as it allows users to place the virtual objects in the real environments. A ZED mini camera was mounted on an Oculus Rift headset. The tracking area was set up with Oculus two sensors, and the size of the tracking area was approximately 3 meters by 2.5 meters, the optimal play area for an Oculus. Hand Tracking: We used a for hand tracking. Leap Motion uses image processing with infrared stereoscopic video [36]. Unlike other devices such as Kinnect, a Leap Motion does not return a complete depth map but only a set of relevant hand points and some hand pose features; thus, we used preset hand pose features of Leap motion—grabbing, pinching, and pointing. Virtual hand was always rendered in the scene letting users Figure 2: Example scenes: Gaze and pinch (top), Direct touch and know their hands are tracked well. Leap motion was attached to the grab (middle), and worlds-in-miniature (bottom). front of the Oculus Rift.

3.2 Gaze and Pinch Interaction movement. Users had to physically walked around the room in order The gaze and pinch interaction uses head-movement as a proxy to place the virtual furniture. to eye gaze. We used the term "gaze" instead of head position, which is more accurate, because gaze is more generally accepted 3.4 Worlds-in-Miniature Interaction word in describing a multimodal interaction [21]. Three tasks are The last interaction is a world-in-miniature (WIM) metaphor [37]. implemented in the following ways: 1) Menu: a menu was always Unlike the previous two interactions where users can directly select displayed in front of the users at the distance of 2 meters. The and move the actual-scale object, WIM interaction provides users gaze cursor moves along with the user’s head movement. When the the hand-held miniature representation of the objects. In order cursor was pointing to the object, the object got highlighted with to create the miniature size room, users were first asked to scan blue color. 2) Selection: selection are triggered explicitly using the surrounding environment. Using ZED mini’s depth-camera, pinch or "air-" gesture. As user select the virtual furniture, the the scanned room environment was brought in to the miniature furniture item appeared in front of the user within the reachable area, environment. After the generation of the miniature floor, three a range between 0.6 to 0.8 meters. 3) Manipulation: users could tasks are implemented in the following ways: 1) Menu: menu was move the furniture by holding a pinch gesture. The virtual object displayed in front of the users at the distance of 0.6 - 0.8 meter, moved along with the hand’s movement as seen in Figure 2 (top). which reflects the reachable length for average human. 2) Selection: As users select the menu by pressing the button with their index 3.3 Direct Touch and Grab Interaction finger, the miniature virtual objects appeared. 3) Manipulation: By The grab and touch interaction has been a standard way to interact moving the miniature object, users can move the corresponding with a virtual object for the head-mounted VR devices such as an actual-size object as seen in Figure 2 (bottom). Users were able Oculus Rift and HTC Vive [18]. Three tasks are implemented in the to select and move the miniature object by directly manipulating it following ways: 1) Menu: the attached menu metaphor was used. with touch and grab interaction. A menu was attached to the user’s left hand and users could evoke the menu as they flip their left hand (seen in Figure 2, middle). We 4 USER STUDY used this attached menu metaphor for two reasons: first, this menu selection mechanism corresponds well to direct touch and grab inter- 4.1 Participants action because the menu should be directly touched to be invoked. 21 participants were recruited through mailing lists and flyers in the Second, this attached menu minimizes unnecessary physical move- campus. Participants’ age ranged from 18 to 44 with a mean of 24.2 ment involving menu selection, such as walking toward the menu, (SD = 7.8) and the majority of them were university students. We thus enables better comparisons with other conditions. 2) Selection: specifically sought participants who had no prior experience with user could select the menu by directly touch the button with their head-worn mixed reality devices and hand tracking devices. None of index finger. As users select the virtual furniture, the furniture item the participants had previous experience with mixed reality headsets appeared in front of the users, within the range between 0.6 and 0.8 or hand tracking device, but nine participants had prior experience meter. This range reflects the average length of human arms. When with head-worn VR devices such as an Oculus Rift or HTC Vive. it comes to object selection, users could select the object by directly On average, participants spend a significant amount of time working touching it with their bare hands then could grab it by making a on a computer on a daily basis (5-8 hours a day: 47%, more than 8 fist. 3) Manipulation: users could move the object with their arms’ hours: 28%).

277 Figure 3: Overview of experimental procedure

4.2 Tasks out the interaction for longer than 2 minutes, we marked the task as With each interaction condition, participants were asked to complete “failed” then asked participants to proceed to the next task. three tasks: 1) select the menu to choose a virtual object; 2) move the Efficiency and Accuracy: The efficiency of interaction design object to the designated area, and 3) arrange multiple furniture items. was examined by measuring the time it took to complete the task Figure 3 shows an overview of the three tasks and the experiment 3. The accuracy of interaction design was examined using position procedure. A virtual object always appeared within a reachable area data of five furniture items: four stools, and one table. Participants from the user, a range between 0.6 to 0.8 meter, which is average were asked to arrange four stools so that they are perpendicular to length of human arm. each other. The average deviation from 90-degree was calculated to The primary purpose of task 1 and task 2 is to examine discov- examine accuracy. erability each interaction technique. Thus, a minimum level of in- Usability: The perceived usability of each interaction design was struction was provided to participants. For example, for gaze-pinch evaluated using the System Usability Scale (SUS) [6]. The system interaction, participants were told to “use a hand-gesture to move the usability scale includes 10 items with five responses that range object,” and no further information about what gesture users should from strongly agree to strongly disagree. The example questionnaire use was provided. For touch-grab interaction, participants were includes: “I found the system was easy to use,” and “I would imagine told to "use hands to move objects. No further information about that most people would learn to use this system very quickly.” In how participants can move the virtual furniture was provided. For a order to examine perceived task loads, the NASA Task Load Index worlds-in-miniature interaction, participants had a same instruction (NASA-TLX) questionnaire was also included in the survey [14]. with touch-grab interaction–"use hands to move objects." Naturalism and Enjoyment: To measure the perceived level of After completing task 1 and 2, detailed instructions of how to naturalism, we used five-point Likert scale questionnaires. The interact with the virtual objects were provided before the task 3. questionnaire includes three items: “interaction with a virtual object The goal of task 3 is to measure the accuracy and efficiency of was similar to how I would interact with an object in a real-world,” each interaction. After a brief demonstration and trials, participants “interaction with a virtual object felt familiar,” “interaction with a were asked to select two red stools, two black stools, and one table, virtual object seems intuitive and natural.” The questionnaire for then arrange the four chairs around the table so that each chair is the enjoyment was adopted from previous literature on extended perpendicular to each other. The total time it took to complete the Technology Acceptance Model [11]. The enjoyment questionnaire task was measured as an indicator of efficiency. The angle between includes two items: “It was enjoyable to use this system,” “It was each chair was measured as an indicator of accuracy. After the fun to use this system.” Cronbach’s alpha for both variables was α α completion of all three tasks, participants evaluated the interaction above 0.7 ( naturalism = 0.78; enjoyment = 0.81). design. These three tasks are repeated for 3 interactions. Control variables: Responsiveness of the system was included as a control variable. The questionnaire asked participants whether 4.3 Experiment Procedure they experienced system lag when they experience each interaction In order to examine participants’ preference and compare three design. Participants’ demographic information, as well as their pre- techniques, this study employed a within-subject study design. Each vious experience with VR, , and average daily computer interaction condition was fully counter-balanced to minimize the use, were also measured. learning effect. Participants experienced all three interactions in a randomized order and were asked to pick one interaction design 4.4.2 Gesture analysis and follow-up interview they preferred the most. Participants evaluated the interaction by Gestures: Studying gesture can help 3D UI designers to find suitable answering questionnaires. Finally, they were asked to complete the interface features that correspond well to users’ expectation [31]. final survey that includes questions on demographic information Following Kendon [19]’s gesture analysis method, the session was and their average computer usages. After the completion of the video-recorded and changes of gestures were encoded with a de- survey, participants were invited for a follow-up interview. The scription of the movement change. The purpose of this analysis is entire process approximately took 45 minutes. to identify gestures participants use when they first encounter the virtual object, thus the video recordings of task 1 and task 2 sessions 4.4 Measures were mainly used for the analysis. The hand movement is described 4.4.1 Evaluation of three interaction designs with three main elements: a number of hands used, specific ges- Discoverability: Discoverability was examined by measuring the ture used, and a target object. The examples of description include: time it took to complete the task 1 and task 2 [29]. In our pilot pressing the button with an index finger, lifting the table using both study with 7 graduate students, we asked them to notify us when hands, pushing the chair with one hand, and gripping the top of the they could not figure out the interaction and wanted to give up the chair. tasks. The pilot study indicated that people feel a great level of User experience: In order to better understand participants’ ex- frustration when they failed to interact with the virtual objects after perience with three interaction designs, a follow-up interview was 1.5 minute. Based on this result, when participants could not figure conducted. The interview questions mainly include five items: 1)

278 Figure 4: Discoverability (top), Accuracy and Efficiency (bottom) Figure 5: Users’ rating of NASA TLX (top), System usability scale, naturalism, and enjoyment of each interaction design (bottom) which interaction did you prefer the most? 2) why do you prefer interaction X over others? 3) what do you like or dislike about other interactions? 4) What’s the hardest part about the interaction 5.1.2 Accuracy and Efficiency you experienced? 5) How would you like to provide a solution to make an improvement? In addition to these five questions, we also The object manipulation task involved arranging four stools around asked questions based on our observation, for instance, what specific the table so that each stool is perpendicular to each other. Accuracy gesture participants used and why. of manipulation was calculated by creating vectors of stools across from each other and comparing the angle between these two vectors 5RESULTS to that of a right angle. Mean of absolute angle was used to represent 5.1 Evaluation of Three Interaction Designs the accuracy of each interaction design. ANOVA analysis shows a significant main effect on accuracy for different interaction designs 5.1.1 Discoverability 2 (F(2, 54) = 8.58, p < .001, ηp = .241). The result of the analysis is In order to examine how easy each interaction is to discover, the time shown in Figure 5. it took for participants to complete the first task (menu selection) and The pair-wise comparison results show that mean manipulation the second task (object selection and manipulation) was compared. error was significantly lower for worlds-in-miniature interaction ANOVA analysis shows a significant main effect for both tasks design than other two interaction techniques (M gaze-pinch = 20.36, 2 (menu selection: F (2, 51) = 11.13, p<.001, ηp = .303; object SD gaze-pinch = 7.31; M touch-grab = 18.42, SD touch-grab = 6.85; M 2 selection and manipulation: F (2, 50) = 9.67, p <.001, ηp = .283). WIM = 12.01, SD WIM = 5.16). The efficiency of each interaction was Tukey’s HSD post hoc test was conducted for pair-wise comparisons. measured with the total time it took to finish the object selection and The result of the analysis is shown in Figure 4. manipulation task (task3). ANOVAanalysis shows a significant main For the menu selection task, gaze-pinch interaction took the effect on efficiency for different interaction designs (F(2, 54) = 29.23, p < .001, η 2 = .51). The Tukey’s HSD pair-wise comparison test longest amount of time to complete the menu selection (M gaze-pinch p shows that worlds-in-miniature was significantly faster to complete = 54.88, SD gaze-pinch = 20.47; M touch-grab = 42.83, SD touch-grab = the task than the other two interaction techniques (M = 25.57; M WIM = 23.69, SD WIM = 12.27). With worlds-in-miniature gaze-pinch interaction, it took significantly less time to select the menu than 5.90, SD gaze-pinch = 2.08; M touch-grab = 3.98, SD touch-grab = 1.56; gaze-pinch interaction (p <.001) and touch-grab interaction (p <.05). M WIM = 2.44, SD WIM = 0.51). For the object selection and manipulation task, gaze-pinch interac- 5.1.3 Usability, Naturalism, and Enjoyment tion took a significantly longer amount of time to complete the task (M gaze-pinch = 49.20, SD gaze-pinch = 38.33; M touch-grab = 21.57, Usability of the interface was measured with two indexes: the NASA- SDtouch-grab = 14.50; M WIM = 15.92, SD WIM = 11.04). Six partici- Task Load Index (NASA-TLX) and System Usability Scale (SUS). pants failed to figure out gaze and pinch interaction within 2 minutes The ANOVA results for NASA-TLX is shown in Figure 6. There with the gaze-pinch interaction. There was no significant difference was a significant main effect for mental load (F(2, 53) = 10.04, p 2 2 between grab-touch interaction and worlds-in-miniature interaction < .001, ηp = .27), physical load (F(2, 54) = 4.84, p < .05, ηp = 2 (p = .76). .15 ), performance (F (2, 54) = 6.97, p < .01, ηp = .20), effort

279 Table 1: Findings from the gesture analysis.* Indicates the percentage of participants who used the gesture

Frequency of the gesture participants used (%)* Press the menu button with an index finger (100%) Menu Press the menu button with the entire hand (82%) selection Grabbing gesture with one hand (52%) Pinch or air-tab gesture with one hand (5%) Lift the table with both hands (88%) Grab the side of the chair with both hands (84%) Object selection Lift the chair with both hands (52%) and manipulation (Large size) Push the table using one hand (47%) Grabbing gesture with one hand (47%) Pinch or air-tab gesture with one hand (5%) Object selection Grabbing gesture with one hand (82%) Figure 6: Gestures used by participants for different tasks: Interact- and manipulation Push the object using one hand (29%) ing with menu (top), a large object (middle), and a miniature object (Miniature) Grab the side of the object using both hands (29%) (bottom)

scale objects such as a table. On the other hand, for the miniature size object, participants frequently used the grabbing gesture to pick η 2 (F(2, 54) = 38.29, p < .001, p = .58 ), and frustration (F (2, up the virtual object. The majority of participants used only one 2 54) = 23.09, p < .001; ηp = .46). The results from pair-wise hand when they were interacting with a miniature size object. comparisons between three interaction designs indicated that the gaze-pinch interaction are significantly more demanding. The gaze 5.2.2 Follow-Up Interview: User Experience and pinch interaction required a significantly higher level of effort The follow-up interviews reveal the strength and weakness of each than touch-grab interaction (p <.05) and the worlds-in-miniature interaction techniques in greater depth. The majority of participants interaction (p <.001). preferred worlds-in-miniature interaction over the other two for ease The significant main effects were reported for system usability of use. Participants frequently described the worlds-in-miniature η 2 (F(2, 54) = 20.13, p < .001; p = .43) and perceived naturalism interaction as the “easiest” interaction. Although there was no sta- 2 (F(2, 51) = 13.35, p < .001; ηp = .34), yet no significant effect tistically significant differences reported on the level of enjoyment, 2 was found for enjoyment (F(2, 54) = 2.41, p = .09; ηp = .08). about one third of participants indicated that they would enjoy di- The pairwise Tukey HSD test shows that the worlds-in-miniature rectly interacting with real-size object rather than manipulating a interaction provides significantly better usuability than other two miniature object. Furthermore, more than half of interviewees said interactions (M gaze-pinch = 45.97, SD gaze-pinch = 18.39; M touch-grab they preferred walking around the room to standing still when inter- = 57.61, SD touch-grab = 19.22; M WIM = 81.31, SD WIM = 13.92). In acting with virtual objects. terms of perceived level of naturalism, both touch-grab and worlds- When it comes to discoverability and ease of use, Gaze and pinch in-miniature interaction were rated to be significantly more natural interaction was participants’ least favorite design amongst three. than gaze-pinch interaction (M gaze-pinch = 2.25, SD gaze-pinch = 0.92; Many noted that gaze and pinch interaction was difficult to figure out M touch-grab = 3.77, SD touch-grab = 0.97; M WIM = 3.41, SD WIM = without instruction. Three interviewees mentioned that they did not 0.87) as shown in Figure 6. notice the gaze cursor nor understood the mechanism of gaze input. While majority of participants prefer other two interaction designs, 5.2 Gesture Analysis and Follow-up Interview two participants said that they would like to use gaze and pinch interaction in the future because it allows them easier interaction 5.2.1 Gesture Analysis once they learn how to use it. These two participants indicated We analyzed video recordings of participants’ hand movements by that they especially liked gaze-pinch interaction as it requires less systematically categorizing and labeling gesture they used. Due physical movement and users can still directly interact with a life- to the video recording error, videos of 17 participants were used scale object instead of moving miniature items. for the analysis. Table 1 shows the summary of findings from gesture analysis. When participants first encountered the gaze-pinch “Even if it’s not real, you think it’s furniture. It feels like interface, in which the menu was displayed far away from the user, it should be heavy, and I should be using both hands” the first action all participants tried was to press the menu button with an index finger. When pressing gesture failed to evoke the menu More than 80 percent of participants used both hands trying to lift selection, 14 participants walked or leaned toward the menu and a table or chairs. Interview with participants reveal that perceived tried to press the button again using an entire hand. The next action feature of furniture such as its weight influence their instinctive form participants tried was a grabbing gesture. In the follow-up interview, of hand interactions. Even after we provided the instruction and many participants noted that they would not be able to figure out the participants learn that they can move the object using only one hand, pinch gesture unless they were provided with the instruction. they preferred using both hands when moving a large object. When participants first encountered the large-scale furniture items, Participants also shared their opinions on desired attributes for either with gaze-pinch interaction or touch-grab interaction condi- interaction design and made suggestions on how to make an improve- tion, grabbing was not the first gesture participants tried to use. The ment. Frequently mentioned topics include 1) dynamic interactions, majority of participants tried to move the virtual table by “lifting” 2) natural interaction, and 3) direct interaction. Based on these find- it. In terms of the virtual stool, the first action many participants ings, we discuss design recommendation in the remaining of this tried was “grabbing” the side of the stool using both hands. 88% of paper. Figure 7 provides a summary of the follow-up interviews participants used both hands when they were interacting with large with sample verbatim.

280 User experience with each interaction design

Worlds in Miniature

Advantage Disadvantage Sample Verbatims  Ease of use  Not as fun as the other two “If you are a perfectionist and you want everything in the right position, this is a good way  Efficient  Limited view to move furniture” “You don’t really see the large one when you are arranging the small one”  More precise “It was easier, but it’s more fun to interact with large one”  Natural Touch and Grab

Advantage Disadvantage Sample Verbatims  Direct interaction  Difficult to figure out the grip “It feels more real when you move and drag the chair around the floor”  Fun to use gesture “I think moving and walking is more intuitive. I don’t mind having to walk. It’s better that way”  Natural  Physical movement “Interaction was actually not intuitive, I didn’t know what to do to make it move. I kept  Physical movement pushing it at first"

Gaze and Pinch

Advantage Disadvantage Sample Verbatims  Efficient once you figure  Less intuitive “If someone is not familiar with VR, using eye-gaze won’t be intuitive. I noticed it (cursor) out  Have to gaze first after a while, but I prefer the other one that you can just use your hands. You don’t have to focus on things”  Requires less physical  Difficult to select when “Once you figure out, it is not difficult. I liked it because you can simply look at the object” movement multiple objects are aligned in “It was not easy to use at first. I think you definitely need some instruction for this one” the same direction

Desired attributes for 3D UI interaction design

Dynamic interactions “I hope there are more you can do. Like you can pull, push, or lift, basically do everything that you would do with the real chair and table.” “You can combine all these three somehow or give options to choose. You can pinch, grab and have the mini-size one too.”

Natural interaction “You don’t grab the table. That’s not how you move the table in a real-world. When you see a table like this you try to lift it. You put your hands under the top or you just push it. I wish I can do all that in AR too.” “Pinching worked fine when moving the chair, but not for the menu. Can you click it with the finger? That’s what we do on our phone”

Direct and physical “I liked when you can directly touch and click the menu. When you see the sci-fi movie, you have all these windows and a virtual control center in front of you. You can still touch it”

Figure 7: A summary of findings from interviews

6DISCUSSION 7DESIGN IMPLICATIONS 7.1 Natural Interaction is Dynamic and Complex This study examines hand interaction techniques for a head-worn Designers often strive to create a simple design. Especially for MR device. The primary goal of this study is to understand how to an innovative interface such as mixed reality system, one of the design 3D user interaction that is intuitive and easy to learn. Through common design approach is to find a one single input system that can a series of user studies, we compared three widely known interaction work across different platforms [26]. Assigning specific interaction techniques: gaze and pinch interaction, touch and grab interaction, for each input system—for example, simulating mouse movement and worlds-in-miniature. The finding from a user study suggests with a gaze or using finger pointing as a mouse cursor—has been a two key implications. classical design approach. This one-to-one mapping approach may First, worlds-in-miniature interaction stands out amongst three be easier to implement; however, such an interaction mechanism is for providing better usability. Compared to other two studied in- difficult to figure out without instructions. teractions, worlds-in-miniature interaction allowed faster and more Natural interaction is complex and dynamic. In our in-depth accurate interaction in manipulating multiple objects. Unlike the interviews, when we asked participants to choose a single interaction gaze-pinch and touch-grab interaction, that only provide perspec- they like the most, many indicated that they want to combine all tive views of the objects, having an additional elevated view from three interactions rather than picking one. The desire for dynamic above makes it easier to see the entire layout and understand spatial interaction is well expressed in one of the participants comments–"I configuration [34]. hope there are more you can do. Like you can pull, push, or lift, basically do everything you would with with the real chair." Second, the findings from the gesture analysis suggest the critical role of the visual appearance of a virtual object or, more specifically, 7.2 Natural Interaction is Direct and Physical the affordance of an object in enhancing discoverability [25]. That is, a visual characteristic of a virtual menu which makes it looks "push- Natural interaction is direct and physical, which means it involves able" and the shape of a virtual table that affords "lifting" largely "touching" objects. The mid-air gestures has gained its popularity determined participants’ form of interaction. The perceptual prop- with the release of motion sensing devices [40]. Research defines erty an object such as weight influences how users would interact mid-air gesture as an interaction that involves touchless manipula- with the object. This is well represented in one of the participant’s tions of digital content [38]. The term itself presumes the in-direct comment: "It’s furniture. It feels like it should be heavy." interaction. In our user study, participants frequently noted that

281 Figure 8: Findings from gesture analysis and design suggestions based on the study findings

an pinch gesture felt unnatural when they are selecting a menu a new interface; however, such a pure novice user is only a small and manipulating an object. In the in-depth interviews, partici- part of the picture. The results of the study would likely have been pants indicated that they would prefer to have a pushable menu very different when participants had enough time to learn each inter- and grabbable virtual objects with which they can have more direct action. Future work should address the progressive learning effect interaction. Wobbrock et al. and Piumsoboon et al. reported similar and further examine the usability of varying interaction techniques findings [29,41]. in the longer term. Gestures are often referred to as a critical component of the With the worlds-in-miniature interaction, it is challenging to pick natural [24, 28]. Holding a hand in the air may help up the virtual object that is smaller than their hand size. Prior users to manipulate the remote objects easily; however, the lack of 3DUI works [16, 17] had tackled these issues and examine selection physical contact can make the entire experience feel virtual. In order techniques in cluttered or dense VR environments. It would be to make an interaction to be more natural and intuitive, the interface interesting to explore how to design a WIM interface with these should allow direct interaction with the object. Providing digital selection techniques. buttons and icons within the reachable distance, so that users can The technical limitation is another important concern that has virtually touch the object rather than using a mid-air gesture, can to be addressed in interpreting study results. In creating a mixed simulates more realistic interaction. reality setting, we used a Zed Mini camera’s video see-through features rather than using an optical head-mounted display such as 7.3 Affordance and visual guidance a Microsoft Hololens. A video see-through images can suffer from Coined by Gibson [12], the concept of affordance has become the image lagging and delays in responses, causing a motion sickness to core design principle in interaction design. Our gesture analysis the users. Leap Motion sensor does not provide the most accurate confirmed that understanding a virtual object’s affordance is vital hand tracking nor supports realistic object occlusions, which can for developing a more natural and intuitive 3D user interaction. result in negative user experience. Depending on the virtual object’s visual appearance, the user’s form Furthermore, we did not consider real-virtual occlusion in a larger of interaction can largely vary. A button should be pushed, a table space. In our study, we scanned the environment using a ZED mini should be lifted, and a miniature object should afford to be picked camera, but it was limited to the small room space. In the larger up. cluttered environment, gaze-punch might be able to outperform Designers can take into account visual affordance and create more touch-grab and WIM. Compared to other two, it would be easier realistic by developing physics-based interaction. How- to navigate the larger space and move the objects in a cluttered ever, creating an accurate mesh colliders for individual objects and environment with gaze-pinch interaction. Future studies can explore simulating realistic collision detection can often be time-consuming various scenarios to address such concerns. and computationally expensive. Providing interaction guidance such as visually presenting the possible forms of gestures and interaction 9CONCLUSION can be a simple solution that can help users to quickly learn the This study compares three widely known interaction techniques: interaction. The great example of visual guidance comes from the gaze and pinch interaction, touch and grab interaction, and worlds-in- bounding box and app bar in the Mirosoft Hololens 2. The bound- miniature. Overall, the majority of participants found that the pinch ing box shown in the Hololens 2 informs the user that the object and grab interaction most challenging to use, while the worlds-in- is currently adjustable while also responding to the user’s finger’s miniature interaction provided users the most accurate, efficient, and proximity. This interactive visual feedback can communicate the easiest interaction in arranging multiple furniture items. The results possible interaction with users. of gesture analysis and in-depth interviews reveal the critical roles of visual affordance of virtual objects in determining the participants’ 8LIMITATION form of interactions in mixed reality. We further provide design The focus of our study was discoverability. Discoverability can be suggestions that can aid in developing more intuitive interaction for an important concern for the users who have difficulties in learning a wearable mixed reality headset.

282 REFERENCES pointing: Precise head- and eye-based target selection for . In Proceedings of the 2018 CHI Conference on Human Fac- tors in Computing Systems, CHI ’18, pp. 1–14. ACM, 2018. doi: 10. [1] A. M. Bernardos, D. Gómez, and J. R. Casar. A comparison of head 1145/3173574.3173655 pose and deictic pointing interaction methods for smart environments. [22] F. Matulic and D. Vogel. Multiray: Multi-finger raycasting for large International Journal of Human–Computer Interaction, 32(4):325–351, displays. In Proceedings of the 2018 CHI Conference on Human 2016. doi: 10.1080/10447318.2016.1142054 Factors in Computing Systems, CHI ’18, pp. 245:1–245:13. ACM, New [2] D. A. Bowman. Interaction Techniques for Common Tasks in Immersive York, NY, USA, 2018. doi: 10.1145/3173574.3173819 Virtual Environments: Design, Evaluation, and Application. PhD thesis, [23] R. P. McMahan, D. Gorton, J. Gresock, W. McConnell, and D. A. Bow- Atlanta, GA, USA, 1999. AAI9953819. man. Separating the effects of level of immersion and 3d interaction [3] D. A. Bowman, J. L. Gabbard, and D. Hix. A survey of usability techniques. In Proceedings of the ACM Symposium on Virtual Reality evaluation in virtual environments: Classification and comparison of and Technology, VRST ’06, pp. 108–111. ACM, New York, methods. Presence: Teleoper. Virtual Environ., 11(4):404–424, Aug. NY, USA, 2006. doi: 10.1145/1180495.1180518 2002. doi: 10.1162/105474602760204309 [24] D. Mortensen. Natural User Interfaces: What are they and how do you [4] D. A. Bowman and L. F. Hodges. User interface constraints for immer- design user interfaces that feel natural. Interaction Design Foundation, sive virtual environment applications. 1995. 2019. [5] D. A. Bowman, R. P. McMahan, and E. D. Ragan. Questioning natu- [25] D. A. Norman. Affordance, conventions, and design. Interactions, ralism in 3d user interfaces. Commun. ACM, 55(9):78–88, Sept. 2012. 6(3):38–43, 1999. doi: 10.1145/2330667.2330687 [26] T. Page. Skeuomorphism or flat design: future directions in mobile [6] J. Brooke. System usability scale (sus): A quick-and-dirty method of device user interface (ui) design . International Journal of system evaluation user information. digital equipment co ltd. Reading, Mobile Learning and Organisation, (2):130–142, 2014. UK, pp. 1–7, 1986. [27] M. Pfeiffer and W. Stuerzlinger. 3d virtual hand pointing with ems and [7] V. Buchmann, S. Violich, M. Billinghurst, and A. Cockburn. Fingartips: vibration feedback. In 2015 IEEE Symposium on 3D User Interfaces Gesture based direct manipulation in augmented reality. In Proceedings (3DUI), pp. 117–120. IEEE, 2015. of the 2Nd International Conference on Computer Graphics and Inter- [28] T. Piumsomboon, D. Altimira, H. Kim, A. Clark, G. Lee, and active Techniques in Australasia and South East Asia, GRAPHITE ’04, M. Billinghurst. Grasp- vs gesture-speech: A comparison of pp. 212–221. ACM, New York, NY, USA, 2004. doi: 10.1145/988834. direct and indirect natural interaction techniques in augmented real- 988871 ity. In 2014 IEEE International Symposium on Mixed and Augmented [8] F. Cafaro, L. Lyons, and A. N. Antle. Framed guessability: Improving Reality (ISMAR), pp. 73–82, Sep. 2014. doi: 10.1109/ISMAR.2014. the discoverability of gestures and body movements for full-body inter- 6948411 action. In Proceedings of the 2018 CHI Conference on Human Factors [29] T. Piumsomboon, A. Clark, M. Billinghurst, and A. Cockburn. User- in Computing Systems, pp. 1–12, 2018. defined gestures for augmented reality. In CHI ’13 Extended Abstracts [9] A. Cockburn, P. Quinn, C. Gutwin, G. Ramos, and J. Looser. Air on Human Factors in Computing Systems, CHI EA ’13, pp. 955–960. pointing: Design and evaluation of spatial target acquisition with and ACM, New York, NY, USA, 2013. doi: 10.1145/2468356.2468527 without visual feedback. Int. J. Hum.-Comput. Stud., 69(6):401–414, [30] I. Poupyrev, M. Billinghurst, S. Weghorst, and T. Ichikawa. The go-go June 2011. doi: 10.1016/j.ijhcs.2011.02.005 interaction technique: Non-linear mapping for direct manipulation in vr. [10] C. Colombo, A. Del Bimbo, and A. Valli. Visual capture and un- In Proceedings of the 9th Annual ACM Symposium on User Interface derstanding of hand pointing actions in a 3-d environment. IEEE Software and Technology, UIST ’96, pp. 79–80. ACM, New York, NY, Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), USA, 1996. doi: 10.1145/237091.237102 33(4):677–686, 2003. [31] S. S. Rautaray and A. Agrawal. Vision based hand gesture recognition [11] F. D. Davis, R. P. Bagozzi, and P. R. Warshaw. User acceptance for human computer interaction: A survey. Artif. Intell. Rev., 43(1):1– of computer technology: a comparison of two theoretical models. 54, Jan. 2015. doi: 10.1007/s10462-012-9356-9 Management science, 35(8):982–1003, 1989. [32] M. J. Reale, S. Canavan, L. Yin, K. Hu, and T. Hung. A multi-gesture [12] J. J. Gibson. The theory of affordances. Hilldale, USA, 1(2), 1977. interaction system using a 3-d iris disk model for gaze estimation and [13] J. Hales, D. Rozado, and D. Mardanbegi. Interacting with objects in an active appearance model for 3-d hand pointing. IEEE Transactions the environment by gaze and hand gestures. In Proceedings of the 3rd on Multimedia, 13(3):474–486, June 2011. doi: 10.1109/TMM.2011. international workshop on pervasive eye tracking and mobile eye-based 2120600 interaction, pp. 1–9, 2013. [33] M. Seto, S. Scott, and M. Hancock. Investigating menu discoverability [14] S. G. Hart and L. E. Staveland. Development of nasa-tlx (task load on a digital tabletop in a public setting. In Proceedings of the 2012 index): Results of empirical and theoretical research. In Advances in ACM international conference on Interactive tabletops and surfaces, psychology, vol. 52, pp. 139–183. Elsevier, 1988. pp. 71–80, 2012. [15] K. Hu, S. Canavan, and L. Yin. Hand pointing estimation for human [34] A. Shashua. Projective depth: A geometric invariant for 3d reconstruc- computer interaction based on two orthogonal-views. In 2010 20th In- tion from two perspective/orthographic views and for visual recogni- ternational Conference on Pattern Recognition, pp. 3760–3763. IEEE, tion. In 1993 (4th) International Conference on Computer Vision, pp. 2010. 583–590. IEEE, 1993. [16] Y. Jang, I. Jeong, T. Kim, and W. Woo. Metaphoric hand gestures for [35] L. E. Sibert and R. J. K. Jacob. Evaluation of eye gaze interaction. In orientation-aware vr object manipulation with an egocentric viewpoint. Proceedings of the SIGCHI Conference on Human Factors in Comput- IEEE Transactions on Human-Machine Systems, 47(1):113–127, Feb ing Systems, CHI ’00, pp. 281–288. ACM, New York, NY, USA, 2000. 2017. doi: 10.1109/THMS.2016.2611824 doi: 10.1145/332040.332445 [17] Y. Jang, H. Noh, T. Chang, W. Kim, and Woo. 3d finger cape: Clicking [36] S. Sridhar, F. Mueller, M. Zollhöfer, D. Casas, A. Oulasvirta, and action and position estimation under self-occlusions in egocentric view- C. Theobalt. Real-time joint tracking of a hand manipulating an object point. IEEE Transactions on Visualization and Computer Graphics, from rgb-d input. In European Conference on Computer Vision, pp. 21(4):501–510, April 2015. doi: 10.1109/TVCG.2015.2391860 294–310. Springer, 2016. [18] J. Jerald. The VR book: Human-centered design for virtual reality. [37] R. Stoakley, M. J. Conway, and R. Pausch. Virtual reality on a wim: Morgan & Claypool, 2015. interactive worlds in miniature. In CHI, vol. 95, pp. 265–272, 1995. [19] A. Kendon. Some relationships between body motion and speech. [38] P. Vogiatzidakis and P. Koutsabasis. Gesture elicitation studies for mid- Studies in dyadic communication, 7(177):90, 1972. air interaction: A review. Multimodal Technologies and Interaction, [20] H. Kharoub, M. Lataifeh, and N. Ahmed. 3d and 2(4):65, 2018. usability for immersive vr. Applied Sciences, 9(22):4861, 2019. doi: [39] C. von Hardenberg and F. Bérard. Bare-hand human-computer in- 10.3390/app9224861 teraction. In Proceedings of the 2001 Workshop on Perceptive User [21] M. Kytö, B. Ens, T. Piumsomboon, G. A. Lee, and M. Billinghurst. Pin-

283 Interfaces, PUI ’01, p. 1–8. Association for Computing Machinery, New York, NY, USA, 2001. doi: 10.1145/971478.971513 [40] I. Wang, P. Narayana, D. Patil, G. Mulay, R. Bangar, B. Draper, R. Bev- eridge, and J. Ruiz. Exploring the use of gesture in collaborative tasks. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, CHI EA ’17, pp. 2990–2997. ACM, New York, NY, USA, 2017. doi: 10.1145/3027063.3053239 [41] J. O. Wobbrock, M. R. Morris, and A. D. Wilson. User-defined gestures for surface computing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’09, pp. 1083–1092. ACM, New York, NY, USA, 2009. doi: 10.1145/1518701.1518866 [42] Y. Zhang, S. Stellmach, A. Sellen, and A. Blake. The costs and benefits of combining gaze and hand gestures for remote interaction. In IFIP Conference on Human-Computer Interaction, pp. 570–577. Springer, 2015.

284