Creating Novel Mixed Reality Experiences using existing sensors and peripherals by Fernando Trujano S.B., Massachusetts Institute of Technology (2017) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2018 ○c Fernando Trujano, 2017. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created.

Author...... Department of Electrical Engineering and Computer Science December 18, 2017 Certified by...... Pattie Maes Professor Thesis Supervisor Accepted by ...... Christopher J. Terman Chairman, Master of Engineering Thesis Committee 2 Creating Novel Mixed Reality Experiences using existing sensors and peripherals by Fernando Trujano

Submitted to the Department of Electrical Engineering and Computer Science on December 18, 2017, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science

Abstract Mixed Reality (MR) technology allows us to seamlessly merge real and virtual worlds to create new experiences and environments. MR includes technologies such as Virtual Reality and Augmented Reality. Even though this technology has been around for several years, it is just starting to become available to an average consumer with the introduction of devices such as the HTC Vive and the Microsoft Hololens. This thesis consists of the design, implementation and evaluation of three projects that highlight a novel experience in mixed reality using existing sensors and periph- erals. VRoom describes a simple and novel approach to 3D room scanning using only an HTC Vive and 360 camera. Mathland is a mixed reality application that enables users to interact with math and physics using a stretch sensor, IMU and camera. ARPiano is an augmented MIDI keyboard that offers a different approach to music learning by adding intuitive visuals and annotations.

Thesis Supervisor: Pattie Maes Title: Professor

3 4 Acknowledgments

I would like to thank my advisors, Prof. Pattie Maes and Mina Khan for their guid- ance and support throughout this thesis. I started my Master’s without a particular project or idea in mind and they were instrumental in introducing me to the poten- tials of MR. I would also like to thank everyone in the Fluid Interfaces group at the Media Lab for sharing their passions and inspiring me to pursue mine. Mathland, as described in Chapter 3 was a collaborative effort with Mina Khan, Ashris Choudhury and Judith Sirera. They helped build and shape Mathland into the tool is it today, and will continue on with its development after my graduation. Finally, I would like to thank my mom, Marisa Pena-Alfaro, my sister, Linda Trujano, and the rest of my family for always cheering me up and giving me the confidence needed to pursue this project. I would like to thank my girlfriend, Juju Wang, for providing valuable input and constantly putting up with my rants on Mixed Reality technology. I would also like to thank my friends for providing a de-stressing environment during my time at MIT.

5 6 Contents

1 Introduction 15

2 VRoom 17 2.1 Introduction ...... 17 2.2 Current Research ...... 18 2.3 System Overview ...... 19 2.4 Implementation ...... 20 2.4.1 Capture Scene ...... 21 2.4.2 Viewer Scene ...... 21 2.4.3 PointViewer Scene ...... 23 2.5 Evaluation ...... 23 2.6 Challenges ...... 25 2.6.1 Synchronization ...... 25 2.6.2 VR Comfort ...... 26 2.7 Future Work ...... 27 2.8 Conclusion ...... 29

3 Mathland 31 3.1 Introduction ...... 31 3.1.1 Goals ...... 31 3.2 Integrated Sensors ...... 33 3.2.1 Hololens Camera ...... 33 3.2.2 F8 Sensor ...... 34

7 3.3 Design ...... 38 3.4 Sharing ...... 39 3.5 Evaluation ...... 39 3.5.1 Study results ...... 41 3.6 Challenges ...... 42 3.6.1 Hololens Development ...... 43 3.6.2 Hololens Anchor Sharing ...... 44 3.6.3 Showcasing AR Work ...... 45 3.6.4 Hardware Limitations ...... 46 3.7 Future Work ...... 47 3.8 Conclusion ...... 48

4 ARPiano 49 4.1 Introduction ...... 49 4.1.1 Goals ...... 49 4.2 Previous Work ...... 50 4.2.1 SmartMusic ...... 51 4.2.2 Synthesia ...... 51 4.2.3 Andante/Perpetual Canon ...... 51 4.3 Design ...... 52 4.3.1 Integrated Sensors ...... 52 4.3.2 Physical and Augmented Keyboard ...... 53 4.3.3 Keyboard UI/Helper classes ...... 54 4.3.4 Component Menu ...... 55 4.3.5 Sharing ...... 56 4.3.6 Unity Debugging ...... 57 4.4 Keyboard Components ...... 58 4.4.1 Chord Suggestion ...... 59 4.4.2 Keyboard Labels ...... 60 4.4.3 Song Tutorial ...... 60

8 4.4.4 Note Spawner and Flashy Visualizer ...... 64 4.4.5 Chord Detector ...... 65 4.5 Evaluation ...... 66 4.5.1 Study Results ...... 67 4.6 Challenges ...... 68 4.7 Future Work ...... 69 4.8 Conclusion ...... 71

5 Conclusion 73

9 10 List of Figures

2-1 VRoom system diagram. 360 Video and 3D positional data from the 360 cameras and Vive Tracker are fed into Unity in order to produce a VR application...... 19

2-2 The viewer scene finds the best frame to render given the user’s position and renders it on the head mounted display...... 22

2-3 The point viewer scene renders each frame as a sphere in the 3D posi- tion in which it was captured...... 23

3-1 (a) Arm controller comprises of a compression sleeve that can be worn by the user. There is a figur8 sensor, which comprises of an IMU and stretch sensor. (b) and (c) Tracking user’s arm movements using the Arm controller ...... 35

3-2 Object controller that enables tangible interactions with virtual objects in Mixed Reality...... 36

3-3 The Microsoft Hololens headset (for MR experience) with our Object Controller and Arm Controller ...... 37

3-4 Five virtual items in Mathland’s menu to allow the user to experience different physical forces and interactions...... 38

11 3-5 Three puzzles created by arranging the virtual checkpoints in MR. The left side images show the puzzle whereas the right side ones show the solution. Each puzzle corresponds to a specific physics concept: a) Circular Motion (using ball with a rope and velocity vector) b) Linear Motion (using ball with a downward Force Field, two Ramps, and a Velocity Vector) c) Projectile Motion (using a downward Force Field, and a Velocity Vector) ...... 40

3-6 Results from our survey with 30 participants. The responses were based on a 5-point Likert scale: Strongly agree, Agree, Neutral, Disagree, Strongly Disagree. Charts a) to f) each correspond to one question on our post-study survey...... 42

4-1 Multifunction knob and MIDI keyboard with AR markers on first and last keys...... 52

4-2 Calibration ensures that the Hololens knows the exact location of the physical keyboard. The markers show the perceived location of the first and last keys...... 54

4-3 3D icons representing Keyboard Components are overlayed on top of the keyboard. From left to right, the icons represent: KeyboardLabels, NoteSpawner, ChordSuggestion, ChordDetector and SongTutorial. In this figure, the NoteSpwaner component is enabled...... 56

4-4 Non-calibration features, such as the Piano Roll can be debugged di- rectly in Unity ...... 58

4-5 The Chord Suggestion component highlights keys that can form chords. In this figure, both major (green highlight) and minor (pink highlight) chords are suggested with shared keys being highlighted blue. . . . . 59

4-6 The Keyboard Labels component augments each key by labeling it’s corresponding note...... 60

4-7 SpectatorView image of AR PianoRoll representation of Beethoven’s Für Elise ...... 61

12 4-8 The knob can be used to see and control the position of the song being played (left) or the speed at which it is being played (right) ...... 62 4-9 At the end of a Song Tutorial, the keyboard is transformed into a visual representation of the user’s performance. In this graph, it is clear that the user made more mistakes in a specific section near the beginning of the song...... 63 4-10 The NoteSpawner component emits cuboids as the user plays the key- board in order to visualize rhythmic and tonal patterns in the song. 64 4-11 The Chord Detector component analyzes the keys as they are pressed in order to label chords when they are played. This figure also shows how multiple components (ChordDetector and KeyboardLabels) can be enabled at once...... 65

13 14 Chapter 1

Introduction

Mixed Reality (MR) technology allows us to seamlessly merge real and virtual worlds to create new experiences and environments. The term “Mixed Reality” was defined in 1994 and includes technologies such as Virtual Reality (VR) and Augmented Reality (AR). Even though this technology has been around for several years, it is just starting to become available to an average consumer with the introduction of devices such as the HTC Vive and the Microsoft Hololens. To make these products a reality, researchers had to devise new sensors and technologies such as Microsoft’s Spacial Understanding sensors and HTC’s Lighthouses. Until now, most research has focused on creating new sensors along with their corresponding product, but little research has been published on the use of existing sensors and peripherals in a mixed reality application. As such, there may be lost or unknown potential available in existing sensors when used in a MR application. The Leap Motion for example, initially started out as a desktop hand and finger tracker, but it’s true potential was later discovered to be in VR. The goal of this project is to integrate and explore the potential of various sensors in a MR context. Chapter 2 describes VRoom, a simple approach to 3D room scanning using only an HTC vive and 360 camera. Chapter 3 describes Mathland, a mixed reality application that enables users to interact with math and physics in order to better learn the un- derlying concepts. Chapter 4 describes ARPiano, an augmented keyboard that offers a different approach to music learning by adding intuitive visuals and annotations. Finally, Chapter 5 concludes, and offers my closing thoughts on MR technology. 15 16 Chapter 2

VRoom

2.1 Introduction

Room-scale Virtual Reality was first introduced to general consumers in 2016 by the HTC Vive. Room-scale VR differs from traditional VR (i.e., Google cardboard, GearVR) in that it gives the user the ability to physically move within a virtual space with 6 degrees of freedom. This difference creates a much more immersive experience by allowing users to explore their virtual world by physically moving around. This technology has been mainly adopted for gaming, but it has great potential in many other applications. VRoom uses room-scale VR technology to map out an existing room in 3D space in order to allow others to experience that room from anywhere in the world. Before making important decisions, such as the purchase of a home or physical space, people want to make sure they fully understand what they are buying. How- ever, due to a variety of circumstances, it is not always possible or feasible to be in the same physical location when making these decisions. Current solutions to this problem do not allow the user to truly get a feel for the space. For example, single photographs are commonly used for this purpose, but they fail to accurately show the room’s size and surroundings. We have recently started to see real estate com- panies incorporate 360 photography technologies to allow potential buyers to gain an even better sense of the space. These technologies could easily be expanded to a VR

17 headset, where a user could feel as if they were in the location of interest. While this application of 360 photography in VR does provide a better feeling of the space, all observations are done through single vantage point. The next “step up” in these technologies would be to allow the users to walk around the room and see it from different perspectives and angles. One way to get a real room into a virtual space is to observe it and digitally recreate all aspects and components of the room using 3D modeling software. While this approach may provide an accurate representation of the space, it misses out on many of the details captured by the original space and requires a great amount of individual work in order to create the models. VRoom allows anyone to “scan” a room using a 360 Camera and a Vive tracker in order to experience that room from anywhere at any time.

2.2 Current Research

Current research on this field has focused on synthesizing frames from slightly different vantage points given frames from a single vantage point. This allows the user to better experience a 360 picture because a new frame can be synthesized as the user moves their head, allowing for depth and parallax in the image. Uncorporeal has developed a technology which they call “holographic photogra- phy” [1]. They use a series of 20-30 images to reconstruct a three dimensional en- vironment with depth that can be viewed in VR with parallax. Similarly, Huang et al. present a system that uses a combination of structure-form-motion and dense reconstruction algorithms to reconstruct the 3D geometry of the scene [2]. They then use this scene, along with their warping algorithm to synthesize new views from a single viewpoint. Both of these systems are able to accurately synthesize realistic frames to create a more immersive viewing experience. However, their scan coverage is limited. For example, in the Uncorporeal example, a user can only move a couple of inches in each direction before the system fades to black. VRoom provides a similar experience, but

18 allows the user to physically walk throughout a much larger area, up to 300 square feet (the maximum coverage of the HTC Vive trackers). Another way to experience a physical space in VR is to create a textured 3D model using photogrammetry. This method involves taking several pictures of an object or space and analyzing the movement of various points from image to image. Google Earth for example, uses photogrammetry to construct 3D models of buildings in major cities [3]. While photogrammetry can produce an accurate representation of an object or space, the resulting model requires significant computation to produce and lacks the realness of a photograph from any given angle. VRoom aims to achieve more realistic results than photogrammetry with a simpler algorithm that requires far less computational power.

2.3 System Overview

The goals of VRoom are twofold - to easily scan a room using consumer hardware and to allow anyone with a VR headset to place themselves in that room. To accomplish this I used two Kodak Pixpro 360s and a Vive tracker. Each Kodak camera has an ultra wide lens with a 235 degree field of view. By placing two of them back toback, I can create 360 footage. The Vive tracker uses sweeps of infrared light in order to get 3D positional data with sub millimeter precision. This tracker is generally used

Figure 2-1: VRoom system diagram. 360 Video and 3D positional data from the 360 cameras and Vive Tracker are fed into Unity in order to produce a VR application.

19 to track hands in a VR game but for this project I attached the tracker to the 360 camera in order to get 3D positional data corresponding to each captured frame. The 360 frames, along with their positional data are fed into Unity which produces a VR application that allows a user to walk along the scanned paths of the room. A sample use case, from start to finish might look as follows:

1. User sets up the HTC Vive trackers on the room they want to scan

2. User opens the Capture scene [described in Section 2.4.1] in Unity

3. User begins recording with the 360 camera and starts the capture

4. User moves through the room, capturing different frames at different locations of interest

5. User stops the capture, stitches the 360 video, synchronizes it, and splits it into frames

6. User feeds 3D tracking data from capture scene and 360 frames into Unity to produce a VR app

7. Anyone can now experience the room that the user scanned

2.4 Implementation

VRoom is implemented in C# using Unity and SteamVR. SteamVR provides the interface to the Vive tracker and head mounted display in order to get positional data for both. The Unity application is divided into three scenes. The capture scene allows a user to scan a new room, the viewer scene allows a user to experience the room as it was scanned and the point viewer scene allows a user to visualize a particular room scan.

20 2.4.1 Capture Scene

The Capture Scene, as the name suggest, is responsible for capturing 3D positional data as the 360 camera and tracker move through the room. At the end of a scan, a binary file is produced that contains a mapping between frame numbers and their corresponding positions. A scan is started by pressing the trigger on the controller which also plays a loud beep to inform the user that the scan has started. This beep assists with synchronizing the camera frames to the Unity frames. The videos are then manually extracted from the cameras and stitched together using the included software. Finally, the 360 video is trimmed to start in the same frame as when the scan started (as denoted by the loud beep) and separated into individual frames. The Capture scene also shows the current frame being captured in order to test the synchronization of the system. More details on synchronization can be found in section 2.6.1.

2.4.2 Viewer Scene

The viewer scene uses the 360 frames from the camera and the 3D positional data from the capture scene to allow a user to experience a scanned room and walk around defined paths. It does this by placing the user (a Unity camera) inside a sphere with the equirectangular frame projected onto it as a texture. SteamVR controls the user camera and links it to the user’s head movements. This way when the user turns their head, the Unity camera will turn to face a different side of the sphere in order to look at a different part of the room. This technique allows the user to view the 360 picture with all three rotational degrees of freedom. The main innovation of the Viewer scene, and what sets it aside from a traditional 360 degree picture viewer, is that it also takes into account translational motion of the user to render new textures based on the user’s position. Upon initialization, the viewer scene loads the 3D positional data for the room from the binary file generated by the Capture scene. It uses this data to construct a KD Tree that maps frame numbers to their corresponding position in 3D space. On each frame update, the

21 Viewer scene queries SteamVR to get the VR headset’s position and uses the KD Tree to get the closest frame to the headset’s position. This frame is then applied to the sphere’s texture to create the illusion of movement through the room.

During early user studies (described in section 2.5), I quickly discovered that for a smooth VR experience, it is imperative that the system always react to user’s movements. This means that if a user moves towards a location where there is no new frame to show, the system still needs to react to the motion in some way. Some systems, like [1], simply fade to black as the user leaves the supported area, but I felt like this would detract from the room experience. Instead, I opted for some very primitive form of frame interpolation. If a user starts moving away from a frame, the system allows them to walk inside the sphere in which they reside. This creates an effect of objects getting closer as you walk towards them and works well aslong the user remains far away enough from the surface of the sphere. In turn, for every frame, the user’s distance and direction from the closest frame is calculated and an offset is applied to their position relative to the center of the sphere. This technique gives the user a lot more freedom to move around the room without adding too much complexity to the system.

Figure 2-2: The viewer scene finds the best frame to render given the user’s position and renders it on the head mounted display.

22 2.4.3 PointViewer Scene

The point viewer scene allows a user to visualize the scanned path for a particular room. The scene takes the tracking data produced by the Capture scene to render each frame as a small sphere in its true 3D position. The user can walk around this scene to get an idea of what areas of the room are scanned. Additionally, the closest frame to the user - the one that will be shown in the viewer scene is highlighted green. This feature was especially useful while debugging the application as bugs can be extremely disorienting to catch in the viewer scene. For example, at one point during development, the closet point algorithm had bugs which caused random frames to be rapidly switched. The only way to see what was going on was to wear the headset which was both uncomfortable and unfeasible.

Figure 2-3: The point viewer scene renders each frame as a sphere in the 3D position in which it was captured.

2.5 Evaluation

I evaluated each of the three scenes described in section 2.4 in terms of their comfort level, performance, and best practices. The evaluation was done by me at several stages of the project and through a formal user study near the completion of the project.

23 I ran the system on a desktop PC with an Intel Core i5 7600K CPU, 16GB of RAM, and an Nvidia GTX 1080 GPU with 8GB of VRAM. Scans ranged from 500 frames to 2000 frames. Performance was not an issue with this particular build as the system was able to fetch and render new frames at the required 90 frames per second. Additionally, thanks to the sub-millimeter tracking provided by the HTC Vive, the scans themselves where accurate and to scale. For example, when viewing my room virtually while in the same physical room, I was able to feel different pieces of furniture as I moved around in the virtual scene.

Throughout the project, I evaluated and tweaked the Capture Scene in order to get sharper, more accurate scans of a room. I experimented with different paths of scanning including waving the camera up and down as I walked through the room, and straight lines at different speeds. Waving the camera allowed me to get multiple frames at different heights in a single sweep but the captured frames were blurry because of the constant motion of the camera. Straight lines worked well when viewed, with slower lines allowing for finer fidelity when walking since more frames were captured. However, a single straight line limits the overall scan coverage of the room. The best scan method was a combination of straight lines around the room, as it allowed for smooth experience when walking through each line while maintaining a greater scan coverage of the room.

The user study focused on evaluating the viewing aspect of VRoom, that is the Viewer Scene and the PointViewer Scene. I instructed participants through several scans, while changing variables such as frame interpolation and sphere size. I then let users roam freely through a variety of scans, emphasizing the importance of making verbal comments as they explored the room. The point viewer feature felt natural to all participants and allowed them to quickly see which parts of the room were "sup- ported" in the current scan. Additionally, every participant reported a much more comfortable experience when frame interpolation was turned on. For the scanned room, a sphere radius of 5 (unity Units) seemed to provide the most realistic frame interpolation.

As expected, users reported discomfort when the view jumped between two non-

24 adjacent frames but were surprisingly pleased with the comfort level and realism when walking along the defined path. An interesting observation from this part of the study is that more frames does not necessarily mean a smoother experience. My initial assumption was that the more frames there are, the more likely it will be to find the best frame to display given the user’s current position. However, dueto slight variations in height, orientation and lighting inherent to the capturing phase, the transition between close frames that were shot at different times was less smooth than the transition between sequential frames.

2.6 Challenges

Working with Unity and SteamVR was relatively straight forward. Documentation for both was easy to find online and I did not notice any mayor discrepancies between the code and documentation. However, there were a couple of other aspects of the project that proved to be more challenging, namely synchronization and VR comfort.

2.6.1 Synchronization

VRoom consists of several detached components that must be perfectly synchronized in order to provide a smooth experience. First, the two 360 cameras must be syn- chronized to each other with frame accuracy. If the footage of one camera is a couple of frames off from the other, the resulting stitched image will not be accurateor smooth. The Kodak Pixpro cameras include a remote that is supposed to start and stop both cameras at the exact same time. However in my tests, the cameras started at the same time but did not end together. This confused the stitching software and produced inaccurate scans. To fix this, I relied on the loud beeps produced bythe capture scene to make sure both cameras were synced before stitching. The stitched 360 video and the captured 3D positional data are also initially detached and must be synchronized with each other to ensure that the positions for each frame actually match up with where they were taken. To make sure both the camera and the Unity tracker started capturing data at the same time I began the

25 recording on the camera first, and trimmed the stitched footage to start on thesame frame that the beep from the Capture Scene starts. The main challenge to ensure synchronization between the Unity Capture Scene and the camera recordings had to do with frame rates. The cameras record at 29.97 frames per second while Unity in VR runs at 90 frames per second. Unity usually allows developers to change the native frame rate but this setting is ignored for all VR applications, presumably to ensure that users have a consistently smooth VR Experience. I found a couple of “hacks” online that force Unity to run at a given frame rate. While these hacks did indeed change the frame rate, they were not precise enough for my use case. Instead of staying at each frame for .03336 seconds (the amount of time required to achieve a frame rate of 29.97), Unity would often overshoot to numbers as high as .034 seconds. While this small difference may seem insignificant, it quickly adds up overmany frames. For example, in one test case by the time Unity thought it was capturing frame 1000, the cameras were actually shooting frame 1020. This frame irregularity essentially rendered the scan useless since the captured frames did no longer have the correct position assigned to it. I dealt with this issue by modifying the frame rate hacks to account for each frame overshoot and offset the next frame accordingly. These changes were tweaked over time as I did more synchronization tests. To test synchronization I set the Capture Scene to show the current frame on the screen as it was capturing. I then started recording the screen while a scan was in place. After stitching and syncing the videos, I was able to manually check that the frame shown on the screen matched the actual frame number shot by the camera. While a discrepancy of a couple of frames every so often is acceptable, recorded frames should not start deviating from scanned frames (indicating that one medium is recording at a slower speed than the other).

2.6.2 VR Comfort

The second challenge had to do with comfort while using the VR application. Several studies, such as [4] and [5] have concluded that motion sickness in VR is caused by conflicts between the visual and vestibular systems. Therefore, it is extremely

26 important that a user’s motion gets accurately reflected in the virtual environment as quickly as possible. The most expensive operations when viewing a room scan are the closest neighbor calculation and the frame fetch. I realized that in order to support scans consisting of thousands of frames while still maintaining the speed needed to have a smooth VR experience, a simple linear scan through all the frames would not work. To overcome this issue, and gain significant performance, I looped over all frames once upon ini- tialization in order to construct a KD tree. The KD tree allows me to search for the closest point to the headset in O(log n) time, compared to the O(n) time a linear search would take. This is a significant improvement considering this operation must occur 90 times a second. For frame fetching, I decided to split the video into individual jpeg frames since my initial tests showed improved random fetch performance com- pared to seeking frames from a video. With these two improvements implemented, VRoom is able to offer a comfortable experience, though more improvements can certainly still be made, as described in section 2.7.

2.7 Future Work

VRoom is noticeably more immersive compared to a single 360 picture and can cover a much greater space than current research. However, there are several areas that could improve the overall experience. First, as described in section 2.4.2, the current implementation of frame interpola- tion is very basic. It would be interesting to apply current frame synthesis techniques to each frame, such as the ones described in [1] and [2]. The key challenge here is being able to synthesize new frames quickly enough to keep up with the 90 frames per second required for VR. This improvement would provide a more immersive and comfortable experience when deviating from the scanned path. To maximize com- fort, transitions between synthesized frames and real frames should be as smooth as possible as well. Currently, as described in 2.6.1, synchronization must be perfect in order to ensure

27 a smooth experience. The main barrier to synchronization with the current imple- mentation is that the captured 360 frames are initially detached from the captured positional data. These are then manually reattached and synced for each room. A better solution to this problem would remove this detachment and capture both the video and positional data from the same device. This could be achieved by hooking up the HDMI out port of the cameras to a USB capture card. Unity could interface with this card in order to record the frames and position data at exactly the same time. This would provide better synchronization while making the capturing process easier.

A main advantage of VR is the ability to show each eye a different image. This is used by many games and applications to create the illusion of depth. VRoom currently shows the same image to both eyes but a more immersive experience would show a slightly different image to each eye in order to make each frame appear 3D. While regular (non-360) 3D photos require two cameras, 3D 360 photos requires a rig of four 360 cameras in order to get 3D information in all directions. This addition would further complicate the capture process and double the number of frames that need to be fetched, but it could greatly increase the immersiveness of the application.

During the user study, some users noticed that the view seemed shaky even when they walked straight. This is expected given the scans are currently done by hand. A smoother experience could be achieved by applying video stabilization to the captured frames. This would require stabilizing 360 frames which is much more complicated than scanning non-360 frames, as explained in [6] and [7]. It is also important to apply the same, or a similar smoothing algorithm to the captured data itself since not doing so could reintroduce the shakiness into the final view.

The current scanning method relies on a person manually scanning the room. This method introduces many slight variations in height, roll, pitch and yaw while the camera moves through the room. Also, the person scanning the room appears in each of the frames which detracts from the room experience. The current scan method could be greatly improved by using small autonomous drones. The drones would be small enough to be automatically removed from each 360 frame and would

28 be much more accurate at maintaining a steady camera than a human.

2.8 Conclusion

VRoom is able to leverage room-scale VR technology and a 360 camera in order to map out different frames of a room in three dimensional space. A user can then move through this space in order to see the room from different vantage points, creating a more clear sense of scale than a single 360 picture. VRoom allows anyone to experience a real room in virtual space from anywhere in the world. This technology could be used in fields such as real estate and tourism in order to provide a better senseofa space without being physically present.

29 30 Chapter 3

Mathland

3.1 Introduction

Mathland is a mixed-reality application that augments the physical world with inter- active mathematical concepts and annotations. Through the use of an IMU, stretch sensor, haptic motor and camera, Mathland opens up new opportunities for mathe- matical learning in an immersive environment that affords situated learning, embod- ied interaction and playful cognition. Mathland was created as a joint effort between Mina Khan, Ashris Choudhury and Judith Sirera with Mina Khan having initially conceived the project.

3.1.1 Goals

It is believed that by confining mathematics education to textbooks and timed tests, schools not only take the fun and creativity out of mathematics, but also limit the educational value of mathematics [13] and contribute to the widespread problem of mathematical anxiety [11, 12]. Science in particular is difficult for students to understand due to the lack of real-life visuals such as forces acting on an object [14]. We use Mixed Reality to render 3D visualizations and utilize our integrated sensors to control objects and instantly see the results of different forces acting on an object. Mathland aims to make learning mathematics fun by enabling users to

31 explore and experiment with mathematical objects in their real world. Mathland accomplishes these goals by combining situated learning, embodied interaction and playful cognition.

Situated Learning

Simply put, situated learning refers to learning in your environment. In traditional schooling, students learn from a textbook which does not match their real three di- mensional environment. We use MR to bring the visualizations to the real world while the integrated sensors allow users to intuitively change parameters. Furthermore, we use the spatial mapping capability of Hololens to bring the real world environment into the virtual world so that virtual objects may interact with the real environment. For example, a virtual ball can bounce off a real wall. Bringing the real world en- vironment into the virtual world opens up really unique opportunities as the virtual world does not have to be bound by real physical laws. Using the real environment in the virtual world, a user can, for example, experience what it would be like if their room was on Mars. Research suggest that cognition is intrinsically linked to people’s physical, social and cultural context [15, 16, 17]. Situated learning ties a person’s learning to their context such that learning is inseparable from a user’s activity [18]. Mathland allows for a unique form of situated learning as the users learn in their real environment using real objects while not being limited by only real world possi- bilities.

Embodied Interaction

Cognitive processes are tied to not only people’s context, but also to their interac- tion with their physical environment [19, 20]. Body movements such as gestures are considered cross-modal primes that facilitate the retrieval of mental or lexical items [21]. Mathematical ideas, in particular, have been connected to bodily experiences [30], and research suggests that at least some mathematical thinking is embodied [22]. Students show improved math performance when user’s gestures are in line with their mathematical thought process [23]. Mathland supports embodied interaction through

32 the camera-based Object Controller and the wearable IMU-based Arm Controller (see Section 3.2.2).

Playful Learning

Research has shown that playful activities can support and enhance learning [24, 25]. Gee writes that kids experience a more powerful form of learning when they play games compared to when they are in the classroom [26]. In computer games, learning occurs as a result of the tasks involved in games, and user’s skills develop as they progress through the content of the game [27]. It has also been shown that students who learn using games in math class outperform the ones who do not [28]. Mathland enables playful learning by including a Puzzle experience in which users must make the ball travel through a series of ordered checkpoints using the available objects (described in Section 3.3). Both solving and creating Mathland puzzles chal- lenges users to think carefully and creatively about the menu objects available and thus encourages constructionist learning in a playful and interactive mathematical environment.

3.2 Integrated Sensors

Mathland allows the users to move, rotate, and resize objects in the virtual world. Additionally, Mathland can track a user’s arm to support embodied interactions, such as throwing a ball. In order to support the interactions required to create mathland we had to integrate with several existing sensors and peripherals. We used the Hololens camera and AR markers for 3D positioning, an IMU and stretch sensor for rotating, resizing and arm tracking, and a haptic motor for feedback.

3.2.1 Hololens Camera

The Hololens comes equipped with a built in 720p camera that allows a user to capture POV videos and pictures of the holographic words. In Mathland, we use this camera to detect a Multi Target Vuforia marker in 3D space. This allows the user to position

33 any object in their virtual world by simply moving the tracker to the desired location in the physical world. Additionally, we used the Hololens camera to detect and track simple objects using OpenCV. This allows us to track a real ball in real-time and trace its speed and trajectory in the real world.

3.2.2 F8 Sensor

Figur8 is a startup based in Boston that creates Bluetooth Low Energy (BLE) trackers for use in body analysis in physical therapy [31]. The device works by combining a 6 axis IMU with a capacitive stretch sensor and BLE radio for transmitting the data to a phone. We partnered with figur8 for this project and were able to integrate their trackers with Mathland.

While figur8 provided the hardware necessary to get rotation and stretch data, they only had software written to interface their trackers with iOS. We needed the BLE device to interface with a Universal Windows Platform (UWP) (which Hololens uses). To do this, I researched the BLE specifications and got the relevant GUIDs for services and characteristics of the sensor. Integrating BLE with a UWP application was not straightforward as the Windows Bluetooth libraries are fairly new and not well documented. Additionally, Unity and the Hololens run different versions of the .NET platform which further complicates the integration as the Bluetooth library is only supported on newer .NET platforms.

When connected, the figur8 sensor begins transmitting the rotation data (in Quaternions) and the stretch data (as in integer) to the BLE server. The rota- tion data needs to be further processed in order for the integrated IMU to match the orientation of the device itself. I did this by multiplying the raw Quaternion with multiple quaternion offsets, which were obtained through trial and error, and with help of figur8 engineers. The stretch data ranges from 0 to 100 depending onhow much the sensor is stretched. This integer did not need any further processing and was passed on to the application as is. 34 Figure 3-1: (a) Arm controller comprises of a compression sleeve that can be worn by the user. There is a figur8 sensor, which comprises of an IMU and stretch sensor. (b) and (c) Tracking user’s arm movements using the Arm controller

Arm Controller

I was able to bring a user’s real arm into the virtual world by attaching one figur8 sensor at the elbow joint using a compression sleeve (Figure 3-1a) such that the sensor gets stretched as the user bends their arm (Figure 3-1b). I further processed the IMU data and mapped it to the rotation of the shoulder of a virtual arm and mapped the stretch data to the bend of the elbow. This allows the virtual arm to precisely follow the movements of the user’s arm.

For calibration with users of different heights, the arms starts extended downwards in a fixed position. The user is then instructed to “walk into” the virtual armsothat it matches up with their real arm while they turn on the sensor. This permanently attaches the arm to the user by keeping the distance between the Hololens and arm fixed as the user walks around the space. A correctly calibrated arm allows theuserto walk around the space while using their virtual arm to interact with virtual objects. This calibration technique also allows the user to fix the arms to different locations relative to their head, effectively “extending” their arm in the virtual world.

For Mathland, we used the arm controller to bridge the gap between the real and virtual worlds and allow for embodied learning. The arm is able to interact with the virtual objects and throw balls, for example. Furthermore, since the arm is virtual, it does not have to be constrained by human or physical constraints. For example, the arm could be made much larger and stronger so that the user can pick up much larger virtual objects (such as a car) using their real arm.

35 The Arm Controller is novel in this field as unlike traditional VR controllers, it does not rely on external trackers or require the user to hold a bulky controller in their hand. This allows the user to forget about the controller and interact with the virtual objects in a more natural way.

Tangible Object Controller

The Tangible Object controller allows users to move rotate and resize objects in an intuitive, tangible way. It uses a figur8 sensor housed inside an AR marker cube (Figure 3-2). The Object Controller can be “attached” to any object in the scene using voice commands. After attaching the controller to an object, any modifications done to the controller will be represented on the virtual object. As mentioned above, I used the Hololens camera and multi-target Vuforia marker to detect the three-dimensional position of the Object Controller as the user moves it around 3D space. Furthermore, I mapped the IMU and stretch data from the figur8 sensor to the rotation and scale of the object, respectively. Since there is no correct way to hold the Object Controller, additional processing of the IMU data needs to be make in order to ensure smooth rotations. When the Controller is first attached to an object, the Quaternion rotation of the Controller is queried and its inverse is calculated and stored. Then, any future sensor Quaternion rotations are multiplied by the stored inverse before being applied to the virtual ob-

Figure 3-2: Object controller that enables tangible interactions with virtual objects in Mixed Reality.

36 ject. This simple calculations makes the rotation steady when the controller is first attached (since multiplying the rotation by mappedRotation = currentRotation * Quaternion.Inverse(initialRotation) = Quaternion.Identity if currentRotation = ini- tialRotation) and all rotations are relative to the initial rotation of the controller.

The Object Controller is novel as it does not require any additional sensors for positioning and allows for very intuitive interactions with objects - to rotate an object you rotate the controller, to stretch and object you stretch the controller, etc. Because the output of the controller (moving, stretch, rotating of virtual object) is so closely aligned to the input of the controller (moving, stretch, rotating of the controller) these interactions also provide much better haptics and feedback than other solutions in the market today.

Controller Haptic Feedback and Vibration Motor

The Object Controller also contains a Force Resistive Sensor (FSR) and a vibration motor. The FSR is used to register when the user grasps an object to allow a user to select an object without using voice commands. The vibration motor is powered by Adafruit’s Haptic Motor Controller and can provide feedback to the FSR input. The FSR and haptic motor are connected to a Feather BLE Module, which can interface with the Hololens using similar code as the figur8 sensor.

Figure 3-3: The Microsoft Hololens headset (for MR experience) with our Object Controller and Arm Controller

37 3.3 Design

The Arm and Tangible Object Controllers were used to create Mathland (Figure 3- 3). Using MR on Microsoft Hololens, we allow the users to see the world through a mathematical lens and immerse themselves in an interactive mathematical play- ground. Mathland is implemented as a Universal Windows Program using Unity 5.6.2 and Visual Studio 2017. Additionally, we used Holotoolkit-Unity, OpenCV, Vuforia, and the Windows Bluetooth library to support our interactions. Currently, Mathland allows users to place five different types of objects (Figure 3-4). All ofthe objects can be rotated, resized and repositioned using the Object controller. The detailed behavior of the objects is as follows: Force Field: The Force Field is a cube with arrows that applies a constant linear force in the direction of the arrows on all objects within its boundaries. A user can rotate the arrows to change the direction of the applied force, or resize them to change the magnitude. The Force Field can also be made full scale so that it applies a constant linear force on all objects in the virtual world. A full scale force field pointing downwards for example, can be used to simulate gravity. Velocity Vector: The Velocity Vector is an arrow that applies an instantaneous force to an object. The Velocity Vector adds to the instantaneous velocity of any objects that comes in contact with it. Like the force field, rotating or resizing the Velocity Vector will change the direction and magnitude of the applied force, respec- tively. Ramp: The Ramp is a simple solid object that can be used to direct the motion of the ball along a path.

Figure 3-4: Five virtual items in Mathland’s menu to allow the user to experience different physical forces and interactions.

38 Cube: The Cube is a simple solid object that can act as a barrier or building block. Rope: The Rope consists of a fixed end, which can only be moved by the user, and an un-fixed end where other objects can be attached to. The rope canbeused to model circular motion by applying a Velocity Vector to a ball attached to the end of the rope. The user can use these virtual objects with the real world environment to explore their mathematical curiosities. The benefit of virtual objects is that they can respond to both virtual and real world objects, compared to the real objects, which are influ- enced by only the real objects and forces. As a result, the user can experiment with physically difficult or impossible scenarios.

3.4 Sharing

Sharing allows people to live together in Mathland. This means that everyone sees the same objects and interactions in the same physical space. This allows for collaboration in the learning process as users can reference the same virtual objects by pointing and gesturing in their real space. Sharing in Mathland was implemented with the Holotoolkit Sharing Service, which provides an interface to broadcast custom messages to all devices in the session. The sharing component makes sure all user interactions are broadcasted to all other users so that everyone sees the same modifications to virtual objects. The Shared Anchor (described in section 3.6.2) provides a common point in the physical space so that all users see the virtual objects in the same physical location.

3.5 Evaluation

We designed a preliminary experiment to evaluate the user experience in Mathland. We used the Puzzle experience in Mathland to build three puzzles that targeted different motions, which are taught in basic Newtonian physics: Circular Motion

39 Figure 3-5: Three puzzles created by arranging the virtual checkpoints in MR. The left side images show the puzzle whereas the right side ones show the solution. Each puzzle corresponds to a specific physics concept: a) Circular Motion (using ball witha rope and velocity vector) b) Linear Motion (using ball with a downward Force Field, two Ramps, and a Velocity Vector) c) Projectile Motion (using a downward Force Field, and a Velocity Vector)

(Puzzle 1), Linear Motion (Puzzle 2) and Projectile Motion (Puzzle 3). Figure 3-5 shows the relative positions of the checkpoints for each puzzle. In order to solve each of the puzzles, the user had to understand the mathematics/physics concepts associated with the five virtual items in Mathland’s menu. For Puzzle 1 (Circular Motion), the user had to notice that since the checkpoints are arranged in a circular orbit, the ball needs a centripetal force. This centripetal force can be created by attaching the ball to a Rope object, and giving the ball an initial velocity that is perpendicular to the rope so that the virtual ball starts swinging in a circular motion. To solve Puzzle 2 (Linear Motion), the user could place inclined ramps between the checkpoints and create a full-scale downward Force Field to resemble gravity (since there is no gravity by default in the world, and without a downward force, the ball

40 does not move down the ramps). For Puzzle 3 (Projectile Motion), in addition to the downward gravity-like Force Field, the user needed a Velocity Vector to launch the ball at an angle such that it follows a parabolic path, i.e. projectile motion. The three puzzles and their respective solutions are shown in figure 3-5. We conducted the experiment with two participants in one trial to evaluate the collaborative learning experience in Mathland. The two participants shared a common Mixed Reality world, and had a total of 30 minutes to solve all the puzzles. We gave only one Object Controller to each participant pair to encourage more collaboration and interaction between participants.

3.5.1 Study results

We had a total of 30 participants (22 female and 8 male; 8 in the age group 16-20, 21 in the age group 20-35, and 1 above 35). 28 participants performed the study in pairs and the remaining 2 performed the study individually because of scheduling conflicts. The two people in each trial were paired based on their time availability. As we were observing the study, we noticed that despite some problems with the Hololens, most participants were having a lot of fun. They were solving puzzles lying down on the floor, playing around with the different elements, and high-fiving each otherwhen completing a puzzle. We presented our participants with a post-study survey, which had six five-point Likert scale type responses and three open ended questions. Figure 3-6 shows the results for the six Likert scale type responses, which can be categorized as follows: i. Engagement: Our users were thoroughly engaged in Mathland, 28 (93.3%) out of the 30 participants agreed (20%) or strongly agreed (73.3%) that they “would like to use Mathland with someone else they know” (figure 3-6a). Moreover, 26 (86.6%) of our participants agreed (25%) or strongly agreed (70%) that they “liked being in a 3D Mathland” (figure 3-6b). ii. Learning : The participants found the puzzles to have educational value as 27 (90%) of them agreed (23.3%) or strongly agreed (66.7%) that “people unfamiliar with much mathematics/physics can learn something in Mathland” (figure 3-6c).

41 Figure 3-6: Results from our survey with 30 participants. The responses were based on a 5-point Likert scale: Strongly agree, Agree, Neutral, Disagree, Strongly Disagree. Charts a) to f) each correspond to one question on our post-study survey.

iii. Problem-solving: All of our participants not only solved all three puzzles, but also enjoyed solving those puzzles. 29 (96.6%) of them agreed (13.3%) or strongly agreed (83.3%) that “Mathland can make problem-solving fun” (figure 3-6d), and 25 (83.3%) agreed (6.7%) or strongly agreed (76.7%) that they “would like to solve more puzzles in Mathland” (figure 3-6e). iv. Collaboration : During the study we observed that partners collaborated effectively to problem-solve. In the post-study survey, out of the 28 participants who did the study with another partner, 26 (92.8%) agreed (14%) or strongly agreed (78.6%) that they “having a partner made Mathland fun” (figure 3-6f).

3.6 Challenges

Developing Mathland posed different challenges - mostly related to developing ona still evolving environment like the Hololens. The first hurdle involved getting used to a new environment with unusual deployment and debugging methods. Another challenge was making sure that all players share the same coordinate system so that they can see the virtual objects on the same location. Lastly, we had to build a separate rig to be able to showcase Mathland in video form. 42 3.6.1 Hololens Development

The Hololens developing environment is constantly evolving to better fit the needs of developers. Hololens runs applications on the Universal Windows Platform, which can be built by Microsoft Visual Studio. Unity is used as the main game engine and packages such as HolotoolKit-Unity greatly help with managing the Hololens essentials such as Spatial Mapping and Sharing. However, many other libraries such as Obi Solver (for physics simulations) were not supported in this platform. Since this environment is still relatively new, it suffers from some issues not present inmore established environments like Android and iOS.

Deployment for example, consists of a two step process where a Visual Studio Solution must be created from Unity first before it can be deployed to the Hololens. This extra step slows down developing as two build processes must complete for each test iteration. Additionally, there are times where deployment to the Hololens will inexplicably fail and the build process must be started over. This two step process of deployment also makes it harder to debug as debug/crash statements don’t have an associated filename or line number.

The Hololens environment is also rapidly changing. For example, the tools we used to build the application were deprecated by the time we ran our user studies. This rapid evolvement also meant that documentation was not always caught up with the current version of the code. This was time consuming to find and fix but we were able to share our discoveries with the community through pull requests and forum answers.

Additionally, since the Hololens is aware of its surroundings via spatial mapping and wireless radios, the code can have different results in different physical spaces. For example, Anchor Sharing (see section 3.6.2) requires the 3D spatial mapping to be serialized and shared, but this process can fail depending on the surrounding location since objects like screens or glass can produce very different spatial maps on each hololens, causing them to be out of sync in one room, but not another.

43 We also ran into a particularly difficult bug where all five of our Hololens suddenly kept crashing on initialization while running the same code that had worked hours earlier in the same room. We reverted back several commits to no avail and even tried factory resetting some Hololens. The next day, we launched the app on a different location and had no issues with any Hololens. After several hours of debugging and frustration we discovered that the crash was caused by a rogue Bluetooth device that was sending corrupt advertisement packages. The packages would make the Windows bluetooth library crash as soon as the name of the device was queried. This explanation explains why all hololenses suddenly stopped working (the rogue device was turned on) and why everything seemed fine in a different location, outside the range of the rogue device. We patched this issue by blacklisting the BLE addresses of any device that had previously caused the app to crash.

3.6.2 Hololens Anchor Sharing

Collaboration is at the core of Mathland as solving problems with another person is proven to help both users learn the material better. In order to ensure smooth collaboration between two or more users, we needed to make sure virtual objects appear to be in the same location for all users. It is important to note the difference between the sharing component and the anchor sharing. The sharing component (described in section 3.4), guarantees that updates from one player, (ie rotating a ramp) will be propagated to all users while the anchor sharing guarantees that all users will see the ramp in the same physical location. The first player to launch the app is prompted to place an anchor in their world. Any other players that join afterwards should be able to see the anchor in the same location. From there, the anchor library can be used to make sure objects will appear in the correct coordinates for each user. However, getting the anchor sharing to work as expected was not as straightforward as it appeared. To understand why, it is useful to know how spatial sharing actually works. Internally, the player that places the anchor collects spatial information about the anchor’s location and serializes it before sending it to the server. Any new players

44 that joins then downloads this information (about 13MB), deserializes it and tries to use its own spatial mapping data to “match” the anchor location as described by the first Hololens. If a good match is detected, the anchor is placed, otherwise the app waits until it can collect enough information to get a match. Since each Hololens creates a spatial map independently, these are not guaranteed to match, specially when dealing with tricky objects like glass and with people walking around. If the first hololens describes the location of the anchor using features not present inthe other hololens’s map, they will never be able to find a match and the app will hang forever. To minimize the amount of failed matches, it was helpful to understand how the system works. First, it is important to make sure the spatial map is accurate on all hololenses before matching. This ensures that the description of the anchor location will be good enough for the other hololens to find a match. Second, the anchor must be placed around well defined features as these are sent to the other hololenses. For example, placing the anchor in the middle of a wide open space will make it much harder for the other hololens to find the right location as there are not asmany features surrounding the anchor. Similarly, placing the anchor near glass will yield the same unacceptable results. Finally, during our testing, we noticed better success rates if the apps are started facing the direction where the anchor was placed. When it works successfully, the entire anchor sharing process takes between one to five minutes, but sometimes multiple tries are necessary before getting a success- ful match. This process must be done before the core of the app can launch which dramatically decreases the speed of development. To overcome this, we initially de- veloped components independently so that we could test them without going through the long anchor sharing process.

3.6.3 Showcasing AR Work

Since the Hololens is still a new and expensive piece of technology, its availability to consumers is still very limited. As such, it is difficult to explain Mathland to people that don’t have access to the same technology. While the Hololens does include a

45 camera that can overlay virtual objects on top of the camera feed, the resulting quality is poor and not optimal for presentation. We used Microsoft’s Spectator View to record higher quality videos. Spectator View consists of a hardware camera rig and the compositor code. Spectator View works by having the camera and hololens both record the video feed and holograms respectively. The compositor software then takes these two inputs, applies the appropriate mathematical transformations and outputs a final video. The hardware rig attaches a Hololens on top of a camera andwas fairly straightforward to build using Microsoft’s guide [29]. The compositor software requires that all app updates are shown in the Unity editor, which required some restructuring of our code. With Spectator View working, we were able to record high quality videos of Mathland [8].

3.6.4 Hardware Limitations

My initial tests revealed that unlike VR, where 3D-tracked controllers provide a smooth interaction, the Hololens relies purely on air gestures for input and does not provide 3D tracking capabilities. This limitation makes it difficult to support simple interactions in an application. Additionally, air gestures are difficult to per- form and lack the tangibility and haptics of a physical controller. To get around this limitation, we used the Hololens’s camera, computer vision, an IMU and a stretch sensor to create a Tangible Object Controller (see Section 3.2.2).

The Hololens display does a pretty impressive job at rendering objects in the scene at 720p resolution. However, the field of view of the display is very limited, at around 40 degrees. This means that you have to look around in order to see the entire scene. I purposely did not address this hardware limitation as I believe the field of view will be greatly improved in future augmented reality devices. For example, the Meta headset, which is already available to consumers, doubles the field of view of the Hololens.

46 3.7 Future Work

In this paper, we integrated the three key components we believe are necessary for en- abling a Mixed Reality Mathland: embodied interaction (Object and Arm controller), situated learning (OpenCV and spatial analysis), and constructionist methodologies (Newtonian puzzles and visualizations). We look forward to exploring these three areas further in the following ways:

(i) Conduct user studies to evaluate the Object and Arm controller: We evaluated the overall learning experience of users in the study mentioned in this paper, but we plan to conduct more studies to evaluate the interaction design of the Object and Arm controller in the context of Mathland as well as other MR applications.

(ii) Explore computer vision approaches to recognize 3D objects: Hololens can do generic spatial mapping for walls, ceilings, etc, but cannot detect the 3D structures of individual objects. To make Mathland more deeply connected to the real world, we plan to recognize the 3D shapes of real world objects so that the user can create 3D virtual objects by scanning real world objects.

(iii) Improve the field of view of the existing cameras: Mathland’s computer vision capabilities are currently limited to the Hololens’ camera, and thus objects outside the user’s view are not tracked. We hope to be able to track more objects outside the user’s field of view without using external trackers so that, for example, ausercan bounce a real ball anywhere in the room and the full trajectory of the real ball will tracked in Mathland even when the ball goes outside the user’s field of view.

(iv) Add more learning tools: Mathland currently supports Newtonian physics concepts, but we have plans to provide more concepts related to Flatland [10] and higher dimensions in mathematics, space-time fabric and black holes, electromagnetic fields and the speed of light.

47 3.8 Conclusion

Technological advances often usher in educational innovations as people leverage new technologies for educational use cases. Mixed Reality has previously been used to cre- ate new educational environments, but similar to traditional learning environments, most of them have treated the learner as a passive consumer rather than active cre- ator of knowledge. In Mathland, we allow learners to unleash their creativity and curiosity in a MR physics playground, where learners can visualize, experience and explore mathematical concepts in their real world. In order to allow for more natural interactions with virtual objects, we used exiting sensors to create custom controllers, namely an Object controller and Arm controller, which allow for more tangible and embodied interactions in Mixed Reality. We also used spatial mapping and computer vision to situate learning in the user’s real en- vironment. Our user studies suggest that Mathland could be conducive to creative learning, collaboration, and playful problem-solving. Using constructionist, embodied and situated learning in MR, Mathland allows the users to view the world through a mathematical lens and naturally interact and experiment with mathematical concepts so that learners may develop an intuitive understanding of mathematics, as envisioned by Papert in his vision for Mathland [9].

48 Chapter 4

ARPiano

4.1 Introduction

This chapter describes ARPiano, a mixed reality application that augments a physical keyboard. ARPiano uses a MIDI (Musical Instrument Digital Interface) keyboard and a multi function knob to create a novel mixed reality experience that supports visual music learning, music visualizations and music understanding. ARPiano is able to precisely locate a physical keyboard in order to overlay various objects around the keyboard and on individual keys. These augmented objects are then used for music learning, visualization and understanding. Furthermore, ARPiano demonstrates a novel way to utilize the keys in a piano as an interface to interact with various augmented objects.

4.1.1 Goals

ARPiano is designed to allow for better visual music learning, and music understanding.

Music Learning

The main goal of ARPiano is to facilitate visual music learning using augmented reality. Traditional music learning focuses on reading symbols in the form of sheet

49 music. However, these symbols are abstract and there is no intuitive correlation between the symbol and the note that it represents. For example, notes of different lengths are differentiated by whether the symbols are filled in or not (quarter note and half notes) or whether they have an extra an extra tick attached to them (eighth notes and sixteenth notes). This notation is not immediately intuitive and requires that the user learn the symbols in order to make sense of them. Current theories of learning show that using multiple modalities in learning is an effective strategy to form mental constructs and schemas [36, 37, 38].The visual aspect of traditional music learning depends on these abstract symbols and could be stronger. ARPiano aims to facilitate visual music learning by providing a deeper connection between the song the user is learning and what they are seeing. This is accomplished by intuitive and spatially accurate visuals, real time feedback, and performance statis- tics. ARPiano visually renders a song in the form of a sequence of cuboids where the length of the rectangle represents the length of the note and the position relative to the keyboard represents the note value.

Music Visualization and Understanding

ARPiano also allows for music visualization and music understanding. Music visual- ization is accomplished by emitting a sprite from the physical key location whenever the key is pressed. This provides the user with a visual counterpart to what they are hearing and allows them to better see patterns in the music. Additionally, while the user is playing, ARPiano is able to augment the interface to point out specific musical artifacts, such as chords, which aids in music understanding.

4.2 Previous Work

Music learning software is not a new field and there are many applications available that try to facilitate music learning by creating visuals or providing performance statistics. However, none of the applications provide spatially accurate visuals.

50 4.2.1 SmartMusic

SmartMusic [32] is a music learning software that provides helpful feedback on a player’s performance . A user is displayed the sheet music as well as a bar indicating the current measure that they are on. While playing a song, SmartMusic uses a micro- phone to determine whether the user is playing the correct notes. While SmartMusic provides helpful statistics to the user, such as which sections had the most mistakes, it still relies on sheet music and does not offer a more intuitive musical representation of a song.

4.2.2 Synthesia

Synthesia [33] is a piano learning software that, like ARPiano, renders songs in the form of a piano roll. Each note in the song is represented by a rectangle that slowly approaches its corresponding key on a virtual keyboard on the display. This repre- sentation is more intuitive than sheet music and allows the user to see patterns in the song. Synthesia is also able to track a player’s performance and provide statis- tics to help them guide their practice. While Synthesia provides an intuitive visual representation of the song, it does so on a traditional 2D display and requires that the user look back and forth between the virtual keyboard on the display and the physical keyboard they are playing on.

4.2.3 Andante/Perpetual Canon

Andante [34] and Perpetual Canon [35] are part of a collection of research projects developed by Xiao Xiao from the MIT Media Lab. They augment a piano with a projector and are able to provide visualizations of pre-recorded or live music. These projects did not however, focus on music learning or analyzing what is being played.

51 Figure 4-1: Multifunction knob and MIDI keyboard with AR markers on first and last keys.

4.3 Design

ARPiano runs on a Hololens that wirelessly receives messages from a server running on a PC. The application augments the physical keyboard with various sprites and annotations and receives input from the keyboard in order to support various inter- actions. ARPiano adopts a modular design in the form of KeyboardComponents (see section 4.4). The design of ARPiano was influenced by the Hololens development challenges encountered in Mathland. As such, sharing is done without a Spatial an- chor (section 4.3.5) and a separate Unity debugger was built along the way to speed up development (section 4.3.6).

4.3.1 Integrated Sensors

ARPiano uses a MIDI Keyboard and a multi function knob for input (Figure 4-1). Both of these devices communicate over USB. Since the Hololens does not have any USB ports, additional steps had to be taken to integrate these devices into a Mixed Reality experience. The application also uses the built in Hololens camera to get the initial position of the the keyboard in three dimensional space. The keyboard sends MIDI commands to the computer over USB and is the main interface for ARPiano. When a key is pressed the keyboard emits the corresponding sound and sends a MIDI message to the connected computer. MIDI is well established and documented so many libraries exist to parse these messages. I used the open source MidiJack Unity plugin [42] to parse each incoming MIDI message into its

52 corresponding channel, note and velocity. The computer then sends each event (either noteOn or noteOff) to the Hololens using the HoloToolkit sharing service. These messages are then used by the Hololens to support various interactions. The multi function knob serves as an additional controller to change various pa- rameters of the keyboard. An augmented interface can be rendered on top of the controller to show which parameter is being changed (see Section 4.4.3). I used the Griffin Powermate which can map knob rotations and presses to different keyboard events. These events are then picked up by the computer using Unity’s Input library and sent to the Hololens via the Sharing Service. The Hololens then decides what to do depending on the current state of the keyboard. Lastly, ARPiano uses the built in Hololens camera to detect the three dimensional position of the first and last keyboard keys as described in the following subsection.

4.3.2 Physical and Augmented Keyboard

At its core, ARPiano consists of a main Keyboard class that handles incoming mes- sages from the server and is responsible for the initial calibration. The MIDI keyboard is connected via USB to a PC running the sharing server. The computer then forwards all MIDI messages to the Hololens. The keyboard class is responsible for parsing these messages and making them available to Keyboard Components (see Section 4.4). Each message consists of the pressed note (as a integer) and the velocity of the pressed note (as a float). Calibration of the physical keyboard allows the Hololens to know precisely where each key on the keyboard is in three dimensional space. This process is done by placing two key-sized Vuforia markers on the first and last key of the keyboard. When the application is first opened, the user is instructed to look at the first and last keyof the keyboard until the keys turn green, indicating successful detection of the markers (Figure 4-2). In order to accommodate keyboards with different number of keys, the user then presses the first and last key. This finalizes the calibration procedure and allows the Keyboard class to accurately determine the position and rotation of the keyboard and each individual key therein.

53 Figure 4-2: Calibration ensures that the Hololens knows the exact location of the physical keyboard. The markers show the perceived location of the first and last keys.

The Keyboard Class exposes several methods such as GetNotePosition which re- turns the position of a key in the physical space given the corresponding midi num- ber. This allows components like NoteSpawner to place sprites in the position of a corresponding key (see Section 4.4.4). The Keyboard class also allows Keyboard- Components to subscribe to three different events: calibrationComplete, noteOn, and noteOff.

4.3.3 Keyboard UI/Helper classes

The Keyboard Class handles the position and interactions with the physical keyboard. All UI elements are handled by their corresponding Keyboard Components and helper methods. Shared UI elements are implemented as helper classes to the Keyboard Class. These include, the Annotation Manager, UIRings, Keyboard Highlighter and the Component Menu.

Annotation Manager

The Annotation Manager creates animated text annotations at specified locations in the keyboard. It is responsible for spawning and destroying annotations at the correct location and rotation so that the user can see them. The annotation manager exposes one method, showAnotation, that takes in a string representing the annota- tion to show and either a Vector3 or a midi note representing the position where the annotation should be rendered. Currently, the Annotation Manager is used by the

54 Chord Detector (see Section 4.4.5), and the Song Tutorial Results graphs (see Section 4.4.3).

UIRing

UIRings create a visual representation of a variable in the form of a ring. They are created by spawning a defined number of small cuboids in a circular pattern. The UIRing constructor takes in a function pointer to the current value of the variable to represent, as well as the max value that the variable can take. Using this information, the UIRing can be set up and forgotten since its state is automatically updated without needing to be specifically called. The UIRing is used by the Song Tutorial component to visualize the song speed and progress (see Section 4.4.3) and by the Component Menu to indicate when the menu will open (see Section 4.3.4).

Keyboard Highlighter

The Keyboard Highlighter helper class can highlight a specified key on the physical keyboard any given color. It does this by creating a thin cuboid with the position and size of the given key and a transparent shader set to the specified color. This helper class consults the main Keyboard class to determine the position of the key as well as the type of key (white or black) in order to determine its size. The Keyboard Highlighter is used by the Chord Suggestion Component (see Section 4.4.1) and the Song Tutorial Results (see Section 4.4.3).

4.3.4 Component Menu

The Component Menu utilizes the keyboard keys as an interface to enable or disable individual Keyboard Components. To open the menu, the user must press and hold the first key in the keyboard. While the key is being held, a UIRing appears ontopto indicate how much longer until the menu opens. ARPiano avoids accidental opening of the menu by requiring the user to hold the lowest key for a couple of seconds, which does not happen often in regular music.

55 Figure 4-3: 3D icons representing Keyboard Components are overlayed on top of the keyboard. From left to right, the icons represent: KeyboardLabels, NoteSpawner, ChordSuggestion, ChordDetector and SongTutorial. In this figure, the NoteSpwaner component is enabled.

Once the menu is opened, all keyboard components are paused and icons repre- senting keyboard components are rendered on the keyboard. Each icon consists of a colored rectangle that spans four white keys on the keyboard along with a three dimensional representation of the keyboard component, as seen in figure 4-3. The rect- angle appears green or white depending on whether the corresponding component is enabled or disabled. To change the state of a component, the user simply needs to press any of the keys under the icon. Once the state of a component changes, the menu closes. Alternatively, the user can press the first key again to close the menu.

4.3.5 Sharing

Sharing allows two or more Hololenses to see the same virtual objects in the same physical space. Sharing in the Hololens is usually accomplished by using an anchor. Mathland, for example, utilized an anchor to create a shared coordinate system be- tween all users. Once it is set up, the anchor provides an accurate way to share the location of virtual objects. The anchor setup however, is far from perfect and greatly increases the setup time while decreasing the speed of development. The anchor setup can take anywhere from 2-5 minutes and does not work all of the times for reasons described in the Mathland chapter. Since the core of the application depends on this shared coordinate system, setting up the anchor is required before any functionality can be tested, which has an adverse effect on the speed at which new features canbe developed. 56 ARPiano was designed in a way that still allows sharing, but does away with the anchor. Since all of the virtual objects in ARPiano are relative to the physical key- board, a shared coordinate system between Hololenses is not required. Each Hololens simply scans the Vuforia markers to get the location of the physical keyboard in their own coordinate system and uses that to render all objects, regardless of any other Hololenses present in the scene. Since there is no other form of input coming from the Hololens, this method provides the same benefits as the anchor with considerably quicker setup time (5 seconds instead of 3 minutes). In Mathland, the user’s location was shared with everyone else in the session, so the anchor was necessary. With this method, all of the input sources (MIDI keyboard and multifunction knob) are connected to the master computer which then broadcasts all events to all Hololenses in the session. From there, each Hololens can receive the events and act in their own coordinate system.

4.3.6 Unity Debugging

ARPiano was designed to overcome some of the development challenges encountered during Mathland. The main challenge with developing Mathland had to do with the setup time required by the anchor, as explained above. The other challenge was that all new features had to be tested on the Hololens, which requires a slower double-building process and does not yet have a robust debugger. In the Hololens, for example, it is not possible to set breakpoints or see the line numbers when an error is encountered. The lack of debugging tools, in combination with the extra setup required to deploy the application to the Hololens and wear the device makes it harder and slower to debug and develop new features. To speed up development, ARPiano was designed to maximize the amount of functionality that can be run and tested from the Unity Editor. Network messages and keyboard calibration parameters were mocked and a virtual keyboard is ren- dered in the Unity Editor. This allows developing and debugging of non-calibration features without a Hololens, as shown in figure 4-4. Additionally, the Unity de- bugging capabilities are far greater than the Visual Studio Hololens counterpart.

57 Figure 4-4: Non-calibration features, such as the Piano Roll can be debugged directly in Unity

For example, file names and line numbers are correctly identified, breakpoints canbe set, and variables can be inspected. These additional debugging features significantly speed up development as most debugging can be done within Unity and the Hololens is only required once most bugs have been addressed.

4.4 Keyboard Components

ARPiano is designed to allow for easy addition of new features in the form of Key- board Components. A Keyboard Component subscribes to messages from the main Keyboard class and can be enabled or disabled independently of other Keyboard Components. Keyboard Components use helper methods from the Keyboard class to correctly position objects relative to the keyboard or the desired key. At this time, ARPiano consists of six different Keyboard Components that en- able music learning, understanding and visualization. The main goal of ARPiano is to facilitate visual music learning through augmentations on the physical keyboard.

58 This is accomplished by the Chord Suggestion, Keyboard Labels and Song Tutorial Components. Music visualization complements the music listening experience by pro- viding a visual counterpart to the song being played. It can also provide a history to each note and allow for easier recognition of patterns in a song. In ARPiano the Note Spawner and Flashy Visualizer Components visualize the notes as they are being played. Music understanding reinforces the music learning aspect by creating deeper connections between what is being played and what is being learned. ARPiano allows users to better understand music by pointing out interesting note intervals through different colors and annotations. The Note Spawner and Chord Detector components aid in music understanding.

4.4.1 Chord Suggestion

The Chord Suggestion component gives hints to a user as they are learning different chords. When a key is pressed, other keys next to it are highlighted to indicate that those keys, together with the pressed key, form part of a chord (Figure 4-5). More specifically, the pressed key is treated as the root key in the cord and the thirdand fifth notes of the triad are highlighted. Different color highlights indicate different chord qualities, though at this time only major and minor chords are supported. The Chord Suggestion component works by doing MIDI math on the pressed midi notes. Since the intervals between the notes in the triads are fixed regardless of the root note, this is a straightforward calculation. The component then uses the helper Keyboard Highlight class to highlight the corresponding keys.

Figure 4-5: The Chord Suggestion component highlights keys that can form chords. In this figure, both major (green highlight) and minor (pink highlight) chords are suggested with shared keys being highlighted blue. 59 Figure 4-6: The Keyboard Labels component augments each key by labeling it’s corresponding note.

4.4.2 Keyboard Labels

The Keyboard Labels component adds note labels to each key of the keyboard to assist the user in recognizing the position of notes in the keyboard (Figure 4.4.2). The Component loops from the first key of the keyboard (as determined during the calibration procedure) to the last key and adds the corresponding note label. In MIDI, each note is represented as an integer with Middle-C being 60 with subsequent indices corresponding to the next semitones (so 61 corresponds to C#, for example). Knowing this, it is possible to label all of the keys in the keyboard.

4.4.3 Song Tutorial

The Song Tutorial Component is the most complex Keyboard Component in ARPi- ano. It enables music learning by allowing the user to play any MIDI song in a visual and intuitive way. The component works by rendering a piano roll of the song that that precisely matches the real piano. That way, as the piano roll moves down, the user can press the indicated keys in a similar way to popular music games like . The speed and current position of the song can be adjusted using the multi- function knob and interactive performance statistics are displayed at the end in order to increase practice efficiency.

PianoRoll

The piano roll consists of a series of TutorialNotes depicted as cuboids whose position in the left/right axis represent the key to be played while the position in the vertical

60 axis represents the time when the key should be played. Furthermore, the length of the cuboid represents the amount of time the note should be held for, and black keys are differentiated from white keys by the their position and color (Figure 4- 7). The piano roll starts on top of the keyboard and slowly moves down, causing some TutorialNotes to lay directly on top of the keys of the keyboard. Each time a TutorialNote does this, the user is instructed to press the key on the keyboard. By following these simple instructions, a melody can be played.

The piano roll is rendered by taking any MIDI song and translating it to a Tone.js compatible JSON file. Tone.js is a javascript library to parse and create MIDI files, and includes an automatic script to convert a midi file to a json file [43]. Itwas chosen specifically to avoid working with MIDI files to speed up development asIwas already familiar with parsing JSON files in Unity. From the MIDI file, the JSONfile is able to obtain a list of notes, each consisting of a time, duration, and midi integer. These can be easily mapped into their corresponding TutorialNote cuboids using the main Keyboard class.

As the piano roll moves down, the user must press the correct keys in order to play the melody. In addition to the auditory feedback the user gets from the piano,

Figure 4-7: SpectatorView image of AR PianoRoll representation of Beethoven’s Für Elise

61 TutorialNotes change color to indicate whether they were correctly played or not. Note verification, as described above, is accomplished by maintaining a list ofnotes are are currently touching the keyboard and thus, should be played. If a note in this list is played, it is marked as correct and turns green, otherwise if the note leaves the keyboard without being played, it is is marked as incorrect and turns red. This visual feedback provides another way for the user to gauge their performance in real time as they are playing the song.

Song Adjustments

The SongTutorial Component allows the user to adjust the current position of the piano roll and the speed at which it moves down by turning a multi function knob. The song can also be paused and resumed by pressing the knob. More specifically, I connected a Griffin Powermate to the server computer and mapped various typesof rotations and presses to different keyboard presses that get forwarded to the Hololens through the sharing service. The knob supports two kinds of rotations: a regular rotation, and a pressed rotation. These are both sent to the Hololens. The knob is augmented with two sets of half UIRings around it that represent two song parameters. The outer ring depicts the current position of the song and the inner ring represents the speed at which it is being played. When the knob is turned,

Figure 4-8: The knob can be used to see and control the position of the song being played (left) or the speed at which it is being played (right)

62 the corresponding ring value changes and its exact value is shown on top of the ring (Figure 4.4.3). The knob UI is an interesting demonstration of how augmented reality can create an multifunction interface using a single device. The knob is kept on the keyboard close by so that it can be reached easily. During music practice, a user is able to use the visual feedback and quickly “scroll” back to a difficult section in order to practice it at a slower tempo, for example.

Song Tutorial Results

After a song is played, the Song Tutorial component displays helpful statistics about the performance in the form of SongTutorialResults. These results use the keyboard as an interactive canvas to visually show different statistics. Each key is highlighted using the Keyboard Highlighter described in Section 4.3.3. An arrow is rendered on the last key to signify that pressing it will render the next available result. Currently, ARPiano supports two types of results: mistakes over time, and mistakes per key. The first graph maps each key in the keyboard to a section of the song andcolor codes the keys depending on the number of mistakes that occurred during that section. This allows the user to quickly see which sections of the song need more practicing, as seen in figure 4-9. Additionally, pressing a key will automatically scroll tothat section of the song so the user can practice it immediately. The second graph color codes each key to the ratio of times that that key was

Figure 4-9: At the end of a Song Tutorial, the keyboard is transformed into a visual representation of the user’s performance. In this graph, it is clear that the user made more mistakes in a specific section near the beginning of the song.

63 pressed correctly. This allows the user to quickly identify which hand needs more practice. Pressing a key will reveal the exact number of times that that key was supposed to be played, and the number of times it was actually played. Song Tutorial Results demonstrate a novel way to utilize the keyboard as an interactive canvas while providing useful statistics to help maximize practice efficiency.

4.4.4 Note Spawner and Flashy Visualizer

The Note Spawner and Flashy Visualizer both emit sprites when a key is pressed from the location of the pressed key. These components subscribe to the noteOn and noteOff events and use the Keyboard’s helper methods to position the sprites atthe correct location. The Flashy Visualizer, as the name implies, aims to complement a live music performance by providing colorful visuals corresponding to the notes being played. This Component renders various sprites when a key is pressed, or when a combination of keys is pressed (such as a chord) in order to provide a visual counterpart to the notes being played. The NoteSpawner emits a cuboid from the pressed key that slowly travels up. The

Figure 4-10: The NoteSpawner component emits cuboids as the user plays the key- board in order to visualize rhythmic and tonal patterns in the song.

64 length of the emitted sprite represents the length that the note was held for while the speed at which the sprite moves up represents the velocity of the corresponding key press. The NoteSpawner creates a temporal dimension to the music being played while allowing the user to see patterns within the song (Figure 4-10). The NoteSpawner Component can be configured to spawn notes whose color cor- responds to that of a color wheel overlayed on top of the circle of fifths. This harmonic coloring provides an interesting visualization of the note intervals so a close relation- ship in sound is described visually by a close relationship in color [40]. Using these colors, tonal patterns can be seen in many songs, such as Chopin’s Prelude, Opus 28, No. 3 in G major, which remains entirely in a single tonal area [39]. The note posi- tions and lengths in combination with the harmonic coding provide a way to observe the tonal and rhythmical patterns of a song in a way that sheet music can simply not capture.

4.4.5 Chord Detector

The Chord detector component makes the user more aware of chords by creating an annotation of the chord name every time a chord is played (Figure 4-11). By

Figure 4-11: The Chord Detector component analyzes the keys as they are pressed in order to label chords when they are played. This figure also shows how multiple components (ChordDetector and KeyboardLabels) can be enabled at once.

65 reminding the user of the chord name whenever a chord is played a deeper connection can be formed between the notes and the corresponding chord. I hypothesize that this aids in music understanding by making the user more aware of common chord intervals. Similarly to the Chord Suggestor, the Chord Detector works by analyzing the intervals of all notes being pressed. On each key press, the Chord Detector analyzes every subset of three notes and tries to match them to a major or minor chord. Once the chord is detected, an annotation with the chord name is rendered on top of the root note. The positioning of the annotation allows the user to see which inversion of the chord was played.

4.5 Evaluation

I designed a preliminary experiment to evaluate the efficiency of the visual music learning aspect of ARPiano. More specifically, I wanted to analyze the strengths and weaknesses of ARPiano by asking non-piano players to try to play various melodies without practicing them ahead of time. As a baseline comparison, I used the Synthesia software described in Section 4.2.2. Each participant was asked to play 4 songs in total, 2 using ARPiano, and 2 using Synthesia. I chose the following songs: Twinkle twinkle (easy difficulty, both parts), Beethoven’s Ode to joy (easy difficulty, both parts), Beethoven’s Für Elise (medium difficulty, melody only), and Bach’s Minuet in G (medium difficulty, melody only). The songs were chosen because of their popularity and varying degrees of difficulty. For each participant, the starting method (ARPiano or Synthesia) and song pairing (one from each difficulty level) was randomly selected such that each song got played an equal amount of times on each method. Users were briefly instructed on how to use each method and played ashort chromatic exercise using the method in order to make sure they understood how to use the method correctly. After the method tutorial, each user played two songs using the selected method, filling out a quick survey in between songs. After thefirst method was completed, the same exercise was conducted with the other method and the other two songs. 66 4.5.1 Study Results

I conducted this study with 14 participants (11 Female, 2 Male, 1 Agender) aged between 18-22. 57% participants had played instruments before (35.7% had played one instrument, 21.4% had played two), and exactly 50% of participants could read sheet music. None of the participants reported having played the piano before.

All participants agreed (7.1%) or strongly agreed (92.9%) that they “found ARPi- ano fun to use”. In comparison, 35.5% of participants agreed (28.6%) or strongly agreed (7.1%) that they “found Synthesia fun to use”. 35.5% of participants agreed that they would use Synthesia again to learn piano while 85.7% of participants agreed (21.4%) or strongly agreed (64.3%) that they would use ARPiano again. 85.7% of participants agreed (35.7%) or strongly agreed (50%) that “ARPiano was intuitive to use” while only 42.8% of participants agreed (21.4%) or strongly agreed (21.4%) that “Synthesia was intuitive to use”.

After each performance, participants were ask to rate their performance based on their initial expectations. 78.57% of ARPiano players reported performing about the same (42.85%) or better than (25%) or much better than (10.72%) what they origi- nally expected. In comparison, only 21.4% of Synthesia players reported performing about the same (14.3%) or better than (7.1%) what they originally expected, with 46.4% of them performing “much worse than” what they expected.

On average, users performed better using ARPiano and commented on the con- venience of having the note sprites land on the corresponding physical key instead of on a virtual keyboard on the monitor. One participant was even able to play both songs (Ode to Joy and Bach Minuet) with fewer than 2 errors in each. Interestingly, for song sections with consecutive notes (key-wise), participants were often able to play the melodies correctly with both methods. However, ARPiano performed much better in sections where the notes were not consecutive. With Synthesia, users often missed any jumps in the notes because they were unable to determine how far to move their hand to hit the correct note and there was not enough time to look back and forth between the keyboard and the monitor. With ARPiano, users were able to

67 see exactly where their hand should move to without having to look away from the keyboard. It should be noted that the limited field of view of the Hololens made it difficult to play notes that were more than an octave an a half away from each other. For example, there is a section of Für Elise where a single high note must be played. All ARPiano players missed the jump since they could not see the note coming into the keyboard. Synthesia players saw the jump, but usually hit the wrong note. These types of mistakes should decrease as AR technology improves.

4.6 Challenges

Many of the Hololens specific challenges encountered with Mathland were addressed in the design of ARPiano. However, there were still different challenges to overcome in order to make ARPiano a reality. These include UI specific challenges when working with AR, imperfect Vuforia tracking and a lack of a music theory background. For ARPiano all UI Elements must be relative to the keyboard. This means that to move something to the right it is not sufficient to move it in it’s x direction asthe definition of “right” depends on the initial position and rotation of the keyboard.I solved this by exposing the position, rotation, and directional vectors (forward, up, right) of the physical keyboard to other classes, and making all UI elements relative to that. However, this was not always sufficient as the Vuforia markers provided imperfect tracking. For the most part, the Vuforia marker positioning was accurate to a centimeter which was good enough for the purposes of positioning objects relative to the key- board. The rotation of the markers however, was not as accurate and caused issues with some UI elements, such as the piano roll in the Song Tutorial component. To support tilting of the piano roll, I initially rendered the piano roll vertically and ro- tated it around the right vector of the keyboard. I then moved the entire piano roll in a vector such that the first note would hit it’s correct position. In theory, this should make all notes hit the correct position. However, due to small imperfections

68 on the axis of rotation (i.e., the rotation of the keyboard as determined by the Vuforia markers) a small drift was introduced and the Tutorial Notes quickly became out of place, making it impossible to continue playing the song. I solved this challenge by expanding the TutorialNote Class and making each note move towards its end loca- tion individually of all other notes. This effectively cancels out the imperfections on the rotation and guaranteed that all notes would always hit the correct spot on the keyboard. This project required a surprising amount of knowledge of music theory. I had to familiarize myself with the underlying music theory to be able to implement com- ponents such as the Chord Detector. Some helper functions, particularly those in the keyboard class pertaining to the structure of notes in the keyboard, are written naively and inefficiently. While these do not have an adverse effect on performance, I’m sure I could find a more elegant way to write them if I had a stronger music theory background.

4.7 Future Work

ARPiano explores the different applications and benefits that an augmented instru- ment can provide. These applications range from music learning to music visualization and understanding. As it stands, with its various Keyboard Components, ARPiano is able to show the potential of an augmented keyboard to the above applications. However, the possibilities of an augmented instrument are far greater than those implemented in the current version of ARPiano. Keyboards are not the only type of instruments that can benefit from augmenta- tion. I hypothesize that many other instruments, such as string or brass instruments can benefit in their own way from various augmentations. These instruments would be trickier to implement because unlike a keyboard, they can move around while they are being played, which would require constant, real-time tracking from the headset. On the music visualization end, augmented instruments have direct applications to live performances where each instrument can be visualized independently in order

69 for the audience to get a better feel and understanding on what each instrument is contributing to the overall song. For easier adoption, a smartphone version using current AR technologies like ARCore and ARKit can be created.

More research is needed to better understand the benefits of an augmented instru- ment in regards to music visual learning. For example, it would be interesting to see the a similar study as the one described in section 4.5 over a longer time period to see whether the augmented keyboard allowed users to master a song more efficiently than other methods.

With an augmented instrument, music learning can take a more traditional ap- proach or a completely different, gamified approach. With a traditional approach, the system could present the user with each song section and allow them to practice it independently. When the user has shown mastery of the current section, the next section can be rendered and practiced. Using devices such as the Muse [41] the sys- tem could be able to detect when the user is ready to move onto the next section by measuring the level of concentration of the user as they play. High brain activity would signify that the user is still struggling with the piece while low activity would signify that the section is likely committed to muscle memory since the user is more comfortable playing it. This allows the system to take full control of the learning space in an attempt to maximize the efficiency of the learning process.

An augmented keyboard also allows the possibility to take a completely different approach to music learning. For example, the learning experience could be masked in the form of various games where the correct combination of keys need to be pressed in order to pass each level. This experience would be especially appealing to kids, who may find it difficult to get started with a piano due to the lack of visual stimulation. Without even being aware of it, kids could be learning about chords structures while playing a fun and entertaining game.

70 4.8 Conclusion

ARPiano uses the Microsoft Hololens to augment a physical MIDI keyboard in order to facilitate visual music learning, understanding and visualization. This novel appli- cation is accomplished by combining existing peripherals such as a MIDI keyboard and multi-function knob with Augmented Reality technology. ARPiano provides a way to track the position of each key on the keyboard in order to render various sprites and annotations on the physical keyboard depending on what the user is playing. With the Component Menu and Song Tutorial Results, ARPiano also demonstrates a way to use an existing physical peripheral as an intuitive interface to various supported interactions. Lastly, ARPiano provides a helpful way to visualize music where both tonal and rhythmic patterns can be observed. Before augmented reality, adding visual learning functionality to instruments re- quired expensive hardware and visuals were limited to that hardware. With aug- mented reality, there is no limit to the type and position of objects that can be rendered on the keyboard. Furthermore, adding new functionality does not require any additional hardware as all new functionality can be virtually rendered into the user’s field of view. I believe this opens the possibility to rethink the way weapproach music learning.

71 72 Chapter 5

Conclusion

This thesis shows the broad applications and potentials of MR devices when used in combination with existing sensors and peripherals. The Mixed Reality technology of today still has a long way to go, but as MR devices get smaller and more affordable this technology will have the potential to change the way we see the world and approach learning in general. Using simple sensors VRoom allows anyone to experience a real room in a vir- tual space from anywhere in the world. Mathland offers a glimpse into the potential of using MR technology in education. Everything from models, to simulations can be shown in an intuitive way without being constrained to the physical laws of the real world. For example, in a VR world, scale and time can be precisely controlled, allowing us to see and experience things in a way that were previously unimagin- able. ARPiano shows how augmented reality can add visual modalities to skills that otherwise lacked them in order to become more efficient at learning. The potentials for MR are still being discovered and we as developers have an opportunity to explore the strengths and weaknesses of this new technology. We must continue to show the potential of MR while steering the technology in a direction that most benefits society. I believe this technology will fundamentally change thewaywe learn, how we interact with people, and how we see the world. I’m thrilled to be in this field and am excited to see what’s coming next.

73 74 Bibliography

[1] O’Hara, R., Rankin O., Marino S.,. Sharing Reality: Holographic Photography - Uncorporeal Systems - Medium. Retrieved from https://medium.com/@uncorporeal/holographic-photography-1-8cd190958751

[2] Huang, J., Chen, Z., Ceylan, D., & Jin, H. (n.d.). 6-DOF VR Videos with a Single 360-Camera. Retrieved from http://stanford.edu/ jingweih/papers/6dof.pdf

[3] Editor, S. H. (2017, November 29). What Tech Does Google Earth use to Generate its 3D Imagery? Retrieved December 12, 2017, from https://www.spar3d.com/blogs/the-other-dimension/tech-google-earth-use- generate-3d-imagery/

[4] DiZio, P., Lackner, J.R. Circumventing Side Ef- fects of Immersive Virtual Environments. Retrieved from http://www.brandeis.edu/graybiel/publications/docs/158_circumventing_ side_eff.pdf

[5] LaViola, J. A Discussion of Cybersickness in Virtual Environments. Retrieved from http://www.eecs.ucf.edu/ jjl/pubs/cybersick.pdf

[6] Kopf, J. (2016, August 31). 360 video stabilization: A new algorithm for smoother 360 video viewing. Retrieved from https://code.facebook.com/posts/697469023742261/ 360-video-stabilization- a-new-algorithm-for-smoother-360-video-viewing/

75 [7] Guan, H., & Smith, W. A. (2017, February). Structure-From-Motion in Spherical Video Using the von Mises-Fisher Distribution. Retrieved from http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7707455

[8] Khan, M., Trujano, F., Choudhury, A., Sirera, J., & Maes, P., (2017, December 12). Mathland: Play with Math in Mixed Reality. Retrieved December 12, 2017, from https://vimeo.com/237324812

[9] Jim Cash. 2016. Mathland - Seymour Papert . Youtube. Retrieved September 15, 2017 from https://www.youtube.com/watch?v=UgE05-3SToc

[10] Edwin A. Abbott. 2006. Flatland: A Romance of Many Dimensions . OUP Oxford. Retrieved from https://market.android.com/details?id=book- c_oXCi3ISd0C

[11] Mark H. Ashcraft. 2002. Math Anxiety: Personal, Educational, and Cogni- tive Consequences. Current directions in psychological science 11, 5: 181-185. https://doi.org/ 10.1111/1467-8721.00196

[12] Sian L. Beilock and Daniel T. Willingham. 2014. Math Anxiety: Can Teachers Help Students Reduce It? Ask the Cognitive Scientist. American educator 38, 2: 28. Retrieved from https://eric.ed.gov/?id=EJ1043398

[13] Jo Boaler. 1998. Open and Closed Mathematics: Student Experiences and Understandings. Journal for Research in Mathematics Education 29, 1: 41-62. https://doi.org/ 10.2307/749717

[14] Michelene T. H. Chi, Paul J. Feltovich, and Robert Glaser. 1981. Categorization and representation of physics problems by experts and novices. Cognitive science 5, 2: 121-152.

[15] John R. Anderson, Lynne M. Reder, and Herbert A. Simon. 1996. Situated Learning and Education. Educational researcher 25, 4: 5-11.

76 [16] Paul Cobb and Janet Bowers. 1999. Cognitive and Situated Learning Perspec- tives in Theory and Practice. Educational researcher 28, 2: 4-15.

[17] James G. Greeno. 1998. The situativity of knowing, learning, and research. The American psychologist 53, 1: 5. Retrieved from http://psycnet.apa.org/journals/amp/53/1/5/

[18] John Seely Brown, Allan Collins, and Paul Duguid. 1989. Situated cognition and the culture of learning. Educational researcher 18, 1: 32-42.

[19] Shaun Gallagher. 2006. How the body shapes the mind . Clarendon Press.

[20] Arthur M. Glenberg, Tiana Gutierrez, Joel R. Levin, Sandra Japuntich, and Michael P. Kaschak. 2004. Activity and imagined activity can enhance young children’s reading comprehension. Journal of educational psychology 96, 3: 424.

[21] Autumn B. Hostetter and Martha W. Alibali. 2008. Visible embodiment: ges- tures as simulated action. Psychonomic bulletin & review 15, 3: 495-514. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/18567247

[22] Martha W. Alibali and Mitchell J. Nathan. 2012. Embodiment in mathematics teaching and learning: Evidence from learners’ and teachers’ gestures. Journal of the Learning Sciences 21, 2: 247-286.

[23] Susan Goldin-Meadow, Susan Wagner Cook, and Zachary A. Mitchell. 2009. Gesturing gives children new ideas about math. Psychological science 20, 3: 267- 272. https://doi.org/ 10.1111/j.1467-9280.2009.02297.x

[24] Mitchel Resnick, Amy Bruckman, and Fred Martin. 1998. Constructional design: creating new construction kits for kids. In The design of children’s technology , from 149-168.

[25] Mitchel Resnick, Fred Martin, Robert Berg, Rick Borovoy, Vanessa Colella, Kwin Kramer, and Brian Silverman. 1998. Digital manipulatives: new toys to think with. In Proceedings of the SIGCHI conference

77 [26] James Paul Gee. 2003. What video games have to teach us about learning and literacy. Computers in Entertainment (CIE) 1, 1: 20-20.

[27] Angela McFarlane, Anne Sparrowhawk, Ysanne Heald, and Others. 2002. Report on the educational use of games.

[28] Diana I. Cordova and Mark R. Lepper. 1996. Intrinsic motivation and the process of learning: Beneficial effects of contextualization, personalization, and choice. Journal of educational psychology 88, 4: 715.

[29] Spectator view. (n.d.). Retrieved December 11, 2017, from https://developer.microsoft.com/en-us/windows/mixed-reality/spectator_view

[30] George Lakoff and Rafael Núñez. 2000. Where mathematics come from: Howthe embodied mind brings mathematics into being . Basic books.

[31] Home. figur8 . Retrieved September 17, 2017 from https://www.figur8.me/

[32] Music Learning Software for Educators & Students. (n.d.). Retrieved December 11, 2017, from https://www.smartmusic.com/

[33] Synthesia, Piano for Everyone. (n.d.). Retrieved December 11, 2017, from https://www.synthesiagame.com/

[34] Xiao, X. (n.d.). Andante: A walking tempo. Retrieved December 11, 2017, from http://portfolio.xiaosquared.com/Andante

[35] Xiao, X. (n.d.). PerpetualCanon. Retrieved December 11, 2017, from http://portfolio.xiaosquared.com/Perpetual-Canon

[36] Campbell, P. S., Scott-Kassner, C., & Kassner, K. (2014). Music in childhood: from preschool through the elementary grades. Boston, MA: Schirmer Cengage Learning.

[37] Gault, B. (2005). Music Learning Through All the Channels: Combining Aural, Visual, and Kinesthetic Strategies to Develop Musical Understanding. General Music Today, 19(1), 7-9. doi:10.1177/10483713050190010103

78 [38] Morris, C. (2016). Making sense of education: sensory ethnogra- phy and visual impairment. Ethnography and Education, 12(1), 1-16. doi:10.1080/17457823.2015.1130639

[39] Harmonic Coloring: A method for indicating pitch class. (n.d.). Retrieved De- cember 11, 2017, from http://www.musanim.com/mam/circle.html

[40] Color Music Theory. (n.d.). Retrieved December 11, 2017, from https://www.facebook.com/Virtuosoism/

[41] How does Muse work? (n.d.). Retrieved December 11, 2017, from http://www.choosemuse.com/how-does-muse-work/

[42] Takahashi, K. (2016, November 05). MidiJack. Retrieved December 11, 2017, from https://github.com/keijiro/MidiJack

[43] Tone.js. (n.d.). Retrieved December 11, 2017, from https://tonejs.github.io/

79