Field Service Support with and WebRTC

Support av Fälttekniker med Google Glass och WebRTC

PATRIK OLDSBERG

Examensarbete inom Datorteknik, Grundnivå, 15 hp Handledare på KTH: Reine Bergström Examinator: Ibrahim Orhan TRITA-STH 2014:68

KTH Skolan för Teknik och Hälsa 136 40 Handen, Sverige

Abstract

The Internet is dramatically changing the way we com- municate, and it is becoming increasingly important for communication services to adapt to context in which they are used. The goal of this thesis was to research how Google Glass and WebRTC can be used to create a communi- cation system tailored for field service support. A prototype was created where an expert is able to provide guidance for a field technician who is wearing Google Glass. A live video feed is sent from Glass to the expert from which the expert can select individual images. When a still image is selected it is displayed to the technician through Glass, and the expert is able to provide instructions using real time annotations. An algorithm that divides the selected image into segments was implemented using WebGL. This made it possible for the expert to highlight objects in the image by clicking on them. The thesis also investigates dierent options for ac- cessing the hardware video encoder on Google Glass.

Sammanfattning

Internet har dramatiskt ändrat hur vi kommunicerar, och det blir allt viktigare för kommunikationssystem att kunna anpassa sig till kontexten som de används i. Målet med det här examensarbetet var att undersöka hur Google Glass och WebRTC kan användas för att skapa ett kommunikationssystem som är skräddarsytt för support av fälttekniker. En prototyp skapades som låter en expert ge väg- ledning åt en fälttekniker som använder Google Glass. En videoström skickas från Glass till experten, och den- ne kan sedan välja ut enstaka bilder ur videon. När en stillbild väljs så visas den upp på Glass för teknikern, och experten kan sedan ge instruktioner med hjälp av realtidsannoteringar. En algoritm som delar upp den utvalda bilden i seg- ment implementerades med WebGL. Den gjorde det möj- ligt för experten att markera objekt i bilden genom att klicka på dem. Examensarbetet undersöker också olika sätt att få tillgång till hårdvarukodaren för video i Google Glass.

Contents

1 Introduction 11 1.1 Background ...... 11 1.2 Problem Definition ...... 11 1.3 Research Goals and Contributions ...... 12 1.4 Limitations ...... 12

2 Literature Review 13 2.1 Wearable Computers ...... 13 2.1.1 History of Wearable Computers ...... 13 2.2 ...... 14 2.2.1 Direct vs. Indirect Augmented Reality ...... 15 2.2.2 Video vs. Optical See-Through Display ...... 15 2.2.3 Eect on Health ...... 16 2.2.4 Positioning ...... 16 2.3 Collaborative Communication ...... 18 2.4 Image Processing ...... 19 2.4.1 Edge Detection ...... 19 2.4.2 Noise Reduction Filters ...... 20 2.4.3 Hough Transform ...... 21 2.4.4 Image Region Labeling ...... 22

3 Technology 25 3.1 Google Glass ...... 25 3.1.1 Timeline...... 25 3.1.2 Interaction ...... 26 3.1.3 Microinteractions ...... 26 3.1.4 Hardware ...... 27 3.1.5 Development ...... 27 3.1.6 Display ...... 28 3.2 AlternativeDevices...... 28 3.2.1 MetaPro ...... 28 3.2.2 M100 ...... 29 3.2.3 Rift Development Kit 1/2 ...... 29 CONTENTS

3.2.4 Recon Jet ...... 29 3.2.5 XMExpert ...... 29 3.3 Web Technologies ...... 30 3.3.1 WebRTC ...... 30 3.3.2 WebGL ...... 31 3.4 Mario ...... 32 3.4.1 GStreamer ...... 32 3.5 VideoEncoding...... 33 3.5.1 Hardware Accelerated Video Encoding ...... 34

4 Implementation of Prototype 35 4.1 Baseline Implementation ...... 35 4.2 Hardware Accelerated Video Encoding ...... 36 4.2.1 gst-omx on Google Glass ...... 36 4.3 Ideastoimplement...... 36 4.3.1 Annotating the Technician’s View ...... 37 4.3.2 Aligning Annotations and Video ...... 37 4.4 Still Image Annotation ...... 38 4.4.1 WebRTC Data Channel ...... 39 4.4.2 Out of Order Messages ...... 39 4.5 Image Processing ...... 40 4.5.1 Image Processing using WebGL ...... 41 4.5.2 Image Segmentation using Hough Transform ...... 42 4.5.3 Image Segmentation using Median Filters ...... 42 4.5.4 Region Labeling ...... 43 4.6 Glass Application ...... 44 4.6.1 Configuration ...... 44 4.6.2 OpenGL ES 2.0 ...... 44 4.6.3 Photographs by Technician ...... 44 4.7 Signaling Server ...... 45 4.7.1 Sessions and users ...... 45 4.7.2 Serversentevents ...... 45 4.7.3 Image upload ...... 46

5 Result 47 5.1 Web Application ...... 47 5.2 Google Glass Application ...... 48

6 Discussion 51 6.1 Analysis of Method and the Result ...... 51 6.1.1 Live Video Annotations ...... 51 6.1.2 Early Prototype ...... 51 6.1.3 WebGL ...... 51 6.1.4 OpenMAX ...... 52 CONTENTS

6.1.5 Audio ...... 52 6.2 FurtherImprovements ...... 52 6.2.1 Image Segmentation ...... 52 6.2.2 Video Annotations ...... 52 6.2.3 UX Evaluation ...... 53 6.2.4 Gesture and Voice Input ...... 53 6.2.5 More Annotations Options ...... 53 6.2.6 Logging ...... 53 6.3 Reusability ...... 53 6.4 Capabilities of Google Glass ...... 54 6.5 Eects on Human Health and the Environment ...... 54 6.5.1 Environmental Impacts ...... 54 6.5.2 Health Concerns ...... 54

7 Conclusion 57

Bibliography 59

Introduction

1.1 Background

The Internet is dramatically changing the way we communicate, and it is becoming increasingly important for communication services to be adaptable to the context in which they are used. They also need to be flexible enough to be able to integrate into new contexts without excessive eort. Using a wearable device allows a communication system to be tailored for the context to a greater extent. With additional information available such as move- ment, heart-rate or perspective of the user, a richer user experience can be achieved. Wearable devices have a huge potential in many dierent business fields. An upcoming form of wearable devices are currently head-mounted displays (HMD). HMDs have the advantage being able to display information in a hands-free format, this has a huge potential for businesses such as medicine and field service. Perhaps the most recognized wearable device at the moment is Google Glass, which for brevity will sometimes be referred to simply as ‘Glass’.

1.2 Problem Definition

The focus of the thesis involved a generalized use case where a field service techni- cian has traveled to a remote site to solve a problem. The technician is equipped with Google Glass, or any equivalent HMD. When on his or her way to the site, the technician has information such as location of the site and the support ticket available. The technician can look up information such as the hardware on the site, the expected spare parts to resolve the issue, and recent tickets for the same site. Once the technician has arrived to the site, the back oce support will be notified. When at the site, the technician can view manuals and use the device to document the work. If the problem is more complicated than expected, or the technician is unable to resolve the issue for some other reason, the device can be used to call an expert in the back oce support. The purpose of this thesis was to research how a contextual communication system can be tailored for this use case. The part that was in focus was the call to

11 CHAPTER 1. INTRODUCTION the back oce support which is made after the technician has arrived on site and requires assistance to resolve the issue.

1.3 Research Goals and Contributions

The goal was to research collaborative communication and find ways to tailor a communication system for this specialized kind of call. Dierent ways to give instructions and display information to the wearer of a HMD were investigated, as well as how these could be implemented. An experimental prototype using some of the ideas was then constructed. The prototype was implemented using web real-time communication (WebRTC). At the time when the prototype was built there was no implementation of WebRTC available on Google Glass. A framework called Mario that was developed internally at Ericsson Research was used as WebRTC implementation, as it runs on Android among other platforms. The prototype comprised several dierent subsystems that were all implemented from scratch:

• A web application built with HTML5, WebGL and WebRTC technology. • A NodeJS server acting as web server and signaling server. • A Google Glass application using the Glass Development Kit (GDK).

An evaluation of the prototype was done with focus on how it could be further improved, and if any of the implemented ideas can be applied to similar use cases and devices. An evaluation of the capabilities of Google Glass with regard to media processing and the performance of Mario was also performed.

1.4 Limitations

The time limit of the thesis is ten weeks, therefore a number of limitations where made so that it would be completable within this limit. No in-depth evaluation of the user experience (UX) of the prototype would be done. The prototype was to be designed with UX in mind, using the result of the initial research, but no further evaluation would be done. A broad enough UX evaluation was not considered possible within the limited time frame. The prototype would not be optimized for battery life and bandwidth usage. These restrictions are of course important and the limitations imposed by them would be taken into consideration. Although no analysis and optimization would be done to find the most optimal solutions with regard to these issues.

12 Literature Review

2.1 Wearable Computers

A is an electronic device that is worn by the user. They often take form as a watch or head-mounted display, but can also be worn on other parts of the body or e.g. be sewn into the fabric of clothing. “An important distinction between wearable computers and portable computers (handheld and laptop computers for example) is that the goal of wearable computing is to position or contextualize the computer in such a way that the human and computer are inextricably intertwined, so as to achieve Humanistic Intelligence —i.e. intelligence that arises by having the human being in the feedback loop of the computational process." —Steve Mann [1] The definition of what a wearable computer is has changed over time. By some definitions a digital clock with an alarm from the 90’s is a wearable computer, but that is not what we think of as a wearable computer today.

2.1.1 History of Wearable Computers The first wearable computer was invented by Edward o. Thorp and Claude Shan- non [2]. They built a device which helped them cheat at roulette, with an expected gain of 44%. It was a cigarette pack sized analog device that could predict in which octant of the table the ball was some likely to stay in. The next step in wearable computing was the first version of Steve Mann’s EyeTap from 1981 [3]. It consisted of a computer in a backpack wired up to a camera and it’s viewfinder which were mounted on a helmet. EyeTap has since then been significantly reduced in size and the only visible part is an eyepiece used to record and display video. The eyepiece uses a beam splitter to reflect some of the incoming light to a camera. The camera then records the incoming image and sends it to a computer, which in turn processes the image and send it to a projector. The projector is then able to overlay images on top of the user’s normal view by projecting it into the other side of the beamsplitter. In the mid 90’s the ‘wrist computer‘ was invented by Edgar Matias and Mike Ruicci, and it had very a very dierent interaction method. The entire device was

13 CHAPTER 2. LITERATURE REVIEW strapped to the wrist and was marketed as a recording device. Although it was significantly larger than today’s smartwatches [4]. Even though there had been a number of great inventions, wearable devices had not gained any commercial traction at the turn of the mellenia [5]. The following years a few new devices were created, but it took another decennia until wearable devices started getting public recognition. In 2009 Glacier Computers designed the W200, a wrist computer that could run both and Windows CE and had a 3.5" colored touch screen. It was intended for use in security, defence, emergency services, and field logistics and could be equipped with , Wi-Fi and GPS [6]. In 2010 and the following years there was a newfound interested in the idea of smart wearable technology. A more practical alternative to the W200 became available with the optional attachment for the 6th generation iPod Nano which turned it into a wrist computer. In April 2012, Pebble Technology launched a Kickstarter campaign for their Pebble smartwatch [7]. The initial fundraising target was set to $100,000, and it was reached within two hours of going live. It took six days days for the project to become the highest funded project throught Kickstarter so far. The fundraising closed after a little over a month and $10,266,844 raised, more than a hundred times the initial goal. The success of the Pebble smartwatch was one of the indicators that wearable devices were getting some widespread recognition. Another example is the , which is a head-mounted display [8]. Just like Pebble, the Ocu- lus rift had a successful Kickstarter campaign, raising over $2.4M with a $250,000 target. Another device unveiled in April 2012 was Google Glass, which we will look at in depth in the next chapter (3.1). These devices were only a few of many that had reached the market or were about to be released. Some other smart watches are Sony SmartWatch [9], Sam- sung’s Galaxy Gear [10], and a rumored smart watch from Apple [11, ?, 12] More important to this thesis were HMDs that were being developed or already released, such as Meta Pro [13], Vuzik M100 [14], and Recon Jet [15].

2.2 Augmented Reality

Augmented reality (AR) is a variation of virtual reality (VR) where the real world is mixed with virtual objects [16]. VR completely immerses the user in a , the user is unable to see the real world. AR on the other hand, keeps the user in the real world and superimposes virtual objects onto it. VR replaces reality whereas AR supplements it, enabling it to be used as tool for communication and collaboration when solving problems in the real world. A similar concept is augmented (AV), which is a middle point be- tween AR and VR. It allows the user to still be present in the real world, but it

14 2.2. AUGMENTED REALITY is mixed with a virtual world to a further extent than AR. Physical objects are dynamically integrated into the virtual world and shown to the user through their virtual representation [17]. AR can be divided into two categories, direct AR and indirect AR [18]. There are also multiple ways in which AR can be achieved, for this thesis the focus was video and optical see-through displays.

2.2.1 Direct vs. Indirect Augmented Reality Indirect AR is when the superimposed AR elements are not seen with the same alignment or perspective as the user’s normal view [18]. Indirect AR is achieved by displaying an image of reality to the user, and overlaying elements on this image. It is possible to use this technique to achieve direct AR, but the image then has to be taken from the user’s perspective, and displayed in alignment with the user’s eyes. An example of indirect AR is the iOS and Android application Word Lens [19]. The application translates text by superimposing a translation on top of the original text. The text is recorded using the back facing camera on the device, and the result is displayed on the device’s screen. This is indirect AR, because the perspective and alignment of the screen is dierent than the user’s. Direct AR can be emulated using this application by holding the device in alignment with one’s eyes. An example of direct AR is the system used to help workers heat up a precise line on a plate presented in [20]. The system uses AR markers to show a line where the worker is supposed to heat the plate. It uses a video see-through system that is aligned with the user’s field of view and is mounted directly in front of one’s eye. Even though Google Glass is often referred to as an AR device, has a forward facing camera, and a transparent display; it is not capable to direct AR. It is perhaps possible to calibrate an application to allow overlaying objects on the area that is behind the display, but this is an ineective method as the display only takes up a very small area of the user’s field-of view. It would also be dicult if not impossible for the calibration to remain accurate as the display’s position relative to the user’s eyes will change when moving. Developing for direct AR devices provides more possibilities, but also places higher demands on the application. It is possible for a indirect AR application to show video with a high latency, or even still images. When using direct AR however, the latency is crucial, especially with a video see-through display [21]. Direct AR also requires precise calibration of the display and sometimes a camera, in relation to the user’s normal view [22, ?]

2.2.2 Video vs. Optical See-Through Display Video see-through can be used for both direct and indirect AR, while optical see- through can only be used for direct AR.

15 CHAPTER 2. LITERATURE REVIEW

Optical see-through uses a transparent display that is able to show images inde- pendently on dierent parts on the screen. This allows the ability to only display the elements that are overlaid on reality, as the user is able to see the world through the transparent display [23]. Video see-through uses an opaque display, and the user’s view is recorded with one or more cameras. The elements are superimposed on the recorded video, and the augmented video is displayed to the user. This technique requires very low latency and a precise alignment of the cameras and the user’s usual field of view [21, ?]

2.2.3 Eect on Health When using a video see-through display for direct AR, it is important that the display is correctly calibrated for the user’s normal field of view. When the per- spective of the video displayed to the user diers heavily from the expected view, it takes a long time for the brain adapt to the new perspective. But when the user takes o the display, it will only take a short time to restore the view. On the other hand, if the perspective is almost correct, but diers slightly, it will take the brain a short time to adjust, but an expended period of time to restore the view. During this period the user might acquire motion sickness with symptoms such as nausea. Using a HMD for an extended period of time may cause symptoms such as nausea and eye strain. Age has some significance for the eye strain, with older user’s experiencing more apparent symptoms. It is also possible that females have a higher susceptibility to motion sickness caused by HMDs, although this might be the result of males being less likely to accurately report their symptoms [24].

2.2.4 Positioning In many AR applications it is important to have knowledge of the device’s position and orientation in relation to the surrounding space or objects. An AR system that has an accurate understanding of a user’s position will be able to create a far more immersive experience, and often it is required for the system to be of any use at all. Some sensors that can be used to help determine the position and orientation of a device are:

• Gyroscope

• Compass

• Accelerometer

• Global Positioning System (GPS)

• Camera

16 2.2. AUGMENTED REALITY

Sensing Orientation An accurate, responsive and normalized measurement of the device’s orientation in relation to the earth can be acquired using combined data from the gyroscope, accelerometer and compass. The accelerometer is used to find the direction of gravity. It provides a noisy but accurate measurement of the device’s tilt and roll in relation to the earth [25]. In order to remove the noise, a low pass filter has to be used. The downside is that a low pass filter will add a latency to the sensor data, but this is solved using the gyroscope. The gyroscope provides a responsive measurement of the device’s rotation, but it cannot be used to reliably determine the rotation over time since that would cause the perceived angle to drift. It also has no sense of the direction of gravity or north. The drift is a result of how the rotation is calculated using the angular speed received from the gyroscope. In order to calculate the angle of the device, the angular speed has to be integrated. With a perfect gyroscope, this would not be an issue, but all gyroscopes will add at least a small amount of noise. When noise is integrated it is converted to drift, see Figure 2.1. The drift in compensated for using the accelerometer and compass. The compass cannot be used on it’s own be- cause the output is noisy, and it needs to be tilt compensated i.e. it needs to be aligned with the horizontal plane of the earth [26]. In the end, the gyroscope is used to calculate the orientation of the device, it is then corrected with regard to the angle of gravity by the accelerometer, and the direction of the magnetic north pole by the compass. Figure 2.1: Noise vs. Drift

Sensing Position Finding the position is a far greater challenge than determining the orientation of a device. The GPS is only usable outdoors and provides a position with an error of a few meters [27, ?] This makes it unusable for any AR system that is not viewed at a large scale e.g. navigation. The output from the accelerometer is acceleration, which can be integrated once to calculate speed, and twice to calculate position. Like the gyroscope, the output from the accelerometer is noisy, and as explained, integrating noise causes drift. This eect is amplified when noise is integrated twice, the perceived position of the

17 CHAPTER 2. LITERATURE REVIEW device can drift several inches in a second [28]. The largest problem is not the double integration of noise, but small errors in the computed orientation of the device. When calculating the position of the device, the linear acceleration needs to be used, i.e. the observed acceleration minus gravity. The value of the linear acceleration is obtained by subtracting gravity from the observed acceleration using the orientation calculated with the method described in the previous section. The dependency on a correct value of the orientation of the device turns out to be a huge problem when trying to calculate the position of the device. A small error in the orientation will lead to a very large error in the calculated position. If the computed orientation is o by 1%, the device will be perceived to move at 8 1 ms≠ [28]. This means that the only way to accurately determine the position of the device using the listed sensors is with the camera, using image processing. A common method of achieving this is by using markers, figures that are easily recognizable by a camera [29]. This technique will often only determine the markers’ position in relation to the device, and not the devices position in relation to the world. It is also possible to use data from the accelerometer to assist in marker detection with methods such as the one described in [30].

2.3 Collaborative Communication

A central part of the research was the interaction between the expert and the technician. The expert should have better way of giving instructions than simply providing voice guidance while viewing a live video feed. The study presented in [31] explores dierent methods for remote collaboration. Two pilot studies were conducted that give an understanding of the dierence in eectiveness between the dierent methods. The study did not come to a clear conclusion as to what method was the most eective, but it did rule out some solutions. The study compared the eectiveness of instructions given using a single pointer (a red dot controlled by the instructor), and real time annotations by drawing. It also compared the use of a live video feed and a shared still image. In total, the compared methods were:

1. Pointer on a Still Image

2. Pointer on Live Video

3. Annotation on a Still Image

4. Annotation on Live Video

The test were carried out using a pair of software applications. The person that was receiving the instructions used a tablet to record video or capture images,

18 2.4. IMAGE PROCESSING while the instructor used a PC application. The instructor was given a an image of a number of objects arranged in a specific manner, while the other person had the same objects laying in front of them in a dierent arrangement. The instructor then used the remote collaboration tool to tell the other person how the objects should should be arrange. The data presented was: task completion time, number of errors made, and total distance the mouse was moved by the instructor. The study clearly showed that method 1 was ineective when considering the time taken to complete the task. It took on average 25% longer than any other method. The other results were less clear, method 4 and 3 had a slight advantage over 2, while 3 and 4 were equivalent. In a second test were only method 2 and 4 were included, it was shown that it was important to have a method of erasing the annotations. If the drawing were not removed, the old drawings would confuse the user. The tests also showed that the pointer based method caused the instructor to move the mouse roughly five times as far compared to the annotation based method.

2.4 Image Processing

As a part of the collaborative communication interface, one of the ideas that were chosen to be implemented in the prototype was highlighting of objects in an image. Therefore, dierent methods for detecting objects in an image were investigated, as well as methods of representing the region that constituted the object in a way that is simple to transmit over the network.

2.4.1 Edge Detection Edge detection is a method for identifying points in an image where a given property of the components have discontinuities. The method can be used to detect changes in e.g. color or brightness and is an important tool in image processing [32]. An example of edge detection by color is displayed in figure 2.2. Figure 2.2a shows an arbitrary image, while figure 2.2b shows a possible result of an edge detection.

(a) Before (b) After

Figure 2.2: Edge detection by color

19 CHAPTER 2. LITERATURE REVIEW

The are many dierent methods that can be used for edge detection. For this thesis, a very simple implementation was chosen, which is found in [33]. The method provides a baseline edge detection implementation which can be improved upon if required by the application.

The method is as follows:

For each pixel Ix, the average color dierence of the opposing surrounding pixels is computed. Consider the arrangement of pixels surrounding any pixel Ix.

I2 I5 I8

I1 Ix I7

I0 I3 I6

The following formula is used to calculate the value of Ix.

I1 I7 + I5 I3 + I0 I8 + I2 I6 Ix = | ≠ | | ≠ | | ≠ | | ≠ | (2.1) 4 The result from 2.1 is then multiplied by a scale factor s and compared to two thresholds, Tmin and Tmax, using the following formulas:

IxÕ = sIx (2.2)

0 if IxÕ Tmax (2.3) ]_IÕ if IÕ Tmin and IÕ Tmax x x Ø x Æ The scale factor s and the[_ thresholds Tmin and Tmax are all parameters to the filter, and will give varying results depending on their values.

2.4.2 Noise Reduction Filters An important part of the image processing was to reduce the noise in the image, improving the result of the edge detection. A few methods of noise reduction were explored:

Gaussian Blur Gaussian blur is a low-pass filter that blurs the image using a Gaussian function. It calculates the sum of all pixels in an area, after applying a weight to each pixel.

20 2.4. IMAGE PROCESSING

The weights are computed using the two-dimensional Gaussian function 2.5, a com- position of the one-dimensional Gaussian function 2.4.

1 x2 G(x)= e≠ 2‡2 (2.4) Ô2fi‡2 1 x2+y2 G(x, y)=G(x) G(y)= e≠ 2‡2 (2.5) ú 2fi‡2 When implementing a Gaussian filter, a matrix is computed using the Gaussian function and subsequently normalized. A new value of every pixel in the image is then calculated by multiplying it and the surrounding pixels by the correspond- ing matrix component. This operation can be divided into two steps, horizontal blur and then vertical, or vice versa. Doing this yields the same result, but fewer calculations are needed.

Median Filter A median filter is a nonlinear filtering technique which is often used to remove noise from images. It will preserve edges in an image which makes it optimal for segmenting an image. A filter such as Gaussian blur will remove noise, but also blur edges, which can make a subsequent edge detection filter less eective [34]. A median filter is similar to the Gaussian blur in that it each pixel’s value is calculated using the surrounding pixels. But instead of calculating a weighted av- erage, the median of the surrounding pixels is calculated [35]. This causes common colors in an area to become more dominant, while removing noise.

2.4.3 Hough Transform The Hough transform is a technique which can be used to isolate features of any particular shape within an image [36]. It is applied on a binary image, typically after applying an edge detection method. The simplest form is the Hough Line transform used to detect lines. The idea of the Hough transform is that each point in the image that meets a certain criteria is selected, e.g. all white pixels. Each point then votes for every line that could pass through that point, which in reality is a discrete number that is selected depending on desired performance and resolution. When every point has voted, the votes are counted and lines are selected using the result of the vote. The lines that are selected span the entire image, which means that a second algorithm has to be employed in order to find the line segments that were present in the image. An ecient implementation of the Hough transform requires the possibility to represent a line using two variables. For example, in the Cartesian coordinate system a line can be represented with the parameters (m, b), using y = mx + b.In the Polar coordinate system a line is represented with the parameters (r, ◊), where r = x cos ◊ + y sin ◊. Because it is impossible to represent a line which is parallel

21 CHAPTER 2. LITERATURE REVIEW to the y-axis using y = mx + b, Polar coordinates are used to represent a line in a Hough transform. Polar coordinates were not used initially, this improvement was suggested by Duda and Hart in [37]. The voting step of the Hough transform is typically implemented using a two- dimensional array called accumulator. Each of the two dimensions represent all possible values of r and ◊ respectively. All fields in the accumulator are initialized to 0. When a vote is cast for a line represented by two distinct values of r and ◊, the corresponding field in the accumulator is incremented. Once all votes are counted, the field in the accumulator with the highest value gives the r and ◊ values of the most prominent line in the image. Several lines can be found using various techniques such as finding local maxima, selecting a set number of lines, or finding all values above a threshold.

2.4.4 Image Region Labeling

Region labeling, or connected-component label- ing, is used to connect and label regions in an image. What connects the dierent regions can vary, but a binary image such as the one in fig- ure 2.3 is typically used. In this case all the white regions are labeled [38]. An algorithm that connects the pixels in an image based on 4-way connectivity is pre- sented [39]. The algorithm is split into two passes, the first assigns temporary labels and remembers which labels are connected, while the second pass resolves all connected labels so that each region only has one label. The pixels are iterated through in row major Figure 2.3: Region labeling order, from the top left to bottom right. Each pixel is checked for connectivity to the north and west, the pixels marked in figure 2.1. For every pixel that is encountered that should be labeled, e.i. is white; the following steps are carried out:

22 2.4. IMAGE PROCESSING

N

W

Table 2.1: Connected Pixels

1. If neither the North or West pixels are labeled, assign a label to the pixel and move on to the next pixel. Otherwise, move to the next step. 2. If only one of the North or West pixels are labeled, assign that label to the pixel and move on to the next pixel. Otherwise, move to the next step. 3. If both the North and West pixels are labeled, take the following steps. a) If the North and West pixels have the same label, assign that label to the pixel and move on to the next pixel. Otherwise, move to the next step. b) The North and West pixels must have dierent labels, record this dier- ence and assign the lowest label out of the two to the pixel and move on to the next pixel.

The second pass consist of iterating through every pixel in the image once more. Any pixel that is encountered that has a label which has an equivalent label with a lower value has it’s label reassigned to the lower value. After the second pass all pixels are guaranteed to have the same label i they are connected to each other.

23

Technology

This chapter describes the hardware and software involved in the thesis.

3.1 Google Glass

Google Glass is a wearable computer in the form of a HMD. It has a camera, a horizontal touchpad, and an optical see-through display. Audio is received by the user through a bone conduction transducer, and there are one or two internal microphones [40]. The battery capacity is enough for a day of normal use, but will drain rapidly when e.g. recording a video [41]. The device is delivered with a pair of attachable shades, nose pads of varying size, and a Micro USB mono headphone. Google Glass runs a modified version of Android. At the beginning of the work on this thesis, the base android version was 4.0.4 (Ice Cream Sandwich), but this was later changed to version 4.4 (Kit Kat) with the XE16 update. The device is configured using either a website or a companion ap- plication. The configuration is done either using Bluetooth, if using the smartphone application, or with a QR-code, if done through the web page.

3.1.1 Timeline The timeline is the central part of the interaction with Google Glass. It consists of cards, with each card being some content attached to a point in time. Cards can include simple static content such as an URL or an image. They can also be more complex and display dynamic content, such cards are called live cards. Live cards are used e.g. in the compass or navigation applications. The centerpiece of the timeline is the clock screen, which is the default screen to be shown when the display is activated. From this screen, the timeline can be scrolled in two directions, left and right. Scrolling to the right represents going back in time. When new cards are inserted in the timeline, they are placed next to the center. As newer cards are inserted, the older cards are gradually pushed to the right and eventually disappear once they are too old. Scrolling to the left reveals cards that are happening “now”. Some of cards shown here are permanent, such as the weather, calendar, and settings card. They

25 CHAPTER 3. TECHNOLOGY are mixed with cards that are temporarily inserted by the user e.g. the navigation, stopwatch, and compass cards.

3.1.2 Interaction The main methods of interaction with Google Glass are the touchpad and voice commands. The multi-touch touchpad is located on the outside of the device along the bearer’s temple. The touchpad is currently the only way to navigate through the timeline, although newly inserted cards are shown if the screen is activated soon after they are received. Some of the cards in the timeline have a text at the bottom reading “ok glass”, the user can either tap the touchpad or say “ok glass” to activate these cards. If the card is activated using voice input, a list of all voice commands that are available to the user will be shown. If it is activated through touch input, the user can scroll through the same commands and activate them by tapping once more. The clock card also has the “ok glass” text at the bottom. When this card is activated it shows a list of all the voice commands that can be used to start new interactions with the device, such as “take a picture” and “get directions”. Developers can add their own applications to this list, as long as they adhere to a predetermined set of voice commands. There are two intended ways of activating the screen while wearing Google Glass: tilting the head backwards, or by tapping the touchpad. Since the display can be activated by simply tilting ones head it is possible to perform most common tasks using only voice input. This allows the devices to be used completely hands-free. This has great potential for the field support use case, but if it is used as the only input method, the application will be vulnerable to noisy environments, or crowded environments where the technician would prefer not to use voice input. A less apparent way of interacting with the device is an inward facing IR sensor. The sensor is able to detect when the user blinks. This is used by a functionality that is built into the system which lets the user take pictures by winking [42].

3.1.3 Microinteractions The design of Google Glass is focused around microinteractions [43]. The goal is to allow the user to go from intent to action in a short time. When a user takes a note there should only be a few second between when the user realizes that a note needs to be written down, and when the note has been completed. When a wearer of Google Glass receives an email, it will take a few seconds before the user is looking at the email and determining if it’s important or not, provided the user is not preoccupied. This reduction in time for simple interactions results in the removal of the barrier towards the world that technology creates around user. When using a smartphone, if the user has invests 20 seconds to retrieve the device and navigate to the appropriate application, she is bound to devote even more time while it is in

26 3.1. GOOGLE GLASS her hand. An incoming SMS leads to a quick scan of the inbox, and why not check the email, and social media as well? By limiting the time it takes to do these small tasks, the time in which technology is out of the way can be increased [43].

3.1.4 Hardware The hardware in Google Glass has not been made ocial, but a teardown and and analysis with ADB has revealed some information of what components are inside [40, ?]

OMAP4430 SoC

• 1 GB RAM, 682 MB unreserved

• 16 GB Flash storage, 4 GB reserved by system

• 640x360 pixel display

• 5 MP camera capable of recording up to 720p

• Bone conduction transducer

• One or two internal microphones

• 802.11b/G Wifi

• Bluetooth 4.0

• 3 axis gyroscope, accelerometer, magnetometer

• Ambient light sensor

• Proximity sensor

The CPU and RAM are comparable to a low-end smart phone such as the Motorola Moto G [44]. A notable component that is absent is a 3G or 4G modem. This is because Google Glass is not intended to be used as the only device carried by the user.

3.1.5 Development Initially, the only way to develop applications for Google Glass that could run without rooting the device was through the Mirror API [45]. It allows developers to build web services, called Glassware, that interact with the users’ timelines. Static cards can be added, removed and updated using the provided REST API. In late November 2013 the Glass Development Kit (GDK) was released. It was built on top of Android 4.0.4 and provides functionality such as live cards, simplified

27 CHAPTER 3. TECHNOLOGY touch gestures, user interface elements, and other components that streamlines application development for Google Glass. In April 2014, the XE16 update was released. This update brought the Android version up to 4.4, which includes the new WebView and MediaRecorder API.

3.1.6 Display The display in Google Glass is a 640 by 360 pixel display and is roughly the equiv- alent of looking at a 25 inch screen eight feet away. It is a transparent optical display, and it takes up a very small portion of the users field of view, approxi- mately 14 ¶ [46]. The display is positioned above the users natural field of view, which means that in order to look at the display the user needs to look up. This keeps the display from obscuring the user’s vision, even while it is active.

3.2 Alternative Devices

Other devices than Google Glass were explored to gain a broader understanding of wearable devices and to provide alternatives. The type of wearable computer that was the focus of in this thesis was the head mounted display, which meant that only such devices were evaluated. It is possible to use e.g. a smartwatch with a camera, or a chest mounted projector such as SixthSense [47] for the use case, but HMDs where prioritized as it is more likely that a solution for Google Glass is reusable on another HMD. The list was also limited to devices that were commercially available or soon to be commercially available. The devices were compared to Google Glass with a few main aspects in mind:

• How similar is the hardware?

• What display technology is used, and how does it dier?

• How do the development platforms compare?

3.2.1 Meta Pro Although the Meta Pro [13] might seem similar to Google Glass, using these would significantly change the user interaction. They provided direct AR through optical see-through displays as well as motion detection using depth sensors. They have significantly more powerful hardware directly available, as all the processing is done on a separate high-performance pocket computer capable of running -based desktop operating systems [48]. Applications are developed either using Meta’s internally developed SDK or with Unity3D [49, ?] It is unlikely that any non-native Android application that is developed for Google Glass will be easily portable on this device. In the end this does not matter as the user interaction is so fundamentally dierent that any

28 3.2. ALTERNATIVE DEVICES application designed with Google Glass in mind would have to be redesigned from the ground up.

3.2.2 Vuzix M100 The Vuzix M100 [14] are very similar to Google Glass, both have a OMAP 4 CPU, although the M100 have a slightly faster OMAP4460. The M100 also runs Android 4.0.4, has a similar display which only makes it capable of indirect AR, and share many other hardware features. Any application developed for Google Glass would likely be easy to port to the M100. Dierences between the two are that the M100 has an opaque display and has multiple buttons instead of a touch pad.

3.2.3 Oculus Rift Development Kit 1/2 The Oculus Rift [8] is primarily intended for use as an alternative to a conventional display when playing computer games. The available development version does not have a transparent display or a front-facing camera, which makes it incapable of performing AR. This can be solved using custom modifications such as attaching a front-facing camera, which is potentially going to be included in the consumer version [50]. The Oculus Rift has to be connected to a control box, which in turn receives images from an external device. This makes it very impractical to use and is unlikely to fit the use case.

3.2.4 Recon Jet The Recon Jet [15] have a lot in common with Google Glass, they have a small optical HMD and are also incapable of direct AR. Since they also run Android, but with an unknown CPU, it would likely be easy to port any application from Glass to run on the Recon Jet. This device has an advantage over Glass in that the touch sensor works in all weather conditions, and also while wearing gloves [51]. This can be a big advantage depending on the typical work environment of the field technician.

3.2.5 XMExpert The XMExpert [52] is a complete solution for field service support, and is included here as a parallel solution rather than an alternative device. It includes a portable workstation for the expert, and a helmet mounted AR system for the field techni- cian. Instructions are given by the expert using his hands, which are overlaid on the technicians view in a similar way to the Vipaar system [53]. Just like the Oculus Rift, the AR helmet system was in itself quite large, and expert also has to carry a backing unit when moving around. This makes the system

29 CHAPTER 3. TECHNOLOGY less suited for the entire use case were the technician is able to use the same device to display information when traveling to the location as well as working.

3.3 Web Technologies

The number of technologies available in web browsers have grown significantly over the past years, the days where browsers could only view static HTML pages are long gone. HTML5 is the latest version of HTML, the previous version is 4.01 which was released in 1999 [54]. It add multiple features that can be used to allow rich content in the browser without the need for plugins.

3.3.1 WebRTC

RTC stands for Real-Time Communication, technology that enables peer-to-peer audio, video, and data streaming. WebRTC provides browsers with RTC capa- bilities, allowing browser-to-browser teleconferencing and data sharing, without requiring plug-ins or third-party software [55]. WebRTC is currently supported by Google Chrome, Mozilla Firefox, Opera, and Opera Mobile. The implementation of WebRTC varies between the dierent browsers, not all features of are implemented everywhere and some do not have interoperability. This is partly because the WebRTC standard is still under de- bate [56]. The WebRTC API has three important interfaces, getUserMedia, RTCPeerConnection and RTCDataChannel [57]. getUserMedia is used to gain access to the user’s video and audio recording devices through a LocalMediaStream object. When getUserMedia is called, the user is prompted to allow the use of the recording devices. The user is able to deny the request, pro- tecting against malicious websites and recording at inconvenient times. The RTCPeerConnection interface handles the connection to another peer. The LocalMediaStream object retrieved with getUserMedia can be added to a RTCPeerConnection in order to allow the remote peer to receive a media stream from the user. Likewise, the RTCPeerConnection will receive remote streams when these are added by the peer. A RTCPeerConnection instance will emit events with signaling data that needs to be received by the other peer. How these signals should be transmitted is not included in the WebRTC specification, and needs a separate implementation. The RTCDataChannel interface provides a real time data channel which trans- mits text over the Stream Control Transmission Protocol (SCTP) [58]. SCTP sup- ports multiple channels that can be configured depending on what kind of reliability is required. By default the channels will retransmit lost packets and guarantee that packets will arrive in order, but both of these behaviors can be switched o.

30 3.3. WEB TECHNOLOGIES

3.3.2 WebGL

WebGL (Web Graphics Library) enables browsers to access graphics hardware on the host machine in order to render accelerated real time graphics. It is primarily used to render 3D scenes, but can be used for other purposes as well, such as image processing. The WebGL API is based on OpenGL ES 2.0 [59]. OpenGL ES is in turn a subset of OpenGL [60], aimed at embedded devices. The API uses a HTML5 canvas element [61] as drawing surface.

Figure 3.1: Simplified WebGL Pipeline

WebGL uses as pipeline to do it’s work, a simplified version of the WebGL pipeline can be seen in figure 3.1. The blue elements are set with the WebGL API, while the red elements are fixed, although in some cases properties of them can be changed. The green elements are called shaders and it is through these that the developer can program the GPU. The shader programs are written in the OpenGL Shading Language (GLSL) [62]. WebGL is essentially a 2D API, and not a 3D API [63]. The the output of a WebGL pipeline is a 2D surface, the developer is responsible for creating the 3D eect. A program that wishes to display an object that is specified in 3D coordinates has to project the object onto a 2D plane called clipspace. The coordinates of the clipspace plane are always -1 to +1 along both the x-axis and y-axis. The vertex shader is responsible for converting 3D points (vertices) to clipspace coordinates. It takes a single vertex as input and outputs the clipspace coordinates of the vertex. Each vertex is a part of a primitive shape, such as a triangle or line. When the vertex shader has computed the position for all points in a primitive, it is rasterized. Rasterization is the process of converting a shape specified by coordinates to a raster image (pixels). An illustration of this is showed in figure 3.2. When the rasterization is complete, every pixel that is produced is sent to the fragment shader (this is not always the case, e.g. if multisampling is used [64]).

31 CHAPTER 3. TECHNOLOGY

Figure 3.2: Illustration of Rasterization

The fragment shader is responsible for setting the color of each pixel it receives and can also discard the pixel. When OpenGL is used for image processing, this is commonly where all the processing is done. The end of the pipeline, the framebuer [65], is by default the buer that is displayed to the user once the rendering is complete. It is however possible to replace the default framebuer with one specified by the programmer. Using this feature it is possible to render a scene to a texture, which then can be used when rendering the next scene.

3.4 Mario

Mario is an implementation of WebRTC. It is build using the GStreamer library and runs on OS X, Android, iOS and Linux. Mario can be used to write both web-based and native applications. A C-API is used for native applications, while web-based applications use a JavaScript API. The JavaScript API uses remote procedure calls (RPC) [66] to send commands to a server running Mario in the background.

3.4.1 GStreamer GStreamer [67] is an open source library that runs on Linux, OS X, iOS, Windows, and many other platforms. GStreamer relies on GObject [68] to provide object oriented abstractions. The main purpose of GStreamer is to construct media pro- cessing pipelines using a large number of provided elements. GStreamer is designed with a plugin architecture which allows for plugins to be dynamically loaded upon request. Each plugin contains one or more elements that can be linked together. An example pipeline of a simple MP3 player with an audio visualizer can be seen in figure 3.3. The pipeline does not function unless a queue element is inserted, which is left as an exercise for the reader.

32 3.5. VIDEO ENCODING

Each element in the pipeline has it’s own function:

filesrc reads a file and passes on a raw byte stream mpegaudioparse parses the raw data into an encoded MP3 stream. mpg123audiodec decodes the MP3 stream and passes on raw audio data. tee splits the stream. autoaudiosink automatically selects a suitable method for audio playback and plays the received audio stream. wavescope creates a wave visualization of the audio stream. autovideosink automatically selects a suitable method for video playback and shows the received video stream.

Figure 3.3: Example GStreamer Pipeline

This pipeline architecture makes it simple to swap out parts of the pipeline while keeping the rest intact. For example, if the previous MP3 player wanted to playback a vorbis stream instead, this could be achieved by replacing the mpegaudioparse and mpg123audiodec with vorbisparse and vorbisdec respectively. This is used in Mario to be able to be able to encode and decode dierent video and audio formats.

3.5 Video Encoding

The current implementation of Mario on Android uses software encoders, some of which are NEON [69] optimized. A high-end consumer smartphone has enough processing power to encode video in realtime using a software encoder. However, the OMAP 4430 processor in Google Glass XE does not have enough processing power to encode video at an acceptable resolution and frame rate. The desired resolution was VGA (640 x 480 pixels), with a frame rate of 25 frames per second. When using the default software encoder with these settings, no video was received at all. The NEON optimized encoder was able to transmit video at a few frames per seconds with a very low video quality.

33 CHAPTER 3. TECHNOLOGY

While this was enough for initial testing, the performance of the video encoder would have to be improved significantly if used in a real application.

3.5.1 Hardware Accelerated Video Encoding The most computationally intensive task in a real time multimedia communication application is normally video encoding [70]. Therefore, many mobile devices have dedicated hardware that can be used for video (and audio) encoding. Google Glass is no exception, it’s OMAP 4430 SoC processor has an IVA3 multimedia hardware accelerator. If hardware video encoding was to be used, it would need to be incorporated in the GStreamer pipeline in Mario. The only sensible way to enable hardware encoded video was by using an existing GStreamer plugin. Three possible alternatives were found, gst-ducati, gst-omx and the androidmedia plugin. All the listed plugins access the hardware at dierence layers: gst-ducati uses libdce (distributed codec-engine), which is the lowest level; gst-omx uses Open- MAX IL, a middle layer; and androidmedia uses the Android MediaCodec API, the highest layer. The layer that is chosen has an eect on where the application will be able to run. Using gst-ducati will limit the use to devices with an OMAP processor, while gst-omx and androidmedia should theoretically allow the application to run on any Android device [71]. androidmedia This plugin uses the Android MediaCodec API, which was added in Android 4.1. The API is built on top of OpenMAX IL [71], and makes it easier to provide cross compatibility. Since Google Glass initially ran Android 4.0.4 it could not be used. gst-ducati gst-ducati can be found at [72], it depended the following libraries: libdce, libdrm, a custom branch of gst-plugins-bad, libpthreadstub, which were all also available at [72]. The Distributed Codec Engine (DCE) which gst-ducati uses is written specif- ically for the IVA-HD subsystem in the OMAP4 processor. It interfaces directly with the co-processor without the need of OpenMax. gst-omx OMX is an abbreviation of OpenMAX (Open Media Acceleration), which is a cross- platform C-language interface that provides abstractions for audio, video, and image processing. OpenMAX provides three dierent interface layers: application layout (AL), integration layer (IL) and development layer (DL). A number of dierent GStreamer plugins exist for OMX, which all use OpenMax IL. The most actively maintained alternative was chosen, which can be found at [73].

34 Implementation of Prototype

The research goals of the thesis were achieved by reviewing existing research, and creating and evaluating a prototype. The information gathered from the literature- review was complemented with original ideas and used in the construction of the prototype. This chapter describes the process of implementing the prototype, the problems encountered, and how they were solved.

4.1 Baseline Implementation

The work began by implementing a basic version of the prototype, with minimal functionality. A web server was built using NodeJS [74]. It served a static web page and also acted as signaling server for the WebRTC call. The web page had controls to initiate a simple two-way audio and video WebRTC call. Basic session and user handling was added where any participant could freely choose to join any session using any username. A basic Android application for Google Glass was created, which provided a minimal implementation of WebRTC using Mario. The application could join a session on the server and set up a WebRTC call. The early prototype revealed several issues:

• The encoded video on Glass had to have a very low resolution, or no video would be received.

• The Glass application became sluggish over time and would eventually freeze completely.

• Glass became very warm and displayed a warning message.

These problems all hinted towards that the CPU in Google Glass was not power- ful enough to allow software encoding of video. Dierent alternatives for hardware encoded video on Google Glass were therefore researched.

35 CHAPTER 4. IMPLEMENTATION OF PROTOTYPE

4.2 Hardware Accelerated Video Encoding

Mario was built using GStreamer version 1.0, so any plugin that provided hardware encoding would also have to be built on top of GStreamer 1.0. This was a significant issue, most of the libraries discovered where aimed at GStreamer 0.10. Three plugins where however found that used GStreamer 1.0: gst-omx, gst-ducati, and androidmedia, which were described in the previous chapter. Each of the plugins were built and tested, but none if them worked without modifications. It was possible to build tandroidmedia and load it into Mario. But the plugin used the Android media API, which was only available in Android 4.1. Since Glass at the time ran Android 4.0.4, the plugin could not be used. gst-omx could be built and loaded, but the plugin didn’t work and printed a long list of error messages. It was possible to build gst-ducati and all of it’s dependencies, but trying to link [75] it with Mario caused a large number of symbol conflicts. Out of the three plugins tested it was concluded that gst-omx was the one that most likely could be adapted and integrated into Mario and Google Glass. androidmedia was limited by the Android version, and gst-omx was still actively developed and seemed to have a larger community than gst-ducati.

4.2.1 gst-omx on Google Glass The TI E2E community [76] was a valuable source of information. A post by a TI employee explained how to debug OpenMAX on devices with a OMAP4 CPU: a log on the device located at /d/remoteproc/remoteproc0/trace1 logged errors from the remote processor, including errors generated by OpenMAX. This was a vital tool as it reported errors that were not shown in the GStreamer log. A number of issues were found and resolved, enabling the use of hardware encoded video through gst-omx. Using hardware accelerated encoding, VGA video could be streamed at 25 frames per second, which was adequate for the prototype.

4.3 Ideas to implement

Ideas for how to continue the development of the prototype were explored. An aspect which was apparent from the beginning was that a remote view of the expert was note very useful to the technician. A solution similar to XMReality and VIPAAR was considered, where there is an augmented view of the expert shown to the technician. This idea was abandoned, as it would require a more sophisticated setup on the expert side, and it would be dicult to provide something new to the solution. The idea that was chosen instead was that the expert would use simple anno- tation tools to annotate the view of the technician.

36 4.3. IDEAS TO IMPLEMENT

4.3.1 Annotating the Technician’s View The idea was to allow the expert to draw on top of a still image or a video feed, and that the annotations would be sent to the technician. It was undecided if this should be done on a video feed or still images. In the collaborative communication study (see 2.3), it was concluded that an- notating live video and still images are roughly as ecient. An important aspect of the test conducted in the research was that the person receiving the instructions had the same viewpoint all the time. When annotating live video it allowed the person to align the annotations with the view, solving the problem of keeping the annotations stable. In the field technician use case, it would be beneficial if the technician is able to do work that requires moving around, while receiving instructions. This meant that if live video annotations were to be used, some form of alignment of the annotations and the video would have to be done.

4.3.2 Aligning Annotations and Video If live video was to be annotated, there would have to be a way to track annotations so that e.g. a circle around an object would stay around the object even as the technician moves around. An approach considered was to track the technician’s position. It does not solve the problem of finding the annotations’ position in relation to the real work objects, but it simplifies the problem. The concept was tested with a simple prototype application that aimed at show a 3D cube in the real world using AR. A combination of the gyroscope, accelerometer and compass were used to track the rotation of the device, as described earlier (see 2.2.4). This worked well, the cube stayed in the same position when rotating one’s head, without any noticeable delay. The challenge proved to be to track the position of the device. As was also concluded earlier (see 2.2.4), an accelerometer can not be used to accurately track the position of an object. This meant that image processing would have to be used to track the location of the technician, as well as finding the annotations’ location relative to the real world objects. A number of dierent approaches to this issue were considered:

1. Do all video processing on the expert side and send annotated video to the technician.

2. Recognize points in the video that can be used as anchors, send annotations relative to these points.

3. Same approach as 1, but transfer the annotations from the incoming video to a live feed on Glass.

37 CHAPTER 4. IMPLEMENTATION OF PROTOTYPE

When the video processing is said to be done on the expert or technician side, it is not required to be done on the actual device. It could be ooaded to another device or server, as long as it can be done with very low latency. Alternative 1 would cause a big delay between the technicians real view and his or her view in Glass. It would also require twice the bandwidth as video is sent in both directions and use more processing power on Glass to decode the incoming video. Alternative 2 was to synchronize the annotations using image recognition. This could be done by recognizing points in the video on both the expert and technician side, and then sending annotation coordinates that are relative to the recognized points. This is perhaps the most realistic and optimal solution, if it is possible. It would most likely require the image processing on Glass to be ooaded to a nearby device. Alternative 3 solves the latency issue present in alternative 1, but it increases the complexity significantly. It would likely require the computation to be ooaded to another device because of the limited processing power of Google Glass. In the end neither of the alternatives were implemented. Alternative 1 and 3 requires the ability to modify an incoming video stream and to transmit that stream to another peer, which is not supported by WebRTC. Alternative 2 might be possible to implement, but it was considered too complex to implement within the time frame. It was also not though of from the beginning and is is left as a possible path for future improvement. It was decided that still image annotation would be used instead. According to the research found it should also be at least as ecient as annotated video. The earlier mentioned experiment had also shown that humans are very good at correlating objects in an image with the real world, even if the image is from a dierent perspective, which would be the case when using still images.

4.4 Still Image Annotation

The implementation of still image annotation presented a few questions and chal- lenges: how does the expert select the image, what types of annotations should be possible, and how are the annotations transmitted to the technician? The solution to the problem of selecting an image was solved by allowing the expert to select video frame a few seconds in the past. The live video was shown in a section of the web application, with each shown video frame being pushed onto a queue. Whenever the user pressed a button on the page, the queued video was displayed in a separate section. The user could then seek through the queued video and select a suitable frame. The solution made it possible to always display the live video feed, while also being able to go back in time to select an optimal video frame. How the images were sent to the technician is described later in this chapter. The first annotation method to be implemented was free drawing. The drawing was done using a 2D canvas in both the Glass application and web application. The

38 4.4. STILL IMAGE ANNOTATION

Android Canvas API [77] is very similar to the HTML5 Canvas API [78], which made it very simple to implement a drawing system with a consistent result for the same data. Both canvases also supported drawing text, which was added at a later stage.

4.4.1 WebRTC Data Channel

WebRTC data channels were chosen as transport for the annotation data. This as- sured the lowest possible latency. It also causes the annotation data to be transmit- ted by the same means as the audio data, ensuring that the audio and annotations are synchronized. Mario had not implemented WebRTC data channels at the time, however this would be added in the future. A simple solution that allowed data to be sent directly over UDP was implemented. The transport of the annotation data was designed with the assumption that SCTP would later be used. A protocol was created that assumed that all data was eventually delivered, but could handle out of order delivery. This reduces the latency, and more importantly it removes the spikes caused when a message is lost and has to be retransmitted. This was also a necessity as the messages temporarily had to be transported over UDP, although no retransmission of messages was implemented.

4.4.2 Out of Order Messages

A method that allowed the annotation data to be received out of order was devised. Each time the expert started drawing, a message was sent marking the beginning of a new line and any properties of the line, such as color. When the expert had moved the cursor far enough (10 px), a message was sent with the coordinates of the point the cursor had been moved to. This limited the amount of data sent by dividing the drawing into segments instead of pixels. In order to support out of order delivery, each line and point was given an index, with the point indices starting at 0 for each line being drawn. Every start of line message included the index of the line, while every point message contained the line index and it’s own index. A point message buer was also added in the Glass application that buered any point message that was received before the associated line message. As shown in the research on collaborative communication, it was important that the expert was able to remove old annotations to avoid confusing the technician by cluttering the scene. This was achieved by sending a clear message. The clear message supported out of order delivery by using the same incrementing index as was used by the line messages.

39 CHAPTER 4. IMPLEMENTATION OF PROTOTYPE

Figure 4.1: Out of Order Delivery of Messages

A method for ensuring that the correct annotations were cleared had to be implemented in the Glass application. This was achieved by keeping track of the minimum allowed index, as illustrated in figure 4.1. Whenever a clear message is received, the minimum index is set to the index of the clear message. At the same time, any annotation with an index lower than the new minimum is removed. Any message that is received with an index below the minimum is ignored. This solution was later expanded to allow clearing of individual colors by keeping track of a minimum index for each color.

4.5 Image Processing

An idea that was brought up during the development of the prototype was the ability to highlight objects in the selected image by clicking on them. This was chosen as the feature that would be implemented after annotations by free drawing. Multiple dierent methods for selecting objects were considered: 1. The user draws a rough outline around the object, an algorithm then refines the edge. 2. Recognize objects in the image and allow the user to select them. 3. The image is split into dierent regions. When a user clicks inside a region it is selected. Out of the options listed, number 1 would provide the most accurate outline, but the goal of the object selection method was to provide a quick way for the expert to select and object. Alternative 1 defeats the purpose of the object highlighting, as the rough drawing would signal the object of interest just as well as a precise outline. Alternative 2 and 3 were both chosen and methods for implementing them was researched.

40 4.5. IMAGE PROCESSING

The image processing had to be done either directly in the web application or ooaded to a server. A web based solution was preferred, as this would minimize the delay and server load. Dierent libraries were found that could provide face detection or hand detec- tion, but no library was found that provided either detection of objects that stood out of an image or some method of segmenting an image by color. An attempt was instead made to implement alternative 2 or 3 using WebGL.

4.5.1 Image Processing using WebGL The image processing was implemented using fragment shaders. While they only output one color, fragment shaders can read multiple points from many dierent textures that are loaded onto the GPU. A series of fragment shaders can be composited with the help of framebuers. The image is first loaded into a texture. The texture is then rendered to a frame- buer with an attached texture using the first shader. The new texture is then rendered to another framebuer using the second shader, and so on. The result obtained in the final step can then be rendered to the screen or read into main memory.

Figure 4.2: Ping-Pong Framebuers

Because a framebuer can not render to itself, at least two framebuers had to be used. The render target alternated between the two framebuers, using the unbound framebuer’s attached texture as input to the fragment shader. An illustration of this is shown in figure 4.2. The figure also shows how new framebuers

41 CHAPTER 4. IMPLEMENTATION OF PROTOTYPE can be created to save the intermediate steps in the process. This method was used as a debugging tool to be able to look through the dierent steps afterwards.

4.5.2 Image Segmentation using Hough Transform A challenge faced when implementing the image segmentation was how the selection of a region would be shared with the technician. Even if the image could be divided into suitable regions, the preferred way of sending the annotations to the technician was through the text based WebRTC data channel. For this reason, the first method that was tried was a form of object recognition using a Hough Line transform (see 2.4.3). The intention was that a Hough transform would find line segments in the image that could then be stitched together, creating graph. With some basic collision detection the coordinates of a mouse click could be used to find the smallest subgraph that encompassed the click. The coordinates of the line segments could the be sent to the expert and it would be trivial to render the outline. The image was preprocessed by applying a Gaussian blur and then the simple edge detection algorithm described in the literature-review (see 2.4.1). The Hough Line transform was then done, revealing any lines in the image. The lines were then supposed to be split into the actual line segments in the image, and segments would then be combined to a graph. This method proved to have some big disadvantages. The biggest one was that a Hough transform can only reveal objects with a predetermined shape, in this case it would only be able to detect object with straight and sharp edges. This was known from the beginning but it because apparent how limiting it was once the algorithm was run on some sample images. It also proved to be dicult to pick out the lines in the image by finding relevant local maximum points. Many of the sample images used had lines that where similar to each other but did not originate from the same object, e.g. two tables standing next to each other at slightly dierent angles.

4.5.3 Image Segmentation using Median Filters While experimenting with the Hough transform, a mean filter was tested to see if it would be a more suitable preprocessing step than the Gaussian blur filter. It turned out that a decent segmentation of the image could be achieved by applying a mean filter several times and then running an edge detection shader. This method was further improved upon be creating a second type of median filter that sampled the surrounding pixels in a circular pattern instead of a square. The two dierent median filters were applied after each other a number of times and by trial and error a pattern than yielded an acceptable result was found. The method provided a decent segmentation of the image. It was not able to properly identify objects when an edge was blurred e.g. if out of focus.

42 4.5. IMAGE PROCESSING

Metropolis-Hasting Algorithm An attempt was made to implement the Metropolis-Hastings algorithm [79]. Good results were achieved with a slow cooling process and an iteration count of several thousands, but this resulted in a computation time far above what was acceptable for the application. The shortcut method with a fast cooling process that results in domain fragmentation was never achieved, most likely due to bad parameters or errors in the implementation.

(a) Before (b) After

Figure 4.3: Result of Median Filters and Edge Detection

4.5.4 Region Labeling An example of a resulting image after the median filters and the edge detection have been applied can be seen in figure 4.3. The regions found in the image are white while the borders between the regions were black. A way to select a region and to share this selection with the technician had to be implemented. This was achieved with the help of the labeling algorithm described earlier (see 2.4.4). Initially another more naive algorithm was used, which took roughly 1.3 s to label an image with VGA resolution. When using the new algorithm the time was reduced to below 1 s. This was still too slow to be an acceptable solution, especially since the web browser is frozen while the algorithm is running. The solution was to use a web worker [80], which can be thought of as a thread in JavaScript. Web workers are more closely related to processes however, as they run in a separate context and can only share immutable data with other contexts. When moving the computation to a web worker the goal was to allow the main thread in the browser to keep running while the computation was being executed. It turned out to have a side eect, the execution time went down to approximately 30 ms. It was guessed that isolating the code in an otherwise empty context allowed better optimizations to be made, but this was never studied further. Once the regions had been labeled, an outline around a single region could be rendered using the region label and an edge detection shader. The same method

43 CHAPTER 4. IMPLEMENTATION OF PROTOTYPE was also used to send the label to the technician.

4.6 Glass Application

The development of the Glass application was very straightforward as it did not have any (GUI) to speak of. In most cases the implementation simply mimicked the web application.

4.6.1 Configuration The address to the server was initially hardcoded in application. This meant that every time a dierent network was used, the address in the code had to be changed. This quickly became a tedious task, so a method of configuring the address was added. QR codes are used to configure the Wi-Fi network on Glass, this idea was copied when adding the address configuration option. A library to generate QR codes was added to the web application. When clicking a button a QR code was shown that contained the current address of the server. Since Android does not provide a QR code reader a library called ZXing [81] was used to provide that functionality. The QR code scanner could be accessed through a context menu in the Glass application. When scanning the QR code provided by the web application, the address received was stored in the shared preferences for the application. The next time the application tried to connect to the server it would now use the updated address.

4.6.2 OpenGL ES 2.0 The OpenGL version used in the Glass application was OpenGL ES 2.0, which WebGL is based upon. This meant that all the methods used for the rendering could be reused, and the shaders could be reused without any modification. The Glass application did not mirror the image processing functionality, only the methods used to render an outline around an image region were implemented.

4.6.3 Photographs by Technician When testing the system it was realized that it would be useful for the technician to be able to take photographs himself and send these to the expert. This would allow for the expert to instruct the technician to e.g. take a close up picture of an object using annotations or voice guidance. With the solution at the time the technician would have to simply look at the object for some time so that the expert could then select a suitable frame. This proved to be a technical challenge as the camera has to be acquired before a photograph can be taken, and it is already in use by Mario. A possible solution would be to end the call, take the picture, and then resume the call. This seemed

44 4.7. SIGNALING SERVER like a solution that would make the feature pointless as it would take much longer time to take the picture than to simply look at the object for some time. An alternative solution was found that allowed Mario to keep running while a photograph is taken. GStreamer provides methods for intercepting buers that are being sent through the pipeline. This was used to add a function to Mario which copied the next video frame send from the camera. This meant that the picture had the same resolution as the video, but it had better quality as it was extracted before the video was encoded. This solution also allowed photographs to be taken with a very small delay, as the camera is already recording video.

4.7 Signaling Server

This section describes the implementation of the signaling server. The server had several tasks that it must be able to handle:

• Keeping track of sessions and users.

• Provide a signaling channel between the users for WebRTC.

• Allow images and photographs to be uploaded to a session.

• Send notification to users on events such as image upload or user events.

4.7.1 Sessions and users The session and user handling was very simple. No authentication was implemented as this was not a needed feature in the prototype. Session IDs and usernames are chosen by the users when connecting to the server. Users receive events when another user joins or leaves their session, and are able to send messages to any user in the same session.

4.7.2 Server sent events The signaling channel was chosen to be implemented using either web sockets [82] or server-sent events [83]. Server-sent events are implemented in the browser using the EventSource API, while web sockets use the WebSocket API. The biggest dierence between server-sent events and web sockets is that web sockets are bidirectional while server-sent events only allow data to be sent from the server to the client. Data can be sent to the server when using server-sent events, but this has to be done separately using regular HTTP request. This causes a large overhead compared to web sockets if small chunks of data are sent. The server implementation of server-sent events would also be simpler than the web-socket implementation, since server-sent events simply keeps a HTTP request open.

45 CHAPTER 4. IMPLEMENTATION OF PROTOTYPE

Before the XE16 update to Google Glass, which brought the new Android 4.4 WebView, neither of these technologies were supported by the WebView in the GDK. This could be solved using a polyfill, which is a drop-in script that implements some missing functionality in the browser [84]. Using a polyfill caused the browser to fall back to long polling. Long polling is a form of polling where the browser requests the resource at a much slower rate. If the server does not have any response available when the request is received, it is left open until there is something to send. In the end this issue disappeared when the XE16 update was released. Server-sent events where chosen because the data sent to the server was sent in large chunks and infrequently, only the notifications from the server to the clients were small and frequent. On the server they were implemented by keeping one request open for each connected user. When the web application had finished loading a request was sent to the path /join along with the required parameters. Upon receiving the request, the response was simply left open and the content type was set to text/event-stream. This allowed events to be sent using the open response. Server-sent events were used to implement both the server notifications such as users disconnecting and images being uploaded, as well as the signaling channel that was used to initiate the WebRTC call.

4.7.3 Image upload The image upload was implemented by saving images that were sent to a specific path using a PUT request. The images were saved per session and were stored in memory only. The images uploaded by the technician were assigned a unique id and all up- loaded photo were saved. Of the images uploaded by the expert, only the most recent one was saved. Each time an image was uploaded an event was sent to all users in the session. Dierent events were emitted depending on if the expert had uploaded an image or the technician had uploaded a photograph. This allowed the clients to start downloading images as soon as they had been uploaded. An improvement that was considered was to send the image events as soon as the image upload had started. The image would then be streamed to any involved client while it was still being uploaded, resulting in a reduced latency. This was never implemented due to time constraints.

46 Result

This chapter provides a summary of the final implementation of the prototype from a user point of view.

5.1 Web Application

Figure 5.1 shows a view of the web application when it has just loaded. The configure button (1) displays a QR code which can be scanned by the technician to set the server address to the address that the expert is currently connected to. Once the technician has connected to the same session, an “Initiate Call” button will be displayed in area A. Area A is where the remote view of the technician is displayed. When button 2 is clicked a scan through the last few seconds of video is initiated. While scanning through the video with the slider in area 3, the frames are displayed in area B. When the technician takes a photograph it will appear in area D. The expert is able to preview the pictures by hovering over them. At any point during a call, images can be dragged from areas A, B, and D to area C to select an image for annotation. The selected image is immediately displayed to the technician through the display on Google Glass. Images can also be dragged from other sources such as other websites or the file system. When the expert has selected an image it can be annotated using any of the methods seen in area 5: text, highlighting, or free drawing. The color of the annotations is selected with the color picker (4). In free drawing mode the expert annotates the images by drawing with the mouse on top of the image. Highlight mode allows the expert to select regions in the image that have been found using an image segmentation algorithm. When the expert hovers over the regions of the image a preview of the highlighting will be shown. If the expert clicks the region it will be selected and the highlighting will be sent to the technician. When in text mode the expert can click anywhere within the image to display a box where text can be inserted. If the expert hits enter the text is converted to an annotation and displayed to the technician, while escape aborts the text input. The erase buttons (6) can be used to remove the annotations. The “Erase Color” button only removes annotations of the currently selected color, while “Erase All”

47 CHAPTER 5. RESULT removes all annotations. All annotations are automatically removed if a new image is selected. Button 7 and 8 are debug tools and are used to display sample images and enable debug keybindings. The debug keys can be used go back and view the dierent steps of the image segmentation process.

Figure 5.1: Web Application

5.2 Google Glass Application

The user opens the application using voice command or by navigating to it in the command list. When inside the application the user is always able to quickly tap the touchpad to bring up the context menu. Through the menu the user is able to stop the application, which alternatively can be done by swiping down on the touchpad. The menu also provides access to a QR code scanner which is used to configure the server address by scanning the code generated by the web application. Once the application has started it automatically tries to connect to the server using the configured address. The user gets feedback through a large message in the middle of the screen which tells what the connection status is. If the application successfully connected to the server a message saying “Connected” will be shown.

48 5.2. GOOGLE GLASS APPLICATION

If the connection is not successful the user can tap the touchpad with two fingers to attempt to reconnect. Once the application is successfully connected to the server the technician has to wait for the expert to initiate the call. When the call is started, the text disappears and the technician will be able to speak to the expert. As soon as the call has started the technician is able to long tap the touchpad to send a photograph to the expert. When the expert has selected an image to annotate, the image will be displayed in full screen to the technician. As soon as the expert adds or removes any an- notations they are instantly sent to the technician and displayed on top of the image.

49

Discussion

This chapter will analyze the work and result of the thesis. In addition to imple- menting a prototype of a collaborative communication system, a research goal was to evaluate the prototype from a viewpoint of further improvements, reusability, and how well suited Google Glass are for the task.

6.1 Analysis of Method and the Result

This section discusses some of the choices made when implementing the prototype and parts that could have been done dierently.

6.1.1 Live Video Annotations The choice of sending still images to the technician instead of video was sensible given the time frame, but it would have been interesting to investigate the use of live video annotations. If the time spent implementing the image segmentation was instead spent by investigating live video annotation, it is possible that the discoveries made would have been more closely related to the goal of the thesis. In either case it would have been a good idea to do a simple implementation of live video annotations, without any compensation for delay or movement, if only to get a better understanding of the issues and benefits of live video annotations.

6.1.2 Early Prototype Implementing an early prototype with minimal functionality provided valuable in- formation about the performance issues of Google Glass. It allowed planning to be changed at an early stage to ensure that a solution was found. If the implemen- tation of the prototype had not started until later, it is likely that the unexpected issues had had a bigger impact on the result.

6.1.3 WebGL An issue with WebGL on displays with a high pixel density was discovered after the prototype had been completed: it took a very long time to read the image data needed by the region labeling algorithm from the GPU, up to several seconds. Given

51 CHAPTER 6. DISCUSSION the current state of WebGL, which is mostly stable but has performance issues in some edge cases, it would have been favorable to run the image segmentation algorithm on a separate server. Even so, if going back with the knowledge of the issues with WebGL, it would still be compelling to use. Implementing an image processing algorithm in WebGL was an interesting example of the possibilities of new web technologies.

6.1.4 OpenMAX Once the Android MediaCodec API was available on Google Glass, the androidme- dia plugin was tested. While simply plugging an androidmedia video encoder into the pipeline did not work, it is possible that using androidmedia would have been a lot easier than gst-omx, while also providing a more portable solution. This was not something that could have been done dierently however, as the MediaCodec API was made available toward the end of the thesis work.

6.1.5 Audio The visual component of the communication system received far more attention than the audio, which is perhaps justified. Some time should nevertheless have been spent by doing a small-scale comparison of dierent audio codecs and settings and finding a combination that works particularly well with the bone conduction transducer used by Google Glass.

6.2 Further Improvements

This section lists possible improvements and some ideas were never implemented.

6.2.1 Image Segmentation The most direct improvement to the prototype would be to replace the image segmentation algorithm. The implemented algorithm is good enough to be able to demonstrate the concept, but it has a lot of flaws. If was very vulnerable to bad focus and soft edges on objects, and would often fail to outline an object that was clearly distinguishable from the background. A method that was discovered towards the end of the work was mean shift, it would probably be a big improvement over the current solution. Alternatively the implementation of the Metropolis-Hastings algorithm could be fixed and/or tuned.

6.2.2 Video Annotations Something that should be investigated is how the still image solution compares to a live video solution. This would mean implementing live video annotations

52 6.3. REUSABILITY which could be considered an entire new system rather than an improvement to the existing prototype.

6.2.3 UX Evaluation A thorough evaluation of the user experience would also have to be done. This evaluation should be focused on the technician’s work and how easy it is to adapt to the use of Google Glass and how easy it is to follow instructions provided.

6.2.4 Gesture and Voice Input An idea that was considered was adding some form of gesture or voice input for the technician. This would allow the technician to provide input in situations where it might not be possible otherwise.

6.2.5 More Annotations Options The usability of the annotation interface could be improved by adding features such as drawing shapes, undo and redo, and working with multiple images simultane- ously.

6.2.6 Logging A system that records the technicians actions could also be implemented. The video and audio would be recorded, along with other data such as facing, position, temperature, and bandwidth usage. This would require a more sophisticated setup where the video is transmitted via an intermediate server where it can be recorded.

6.3 Reusability

The system can easily be adapted for any other device that is similar to Google Glass, such as the Recon Jet or Vuzix M100. Because the devices also run Android, the application could likely be ported with minimal eort, only replacing the GDK specific code with a solution written with their SDK. It would however the dicult to adapt the prototype to any direct AR device, such as the Meta Pro. It does not make sense to use still images to give instructions to someone wearing a direct AR device. The images could perhaps be displayed on a simulated floating screen just like the other devices, but this would just cause a loss of detail compared to the other options. The Meta Pro also provide depth sensors and other tools that would make live video annotation a lot simpler to implement.

53 CHAPTER 6. DISCUSSION

6.4 Capabilities of Google Glass

It was obvious from beginning that a big issue would be the processing power avail- able on Google Glass. What was also noticed early was the problem of overheating. When using the built in video recording software it took only a few minutes for the device to become quite warm. When developing the prototype Glass had to be removed from time to time because it became too hot to be comfortable to wear. Once the hardware could be accessed to accelerate the video encoding it was proven that it was possible to use Glass for the use case. The device did not overheat as quickly, although it still happened after 5 – 10 minutes of use. At it’s current state it is not suitable for long calls to the back oce support, but it can definitely be used for short calls that don’t last longer than about 5 minutes. The battery capacity was most often not an issue, although this was most likely because the device overheated before the battery could be depleted. Battery life is an issue that could quite easily be resolved in the use case. The technician could e.g. charge the device while driving to the location. There are also several products that aim to extend the battery life of Google Glass by attaching an extra battery. An issue that was quickly recognized was the audio quality. Glass uses a bone conduction transducer which results in a quite bad audio playback. One of the design goals of the device is that it should not shut out the user from his or her surroundings. This feature comes at a cost, as the audio quality does suer.

6.5 Eects on Human Health and the Environment

This section discusses possible eects on the user and the environment.

6.5.1 Environmental Impacts Any communication system that is able to replace the need to be at a location in person has the potential to reduce both environmental and economic costs. When the problem at a site turns out to be too complicated for the technician to solve alone, he or she can call support instead of having to give up and send someone else to the site. If the issue can be solved remotely using the instructions from an expert, unnecessary pollution and expenses arising from traveling to the site can be avoided. Problems will be solved faster and more reliably the better adapted the com- munication system is. Failures are avoided by providing the expert with the tools necessary to give comprehensive instructions to the technician, which is what the proposed system aims to do.

6.5.2 Health Concerns The eect on the health of the technician that the prototype has is minimal. The indirect AR used in the prototype does not cause simulation sickness, which is a risk

54 6.5. EFFECTS ON HUMAN HEALTH AND THE ENVIRONMENT when using a video see-through direct AR system. The eects are likely limited to eye-strain caused by looking at the screen, and some possible discomfort if Google Glass becomes overheated.

55

Conclusion

In conclusion, the prototype that was created shows that it is possible to create a communication system involving Google Glass that is well adapted to the use case. With some work the system is likely to be an improvement over any smartphone or tablet based solution. Glass is also well suited for the entire use case as it can be used for navigation when driving to the remote site, as well as notifying the technician of any events. There are however some issues with the hardware on Google Glass, the most apparent being the problem of overheating. But one should not forget that the version of Google Glass used is the explorer edition, and much can change before the device is ocially released to the market. It is also the first generation of a new type of device, and it is likely that future version will fill in the gaps.

57

Bibliography

[1] Mann S. In: Soegaard M, Dam RF, editors. Wearable Computing. Aarhus, Denmark: The Interaction Design Foundation; 2013. . [2] Thorp EO. The Invention of the First Wearable Computer. In: ISWC ’98 Pro- ceedings of the 2nd IEEE International Symposium on Wearable Computers. Pittsburgh, PA, USA: IEEE Computer Society Press; 1998. p. 4–8. [3] Peng J, Seymour S. Envisioning the Cyborg in the 21st Century and Be- yond;p. 18. [4] The history of wearable computing [homepage on the Internet]. Dr. Holger Kenn; [cited 2014 May 24]. Available from: http://www.cubeos.org/ lectures/W/ln_2.pdf. [5] The History of Wearable Technology [homepage on the Internet]. Chicago: The Association; c1995-2002 [updated 2001 Aug 23; cited 2002 Aug 12]. Available from: http://217.199.187.63/click2history.com/?p=21. [6] Glacier Computer – Press Release [homepage on the Internet]. Glacier Com- puter; [updated 2014 May 24; cited 2014 May 25]. Available from: https: //github.com/zxing/zxing. [7] Pebble: E-Paper Watch for iPhone and Android [homepage on the Internet]. Kickstarter, Inc.; [cited 2014 May 25]. Available from: https://www.kickstarter.com/projects/597507018/pebble-e- paper-watch-for-iphone-and-android. [8] Oculus Rift [homepage on the Internet]. Oculus VR, Inc.: The ; [cited 2014 May 25]. Available from: http://www.oculusvr.com/rift/. [9] SmartWatch [homepage on the Internet]. Sony Mobile Communications; [cited 2014 May 25]. Available from: http://www.sonymobile.com/global- en/products/accessories/smartwatch/. [10] U.S. News Center [homepage on the Internet]. Samsung; [updated 2013 Sep 05; cited 2014 May 25]. Available from: http://www.samsung.com/us/ news/newsRead.do?news_seq=21647.

59 BIBLIOGRAPHY

[11] Apple iWatch release date, news and rumors [homepage on the Inter- net]. TechRadar; [updated 2014 April 09; cited 2014 May 25]. Avail- able from: http://www.techradar.com/news/portable-devices/ apple-iwatch-release-date-news-and-rumours-1131043. [12] Apple iWatch: Price, rumours, release date and leaks [homepage on the Internet]. Beauford Court, 30 Monmouth Street, Bath BA1 2BW: Fu- ture Publishing Limited; [updated 2014 Apr 09; cited 2014 May 25]. Available from: http://www.t3.com/news/apple-iwatch-rumours- features-release-date. [13] Meta – SpaceGlasses [homepage on the Internet]. META; [cited 2014 May 25]. Available from: https://www.spaceglasses.com/. [14] M100 – Smart Glasses | Vuzix [homepage on the Internet]. Vuzix; [cited 2014 May 25]. Available from: http://www.vuzix.com/consumer/ products_m100/. [15] Recon Jet [homepage on the Internet]. ; [cited 2014 May 25]. Available from: http://reconinstruments.com/products/jet/. [16] Azuma R, Baillot Y, Behringer R, Feiner S, Julier S, MacIntyre B. Recent Ad- vances in Augmented Reality. IEEE Comput Graph Appl. 2001 Nov;21(6):34– 47. [17] Milgram P, Takemura H, Utsumi A, Kishino F. Augmented Reality: A Class of Displays on the Reality-Virtuality Continuum; 1994. p. 282–292. [18] Liestol G, Morrison A. Views, alignment and incongruity in indirect augmented reality. In: Mixed and Augmented Reality - Arts, Media, and Humanities (ISMAR-AMH), 2013 IEEE International Symposium on; 2013. p. 23–28. [19] Word Lens on App Store on iTunes [homepage on the Internet]. Apple Inc.; [updated 2014 Apr 18; cited 2014 May 25]. Available from: https: //itunes.apple.com/en/app/word-lens/id383463868?mt=8. [20] Iwamoto K, Kizuka Y, Tsujino Y. Plate bending by line heating with inter- active support through a monocular video see-through head mounted display. In: Systems Man and Cybernetics (SMC), 2010 IEEE International Conference on; 2010. p. 185–190. [21] Kijima R, Ojika T. Reflex HMD to compensate lag and correction of derivative deformation. In: Virtual Reality, 2002. Proceedings. IEEE; 2002. p. 172–179. [22] Genc Y, Sauer F, Wenzel F, Tuceryan M, Navab N. Optical see-through HMD calibration: a stereo method validated with a video see-through system. In: Augmented Reality, 2000. (ISAR 2000). Proceedings. IEEE and ACM Inter- national Symposium on; 2000. p. 165–174.

60 BIBLIOGRAPHY

[23] Rolland JP, Fuchs H. Optical Versus Video See-Through Head-Mounted Dis- plays in Medical Visualization. Presence: Teleoper Virtual Environ. 2000 jun;9(3):287–309.

[24] Hakkinen J, Vuori T, Paakka M. Postural stability and sickness symptoms after HMD use. In: Systems, Man and Cybernetics, 2002 IEEE International Conference on. vol. 1; 2002. p. 147–152.

[25] Sachs D. Sensor Fusion on Android Devices: A Revolution in Motion Pro- cessing; 2010. 15:20 - 16:20. [Google Tech Talk]. Available from: https: //www.youtube.com/watch?v=C7JQ7Rpwn2k.

[26] Sachs D. Sensor Fusion on Android Devices: A Revolution in Motion Pro- cessing; 2010. 21:30 - 23:10. [Google Tech Talk]. Available from: https: //www.youtube.com/watch?v=C7JQ7Rpwn2k.

[27] Deak G, Curran K, Condell J. A survey of active and passive indoor localisation systems. Computer Communications. 2012;35(16):1939 – 1954.

[28] Sachs D. Sensor Fusion on Android Devices: A Revolution in Motion Pro- cessing; 2010. 23:10 - 27:40. [Google Tech Talk]. Available from: https: //www.youtube.com/watch?v=C7JQ7Rpwn2k.

[29] Kato H, Billinghurst M. Marker tracking and HMD calibration for a video- based augmented reality conferencing system. In: Augmented Reality, 1999. (IWAR ’99) Proceedings. 2nd IEEE and ACM International Workshop on; 1999. p. 85–94.

[30] Skoczewski M, Maekawa H. Augmented Reality System for Accelerometer Equipped Mobile Devices. In: Computer and Information Science (ICIS), 2010 IEEE/ACIS 9th International Conference on; 2010. p. 209–214.

[31] Kim S, Lee GA, Sakata N. Comparing pointing and drawing for remote collab- oration. In: Mixed and Augmented Reality (ISMAR), 2013 IEEE International Symposium on; 2013. p. 1–6.

[32] Umbaugh SE. Digital Image Processing and Analysis: Human and Computer Vision Applications with CVIPtools, Second Edition. 2nd ed. Boca Raton, FL, USA: CRC Press, Inc.; 2010.

[33] Edge detection Pixel Shader [homepage on the Internet]. Agnius Vasiliauskas; [updated 2010 Jun 3; cited 2014 May 25]. Available from: http://coding- experiments.blogspot.se/2010/06/edge-detection.html.

[34] Pinaki Pratim Acharjya DG Soumya Mukherjee. Digital Image Segmentation Using Median Filtering and Morphological Approach. The Annals of Statistics. 2014 01;4(1):552–557.

61 BIBLIOGRAPHY

[35] Arias-Castro E, Donoho DL. Does median filtering truly preserve edges better than linear filtering? The Annals of Statistics. 2009 06;37(3):1172–1206.

[36] Strzodka R, Ihrke I, Magnor M. A graphics hardware implementation of the generalized Hough transform for fast object recognition, scale, and 3D pose detection. In: Image Analysis and Processing, 2003.Proceedings. 12th Inter- national Conference on; 2003. p. 188–193.

[37] Duda RO, Hart PE. Use of the Hough Transformation to Detect Lines and Curves in Pictures. Commun ACM. 1972 Jan;15(1):11–15.

[38] Dillencourt MB, Samet H, Tamminen M. A General Approach to Connected- component Labeling for Arbitrary Image Representations. J ACM. 1992 Apr;39(2):253–280.

[39] Connected Components Labeling [homepage on the Internet]. R. Fisher, S. Perkins, A. Walker and E. Wolfart; [cited 2014 May 25]. Available from: http://homepages.inf.ed.ac.uk/rbf/HIPR2/label.htm.

[40] Google Glass Teardown [homepage on the Internet]. Scott Torborg, Star Simp- son; [cited 2014 May 25]. Available from: http://www.catwig.com/ google-glass-teardown/.

[41] Tech specs - Google Glass Help [homepage on the Internet]. Google; [cited 2014 May 25]. Available from: https://support.google.com/glass/ answer/3064128?hl=en.

[42] Wink - Google Glass Help [homepage on the Internet]. Google; [cited 2014 May 25]. Available from: https://support.google.com/glass/answer/ 4347178?hl=en.

[43] Starner T. Project Glass: An Extension of the Self. Pervasive Computing, IEEE. 2013 April;12(2):14–16.

[44] Motorola Moto G – Full phone specifications [homepage on the Internet]. GS- MArena.com; [cited 2014 May 25]. Available from: http://www.gsmarena. com/motorola_moto_g-5831.php.

[45] Alain Vongsouvanh JM. Building Glass Services with the Google Mirror API; 2013. 40:00 - 40:30. [Google I/O 2013]. Available from: https://www. youtube.com/watch?v=CxB1DuwGRqk.

[46] They’re No Google Glass, But These Epson Specs Oer A New Look At Smart Eyewear [homepage on the Internet]. Say Media Inc.; [updated 2014 May 20; cited 2014 May 25]. The New Reality; [about 3 screens]. Available from: http://readwrite.com/2014/05/20/augmented-reality- epson-moverio-google-glass-oculus-rift-virtual-reality.

62 BIBLIOGRAPHY

[47] Mistry P, Maes P. SixthSense: A Wearable Gestural Interface. In: ACM SIGGRAPH ASIA 2009 Sketches. SIGGRAPH ASIA ’09. New York, NY, USA: ACM; 2009. p. 11:1–11:1.

[48] Meta Pro AR Goggles Kick Google’s Glass - Yahoo News [homepage on the Internet]. Yahoo News; c1995-2002 [updated 2014 Jan 8; cited 2014 May 25]. Available from: http://news.yahoo.com/meta-pro-ar- goggles-kick-122918255.html.

[49] meta: The Most Advanced Augmented Reality Glasses [homepage on the Internet]. Kickstarter, Inc.; [cited 2014 May 25]. The Tech Specs – Software; [about 2 screens from the bottom]. Avail- able from: https://www.kickstarter.com/projects/551975293/ meta-the-most-advanced-augmented-reality-interface.

[50] The Tantalizing Possibilities of an Oculus Rift Mounted Camera [home- page on the Internet]; [updated 2014 May 14; cited 2014 May 25]. Avail- able from: http://www.roadtovr.com/oculus-rift-camera-mod- lets-you-bring-the-outside-world-in/.

[51] Tech Specs | Recon Jet [homepage on the Internet]. Recon Instruments; [cited 2014 May 25]. Available from: http://www.reconinstruments.com/ products/jet/tech-specs/.

[52] XMReality – XMExpert consists of two parts [homepage on the Internet]. XMReality; [cited 2014 May 25]. Available from: http://xmreality.se/ product/?lang=en.

[53] Mobile Video Platform for Field Service [homepage on the Internet]. VIPAAR; [cited 2014 May 25]. Available from: http://www.vipaar.com/platform.

[54] HTML5 Introduction [homepage on the Internet]. Refsnes Data; [cited 2014 May 25]. Available from: http://www.w3schools.com/html/html5_ intro.asp.

[55] WebRTC 1.0: Real-time Communication Between Browsers [homepage on the Internet]. World Wide Web Consortium; [updated 2014 Apr 10; cited 2014 May 24]. Available from: http://dev.w3.org/2011/webrtc/editor/ webrtc.html.

[56] Is WebRTC ready yet? [homepage on the Internet]; [cited 2014 May 24]. Available from: http://iswebrtcreadyyet.com/.

[57] WebRTC | MDN [homepage on the Internet]. Mozilla Developer Network; [updated 2014 May 15; cited 2014 May 24]. Available from: https:// developer.mozilla.org/en-US/docs/WebRTC.

63 BIBLIOGRAPHY

[58] RFC 4960 — Stream Control Transmission Protocol [homepage on the Inter- net]. Internet Engineering Task Force; [updated 2007 Sep; cited 2014 May 24]. Available from: http://www.ietf.org/rfc/rfc2960.txt.

[59] The Standard for Embedded Accelerated 3D Graphics [homepage on the Inter- net]. Beaverton, OR, USA: The Khronos Group; [cited 2014 May 22]. Available from: http://www.khronos.org/opengles/2_X/.

[60] The Industry’s Foundation for High Performance Graphics [homepage on the Internet]. Beaverton, OR, USA: The Khronos Group; [cited 2014 May 22]. Available from: http://www.khronos.org/opengl/.

[61] HTMLCanvasElement [homepage on the Internet]. Mozilla Developer Network; [updated 2014 Mar 21; cited 2014 May 22]. Avail- able from: https://developer.mozilla.org/en-US/docs/Web/ API/HTMLCanvasElement.

[62] Marroquim R, Maximo A. Introduction to GPU Programming with GLSL. In: Proceedings of the 2009 Tutorials of the XXII Brazilian Symposium on Com- puter Graphics and Image Processing. SIBGRAPI-TUTORIALS ’09. Wash- ington, DC, USA: IEEE Computer Society; 2009. p. 3–16.

[63] WebGL Fundamentals [homepage on the Internet]. Gregg Tavares; [updated 2012 Feb 9; cited 2014 May 22]. Available from: http://www.html5rocks. com/en/tutorials/webgl/webgl_fundamentals/.

[64] Litherum: How Multisampling Works in OpenGL [homepage on the Inter- net]. Litherum; [updated 2014 Jan 6; cited 2014 May 24]. Available from: http://litherum.blogspot.se/2014/01/how-multisampling- works-in-opengl.html.

[65] OpenGL Wiki - Framebuer [homepage on the Internet]; [cited 2014 May 22]. Available from: http://www.opengl.org/wiki/Framebuffer.

[66] Birrell AD, Nelson BJ. Implementing Remote Procedure Calls. ACM Trans Comput Syst. 1984 Feb;2(1):39–59.

[67] GStreamer [homepage on the Internet]; [cited 2014 May 22]. Available from: http://gstreamer.freedesktop.org/.

[68] GObject [homepage on the Internet]. The GNOME Project; [cited 2014 May 22]. Available from: https://developer.gnome.org/gobject/ stable/.

[69] NEON – ARM [homepage on the Internet]. ARM Ltd.; [cited 2014 May 24]. Available from: http://www.arm.com/products/processors/ technologies/neon.php.

64 BIBLIOGRAPHY

[70] Chandra S, Dey S. Addressing computational and networking constraints to enable video streaming from wireless appliances. In: Embedded Systems for Real-Time Multimedia, 2005. 3rd Workshop on; 2005. p. 27–32. [71] Android - Media [homepage on the Internet]. Android Open Source Project; [cited 2014 May 22]. Available from: https://source.android.com/ devices/media.html. [72] - [homepage on the Internet]; [cited 2014 May 22]. Available from: https://gitorious.org/gstreamer-omap. [73] gstreamer/gst-omx [git repository]; [cited 2014 May 22]. Available from: http://cgit.freedesktop.org/gstreamer/gst-omx/. [74] NodeJS [homepage on the Internet]. Joyent, Inc.; [cited 2014 May 22]. Available from: http://nodejs.org/. [75] GNU Development Tools — LD [homepage on the Internet]. Panagiotis Chris- tias; 1994 [cited 2014 May 23]. Available from: http://unixhelp.ed.ac. uk/CGI/man-cgi?ld. [76] TI E2E Community [homepage on the Internet]. Texas Instruments; [cited 2014 May 23]. Available from: http://e2e.ti.com/. [77] Android API Reference — Canvas [homepage on the Internet]; [updated 2014 May 20; cited 2014 May 24]. Available from: http://developer. android.com/reference/android/graphics/Canvas.html. [78] Mozilla Developer Network — Canvas [homepage on the Internet]. Mozilla; [updated 2014 May 23; cited 2014 May 24]. Available from: https:// developer.mozilla.org/en-US/docs/Web/HTML/Canvas. [79] Abramov A, Kulvicius T, Wörgötter F, Dellen B. Real-Time Image Segmen- tation on a GPU. In: Keller R, Kramer D, Weiss JP, editors. Facing the Multicore-Challenge. vol. 6310 of Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2010. p. 131–142. [80] Mozilla Developer Network — Canvas [homepage on the Internet]. Mozilla; [updated 2014 May 21; cited 2014 May 24]. Available from: https://developer.mozilla.org/en-US/docs/Web/Guide/ Performance/Using_web_workers. [81] Ocial ZXing (“Zebra Crossing") project home [homepage on the Internet]. GitHub, Inc.; [updated 2014 May 24; cited 2014 May 24]. Available from: https://github.com/zxing/zxing. [82] The Web Sockets API [homepage on the Internet]. World Wide Web Con- sortium; [cited 2014 May 24]. Available from: http://www.w3.org/TR/ 2009/WD-websockets-20091222/.

65 BIBLIOGRAPHY

[83] Server-Sent Events [homepage on the Internet]. World Wide Web Consortium; [updated 2012 Dec 11; cited 2014 May 24]. Available from: http://www. w3.org/TR/eventsource/. [84] Where polyfill came from / on coining the term [homepage on the Internet]. Remy Sharp; [updated 2010 Oct 8; cited 2014 May 24]. Available from: http: //remysharp.com/2010/10/08/what-is-a-polyfill/.

66