Lightweight Algorithms for Depth Sensor Equipped Embedded Devices

Henry Zhong

A thesis in fulfillment of the requirements for the degree of

Doctor of Philosophy

School of Computer Science and Engineering

Faculty of Engineering

The University of New South Wales

May 2017 THE UNIVERSITY OF NEW SOUTH WALES Thesis/Dissertation Sheet

Surname or Family name: Zhong First name: Henry Other name/s: Abreviation for degree as given in the University calendar: PhD School: School of Computer Science and Engineering Faculty: Faculty of Engineering Title: Lightweight Algorithms for Depth Sensor Equipped Embedded Devices

Abstract 350 words maximum Depth sensors have appeared in a variety of embedded devices. It can be found in tablets, smartphones and web cameras. At the time of writing, research into depth sensor equipped embedded devices is still in its infancy. This has led to some key questions: What kinds of applications can take advantage of depth sensor equipped embedded devices and the question of efficiently implementing algorithms on resource-constrained embedded devices? The purpose of this thesis is address these questions. We do so by presenting 3 prototype systems and accompanying lightweight algorithms. The prototypes demonstrate example application for the use of depth sensors in pervasive computing. The novel algorithms make use of depth data to solve common problems, while being lightweight enough on resource consumption to operate on embedded devices. Our algorithms are lightweight because we use simpler features compared to existing algorithms. We do this while achieving better results by several metrics compared to the current state of the art. It is hoped the presented work enlighten the reader on the possible applications for depth sensor equipped embedded devices. The 3 prototypes target 3 major areas of research. First is QuickFind: which contains an algorithm to perform fast segmentation and object detection using a depth sensor, it is applied to a prototype augmented reality assembly aid. The second is WashInDepth: which contains a fast hand gesture recognition algorithm for monitoring correct handwashing. The third is the VeinDeep: for fast vein pattern recognition using a depth sensor, we use it to secure depth sensor equipped embedded devices. It is the first instance where depth sensors have been used for vein pattern recognition.

Declaration relating to disposition of project thesis/dissertation I hereby grant to the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or in part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all property rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstracts International (this is applicable to doctoral theses only).

Signature Witness Date

The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restriction may be considered in exceptional circumstances and require the approval of the Dean of Graduate Research. FOR OFFICE USE ONLY Date of completion of requirements for Award

ii Originality Statement

‘I hereby declare that this submission is my own work and to the best of my knowl- edge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’

Henry Zhong January 31, 2017 Copyright Statement

‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation.

I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only).

I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.’

Henry Zhong January 31, 2017

Authenticity Statement

‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’

Henry Zhong January 31, 2017 Abstract

Depth sensors have appeared in a variety of embedded devices. This includes tablets, smartphones and web cameras. This has provided a new mode of sensing, where it is possible to record an image and the distance to everything in the image. Some pervasive computing applications have taken advantage of depth sensors, such as crowd sourced 3D indoor mapping. However, research into this area is still in its infancy, some questions remain before widespread adoption. These questions are: What kinds of applications can take advantage of depth sensor equipped embed- ded devices and the question of efficiently implementing algorithms on resource- constrained embedded devices?

The purpose of this thesis is address these questions. We do so by presenting 3 prototype systems and accompanying lightweight algorithms. Each algorithm uses depth sensors to overcome problems in visual pattern matching and are lightweight enough to run on embedded platforms. We do this while achieving better results by several metrics compared to the current state of the art. These metrics include pattern matching accuracy, asymptotic complexity, run time and memory use.

The first contribution of this thesis is QuickFind, for fast segmentation and object detection algorithm, it is applied to a prototype augmented reality assembly aid. We test it against two related algorithms and implement our prototype on a Raspberry Pi. The two related algorithms are: Histogram of Oriented Gradients (HOG), a pop- ular object detection algorithm. Histogram of Oriented Normal Vectors (HONV), a state of the art algorithm specifically designed for use with depth sensors. Our test data is the RGB-D Scenes v1 dataset consisting of 6 object classes, in 1434 scenes of domestic and office environments. On our test platform QuickFind achieved the best results with 1/18 run time, 1/18 power use, 1/3 memory use compared to HOG and 1/279 run time, 1/279 power use, 1/15 memory use compared to HONV. Quick- Find has a lower asymptotic upper bound and almost double the average precision compared to HOG and HONV.

The second contribution of this thesis is WashInDepth, for fast hand gestures recog- nition, it is applied to a prototype to monitor correct hand washing. We test it

i against HOG, HONV and implement our prototype on a Compute Stick. WashIn- Depth is an extension of QuickFind. Segmentation is replaced with a background removal step. QuickFind features are used to perform hand gesture recognition, based on video recorded from a depth sensor. We test with 15 participants with 3 videos each for a total of 45 videos. WashInDepth achieved the best results with average of 94% accuracy and a run time of 11 ms. HOG achieved 86% average accuracy and 19 ms average run time. HONV achieved 88% average accuracy and 22 ms average run time. All 3 algorithms had average memory usage within 4 KiB of each other.

The third contribution of this thesis is VeinDeep. VeinDeep performs identifica- tion using vein pattern recognition. We repurpose depth sensors for this task. As far as we are aware, it is the first instance where depth sensors have been used for this purpose. The prototype application for VeinDeep is designed for securing smartphones with integrated depth sensor. As such devices are not widely available at the time of writing, the system is simulated on a Compute Stick with attached depth sensor. We test VeinDeep against two related algorithms The two related algorithms are: Hausdorff distance, an older but popular algorithm for vein pattern recognition. Kernel distance, an algorithm more recently applied to vein pattern recognition. We test with 20 participants, 6 images per hand for a total of 240 im- ages. On our embedded platform VeinDeep achieved the best results with 1/6 run time, 2/3 memory use compared to Hausdorff distance. 1/3 run time, 1/2 memory use compared to Kernel distance. VeinDeep had precision of 0.98, recall of 0.83. At the same recall level Hausdorff distance had precision of 0.5, Kernel distance had precision of 0.9. VeinDeep also had lower average complexity compared to Hausdorff and Kernel distance.

Although the prototypes in this thesis focus on three specific problems. The algo- rithms accompanying the prototypes are general purpose. We hope that the pre- sented content enlighten the reader and encourages new applications which make use of depth sensor equipped embedded devices.

ii Acknowledgements

I thank my supervisors A/Prof. Salil Kanhere, and A/Prof. Chun Tung Chou. I have worked with them since my undergraduate days. Over the years both have tirelessly helped improve my work. I have always benefitted from following their wise advice, many times I did not realise the value of taking a particular course of action they suggested until much later.

I specifically like to thank Salil for handling the many administrative issues over the course of writing this thesis. I specifically like to thank Chun Tung for asking me questions, which made me pause and think. Many of my ideas were not clear and fully formed until I managed to answer his questions.

I thank my parents. They took care of me over the years and unburdened me so that I may pursue my research.

I thank the many friends who have contributed to this thesis. They volunteered their time to participant in my experiments and gave helpful suggestions.

iii Abbreviations

CCL Connected Components Labelling

DMMHOG Depth Motion Maps Histograms of Oriented Gradients

FF Flood Fill

HOG Histogram of Oriented Gradients

HON4D Histogram of Oriented 4D Normals

HONV Histogram of Oriented Normal Vectors

IR Infrared

RGB Red/Green/Blue

ROI Region of interest

SIFT Scale Invariant Feature Transform

SURF Speeded Up Robust Features

TOF Time of Flight

iv Publications

1. H. Zhong, S. S. Kanhere, and C. T. Chou, QuickFind: Fast and contact-free ob- ject detection using a depth sensor, in 2016 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops), 5th CoSDEO workshop on Contact-Free Ambient Sensing, Localization and Tracking, pp. 16, IEEE, 2016.

2. H. Zhong, S. S. Kanhere, and C. T. Chou, WashInDepth: Lightweight hand- wash monitor using depth sensor, in Proceedings of the 13th International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, pp. 2837, ACM, 2016.

3. H. Zhong, S. S. Kanhere, and C. T. Chou, VeinDeep: Smartphone unlock using vein patterns, in 2017 IEEE International Conference on Pervasive Computing and Communication (PerCom), IEEE, 2017.

v Contents

1 Introduction 1

1.1 Overview of Depth Sensors ...... 3

1.2 Motivation and Goals ...... 6

1.3 Contributions ...... 6

1.4 Organisation ...... 9

2 Literature Review 10

2.1 Segmentation and Object Detection ...... 11

2.1.1 Histogram of Oriented Gradient and Normal Vectors . . . . . 12

2.2 Hand Gesture Recognition ...... 13

2.3 Vein Pattern Recognition ...... 15

2.3.1 Kernel and Hausdorff Distance ...... 17

3 QuickFind Fast Segmentation and Object Detection 19

3.1 System ...... 20

3.2 Algorithm ...... 21

3.2.1 Overview of Segmentation ...... 22

3.2.2 Custom Connected Components Labelling (CCL) Rules . . . . 25

vi 3.2.3 Features ...... 28

3.3 Experiments ...... 29

3.3.1 Test Data and Parameters ...... 29

3.3.2 Hardware and Software ...... 32

3.3.3 Segmentation Accuracy ...... 33

3.3.4 Detection Accuracy ...... 34

3.3.5 Speed, Memory and Power Consumption ...... 35

3.3.6 Effect of Changing Parameters ...... 37

3.3.7 Complexity Analysis ...... 39

3.4 Summary ...... 41

4 WashInDepth Fast Hand Gesture Recognition 43

4.1 System ...... 45

4.2 Algorithm ...... 46

4.2.1 Background Removal ...... 46

4.2.2 Features ...... 50

4.2.3 Classification ...... 52

4.2.4 Smoothing ...... 53

4.2.5 Check Gesture Sequence ...... 53

4.2.6 Algorithm Complexity ...... 53

4.3 Experiments ...... 54

4.3.1 Data Collection ...... 54

4.3.2 Test Systems ...... 55

4.3.3 Experiment Setup ...... 58

vii 4.3.4 Experiment Results ...... 59

4.3.5 Comparison to HOG and HONV ...... 66

4.4 Summary ...... 68

5 VeinDeep Fast Vein Pattern Recognition 69

5.1 System ...... 71

5.2 Algorithm ...... 72

5.2.1 Extract Vein Pattern ...... 74

5.2.2 Calculate Derivatives and Angles ...... 76

5.2.3 Rotate to Reduce Distortion ...... 78

5.2.4 Clean Image ...... 80

5.2.5 Effect of Changing Pre-processing Parameters ...... 81

5.2.6 Convert Vein Pattern to Sequence ...... 82

5.2.7 Compare Sequences ...... 82

5.3 Experiments ...... 84

5.3.1 System Setup ...... 84

5.3.2 Data Collection ...... 85

5.3.3 Matching Accuracy ...... 86

5.3.4 Matching Speed ...... 87

5.3.5 Memory Use ...... 89

5.3.6 Changing the Kernel Width ...... 90

5.4 Summary ...... 92

6 Conclusion 93

6.1 Future Work ...... 94

viii Bibliography 97

ix List of Figures

1.1 A demonstration of the capabilities of depth sensors. RGB image on the left shows 2 coffee mug. The left coffee mug is a 2D image displayed on a monitor. The depth map on the right reveals this deception...... 3

1.2 The Kinect V1 (left) and V2 (right)...... 4

1.3 The Kinect V1 structured light IR image (left) and V2 time of flight IR image (right)...... 5

1.4 An example of the 3 images produced by Kinect V1. IR (left), RGB (right) and depth map (bottom)...... 5

2.1 Under HOG and HONV algorithms the ROI is bounded by a rectan- gular window. The window is divided into blocks, blocks are divided into cells and features are computed from each pixel in each cell. . . . 13

3.1 The QuickFind prototype is an augmented reality object assembly aid. The green bounding boxes are overlayed when all expected ob- jects are detected. A red bounding box is overlayed when some ex- pected objects are missing...... 21

3.2 QuickFind algorithm workflow...... 22

3.3 The output from segmentation using QuickFind. Clockwise from top left: Each coloured region represents a segment. The image on the top right represents one segment and is divided into blocks for feature computation, the black regions within the bounding box are padded with null/zero values pixels. Lastly below is the input depth map. . . 24

x 3.4 Rule functions used for segmentation are computed using regression from a series of cropped depth maps. The rules control the threshold for segment height, width, depth and maximum absolute difference between neighbouring depth pixels...... 28

3.5 An example scene from the object detection dataset. Clockwise from top left: Bitmap, depth map, ground truth mask, the substitute ob- jects used for parameter estimation...... 30

3.6 Precision and recall rates for object detection using QuickFind, HOG and HONV...... 36

3.7 Speed, memory and power consumption of QuickFind, HOG and HONV...... 37

3.8 QuickFind detection rate improves with increasing blocks up to 20 × 20. Run time changes little when number of blocks grow...... 38

3.9 Changing parameters has little effect on QuickFind speed. Speed increases linearly with input size are changed...... 38

4.1 WashInDepth detects if a subject has correctly lathered their hands with soap. A correct hand wash involves lathering by performing the 9 gestures in steps 1 - 9, in the order depicted. Detection is triggered using a phone and performed using a depth sensor mounted above the wash basin...... 46

4.2 WashInDepth algorithm workflow. Gestures are detected by com- puting features from each frame as depicted. The queue of observed and expected gestures are compared to determine if the correct wash gestures have been performed...... 47

4.3 WashInDepth parameter choice. The image shows x1, x2, y1 and y2 are chosen so they bound the basin edges. This represents the region of interest. r, c and o are chosen based on the proportion of zero value pixels represented by the black regions...... 50

4.4 WashInDepth test environment with dimensions...... 56

4.5 WashInDepth test platform using a Compute Stick...... 57

xi 4.6 The bar chart shows the gesture detection accuracy of WashInDepth while testing a range of block values without smoothing. The confu- sion matrix shows the results of the best performing iteration which was achieved by using 15 × 15 blocks...... 59

4.7 The distribution of detected gestures for every 50 frames of a typical instance of recorded depth video. The colour-coded legend on right indicates gesture class...... 60

4.8 The bar chart shows the gesture detection accuracy while testing a range of smoothing window lengths and by using 15 × 15 blocks. The confusion matrix shows the results of the best performing iteration which was achieved by using g = 50...... 62

4.9 The bar charts show the gesture detection accuracy for several per- mutations of frame rates and smoothing window lengths...... 64

4.10 The bar charts show the gesture detection accuracy under the person- dependent scenario. Data for each participant was assigned to 3 groups for 3-fold cross validation...... 65

4.11 The mean run time per frame and the proportion of time spent on each task for WashInDepth. Overhead was the time taken for loading libraries, memory allocation and other miscellaneous processes. . . . . 65

4.12 The amount of memory used per frame for WashInDepth. This value was computed by counting every variables used to process each frame multiplied by the number of bytes consumed by each variable. . . . . 66

4.13 The person dependent gesture recognition results for WashInDepth, HONV and HOG...... 67

4.14 The average memory and latency when comparing WashInDepth, HONV and HOG...... 67

5.1 VeinDeep is designed to unlock an infrared depth sensor equipped smartphones using vein patterns. Then the user takes an infrared image and depth map of the back of their hand. This is used to extract vein patterns which identify the user...... 71

5.2 VeinDeep algorithm workflow from raw infrared image and depth map to vein pattern...... 72

xii 5.3 Examples of vein patterns from 3 different participants...... 72

5.4 The series of vein pattern images were taken in succession from the same the right hand of a single person. However there are small differences in each extracted vein pattern image. We refer to this as Jitter...... 73

5.5 VeinDeep algorithm workflow from vein pattern to sequence of points. 80

5.6 The effect on vein pattern from changing pre-processing parameters are shown in this image...... 81

5.7 We compare the similarity of sequences of key points from 2 vein patterns. First by computing the Euclidean distance between key points in both sequences. Then the Euclidean distance is used to compute a similarity score...... 83

5.8 VeinDeep used a Kinect V2...... 85

5.9 The precision / recall results of vein pattern matching algorithms. VeinDeep with a precision of 0.98, a recall of 0.83, edges out Kernel distance which has a precision of 0.90 at a similar recall level. . . . . 88

5.10 The confusion matrices for VeinDeep, Kernel and Hausdorff distance when precision is at approximately 0.98...... 88

5.11 The mean run time to match a pair of vein pattern images...... 90

5.12 The mean memory usage to match a pair of vein pattern images. . . . 90

5.13 The precision / recall values of VeinDeep after changing σ. σ is the kernel width, which controls the penalty when a point in the reference image does not closely correspond with a point in the test image. Changing this value outside a narrow niche led to non-optimal results. It is the only parameter in VeinDeep which is not related to pre-processing of the vein pattern image...... 91

xiii List of Tables

1.1 Specifications for Kinect V1 and V2...... 4

3.1 Table of QuickFind variable and function definitions...... 23

3.2 QuickFind parameters used in testing...... 31

3.3 HOG and HONV parameters used in testing. The parameter names come from OpenCV...... 31

3.4 SVM and FF parameters used in testing. SVM parameter names come from OpenCV...... 32

3.5 QuickFind test platform specifications...... 32

3.6 Number of object instances successfully segmented by FF and Quick- Find...... 34

4.1 The names and number of instances of every gesture in the test data. The 1st column corresponds with the numbered illustrations of Figure 4.1...... 56

4.2 WashInDepth, HOG and HONV parameter values. Bold values led to best results and were used as default values in testing...... 57

4.3 WashInDepth test platform specifications...... 58

5.1 VeinDeep test platform specifications...... 84

5.2 VeinDeep parameter values...... 86

xiv Chapter 1

Introduction

Depth sensors are a type of camera. The release of the Kinect 1 meant the widespread availability of an off-the-shelf and affordable depth sensor. At the time of writing, a number of years have elapsed since the release of the Kinect. A variety of depth sensor equipped embedded devices have recently come to market. Some examples are tablets, smartphones 2 and web cameras 3.

The key advantage of depth sensors over conventional cameras is the recording of distance to objects in captured images. The extra data can be used to improve solutions to existing problems. An example is 3D indoor mapping and localisation using smartphones 4. As depth sensor equipped smartphones become ubiquitous, it becomes possible to crowd source 3D indoor maps from the data recorded by the general public. In turn, the maps provide an indoor localisation service. The maps

1https://developer.microsoft.com/en-us/windows/kinect 2https://get.google.com/tango 3https://www-ssl.intel.com/content/www/au/en/architecture-and-technology/ realsense-overview.html 4https://www.bloomberg.com/news/articles/2016-05-12/ google-looks-beyond-maps-to-chart-the-interior-world-in-3-d

1 1. Introduction can be accurately stitched together using the 3D data. Localisation can be achieved by aligning the scene in the sensor’s field of view against the closest match from the maps [1]. This is an improvement because of the low cost. There is no need to dedicate personnel to the task of mapping and it circumvents the need to build additional hard infrastructure for localisation.

There are pervasive computing applications for depth sensors, yet to be discovered. Each released device has primarily targeted one problem. The Kinect is mainly used for skeleton tracking [2] and the aforementioned smartphone is mainly used for indoor mapping. As devices become more ubiquitous, we are likely to find some unexpected uses. One novel example is KinSpace [3], a system for obstruction detection using Kinect. It finds the floor plane in home and office environments, any vertical clutter on the plane is considered an obstruction. This is a task the manufacturer never intended for the device.

Despite its usefulness, there exist few bespoke algorithms which both make use of depth sensors and work with the limited resources of embedded devices. Lightweight algorithms are important because the vast majority of depth sensors will appear in embedded devices. Research projects may make use of depth sensors with impressive results, but these projects offload processing to more powerful hardware. An appli- cation running on a smartphone may not be allowed to Offload processing due to privacy, bandwidth costs and the real-time nature of some applications. An example is a Correlation Clustering algorithm by Firman et al. [4] with highly accurate im- age segmentation and object detection, but takes an average of over 20 seconds per image on current desktop class hardware. The lack of suitable algorithms mean the majority of depth sensor equipped embedded devices remain relatively underutilised.

2 1. Introduction

Figure 1.1: A demonstration of the capabilities of depth sensors. RGB image on the left shows 2 coffee mug. The left coffee mug is a 2D image displayed on a monitor. The depth map on the right reveals this deception.

1.1 Overview of Depth Sensors

Depth sensors record depth maps. A depth map consists of a 2D array of depth pixels. Each depth pixel records the distance from the sensor to a point in the recorded scene. In other properties depth sensors resemble conventional cameras. The field of view corresponds with a sensor size and lens focal length.

Depth sensors have greater resilience to spoofing, we demonstrate this in Figure 1.1. The RGB image on the left shows 2 coffee mugs, the left coffee mug is a 2D image displayed on a monitor. The depth map on the right reveals this deception.

Depth sensors help overcome some of the challenges of the invariance problem [5]. This is the problem of object detection/recognition in the presence of appearance variation caused by changing lighting and background. Depth sensors are able to produce their own illumination and background can be removed using a depth threshold.

While many depth sensors and depth sensor technologies exist, the work conducted in this thesis focuses on Kinect. The devices and specifications are shown in Table 1.1 and Figure 1.2. Kinect is based on two technologies, V1 uses structured light

3 1. Introduction

Figure 1.2: The Kinect V1 (left) and V2 (right).

Specification V1 V2 Depth map resolution 320 × 240 512 × 424 Frame rate 30 FPS 30 FPS Focal length 6.497 mm 3.657 mm Operating range 4 - 0.8 m 4.5 - 0.5 m Technology Structured light Time of flight

Table 1.1: Specifications for Kinect V1 and V2. and V2 uses time of flight (TOF) [6]. Both also contain conventional and infrared (IR) cameras. An example of the IR image, Red/Green/Blue (RGB) and depth map are shown in Figure 1.4.

Structured light sensors use illumination to cast a pattern. The Kinect V1 uses an IR pattern shown in Figure 1.3. The dot pattern is used to calculate the distance. The further apart the dots, the further the distance. The active IR illumination makes the device capable of measuring depth under any lighting condition in the visible spectrum. However, IR interference or absorbing materials cause sensor errors. Whenever an error occurs the corresponding depth pixel will have a value of zero.

TOF sensors use illumination to measure depth based the time light takes to reflect off surfaces in the scene. The Kinect V2 uses IR illumination, an example image is shown in Figure 1.3. Kinect V2 produces more accurate results compared to V1. Similar to Kinect V1, V2 is capable of measuring depth under any lighting condition in the visible spectrum. IR interference or absorbing materials cause the same type

4 1. Introduction

Figure 1.3: The Kinect V1 structured light IR image (left) and V2 time of flight IR image (right).

Figure 1.4: An example of the 3 images produced by Kinect V1. IR (left), RGB (right) and depth map (bottom).

5 1. Introduction of sensor error as Kinect V1.

1.2 Motivation and Goals

At the time of writing, there are many depth sensor equipped embedded devices in circulation. No one fully understands the implications of this new mode of sensing for pervasive computing. This raises 2 questions which provide the motivation for this thesis. These questions are: What kinds of applications can take advantage of depth sensor equipped embedded devices and the question of efficiently implement- ing algorithms on resource-constrained embedded devices?

We aim to address these questions by presenting 3 prototype systems and accompa- nying algorithms. Each one targets a distinct area of research, these being: segmen- tation and object detection, hand gesture recognition and vein pattern recognition. The prototypes demonstrate novel applications which make use of depth sensors. The accompanying algorithms are lightweight and designed for embedded devices. Our algorithms are lightweight because we use simpler features compared to existing algorithms. We do this while achieving better results by several metrics compared to the current state of the art.

1.3 Contributions

The contributions of this thesis are:

1. The first contribution of this thesis is QuickFind. It is used to address the problem of fast segmentation and object detection using ONLY depth maps.

6 1. Introduction

• We compare QuickFind against two related algorithms: the popular His- togram of Oriented Gradients (HOG) [7], which can be used with data from depth sensors [8]. The state of the art Histogram of Oriented Nor- mal Vectors (HONV) [9], which is specifically designed for use with depth sensors. Our test data is the RGB-D Scenes v1 dataset [8] consisting of 6 object classes, in 1434 scenes of domestic and office environments.

• We find QuickFind achieved almost double the average precision com- pared to HOG and HONV.

• We port all algorithms to the embedded platform Raspberry Pi. We show QuickFind uses 1/18 run time, 1/18 power use, 1/3 memory use compared to HOG. 1/279 run time, 1/279 power use, 1/15 memory use compared to HONV.

• We show QuickFind has a lower asymptotic upper bound compared to HOG and HONV.

• We show QuickFind has over double the segmentation accuracy compared to a standard Flood Fill (FF) Connected Components Labelling (CCL) algorithm [10].

• We implement QuickFind in novel prototype applications. An augmented reality assembly aid and object locator using geo-tagged depth sensor data.

2. The second contribution of this thesis is WashInDepth. It is used to address the problem of fast hand gesture recognition.

• WashInDepth is an extension of QuickFind, we compare it against HOG and HONV. We test using hand gesture actions. Our test data contains 9 gestures performed by 15 participants, 3 videos per participant.

• We find WashInDepth achieved 94% the average accuracy compared to 86% of HOG and 88% of HONV.

7 1. Introduction

• We port all algorithms to the Compute Stick. We show WashInDepth was fastest at recognising gestures, using an average of 11 ms run time compared to 19ms of HOG. 22 ms of HONV. All 3 algorithms had average memory usage within 4 KiB of each other.

• We implement WashInDepth in a novel prototype application for mon- itoring if a person has correctly washed their hands, which ran on the Compute Stick.

3. The third contribution of this thesis is VeinDeep. VeinDeep performs biometric identification using vein pattern recognition. We repurpose depth sensors for this task. As far as we are aware, it is the first instance where depth sensors have been used for this purpose.

• We test VeinDeep against two related algorithms. Hausdorff distance [11], an older but popular algorithm for vein pattern recognition. Kernel distance [12], an algorithm more recently applied to vein pattern recog- nition. We test with 20 participants, 6 images per hand for a total of 240 images.

• We find VeinDeep achieved best results with precision of 0.98, recall of 0.83. At the same recall level Hausdorff distance had precision of 0.5, Kernel distance had precision of 0.9.

• We port all algorithms to the embedded platform Compute Stick. We show VeinDeep uses 1/6 run time, 2/3 memory use compared to Hausdorff distance. 1/3 run time, 1/2 memory use compared to Kernel distance.

• We show VeinDeep has a lower asymptotic upper bound compared to Hausdorff and Kernel distance.

• We implement VeinDeep in a novel prototype application for securing smartphones with integrated depth sensor by using vein patterns.

8 1. Introduction

1.4 Organisation

This thesis is organised into the following chapters. Chapter 2 presents a litera- ture review. Chapter 3 presents QuickFind for fast segmentation and object detec- tion. Chapter 4 presents WashInDepth for fast hand gesture recognition. Chapter 5 presents VeinDeep for vein pattern recognition. Chapter 6 ends with the conclusion.

9 Chapter 2

Literature Review

This thesis targets 3 broad areas of research with the common theme of depth sen- sors. We have divided the literature review into 3 corresponding sections. In section 2.1 we present segmentation and object detection algorithms related to QuickFind. There is extensive existing literature and the breadth of work is too great to cover in entirety. So the focus is on work which utilise only depth maps or sensor fusion incorporating depth maps.

In section 2.2 we present hand gesture recognition algorithms and systems related to WashInDepth. Since WashInDepth is about monitoring correct hand wash, the systems we cover are specific to hand wash monitoring. Additionally we have focused on general gesture recognition algorithms which make use of depth maps in some manner.

In section 2.3 we present vein pattern recognition algorithms related to VeinDeep. As VeinDeep performs identification using veins on the back of the hand, we have focused on hand vein pattern recognition. Finger vein pattern recognition is another major related research area, but is out of the scope of this thesis.

10 2. Literature Review

2.1 Segmentation and Object Detection

QuickFind contains a segmentation and object detection algorithm. The standard object detection workflow is to find a region of interest (ROI) in the test image. Then attempt to look for the target object in each region. The way to generate a ROI is by dividing the test image into segments or through a sliding window. The following describes some common approaches.

The sliding window represents a rectangular ROI, which moves over a test image. Features are extracted and used to determine if the region contains a target ob- ject. To handle objects being closer, the same process is repeated with the image downscaled. The window remains the same size. This is called an image pyramid. Some of the features extracted from a window in depth maps are: Vertices (Spin) [13], image gradients (HOG) [8] and surface normal (HONV) [9]. The advantage of sliding window is they are highly parallelisable, but the background is introduced into the ROI during feature extraction.

Segments are non-rectangular ROIs. The edges of segments are computed using a variety of techniques. Silberman et al. [14] computes the ROI by applying Graph Cut to an image with red, green and depth channels. Each segment is classified using Spin and Scale Invariant Feature Transform (SIFT) features. Another work by Silberman et al. [15] performs major surface removal by clustering surface normal on the depth map then removing clusters. The Watershed algorithm and Probability of Boundary algorithm are applied to the bitmap for segmentation. The classifier drops Spin and instead uses the dimensions of the segment, surface normal and colour histogram. Firman et al. [4] use the same techniques for segmentation, but only applies them to depth maps. Detection occurs using Spin and HOG features. Xia et al. [16] uses only depth maps for segmentation as part of human detection. It uses a custom spatial image filter, Canny Edge Detector and CCL for segmentation.

11 2. Literature Review

Detection is based on shape matching to locate shapes which match a human head.

There is a third class of algorithms which are based on points of interest. Points in the reference and test ROI are compared for similarity. This can be done irrespective of the ROI shape. Some examples of depth map based points of interest algorithms are: 3D SIFT [17], which compares points based on the surface normal value at each point. 3D SIFT finds points at the local max and min values of a Difference of Gaussian filter. 3D Speeded Up Robust Features (SURF) [18] compares points based on the 3D Haar wavelet values. 3D SURF finds points using a 3D box filter.

The features of QuickFind most closely resemble that of HOG and HONV, we use the surface shape of the object as a feature. However, QuickFind does not use a sliding window. It uses segmentation, so background noise is not introduced during feature extraction. QuickFind performs segmentation and object detection exclusively with depth maps. It is not affected by lighting variations compared to sensor fusion techniques of some of the aformentioned works. QuickFind does not share commonality with the points of interest approach.

2.1.1 Histogram of Oriented Gradient and Normal Vectors

In this section we describe HOG and HONV algorithms in more detail. We compare QuickFind against both HOG and HONV in Chapter 3.

Under HOG and HONV algorithms the ROI is bounded by a rectangular window. Each window is divided into blocks, blocks are divided into cells and features are computed from each pixel in each cell. Neighbouring blocks may have overlapping cells. A histogram of values is computed from each cell, the histograms from each cell within each block are concatenated together to form the final feature vector. The feature vector is passed to a machine learning algorithms for object detection.

12 2. Literature Review

Figure 2.1: Under HOG and HONV algorithms the ROI is bounded by a rectangular window. The window is divided into blocks, blocks are divided into cells and features are computed from each pixel in each cell.

The following computations are made for both HOG and HONV before feature ∂d(x,y) extraction. Let d(x, y) represent the depth value at position x, y then h = ∂x , ∂d(x,y) v = ∂y are computed.

For HOG at each pixel we compute the features magnitude = p(h2 + v2) and angle = arctan(v, h). The magnitude is added to the bins of the histogram. The quantized value of angle is used as the bin index. For HONV we find magnitude = p(h2 + v2 + 1), azimuth = arctan(v/2, h/2) and zenith = arctan(p(v/2)2 + (h/2)2, 1). The magnitude is added to the bins of the histogram, but the bins for HONV are computed from the quantized value of both azimuth and zenith.

2.2 Hand Gesture Recognition

WashInDepth contains a hand gesture recognition algorithm. From an algorithmic point of view gesture recognition algorithms are divided into the model based and appearance based approaches. The model based approach involves modelling the shape of the hand based on depth data. Ren et al. [19] determined gestures by detecting the position of the fingers according the distance to the centroid of a region

13 2. Literature Review representing the hand. Kuraki et al. [20] used the same techniques in addition to the number of pixels remaining after background removal. Li [21] found the tips of the fingers based on change in angle of the hand contour. Keskin et al. [22] applied the Kinect skeleton tracking algorithm to a hand. The problem with all of these algorithms is that they are only designed to detect gestures from a single hand. This is not suitable as the hand wash guidelines involve both hands. Oikonomidis et al. [23] proposed a solution which tracks gestures of both hands. The solution matches the input to a vast template of common hand gestures using GPU acceleration. This method stands out for its ability to accurately model interaction of both hands. However, this algorithm requires GPU acceleration and is not lightweight. As such, it would not be suitable for real-time operation on a low power embedded device.

Another group of algorithms is the appearance based approach. This involves ex- tracting features from depth data and passing them to machine learning classifiers. Features used in object detection can be adapted for gesture detection. HOG [7] is such an example which has spawned a family of action recognition algorithms. Histogram of Oriented 4D Normals (HON4D) [24] is a 4 dimensional version of HOG. Depth Motion Maps Histograms of Oriented Gradients (DMMHOG) [25] com- putes HOG features on absolute difference of pixel values between frames. HON4D, DMMHOG and similar algorithms place restrictions on action duration or feature vector length.

WashInDepth is based on the appearance based approach. Gestures are classified on a frame by frame basis. So there are no restrictions on action duration. The features for gesture classification are based on QuickFind. We replace the image segmentation step in QuickFind with a faster background removal step as WashIn- Depth is expected to operate in an environment with a fixed background. Another adaptation employed is replacing all instances where the median depth pixel value is computed, with the mean. The median value is less influenced by odd outlier depth

14 2. Literature Review pixels in the recorded scene data. However, in our setting the background is fixed which minimises the possibility of outliers. Hence, we chose to use the mean value which can be computed faster than the median.

2.3 Vein Pattern Recognition

One of the earliest studies of vein pattern recognition was conducted by Fujitsu Lab- oratories at the beginning of the 2000s. This study [26] involved 70000 participants aged between 5-85. According to their research, vein patterns were unique to each individual participant. They achieve a false positive rate of 0.00008%, but their methods remain unpublished. Many publications appeared after the early study. For this section we restrict the review to a selection of papers which use images of the palm or hand dorsum veins. While finger vein pattern recognition exists, solu- tions require a finger to be placed and scanned inside a receptacle. So it is unlikely to be used to secure a mobile device in its current form.

Research on vein pattern recognition mainly use IR cameras under IR illumination to acquire vein pattern images. They can be broadly categorised into two classes, those based on key points or a similarity score. Key point methods attempt to match corresponding points between reference and test vein patterns. One popular implementation is by Ladoux et al. [27]. An image is captured of the palm, lit by IR emitters able to deeply penetrate tissue. A threshold filter is used to extract vein patterns and box filter to remove noise. SIFT features are used to identify points in regions with high contrast changes. Another implementation is by Kumar et al. [28]. An IR image is captured of the hand dorsum when the subject makes a fist. A Laplacian of Gaussian filter is used to extract vein patterns, a line thinning algorithm is applied and small isolated clusters of veins are removed to clean the image. Key points are based on vein endings and crossings. Ding et al. [29] uses an

15 2. Literature Review

IR image of the hand dorsum laid flat with a mix of techniques from the first two papers. The patterns are extracted using a threshold filter and key points are based on vein endings and crossings.

Methods based on a similarity score measure similarity in shape between two vein patterns. Wang et al. uses thermal [11] and IR [30] images of the hand dorsum. The vein patterns are extracted using a threshold filter and cleaned using a median filter. The Hausdorff distance is computed as a measure of similarity. Finally, their method [31] was modified to make it rotation invariant by aligning against the webbing between fingers. Zhang et al. [12] takes IR images of the hand dorsum. A threshold filter extracts the vein pattern images. The Kernel distance function measures similarity between vein patterns. Both methods use the coordinates of every pixel to compute the score.

VeinDeep makes use of similar filters to that used in the Wang et al. papers. Images are taken of the hand dorsum in the same manner as the Kumar et al. paper to improve vein visibility. VeinDeep compares reference and test points using a similarity function which is implemented in the same manner as Zhang’s Kernel distance function. However, VeinDeep is different in that it does not use the location of every pixel in the vein pattern image. Part of the reason is there are far fewer key points representing vein crossings and endings in our vein images. This is because IR depth sensors produce lower resolution images and have less powerful IR emitters compared to the equipment used in the aforementioned papers. However, it has access to depth map which is used to aid in background removal and reduce perspective distortion.

16 2. Literature Review

2.3.1 Kernel and Hausdorff Distance

In this section we describe Kernel and Hausdorff distance in more detail. We com- pare VeinDeep against both Kernel and Hausdorff distance in Chapter 5.

Kernel distance has one parameter σ, which dictates the kernel width. The algorithm accepts as input 2 points s and t. s is a point belonging to reference image S and t is a point belonging to test image T . Each point has coordinates x, y, which indicate their position within their respective images. The kernel we use is the Gaussian ||s−t||2 function KF (s, t, σ) = exp(− σ2 ). ||s − t|| the Euclidean distance between points s and t. Kernel distance penalises large distances between points in reference and test images. The exact definition of Kernel distance is the following formula.

X X KF (s, s0, σ)+ s∈S s0∈S X X 0 KF (t, t , σ)− (2.1) t∈T t0∈T X X 2× KF (s, t, σ) s∈S t∈T

This formula has useful properties [32]. When S = T the score is 0. The score increases when S and T are dissimilar. The first 2 parts KF (s, s0, σ),KF (t, t0, σ) are known as self-similarity and the last part KF (s, t, σ) is known as cross-similarity. The sum of cross-similarity values are smaller when S is dissimilar to T and larger when S is similar to T .

Hausdorff distance does not have configurable parameters. For every point s in reference pattern S we find the shortest Euclidean distance to some point t in test pattern T . This gives a set of distances equal in number to non-zero value pixels in S. We then find the maximum value in this set. The exact definition is the following formula.

17 2. Literature Review

max(min(||s − t||)) (2.2) s∈S t∈T

18 Chapter 3

QuickFind Fast Segmentation and Object Detection

In this chapter we propose QuickFind. It consists of 2 parts that address the motiva- tions of this thesis. One of the motivations is to determine the kinds of applications which can take advantage of depth sensors. We present in Section 3.1 a prototype augmented reality object assembly aid as one such application. The other motiva- tion is to develop algorithms to enable our application. To address this point, we present an accompanying segmentation and object detection algorithm in Section 3.2.

Object detection when combined with devices such as smartphones and wearables, have enabled a plethora of pervasive computing applications. Examples include lifel- ogging [33], improving dietary habits [34] and improving driving safety [35], among many others. Since depth sensors can now be found in smartphones and wearables, the aformentioned applications benefit from improvements in object detection, en- abled by a depth sensor.

19 3. QuickFind Fast Segmentation and Object Detection

We show these benefits in tests of the QuickFind algorithm in Section 3.3. In Chapter 1 we established that depth maps present an advantage by adding an extra dimension of data. In Section 3.3.3 we show that this advantage combined with our modifications to FF allow better success in separating objects from the background. In Sections 3.3.4, 3.3.5 and 3.3.7 we show that the use of segmentation and the novel features of QuickFind lead to better object detection. This is in comparison to the popular HOG and the state of the art HONV algorithms. QuickFind leads by having superior precision, recall, lower computation time, lower memory use and a lower asymptotic upper bound.

3.1 System

The prototype is an augmented reality object assembly aid. It is designed to aid in the assembly of common household products such as furniture, toys and electronics. Such products often require assembly after purchase. Printed instructions can be lost or difficult to follow. So we propose an augmented reality alternative. The system aims to identify a group of objects and determine if all expected components are present. Then overlay helpful assembly instructions. Our proposal is expected to run as an app on depth sensor equipped smartphones, tablets or smart glasses. Such an app could be made available by the product manufacturer.

The prototype uses the algorithm described in Section 3.2 to detect the various com- ponents. Figure 3.1 illustrates the operation of such an application for assembling computer components. When all components are detected, green bounding boxes are overlaid. When components are missing, red bounding box boxes are overlaid. While our prototype gives fairly simplistic feedback, a complete implementation could overlay animations to depict how components fit together. We envision that a

20 3. QuickFind Fast Segmentation and Object Detection

Figure 3.1: The QuickFind prototype is an augmented reality object assembly aid. The green bounding boxes are overlayed when all expected objects are detected. A red bounding box is overlayed when some expected objects are missing. full feature version would operate in a fashion similar to AR4CAD 1, an augmented reality system designed for industrial assembly.

3.2 Algorithm

The algorithm workflow of QuickFind is shown in Figure 3.2. QuickFind consists of the following components: Segmentation and object detection. The aim of segmen- tation in QuickFind is to use a Custom CCL to divide the depth map into regions that contain at most one valid object. Features are extracted from each segment

1http://www.t3lab.it/en/progetti/ar4cad/

21 3. QuickFind Fast Segmentation and Object Detection

Figure 3.2: QuickFind algorithm workflow. and these are then passed onto a classifier to determine whether it is the target object. We define the variables and function used in this section in Table 3.1 and begin with an overview of segmentation.

3.2.1 Overview of Segmentation

Segmentation aims to divide a depth map into regions such that each region contains at most one object. Figure 3.3 shows an output of segmentation where each region is marked with one colour. Our segmentation algorithm is similar to a CCL algorithm commonly known as FF, whose aim is to paint a connected region with the same colour. FF has been used in a number of segmentation algorithms covered in Chapter 2.1, e.g. Xia et al. [16], Firman et al. [4]. The key idea behind FF is to start with a region with one pixel and to use connectivity to grow the region. One of the difficulties of image segmentation of depth maps occurs when 2 objects are in close proximity. Both objects can appear as one. So instead of using connectivity alone, our segmentation algorithm uses a set of custom CCL rules to grow a region. The custom CCL rules will be described in Section 3.2.2. We first present an overview of the segmentation algorithm.

22 3. QuickFind Fast Segmentation and Object Detection

Variables/Functions Description n, m A segment is divided into n by m blocks for feature computation. px,y,z A depth pixel located at position x, y with depth value z. p1,z p1,z is the first depth pixel added to a segment. fndiff (p1,z) = exp(n1+n2× Maximum allowed absolute difference between neigh- p1,z) bouring pixels. n1, n2 determined using regression. fdepth(p1,z) = d1 Maximum segment depth. d1 determined using regres- sion. fwidth(p1,z) = w1 + w2/p1,z Maximum segment width. w1, w2 determined using re- gression. fheight(p1,z) = h1 + h2/p1,z Maximum segment height. h1, h1 determined using re- gression. E Set of depth pixels which form a segment. pEz A depth pixel which is a member of set E. Qi The ith quartile of the value of a set of depth pixels. width(), height(), depth(), Functions which return the width, height and depth of count() a region formed by a set of depth pixels. count() is the number of non-zero depth pixels in a set of pixels. B A vector consisting of the median pixels values com- puted from blocks. D,Dw,Dh D is a depth map with dimensions Dw × Dh. Table 3.1: Table of QuickFind variable and function definitions.

23 3. QuickFind Fast Segmentation and Object Detection

Figure 3.3: The output from segmentation using QuickFind. Clockwise from top left: Each coloured region represents a segment. The image on the top right represents one segment and is divided into blocks for feature computation, the black regions within the bounding box are padded with null/zero values pixels. Lastly below is the input depth map.

24 3. QuickFind Fast Segmentation and Object Detection

The pseudo code for finding a single segment is shown in Algorithm 1. Before describing the algorithm, it is instructive to point out that each pixel in a depth map is characterised by three values (x, y, z) where x and y are co-ordinates on a plane and z is the depth. A depth pixel has 8 neighbours on the x, y plane. Our segmentation algorithm works through the depth map as a breadth-first graph traversal algorithm. Each depth pixel is treated as a vertex. The algorithm iterates over the depth map in a left to right, top to bottom order. On encountering a depth pixel which does not belong to a segment, the code in Algorithm 1 is run. The aforementioned depth pixel which did not belong to a segment is marked and added to a new segment. All 8 neighbours on the x, y plane of the first depth pixel added are considered for inclusion into the segment if certain CCL rules are met. These rules limit the absolute difference between neighbouring pixels, as well as width, height and depth of the segment. The iteration continues and the previous steps are repeated until no more pixels are added. This process keeps the target object within a rectangular prism bounding box and makes for a fast and approximately accurate segmentation. Segmentation is repeated with different parameters for each object class.

3.2.2 Custom Connected Components Labelling (CCL) Rules

The aim of the custom CCL rules are to use prior knowledge on the size of the objects to limit the growth of a segment. We obtain these rules empirically. The first step is to take depth maps of each target object at different distances from the depth sensor. We then cropped these depth maps using as tight a bounding box as possible, see Figure 3.4. By using linear regression, we fit the width, height and depth of the objects against median depth pixel value of each cropped region.

Let p denote a general pixel in the depth map and p is a vector with 3 elements.

25 3. QuickFind Fast Segmentation and Object Detection

Algorithm 1 Find a single segment.

1: function Single Segment(p1, D, fndiff , fdepth, fwidth, fheight)

2: . p1: starting pixel. D: input depth map.

3: . fndiff , fdepth, fwidth, fheight: rule functions. 4: Q ←() .Q holds the queue of pixels considered for inclusion in segment.

5: E ← ∅ .E holds the set of pixels already added to segment.

6: Add p1 to Q 7: while Q not empty do

8: Remove a pixel px from the queue Q

9: E ← E ∪ px

10: for each existing 8-connected neighbour py to px do

11: if py not in another segment and

12: |py,z − px,z| ≤ fndiff (p1,z) and

13: width(py ∪ E) ≤ fwidth(p1,z) and

14: height(py ∪ E) ≤ fheight(p1,z) and

15: depth(py ∪ E) ≤ fdepth(p1,z) then

16: Add py to Q 17: end if

18: end for

19: end while

20: return E

21: end function

26 3. QuickFind Fast Segmentation and Object Detection

Let pz denote the depth component of p. We will denote the three functions from

fitting as fwidth(pz), fheight(pz), fdepth(pz). Note that the dependence on depth pz is important to make the method scale invariant.

We introduce another function to limit the absolute extent of an object in the z- direction. Let Qi be the ith quantile of the depth of the pixels in an object at a given distance from the sensor. For fndiff , we use regression to fit Q75 − Q25 against median depth pixel of each cropped region. Quantiles are used to remove the effect of outliers due to imperfect cropping.

Before stating the custom CCL rules, we need a few more definitions. Let E denote the set of depth pixels which will form a new segment. We use p (a 3-dimensional vector) to denote the pixel being considered for inclusion into E and p1 is first member added to E. Also, p1,z is the z-component (depth) of the pixel p1. The functions width(), height() and depth() return the dimensions of a segment formed by a set of depth pixels.

The following are the rules which should hold for every pixel p added to E:

Rule 1. p does not belong to another segment.

Rule 2. p must be an 8-connected neighbour to some element pE,z in E.

Rule 3.

| pz − pE,z |≤ fndiff (p1,z) (3.1)

Rules 4.

width(p ∪ E) ≤ fwidth(p1,z) (3.2)

height(p ∪ E) ≤ fheight(p1,z) (3.3)

depth(p ∪ E) ≤ fdepth(p1,z) (3.4)

27 3. QuickFind Fast Segmentation and Object Detection

Figure 3.4: Rule functions used for segmentation are computed using regression from a series of cropped depth maps. The rules control the threshold for segment height, width, depth and maximum absolute difference between neighbouring depth pixels.

3.2.3 Features

After the segments have been computed, we extract features from them. The depth pixels in E are normalised. Assuming that the segment E has Dw × Dh pixels, we

Dw Dh divide the segment into blocks of b m c pixels by b n c pixels where m and n are respectively in the horizontal and vertical directions. Since Dw (resp. Dh) may not be an integral multiple of m (n), the m×n blocks will be extracted from the top-left hand corner of E. If Dw < m or Dh < n, then the segment will be discarded. An example is shown in Figure 3.3. The black regions are gaps in the geometry padded with null/zeros in the example. The median pixel value of each block is used as an attribute of the feature vector. The median value is chosen so a few outliers will not skew the value derived from the block. Dividing a region of interest into blocks is common in existing literature. However computing the median value of each block in a depth map segment does not seem to have been done before.

Depth sensors can produce errors. So segments consisting entirely of null/zero values are rejected. If a block consists entirely of null/zero value pixels, it is assigned a feature value of zero. No further computations are performed to save overhead.

The other four attributes of the feature vector are count(E), width(E), height(E) and depth(E). count(E) is the number of non-zero depth pixels in E. These four values encode into the feature vector the dimensions of the target object. Let B be

28 3. QuickFind Fast Segmentation and Object Detection the vector of n × m values representing the median pixel value of each blocks. The final feature vector is of length 4 + n × m and consists of the following.   count(E)      width(E)      height(E) (3.5)      depth(E)    B

The features are used to train a Linear Support Vector Machine (SVM) classifier.

3.3 Experiments

This section describes some experiments performed using QuickFind. Section 3.3.1 gives an overview of the test data. Section 3.3.2 describes the test setup. QuickFind was tested for segmentation performance against FF in Section 3.3.3. QuickFind was against HOG and HONV for detection performance in Section 3.3.4; speed, power, memory consumption in Section 3.3.5. The effect of changing parameters is described in Section 3.3.6 and the complexity derived in Section 3.3.7.

3.3.1 Test Data and Parameters

The depth maps from the RGB-D Scenes 2 dataset was used for testing. It contained scenes of everyday objects in domestic and office environments. There were 6 object classes in 1434 scenes. Scenes can contain all or none of the 6 object classes. The data was recorded at a resolution of 640 × 480 using a sensor with the same specifications as Kinect V1.

2https://rgbd-dataset.cs.washington.edu/dataset/rgbd-scenes/

29 3. QuickFind Fast Segmentation and Object Detection

Figure 3.5: An example scene from the object detection dataset. Clockwise from top left: Bitmap, depth map, ground truth mask, the substitute objects used for parameter estimation.

The algorithm parameters shown in Tables 3.2, 3.3 and 3.4. We used these values throughout testing unless otherwise stated. The parameters are computed using regression by taking depth maps of target object from 600mm to 2000mm, in 100mm intervals.

The ground truth originally provided by the dataset was corrected for errors by manually creating a set of object masks for every instance of the tested objects in every scene. All subsequent comparisons were performed with the corrected ground truth. An example of the mask and scene data is shown in Figure 3.5.

30 3. QuickFind Fast Segmentation and Object Detection

Parameter Bowl Cap Cereal Coffee Flashlight Soda Box Mug Can Blocks n×m 5 × 5 5 × 5 5 × 5 5 × 5 5 × 5 5 × 5 n1 0.798772 0.527755 0.447823 0.795451 0.933697 0.555157 n2 0.001319 0.001513 0.001513 0.001254 0.001281 0.001458 d1 365 170 188 115 70 60 w1 -17.56 24.52 6.944 -2.411 16.5 -0.76 w2 112878.34 98911.16 130400 62971.275 79830.9 39715.19 h1 -12.48 -13.4 5.934 -3.292 -9.626 1.386 h1 71631.75 90275.4 147400 64704.739 45950.747 74984.426 Normalise 1 - 1000 1 - 1000 1 - 1000 1 - 1000 1 - 1000 1 - 1000 between

Table 3.2: QuickFind parameters used in testing.

HOG HONV Parameters Shared Values nlevels 20 scale0 1.05 threshold L2hys 0.2 group threshold 2 Scaling type INTER NEAREST cell size 8 × 8 block stride 8 × 8 HOG Values HONV Values block size 16 × 16 8 × 8 nbins 20 36 × 36 Window Sizes Values Bowl 56 × 32 Cap 80 × 72 Cereal Box 80 × 112 Coffee Mug 64 × 72 Flashlight 64 × 32 Soda Can 32 × 64

Table 3.3: HOG and HONV parameters used in testing. The parameter names come from OpenCV.

31 3. QuickFind Fast Segmentation and Object Detection

SVM Parameters Values svm type C SVC term crit 1000 Cvalue 0.01 kernel type LINEAR FF Parameter Value Diff 50

Table 3.4: SVM and FF parameters used in testing. SVM parameter names come from OpenCV.

PC Raspberry Pi B CPU 2.3GHz i7 2820QM 700MHz ARM Cores 4 1 RAM 8 GiB 512 MiB OS amd64 Debian Raspbian Debian

Table 3.5: QuickFind test platform specifications.

3.3.2 Hardware and Software

The specifications of our test platforms are shown in Table 3.5. The tests were conducted on the Raspberry Pi and PC. The Raspberry Pi was used to simulate the computing capabilities of resource limited embedded devices. Our algorithms are coded in C++, compiled using GCC 4.9 and OpenCV 2.4.9 library. The tests used the OpenCV version of HOG and SVM. HONV is a modification of the HOG source code.

Our prototype assembly aid implementation is deployed on the Raspberry Pi. Depth maps are captured by a Kinect V1 which is connected via a USB cable. Our proto- type interfaces with the Kinect using the Freenect 0.2 library.

32 3. QuickFind Fast Segmentation and Object Detection

3.3.3 Segmentation Accuracy

This section compares the segmentation accuracy of QuickFind against that of a depth map based FF. This is the same FF used by several algorithms mentioned in Section 2.1, e.g. [16][4]. The segmentation was performed on a PC. The segmen- tation parameter values were determined using the regression method in Section 3.2.2. Since the RGB-D Scenes dataset was produced by another research group. We did not have the original objects to obtain depth maps of the objects at differ- ent distances for the regression needed for our algorithm. We therefore used several substitute objects shown in Figure 3.5. These objects had similar shape to those recorded in the test data.

We first define the criteria for successful segmentation. Let S be the number of pixels in a segment, T be the number of pixels in ground truth mask (An example of ground truth object mask is in Figure 3.5) and O the number of common pixels in the segment and ground truth mask. A standard overlap metric [36] which ensures only one dominant segment per object instance is O/(S+T −O). We say an instance of an object was considered to be successfully segmented if the standard metric has a value above 0.5 and the segment must not consist of only null/zero values.

Table 3.6 shows the number of successfully segmented objects for QuickFind and FF. We find that QuickFind outperforms FF by a big margin. This is because FF and a number of CCL-based algorithms mentioned in Section 2.1 require sensor fusion to perform well. We can explain this by the following example. Figure 3.5 shows a colour picture with a cereal box standing on a white table top. The boundary between the two objects is clearly visible in the colour picture. However, such boundary information is lost in the depth map. That is why a colour picture is helpful in getting segmentation right with a depth map. For QuickFind, we instead use prior knowledge of the object size to assist segmentation.

33 3. QuickFind Fast Segmentation and Object Detection

Object Class Total Instances QuickFind FF CCL Bowl 992 771 211 Cap 815 691 23 Cereal Box 1093 1046 89 Coffee Mug 765 630 32 Flashlight 894 301 7 Soda Can 1147 997 65

Table 3.6: Number of object instances successfully segmented by FF and QuickFind.

Out of all the six classes of objects, flashlight has the worst segmentation accuracy. This was because the sensor had difficulty with small non-planar surfaces of the flashlight. These regions were often filled with erroneous null/zero values. Similarly errors in segmentation for other object classes occurred when objects lay outside the effective operating range of the sensor.

3.3.4 Detection Accuracy

This section compares the accuracy of QuickFind, HONV and HOG. The results were computed using 10-fold cross validation on PC. Under HOG and HONV the depth maps were pre-processed with a 5 × 5 median filter to fill in gaps caused by sensor errors. Gaps cause HOG and HONV to malfunction. The filter overhead was excluded from the speed tests.

We first define the meaning of positive and negative samples for the algorithms. For QuickFind, a positive sample was a non-zero segment with the previously defined overlap metric: O/(S + T − O) > 0.5. One negative sample was chosen from each scene that did not overlap with the ground truth mask. A true positive occurred when a segment met the overlap metric and was classified as the target object.

For a positive sample under HOG and HONV, the bounding box surrounding the ground truth mask is used. The width of the box is increased by 5% on each side.

34 3. QuickFind Fast Segmentation and Object Detection

Likewise for height. This region is used as a positive sample. This is done to ensure the object edges are included into the feature vector to improve detection accuracy. One negative sample was chosen from each scene that did not overlap with the ground truth. A true positive occurred when a window was classified as containing the target object with standard overlap metric [36]: O/(W + B − O) > 0.5. Where W : number of depth pixels in the window, B: number of depth pixels in the ground truth bounding box, O: number of depth pixels overlapped.

Figure 3.6 plots the precision-recall rates for the six classes of objects for Quick- Find, HONV and HOG. We see that QuickFind is significantly more accurate for all object classes except flashlight. The curves were generated by changing the SVM maximum allowed distance to hyperplane parameter. Increasing this value reduced false positives at the expense of increased false negatives and vice-versa. The poor performance of flashlight under QuickFind was because of noisy data leading to difficult segmentation. The noisy data similarly affected HOG and HONV.

3.3.5 Speed, Memory and Power Consumption

Figure 3.7 compares the speed, memory on PC and power consumption on Raspberry Pi for QuickFind, HOG and HONV. Note that on the PC HOG and HONV ran under multi-threaded mode. All other tests were entirely single threaded. Although the use of both single-threaded and multi-threaded programs may make speed comparison difficult, we can safely conclude from Figure 3.7 that QuickFind was faster, more memory and power efficient compared to HOG and HONV.

On our embedded platform QuickFind processed a depth map using an average time of 685 ms, 24 MiB memory and 2.055 Joules of power. HOG used an average of 12406 ms, 81 MiB, 37.218 Joules. HONV used an average of 191425 ms, 360 MiB, 574.275 Joules. The shorter computation time of QuickFind contributed to lower

35 3. QuickFind Fast Segmentation and Object Detection

Figure 3.6: Precision and recall rates for object detection using QuickFind, HOG and HONV.

36 3. QuickFind Fast Segmentation and Object Detection

Figure 3.7: Speed, memory and power consumption of QuickFind, HOG and HONV. power consumption. Note that power consumption was estimated by multiplying load amperage by 5V of the USB power source, over time spent per scene. The speed, memory and power improvements of QuickFind make it more suitable for embedded platforms compared to existing algorithms. The speed results are because of algorithm complexity, which will be discussed in Section 3.3.7.

3.3.6 Effect of Changing Parameters

Two key sets of parameters for QuickFind are segment size and block size. Altering them had some effect on speed and detection rates. The tests were conducted on PC using parameters for object bowl.

37 3. QuickFind Fast Segmentation and Object Detection

Figure 3.8: QuickFind detection rate improves with increasing blocks up to 20 × 20. Run time changes little when number of blocks grow.

Figure 3.9: Changing parameters has little effect on QuickFind speed. Speed in- creases linearly with input size are changed.

Impact of the Number of Blocks

Varying the number of blocks changed object detection rate and speed. Results shown in Figure 3.8. Dividing a segment into more lengths along both axis result in more blocks. This made the feature vector more descriptive and longer. This increased precision but on the flip side reduces recall. At the upper extreme of 50×50 blocks, recall noticeably drops. Many segments containing the target object were less than 50 × 50 in pixels, therefore rejected. Increased blocks did not significantly increase computation time. The SVM probability threshold was set to 0.5 to ensure balanced precision and recall values.

38 3. QuickFind Fast Segmentation and Object Detection

Segment and Image Size VS Speed

Segment size did not change run time much, but input size scale linearly with run time. Results shown in Figure 3.9. Segment size limits were increased by multiplying the maximum height, width and depth allowed by a scaling value. Speed does not change noticeably because the algorithm has complexity linear to the number of input depth pixels no matter the segmentation parameters. We will discuss this in Section 3.3.7. The algorithm scales linearly as the size of the input increases. The input size was increased by duplicating the scene and stitching them together one above another.

3.3.7 Complexity Analysis

QuickFind was the fastest at object detection compared to HOG and HONV. The difference can be explained by the asymptotic time complexity. The upper bound of QuickFind is linear. In comparison, only the lower bound of HOG and HONV is linear. The complexity derivations of QuickFind, HOG and HONV are in the following sections.

HOG and HONV Asymptotic Lower Bound

First we begin with HOG and HONV. Given the image pyramid, defined by 2 parameters: s the scale and q the number of layers. Let Dw × Dh be the dimensions of the depth map. The following is the asymptotic lower bound. 1 − 1/s2 o((D × D ) × ) (3.6) w h 1 − 1/s2×(q+1)

HONV has higher overhead than HOG but they share the same complexity. The upper bound is hard to derive because there is no closed form solution. The lower

39 3. QuickFind Fast Segmentation and Object Detection bound occurs when windows and blocks within windows do not overlap. So features from each pixel are processed exactly once. Under the best case scenario when q = 0 the complexity of HOG and HONV are also linear. q is normally greater than 0 which scales up complexity. Scaling accounts for objects appearing larger when closer. q = 0 assumes the target object always appears the same size in test data. This is unlikely with real world data. The asymptotic lower bound is derived in 3 parts as follows.

Dw×Dh Pre-processing Image Pyramid: The total number of pixels generated is: s2×0 +

Dw×Dh Dw×Dh s2×1 + ... + s2×q . The previous value can be simplified as a geometric sum. If 1−1/s2 it takes constant overhead per pixel. This is O((Dw × Dh) × 1−1/s2×(q+1) ).

Features: When windows or blocks overlap, values derived from each pixel must be normalised once per overlap. The best case occurs when there is no overlap. This 1−1/s2 is o((Dw × Dh) × 1−1/s2×(q+1) ).

Detection: Detection uses linear SVM. The feature vector can be saved as a skip list to skip attributes with value of zero. With no overlap there are equal number of attributes as pixels. There is no guarantee that all these attributes are non-zero so it is possible to get o(1). This does not change the overall lower bound since computing features has a greater lower bound.

QuickFind Asymptotic Upper Bound

The QuickFind asymptotic upper bound is derived in 3 stages. Each stage is linear, allowing the whole algorithm to be linear in complexity. The following equation is the final total complexity, followed by the derivation.

O(Dw × Dh) (3.7)

40 3. QuickFind Fast Segmentation and Object Detection

Pre-processing Segmentation: CCL is O(|V | + |E|) given graph G(V,E)[10].

The depth map is treated as a king graph [37], V = Dw × Dh, E = 4 × Dw × Dh −

3 × (Dw + Dh) + 2. E is derived from the number of edges in a grid graph [38] which is 2 × Dw × Dh − Dw − Dh. A king’s graph has all the edges of a grid graph in addition to (Dw −1)×(Dh −1)×2 edges which form from the diagonals of each grid. Algorithm 1 lines 11 - 15 are the custom additions which add O(1) per iteration. So this is still O(Dw × Dh).

Features: Let bi,j be set of non-zero value depth pixels belonging to ith block of jth segment. Finding the median value of bi,j with k zeros padded on involves selecting the |bi,j|/2 − k smallest number. This is a O(|bi,j|) selection problem [39]. Other features are kept track of during segmentation. Since zero value blocks are never P initialised and i,j |bi,j| = Dw × Dh. This is O(Dw × Dh).

Detection: Detection uses linear SVM. The feature vector can be saved as a skip list to skip attributes with value of zero. Total number of non-zero attributes derived from blocks cannot exceed Dw × Dh. There are 4 additional non-zero attributes per segment to describe segment dimensions. So at most there are 5×Dw ×Dh non-zero attributes. The testing time of linear SVM is linear to the number of attributes [40].

This is O(Dw × Dh).

3.4 Summary

The goal for this chapter was to address the motivations of this thesis in 2 parts. First we presented an augmented reality object assembly aid to demonstrate the usefulness of depth sensors. Then we describe the accompanying segmentation and object detection algorithm to enable this prototype. This entire project we have named QuickFind.

41 3. QuickFind Fast Segmentation and Object Detection

For segmentation accuracy we tested QuickFind against FF. Using exclusively depth map, QuikFind achieved over double the segmentation success rate of FF. One challenge is when 2 objects are in close proximity, they appear as one. We use our novel segmentation rules to limit the segment size and avoid this problem.

For object detection we tested QuickFind against HOG and the state of the art HONV. QuickFind achieved almost double the average precision compared to HOG and HONV. We achieved better results by using novel features and by removing the background with segmentation. QuickFind also had lower asymptotic upper bound compared to HOG and HONV. When ported to Raspberry Pi, QuickFind uses 1/18 run time, 1/18 power use, 1/3 memory use compared to HOG. 1/279 run time, 1/279 power use, 1/15 memory use compared to HONV.

42 Chapter 4

WashInDepth Fast Hand Gesture Recognition

In this chapter we propose WashInDepth. Again it consists of 2 parts to address the motivations of this thesis. We show the usefulness of depth sensors by developing a system for monitoring correct hand wash and we use a custom gesture recognition algorithm.

Maintaining hand hygiene is critical in industries such as health services and food preparation. In the aforementioned industries, personnel are often trained and re- quired to wash their hands before and after performing their duties. The United States Center for Disease Control (CDC) suggests 1 the use of the World Health Organisation (WHO) guidelines [41] for hand hygiene. These guidelines recommend a series of steps to perform a correct hand wash. The steps include a range of hand gestures for correct lathering of soap.

Several systems have been proposed for both training and monitoring compliance

1http://www.cdc.gov/handhygiene/Guidelines.html

43 4. WashInDepth Fast Hand Gesture Recognition with each step of the guidelines. Sensors have been embedded in soap dispensers [42] and faucets [43] in order to keep track of people who have applied soap and rinsed their hands. Neither of the existing systems are able to monitor correct lathering of soap. Currently soap lathering can only be monitored using wrist worn devices [44][45]. These systems work by detecting the gestures of the hand when applying soap or hand sanitiser. Hand gestures are detected by classifying gestures based on accelerometer and orientation sensor data. There are flaws with using wristbands as it require recharging, is prone to collecting contaminants from soiled hands [46], the user may forget to wear it or may misplaces the device.

Depth sensor provide an alternative way of monitoring hand gestures with several advantages. We avoid all the aforementioned problems from using wrist worn devices because monitoring can be performed contactless using a fixed piece of infrastruc- ture. Also depth sensors provide an extra dimension of data to distinguish gestures with similar appearance. They are not influenced by changing lighting conditions and allow easy separation of foreground and background [47]. The last 3 points are advantages compared to conventional cameras. A related commercial system Sure- Wash 2 uses a conventional camera. It is relegated to training kiosks in hospitals [48] because it cannot operate without with calibrated lighting and fixed background.

In Section 4.1 we present the WashInDepth system. In Section 4.2 we present the WashInDepth algorithm. In Section 4.3 we show the benefits of our algorithm by testing it against HOG and HONV. Again we achieve better gesture recognition accuracy and faster performance. Section 4.4 ends the chapter with the summary.

2http://www.surewash.com/

44 4. WashInDepth Fast Hand Gesture Recognition

4.1 System

WashInDepth uses a depth sensor placed above the wash basin. An illustration of the intended setup is shown in in Figures 4.1 and the test environment is shown in Figure 4.4. The sensor is placed so that the field of view covers the wash basin and maximises the resolution of the region where the hands are likely to be placed.

Our goal is to determine if a person has correctly lathered both their hands with soap as per the sequence of 9 gestures depicted in Figure 4.1. As per WHO guidelines total lathering time is expected to take a minimum of 15 seconds. The condition set in the system is for each gesture to be observed for at least 15/9 seconds. This time frame is sufficient to go through the motions of a single gesture. E.g. when performing a palm to palm rub in circular motion, 15/9 seconds is sufficient for a single circular motion. In our experiments (outlined in Section 4.3) we observed that all participants were able to complete each gesture at least once within 15/9 seconds. Once a participant has finished washing their hand, it is possible to determine the gesture performed in each recorded frame. Once all soap lathering gestures are detected in the correct order and of sufficient duration, the soap is considered to have been correctly lathered. The system does not perform detection of the use of soap dispenser and faucet as these are already solved by existing systems.

A wireless trigger is used to start and stop the recording. It is implemented as a simple Wi-Fi scanner. The Wi-Fi chip on the test systems were made to scan for particular SSIDs and activate the sensor, if the signal strength exceeded a threshold. The recording stops when the signal is beneath a threshold. The SSID would be provided by a smartphone carried by the user to act as an identifier. Potential other triggers are RFID cards, motion sensors or cameras.

45 4. WashInDepth Fast Hand Gesture Recognition

Figure 4.1: WashInDepth detects if a subject has correctly lathered their hands with soap. A correct hand wash involves lathering by performing the 9 gestures in steps 1 - 9, in the order depicted. Detection is triggered using a phone and performed using a depth sensor mounted above the wash basin.

4.2 Algorithm

The algorithm workflow is depicted in Figure 4.2. We chose to adapt features from the depth based object detection algorithm QuickFind from Chapter 3. Every other related algorithm from Section 2.2 had one of the following problems: They were not able to track two strongly interacting hands, were too computationally costly for real-time operation on a low power embedded device or assumed the action is a fixed duration. In contrast, QuickFind features do not have these problems and it is quick to compute and easy to implement.

4.2.1 Background Removal

During background removal the goal is to remove the pixel values which do not represent the subject’s hand. This is achieved using an image filter and cropping the remaining white space. This works because the sensor is in a fixed position and the scene background remain static. The background removal method is not affected by variability in hand positioning. So depth pixels representing the hands

46 4. WashInDepth Fast Hand Gesture Recognition

Figure 4.2: WashInDepth algorithm workflow. Gestures are detected by computing features from each frame as depicted. The queue of observed and expected gestures are compared to determine if the correct wash gestures have been performed.

47 4. WashInDepth Fast Hand Gesture Recognition can be extracted.

Firstly, definitions of background removal parameters and the way parameter values are obtained are necessary before further explanation. Figure 4.3 also illustrates how some of these parameter values are obtained.

• p(x, y) is the depth value at position x, y on the filtered depth map after the background has been removed. Ideally, p(x, y) is non-zero when the pixel comes from the hands.

• d(x, y) is the raw depth pixel value taken from the sensor at position x, y.

• z(x, y) is a depth map representing the background. This background can be obtained by taking a depth map of the basin area when no one is using the basin. In order to reduce the noise in the background, we apply a 5×5 median filter to the raw background depth map to obtain z(x, y).

• Z is used with z(x, y) to compensate for slight changes in pixel value of the same point in succeeding frames. Z is the smallest value which overcomes sensor noise.

• x1, x2, y1, y2 are constants which represent the coordinates of the region of interest which bounds the wash basin edges. See Figure 4.3 for illustration.

The filter works by computing for each input pixel d(x, y), a p(x, y) value in the following manner.

 d(x, y) if d(x, y) < z(x, y) − Z     and x1 < x ≤ x2 p(x, y) = (4.1)   and y1 < y ≤ y2   0

48 4. WashInDepth Fast Hand Gesture Recognition

The above filtering operation essentially says that a pixel is included in the fore- ground if it is at least Z units above the background and is within the area of the wash basin. After p(x, y) has been computed, the next step is to determine a bound- ing box to enclose the area covered by the hands. This is done by removing rows and columns of pixels that are almost zero. Note that a bounding box is computed for each frame. This means that different frames may have different bounding boxes. We will refer to the part of depth map within the bounding box as a segment.

Some definitions are necessary before further explanation.

• r, c are row and column removal threshold values used to crop the region of interest. They can be chosen by manually cropping the region of interest finding the percentage of pixels along the edges which have non-zero values.

• row(i) and col(j) represent the set of pixels of row i and column j.

• imin and imax represent the indices of the top-most and bottom-most rows that are retained after white space removal.

• jmin and jmax represent the indices of the left-most and right-most columns that are retained after white space removal.

• nonzero() is a function that takes a set of pixels and returns the subset of non-zero value pixels.

A bounding box is computed by cropping the region outside of rows imin, imax and columns jmin, jmax. These values are found in the following manner.

|nonzero(row(i))| i , i such that × 100 > r (4.2) min max |row(i)| |nonzero(col(j))| j , j such that × 100 > c (4.3) min max |col(j)|

49 4. WashInDepth Fast Hand Gesture Recognition

Figure 4.3: WashInDepth parameter choice. The image shows x1, x2, y1 and y2 are chosen so they bound the basin edges. This represents the region of interest. r, c and o are chosen based on the proportion of zero value pixels represented by the black regions.

4.2.2 Features

In addition to QuickFind, several other features are computed, these includes the frame number, the segment width, height and the mean value of all non-zero value pixels in the segment. The frame number encodes into the feature vector temporal information to improve detection accuracy. E.g. Gesture 1 shown in Figure 4.1 is likely to appear at the beginning of the sequence, any detection at the end is likely erroneous. The mean value of non-zero pixels of the segment account for distance variations as it approximates distance between the hands and sensor. Given the distance, it is possible to distinguish some gestures from just the segment width and height. E.g. A segment which represent gesture 5 shown in Figure 4.1 is noticeably wider than all other gestures.

For the gestures which cannot be distinguished from the distance and dimensions alone, we use QuickFind features. Each segment comprises of pixels representing both hands. Therefore the pixel values in each segment describes the shape of both hands. It is not possible to use the raw pixel values in the classifier. This is because the number of pixels which form the segment in each frame is different and there

50 4. WashInDepth Fast Hand Gesture Recognition are just too many pixels. So these values are condensed by dividing the segment into blocks and computing the mean value in each block, this is the QuickFind feature. This procedure preserves most of the shape information and ensures the same number of features is computed each frame.

Some definitions are necessary before further explanation.

• St is the segment computed from frame number t, with dimensions w × h pixels.

• a, b are used for normalisation, all pixel values are scaled between a, b. Any values of a, b can be chosen so long as 0 < a < b and there is no loss of precision after normalisation.

• mean() computes the mean value of a set of pixels.

• e is the mean value of non-zero pixels from St. e = mean(nonzero(St)).

• m and n are constants such that St will be divided into m by n blocks. Good values for m, n are determined empirically.

• b(k, l) represent the pixels of a block at position k, l.

• f(k, l) represent the feature computed from b(k, l).

• o is the threshold for the percentage of non-zero value pixels allowed in b(k, l). It ensures values computed from blocks contribute meaningful data to classi- fication. A safe value to minimise noise contributing to the feature vector is o = 50.

Pixel values in St are normalised by scaling between a, b. St divided into m×n blocks. Blocks are rectangular regions of width bw/mc pixels and height bh/nc pixels. We have chosen w, h to be divisible by m, n respectively. If w, h are not divisible by

51 4. WashInDepth Fast Hand Gesture Recognition m, n the remainder pixels are discarded. An example segment is depicted in Figure 4.2. The computed feature at b(k, l) is f(k, l), it is defined in the following equation.

 |nonzero(b(k,l))| mean(b(k, l)) if × 100 > o f(k, l) = |b(k,l)| (4.4) 0

The collection of f(k, l) for every block comprises of the following values.

F = [f(k, l)|k = {1, 2, . . . , n}, l = {1, 2, . . . , m}] (4.5)

A zero vector is returned if w < m or h < n. Otherwise the final feature vector comprises of the following values.   w - segment width     h - segment height      e - mean value of non-zero pixels in the frame (4.6)      t  - frame number   F - mean value of pixels in each block

4.2.3 Classification

Each feature vector is sent to a decision tree and labelled with a gesture or a null value to represent unrecognised action. The null value is also assigned on input of a zero vector. The decision tree is used as the classifier because it is good for multiclass classification. Our prototype used Weka to generate the decision tree. Then this tree was converted into a series of SAT problems.

52 4. WashInDepth Fast Hand Gesture Recognition

4.2.4 Smoothing

The system improves classification accuracy by smoothing classification results. This is done by taking the classification results over a window of length g. The most frequently occurring class of gesture is assigned to the entire sequence. When two or more classes are equally most occurring the choice is random. This helps reduce errors caused by intermittent misclassification. Reasons for misclassification and values of g which reduce misclassification are discussed in Section 4.3.

4.2.5 Check Gesture Sequence

The sequence of gestures is checked for the correct order and duration. Wash guide- lines mandate the 9 gestures to last at least a total of 15 seconds. If q is the sensor 15×q frame rate. The number of required observations for each gesture is at least d 9 e. The order is checked by placing the observed gestures in a queue and sequence of expected gestures in another queue. The head in both queues are removed if they match. Otherwise only the head of the observed queue is removed. This is repeated until either queues are empty. If the expected gesture queue is empty, then the correct sequence has been observed. Otherwise there were insufficient observations of a gesture or they were performed in the wrong order.

4.2.6 Algorithm Complexity

WashInDepth features can be computed on each image in linear time. We know from Section 3.3.7 that QuickFind has a linear asymptotic upper bound. WashInDepth replaces the segmentation stage in QuickFind with background removal. During background removal, Formula 4.1 is applied to each pixel. Formula 4.1 is 6 arith-

53 4. WashInDepth Fast Hand Gesture Recognition metic operations. If we assume arithmetic operations are in constant time. Then WashInDepth is linear to the number of input pixels.

The decision tree classifier has a tree height equal to the feature length. Since the maximum feature length is (4 + the number of input pixels). The leaf node can be reached in linear time.

The smoothing stage finds the most commonly occurring gesture class in a sequence of g values. If there are n classes, the search takes O(g + n).

4.3 Experiments

This section describes some experiments performed using WashInDepth. Section 4.3.1 gives an overview of the test data. Section 4.3.2 describes the test setup. WashInDepth was for gesture recognition accuracy, speed, memory consumption and sensitivity to parameter changes, we present the results in Section 4.3.4. Later we ported HOG and HONV code to compare against WashInDepth, we present the results in Section 4.3.4.

4.3.1 Data Collection

Data was recorded from a total of 15 participants, who were recruited from among graduate and undergraduate students. The participants included 12 males and 3 females all aged between 20 to 35 years. The recordings were conducted in a uni- versity common room. The test environment with dimensions are shown in Figure 4.4. The total instances of each recorded gesture are shown in Table 4.1.

The data recorded for each iteration consisted of a sequence of depth maps of res-

54 4. WashInDepth Fast Hand Gesture Recognition olution 640 × 480 at 30 FPS. The recordings started before the hands entered the frame and finished as the participants finished rinsing their hands. Each recorded frame was manually labelled with a gesture as the ground truth or a null value if it did not depict a lathering gesture. The system parameters were determined using the steps outlined in Section 4.2. The parameter values are shown in Table 4.2.

Before recording, participants were shown the images in Figure 4.1 and given a demonstration of the gestures by a researcher. Each subject was tasked with washing their hands three times. One iteration where their hands were placed at the height of their choosing. Another two iterations at approximately 5 - 15 cm and 30 - 40 cm above the bottom of the basin. This was done to ensure the classifier could be trained and tested with data from varying heights. A total of 45 distinct video recordings were made.

In 10 recordings the participants had up to a quarter of their hands obscured under the faucet. The algorithm does not specifically incorporate any features to handle occlusion. However, the classifier has an opportunity to learn to compensate if a block consisting of the occluding object is incorporated into the feature vector.

In 5 recordings participants made mistakes or temporarily moved out of the scenes for a few frames. All such occurrences were annotated with a null/unrecognised gesture value in the ground truth. All these scenarios presented a realistic and challenging range of variations.

4.3.2 Test Systems

We implemented WashInDepth on two different compute platforms: a Windows PC and Windows Computer Stick. The Compute Stick setup is shown in Figure 4.5. The system specifications are shown in Table 4.3. The Compute Stick was used to

55 4. WashInDepth Fast Hand Gesture Recognition

Figure 4.4: WashInDepth test environment with dimensions.

No. Gesture Instances 1 Palm to palm rub 9515 2 Left hand dorsal rub 6475 3 Right hand dorsal rub 7468 4 Fingers interlaced 6321 5 Fingers interlocked 7482 6 Left thumb rub 6045 7 Right thumb rub 6975 8 Left palm scrub 6861 9 Right palm scrub 6773 0 Null/No gesture 24128

Table 4.1: The names and number of instances of every gesture in the test data. The 1st column corresponds with the numbered illustrations of Figure 4.1.

56 4. WashInDepth Fast Hand Gesture Recognition

System Parameter Value Z 50 x1, x2, y1, y2 140, 500, 20, 250 r, c 5, 5 o 50 a, b 100, 1000 m × n 5 × 5, 15 × 15, 45 × 45 g 1, 10, 50, 250 q 10, 20, 30 Tree Parameter Value C - Confidence 0.25 H - Min values per leaf 10 HOG and HONV Parameters Value HOG nbins 20 HONV nbins 10 × 10 Cells and blocks 1 cell = 1 block

Table 4.2: WashInDepth, HOG and HONV parameter values. Bold values led to best results and were used as default values in testing. test the performance on a low power embedded system. A Kinect V1 was used to record hand gestures at 30 FPS. The system was developed in C# and ran with the .Net 4.6 runtime. HOG and HONV were also ported to C#. The decision tree was trained using Weka 3.8 J48 library and converted into C# code.

Figure 4.5: WashInDepth test platform using a Compute Stick.

57 4. WashInDepth Fast Hand Gesture Recognition

Platform PC Compute Stick CPU 2.3 Ghz 2820QM 1.33 Ghz Z3735F Cores 4 4 RAM 8 GiB 2 GiB OS 64 bit Windows 10 32 bit

Table 4.3: WashInDepth test platform specifications.

4.3.3 Experiment Setup

The experiments were carried out in 2 main parts. The first experiments outlined in Sections 4.3.4, 4.3.4 and 4.3.4 were to measure changes in gesture detection accuracy by varying algorithm parameters. A correctly detected gesture occurred when the classifier assigned a gesture value to a depth map, which coincides with the ground truth. This was tested under the person-independent scenario, wherein, data from 15 participants was divided into 3 groups of 5. 3-fold cross validation was used, where data from 10 participants were used for training and 5 for testing. This scenario tests system behaviour when encountering gestures of participants whose idiosyncrasies have not yet been learned by the classifier.

The second experiment in Section 4.3.4 tests gesture detection accuracy under the person-dependent scenario. Wherein, both the training and test data is from the same person. Each person made 3 recordings. The 3 depth videos were concate- nated together and each frame was randomly assigned to one of 3 groups. 3-fold cross validation was used, where data from 2 groups were used for training and one for testing. This scenario was for testing the accuracy achievable when the classifier learned the idiosyncrasies of each participant. Such a scenario might occur in an en- vironment such as a hospital where the users are known beforehand. The prototype is capable identifying the participant using the SSID of the wireless trigger.

All experiments were conducted using the default values listed in bold on Table 4.2 unless otherwise stated.

58 4. WashInDepth Fast Hand Gesture Recognition

Figure 4.6: The bar chart shows the gesture detection accuracy of WashInDepth while testing a range of block values without smoothing. The confusion matrix shows the results of the best performing iteration which was achieved by using 15 × 15 blocks.

4.3.4 Experiment Results

The following sections give an in-depth analysis of the gesture detection results. We present progressively improving gesture detection accuracy results by finding the optimal block and smoothing window length parameters in Sections 4.3.4 and 4.3.4. Section 4.3.4 presents the influence of frame rate on accuracy. Sections 4.3.4 and 4.3.4 presents the latency and memory consumption numbers.

Influence of Block Size on Accuracy

The highest gesture detection accuracy without smoothing was 51%. The gesture detection accuracy and confusion matrix of the best performing iteration are shown in Figure 4.6. Figure 4.7 presents the distribution of detected gestures every 50 frames of a typical instance of recorded depth video. This test served two purposes. The first was to get an idea of the raw frame to frame gesture detection accuracy without smoothing. The second was to empirically determine the number of blocks which give the best results.

This test was performed under the person-independent scenario using the PC plat-

59 4. WashInDepth Fast Hand Gesture Recognition

Figure 4.7: The distribution of detected gestures for every 50 frames of a typical instance of recorded depth video. The colour-coded legend on right indicates gesture class.

60 4. WashInDepth Fast Hand Gesture Recognition form. The range of tested block values were: 5 × 5, 15 × 15 and 45 × 45. No smoothing was used so this was equivalent to g = 1. Accuracy was measured using the formula:

Correctly Classified Instances Accuracy = × 100 (4.7) Total Instances

The highest accuracy was achieved by using 15 × 15 blocks per segment. At 51% this was higher than 48% when using 5 × 5 blocks. This was expected as a greater number of blocks created a longer and more descriptive feature vector. 45×45 blocks counter-intuitively achieved a lower accuracy at 46%. The reason was some of the segments after background removal were less than 45 × 45 pixels. A zero vector was returned if w < m or h < n. Where w, h are the segment width and height. m, n are the number of blocks.

The majority of errors fall into three categories. Firstly the confusion matrix in Figure 4.6 indicate a number of misclassifications between gesture classes 1, 2 and 3. This is corroborated by the distribution of gestures in Figure 4.7. The reason seem to be all three gestures have very similar appearance. The back of the hand dominates the scene as the palm of one hand rests on top. Secondly there are a number of gestures erroneously classified with the null/unrecognised value. This occurred during transitions between gestures, where the appearance of the hands does not resemble any particular expected gesture. Thirdly the tests were conducted under the person-independent scenario, so idiosyncrasies of individuals in the testing data had not been learned by the classifier.

61 4. WashInDepth Fast Hand Gesture Recognition

Figure 4.8: The bar chart shows the gesture detection accuracy while testing a range of smoothing window lengths and by using 15 × 15 blocks. The confusion matrix shows the results of the best performing iteration which was achieved by using g = 50.

Influence of Smoothing Window Length on Accuracy

The highest gesture detection accuracy achieved after smoothing was 58%. The gesture detection accuracy and confusion matrix of the best performing iteration are shown in Figure 4.8. The purpose of this test was to empirically determine the window length which give the best results and show this particular optimisation improves detection accuracy.

This test was performed under the person-independent scenario using the PC plat- form. A range of smoothing window lengths were tested, these lengths were: g = 10, 50, 250. 15 × 15 blocks were used as it gave the best results pre-smoothing.

At 10 frames there was a boost in accuracy. 55% compared to 48% with no smooth- ing. At 50 frames the best accuracy was achieved at 58%. At 250 frames the accuracy drops to 48%. The drop was the result of having too large of a window which resulted in overlap in multiple actions.

62 4. WashInDepth Fast Hand Gesture Recognition

Influence of Frame Rate on Accuracy

The highest gesture detection accuracy achieved with reduced frame rate was 55% at 10 FPS and 54% at 20 FPS. The detection accuracy is shown in Figure 4.9. This test was conducted to determine the performance on low power systems which may not be able to process the full 30 FPS of the Kinect. Case in point in Section 4.3.4, the Compute stick was shown to be able to achieve only 20 FPS in real-time. The goal was to determine if this had an impact on detection accuracy. This test was run with all 3 previously tested window lengths. As changing the frame rate also meant the windows lengths used in previous tests cover a different span of time.

This test was performed under the person-independent scenario using the PC plat- form. Several frame rates and smoothing window length permutations were tested. The tested FPS were g = 10, 20. The tested window lengths were q = 10, 50, 250. 20 FPS was simulated by skipping every 3rd frame. 10 FPS was simulated by skipping every 2nd and 3rd frame.

At 20 FPS with q = 10, 50, 250, the accuracy was 53%, 54%, 41% respectively. At 10 FPS with q = 10, 50, 250, the accuracy was 55%, 54%, 31% respectively. There was an extreme loss in accuracy using q = 250. This was because the window length covers multiple overlapping actions. The results suggest two points: Firstly, a lower FPS must coincide with a shorter window length. Secondly lower FPS caused a slight drop in accuracy.

Person-Dependent Tests

The highest gesture detection accuracy achieved under the person-dependent sce- nario was 94%. The results are shown in Figure 4.10. This test was conducted to determine the accuracy when the classifier was able to learn the idiosyncrasies of

63 4. WashInDepth Fast Hand Gesture Recognition

Figure 4.9: The bar charts show the gesture detection accuracy for several permu- tations of frame rates and smoothing window lengths. gestures performed by the participant.

This test was performed under the PC platform. The range of tested block values were: 5 × 5, 15 × 15 and 45 × 45. No smoothing was used so this was equivalent to g = 1. No smoothing was used because each frame was randomly grouped to one of 3 groups for 3-fold cross validation. Smoothing improves accuracy when frames are ordered in the correct sequence. This is because a gesture in one frame is likely to be the same in subsequent frames except when transitioning between gestures. This is not the case when each frame is randomly grouped.

The results indicate gesture detection accuracy was much higher under the person- dependent scenario. So when possible the system should be trained with data from users before using the system.

Latency

Figure 4.11 presents the latency results for the PC and Compute Stick plat- forms. The time spent on memory allocation and sensor delay remains fairly con- stant on both platforms. On the PC the system performed without fault at 30 FPS. The Compute Stick was used to simulate the performance on a low power system. It was only able to operate at 20 FPS, every third frame was skipped due to its less

64 4. WashInDepth Fast Hand Gesture Recognition

Figure 4.10: The bar charts show the gesture detection accuracy under the person- dependent scenario. Data for each participant was assigned to 3 groups for 3-fold cross validation.

Figure 4.11: The mean run time per frame and the proportion of time spent on each task for WashInDepth. Overhead was the time taken for loading libraries, memory allocation and other miscellaneous processes. capable processor.

Section 4.3.4 shows there is only a slight drop in accuracy with a drop in frame rate. So it is still feasible to implement the system on a low power system like a Compute Stick.

Memory Use

Figure 5.12 presents the memory usage results. The memory results were computed under the person-independent scenario by recording all the variables used after each frame was processed. This was multiplied by the number of bytes occupied by each

65 4. WashInDepth Fast Hand Gesture Recognition

Figure 4.12: The amount of memory used per frame for WashInDepth. This value was computed by counting every variables used to process each frame multiplied by the number of bytes consumed by each variable. variable. A smoothing window length f g = 50 and frame rate of q = 30 were used. The tests were conducted under the PC platform; the results would be equal with the Compute Stick as the same codebase works on both systems.

There was a constant 25 MiB overhead from the C# runtime. In total the current prototype consumes under 27 MiB of memory when in use. If this system was further developed from the prototype, a more efficient language could be used to reduce overhead.

4.3.5 Comparison to HOG and HONV

In this section we compare WashInDepth against HOG and HONV algorithms. We show our features adapted from QuickFind produce superior results in several metrics. Our HOG and HONV parameters are shown in Table 4.2. We tested under the person dependent case on the PC platform.

Our results are shown in Figure 4.13 and 4.14. WashInDepth had the best over- all results. HOG achieved an average accuracy of 86% and HONV achieved 88%, compared to 94% of WashInDepth. WashInDepth also achieved the lowest latency taking 11 ms per frame, compared to HOG with 19 ms per frame and HONV with

66 4. WashInDepth Fast Hand Gesture Recognition

Figure 4.13: The person dependent gesture recognition results for WashInDepth, HONV and HOG.

22 ms per frame. All 3 algorithms used a similar amount of memory, WashInDepth used about 4 KiB more memory than HOG on average.

The latency and memory results are much closer this time between all 3 algorithms. Unlike object detection, there is a single ROI. Generating ROIs in object detection caused much of the overhead for HOG and HONV.

Figure 4.14: The average memory and latency when comparing WashInDepth, HONV and HOG.

67 4. WashInDepth Fast Hand Gesture Recognition

4.4 Summary

The goal for this chapter was to present WashInDepth. First we presented a pro- totype for detecting the correct application of soap while a person washes their hands. We developed a lightweight gesture recognition algorithm to accompany this prototype by extending our previous work.

For testing data was recorded from 15 participants while they washed their hands 3 times each in a university common room. The system achieved a gesture detection accuracy of up to 94%. Furthermore, the system resource consumption was evalu- ated on two platforms. The PC and Compute Stick. The total memory consumption of the system is about 27 MiB. Using the resources available on the Compute Stick, the system was lightweight enough to operate at 20 FPS. These results demonstrate the feasibility of deploying WashInDepth for monitoring hand hygiene.

We showed WashInDepth is better with accuracy of 94% than the HOG and HONV algorithms, which achieved 86% and 88% respectively. WashInDepth was also faster with a latency of 11 ms per frame, compared to 19 and 22 ms for HOG and HONV respectively. All 3 algorithms used a similar amount of memory. These results demonstrate that WashInDepth is superior to closely related algorithms.

68 Chapter 5

VeinDeep Fast Vein Pattern Recognition

In this chapter we propose VeinDeep. We implement a vein pattern recognition system using depth sensors. We find it is possible to re-purpose depth sensors for this task. As far as we are aware this is the first time depth sensors have been used for this purpose. Our prototype is designed to secure a depth sensor equipped smartphone from opportunistic access, e.g. a phone left unattended.

The pervasive smartphone has been integrated into many facets of daily life. The portability, connectivity and on-board sensors have enabled a variety of uses in ad- dition to communication. Many applications involve processing potentially sensitive data on the smartphone, examples include social networking [49], health monitoring [50] and banking 1. In response, several methods have been developed to secure the contents of smartphones from opportunistic access.

Authentication on smartphones are divided into implicit and explicit methods [51].

1https://play.google.com/store/apps/category/FINANCE

69 5. VeinDeep Fast Vein Pattern Recognition

Implicit methods learn from the user’s past behaviour by collecting sensor or meta- data to determine if it matches against current behaviour [52][53]. However, there is a vulnerable interval of time before suspicious behaviour triggers a lockdown. Explicit methods do not have this weakness and identify the user based on the user supplying a specific piece of information. PINs, passwords and pattern unlock screens are explicit methods but open to shoulder-surfing [54] and user chosen pass- words tend to be predicable [55]. Biometrics such as face and fingerprint recognition are resistant to shoulder-surfing and cannot be easily guessed like passwords. How- ever, fingerprints can be left on the very device it is meant to secure 2 and faces can be recorded whenever the user is visible in public 3.

Vein patterns have the advantage of not leaving imprints on surfaces like fingerprints and veins lay underneath the skin [29], so are not easily recordable when in public like a face. Our system uses an infrared (IR) depth sensor to record vein pattern images. Veins are visible under IR illumination because IR light penetrates biological tissue, but blood attenuates IR light differently from other tissue [56]. Since this operates outside of the visible spectrum, it works irrespective of skin colour. This method has become feasible on smartphones at the time of writing due to the recent integration of IR depth sensors into smartphones 4. These sensors are active IR devices which take 3D photographs [6]. This is used to facilitate accurate 3D indoor mapping and localisation 5. We re-purposed them for vein pattern recognition.

In Section 5.1 we present the VeinDeep system. In Section 4.2 we present the Vein- Deep algorithm. In Section 4.3 we compare VeinDeep against Kernel and Hausdorff distances. Section 5.4 ends the chapter with the summary.

2https://www.ccc.de/en/updates/2013/ccc-breaks-apple-touchid 3http://www.popsci.com/its-not-hard-trick-facial-recognition-security 4https://get.google.com/tango 5https://www.bloomberg.com/news/articles/2016-05-12/ google-looks-beyond-maps-to-chart-the-interior-world-in-3-d

70 5. VeinDeep Fast Vein Pattern Recognition

Figure 5.1: VeinDeep is designed to unlock an infrared depth sensor equipped smart- phones using vein patterns. Then the user takes an infrared image and depth map of the back of their hand. This is used to extract vein patterns which identify the user. 5.1 System

The use case of VeinDeep is depicted in Figure 5.1. The user is expected to take an IR image and depth map of the back of their hand (hand dorsum) while making a fist. The fist gesture helps make veins more visible by forcing veins closer to the surface. The user can use their left or right hand. The hand used to unlock the device must be the same one registered with the device, as vein patterns are unique to each person and each hand [28][26].

Vein patterns are extracted from the IR image using an adaptive threshold filter. The depth map is used for background removal and to correct perspective distortion. Perspective distortion makes regions of the hand dorsum tilting away from the sensor appear smaller. Given vein pattern images, VeinDeep uses the location where veins intersect with each row in the image and the row index to compare similarity. A change in the number of vein intersections indicates where veins start, end or cross. This discriminating feature is compared using a similarity function.

71 5. VeinDeep Fast Vein Pattern Recognition

Figure 5.2: VeinDeep algorithm workflow from raw infrared image and depth map to vein pattern.

Figure 5.3: Examples of vein patterns from 3 different participants.

5.2 Algorithm

The algorithm workflow of VeinDeep is depicted in Figures 5.2, 5.5 and 5.7. As the figures suggest, VeinDeep consists of 3 major components. After the image is recorded, vein patterns are taken from the IR image using an adaptive threshold filter. This finds local intensity changes as veins appear darker than surrounding tissue in the IR image. The background is removed using a depth threshold as we assume the hand is closer to the sensor than the background. We correct the vein pattern image for perspective distortion by using the depth map to rotate the pixels in 3D. An example of the output from 3 participants is shown in Figure 5.3.

The second step is to find the key points in the vein pattern. We do this by finding the locations where veins intersect with each row in the vein pattern image. This

72 5. VeinDeep Fast Vein Pattern Recognition

Figure 5.4: The series of vein pattern images were taken in succession from the same the right hand of a single person. However there are small differences in each extracted vein pattern image. We refer to this as Jitter. feature is important for the following reasons. It is discriminating because it reveals the location of vein endings and crossings. The relative position of these points are resolution independent. The vein pattern image appears larger when taken closer to the sensor.

The final step is to compare points in reference and test vein patterns. We com- pute a similarity score. This is implemented in same manner as Zhang’s Kernel distance function [12]. However, our input is the previously found key points. This is important because Zhang uses every pixel instead of key points. This makes our comparison much faster.

One of the challenges is a sequence of vein patterns recorded from the same source may have subtle differences. Henceforth we refer to this as jitter. An example of jitter is shown in Figure 5.4. Some sources of jitter are sensor errors. Changes in distances between hand and sensor. Changes in perspective distortion caused by the hand dorsum being tilted differently in each image. Much of the implementation and a vast majority of system parameters are chosen to deal with this problem. The implementation details and the way we choose parameters are described in the following sections.

73 5. VeinDeep Fast Vein Pattern Recognition

5.2.1 Extract Vein Pattern

The workflow for extracting the vein pattern is depicted in the 1st column from the left of Figure 5.2. The first step is to crop the region of interest (ROI) from both the IR image and depth map. The vein patterns are computed by applying an adaptive threshold filter on the ROI. The filter reveals the veins which appear darker than surrounding tissue in the IR image. The background is removed using an image mask computed from the depth map. We first define some parameters and heuristics for parameter choice before the detailed explanation.

• u1, u2, v1, v2 represent the rows and columns of the bounding box of the ROI. We choose the values so the bounding box is large enough to enclose the hand dorsum.

• I, D, i(u, v), d(u, v) represent the IR image and depth map from the ROI. I is the IR image, i(u, v) holds the IR pixel intensity value at coordinates (u, v). D is the depth map, d(u, v) hold the depth pixel value at coordinates (u, v).

• M, m(u, v), z1, z2 are used for background subtraction. M is binary image mask consisting of binary pixels m(u, v). z1, z2 are thresholds used to deter- mine the value of each m(u, v). We choose z1 to be equal a value which lets the hand be close enough to fill the ROI or the closest operating range of the sensor to maximise resolution of the recorded data. z2 the range for which data recorded beyond this point is too far to be useful.

• N, n(u, v), a, b are used for background subtraction. N is binary image mask consisting of binary pixels n(u, v). N is equal M scaled down in width by a and height by b. a, b are chosen to be the largest value which removes the outline of the hands in the IR image.

74 5. VeinDeep Fast Vein Pattern Recognition

• O(u, v, o) is a function and not part of the images shown in Figure 5.2. It is used as part of an adaptive threshold filter. It computes the mean value of a block of o × o pixels in width and height, centred on i(u, v). o is chosen to be larger than the width of a vein in pixels, but smaller than the typical distance between veins.

• J, j(u, v) represent the vein patterns. J is a binary image consisting of binary pixels j(u, v). If a pixel value is 1 it is considered vein structure, 0 represents the background.

We extract I and D from the ROI of the raw input bounded by rows u1, u2 and columns v1, v2. A mask M is created from D using a threshold filter. We choose to use a threshold filter because we have made the assumption that objects within a certain distance to the sensor belong to the hand of the user and everything else belongs in the background. The following is the definition for the threshold filter.  1 if z1 < d(u, v) ≤ z2 m(u, v) = (5.1) 0

In input image I veins can be found because they appear darker than skin. Similarly, the edges coinciding with the outline of the hand show a transition from the dark background to the lighter skin. This gets misinterpreted as veins, so they must be removed. We do this by using a secondary mask created from M called N. N is M scaled down in width by a and height by b. The value of each pixel in N is chosen by picking the closest pixel by proximity in M. Example images of M and N can be seen in the left most column of Figure 5.2.

A 4-connected components algorithm is used to find segments formed by non-zero pixels in N and M. Both M and N are cleaned by removing all but the largest segment. The pixels removed this way are replaced by zero value pixels.

75 5. VeinDeep Fast Vein Pattern Recognition

The background is removed from I by using N as a mask and vein patterns are extracted using an adaptive threshold filter. The filter produces result J and is defined in the following formula.  1 if i(u, v) > O(u, v, o)   j(u, v) = and n(u, v) = 1 (5.2)   0

5.2.2 Calculate Derivatives and Angles

The workflow for calculating derivatives and angles is depicted in the 2nd column from the left of Figure 5.2. J is corrected for perspective distortion caused by the surface of the hand dorsum being tilted in relation to the image plane. We use 3 derivatives to compute the angle of tilt based on the 3D depth map at each pixel. Then use them to approximate the average angles of tilt on 3 axes for the entire hand dorsum surface. We define some variables before the detailed explanation.

• α(u, v), β(u, v), γ(u, v) are the angles along the 3 axes at coordinates (u, v). This is computed from the 3D depth map.

• αmean, βmean, γmean are the mean values of all the angles. E.g. αmean is the mean value of all the computed α(u, v).

• αref , βref , γref are the reference angles. We want to rotate the vein patterns

in 3D until the average angles on 3 axes equal these values. αref , βref , γref are fixed constants for all implementations.

• φ, θ, ψ are the differences between the mean and reference angles. E.g. φ =

αref − αmean.

76 5. VeinDeep Fast Vein Pattern Recognition

• φmax, θmax, ψmax are the maximum values allowed for φ, θ, ψ. If exceeded, the

input is rejected. φmax, θmax, ψmax are chosen to be some value less than 90, where it is possible to recover from jitter.

The 3 angles are approximated using the following formulas [9].

∂d(u, v) 1 = (d(u − 1, v) − d(u + 1, v)) (5.3) ∂u 2 ∂d(u, v) 1 = (d(u, v − 1) − d(u, v + 1)) (5.4) ∂v 2 ∂m(u, v) 1 = (m(u − 1, v) − m(u + 1, v)) (5.5) ∂u 2 ∂m(u, v) 1 = (m(u, v − 1) − m(u, v + 1)) (5.6) ∂v 2 ∂d(u, v) α(u, v) = arctan( /1) (5.7) ∂u ∂d(u, v) β(u, v) = arctan( /1) (5.8) ∂v ∂m(u, v) ∂m(u, v) γ(u, v) = arctan( / ) (5.9) ∂v ∂u

The values of α(u, v) and β(u, v) are used only at locations where n(u, v) = 1. Angles can only be reliably computed away from the edges of the hand, because depth pixels laying on the edge will have neighbouring pixels with zero value. N is the smaller secondary mask, so when n(u, v) = 1 the pixel is assured to lay away from the edges of the hand. For γ(u, v) these are calculated at locations where m(u, v) = 1. We use the larger mask M because we specifically want to know the value of the gradients along the edges.

We take all the α(u, v) and β(u, v) that have been computed at locations where n(u, v) = 1 and find their mean values. We call these values αmean and βmean respectively. We use locations where n(u, v) = 1 because angles can only be reliably computed away from the edges of the hand. Depth pixels laying on the edge will have neighbouring pixels with zero value, leading to erroneous angles. N is the

77 5. VeinDeep Fast Vein Pattern Recognition smaller secondary mask, so when n(u, v) = 1 the pixel is assured to lay away from the edges of the hand. We compute the mean value of γ(u, v) at locations where m(u, v) = 1. We call this γmean. We use the larger mask M because we specifically want to know the value of the gradients along the edges.

αmean, βmean, γmean are angles which represent the average tilt on 3 axes. We want to align J against 3 reference angles which we call αref , βref , γref . To do this we compute the difference between mean and reference angles, which are called φ, θ, ψ. These are defined as follows.

φ = αref − αmean (5.10)

θ = βref − βmean (5.11)

ψ = γref − γmean (5.12)

A high level of tilt can obfuscate the vein pattern image. A combination of perspec- tive distortion and shadows cast by knuckles and depressions on the hand dorsum surface mean the vein pattern can become irrecoverable. Since it is possible to detect the tilt, we limit the largest absolute value of φ, θ, ψ. If these limits are exceeded, the input is rejected. We call these thresholds φmax, θmax, ψmax.

5.2.3 Rotate to Reduce Distortion

We use the 3 previously computed angles to rotate the vein pattern image J in 3D and reduce perspective distortion. The algorithm workflow is depicted in the 3rd column from the left of Figure 5.2. The properties of projective geometry [57] mean we have to convert each pixel with coordinates (u, v) into a different coordinate system, perform rotations to then convert it back. We first define some variables.

• (x, y) are the coordinates along the horizontal and vertical axes of a pixel in

78 5. VeinDeep Fast Vein Pattern Recognition

the new coordinate system used to rotate the vein pattern image J.

• f is the focal length of the sensor. This is a fixed value based on the hardware.

• w, h are width and height of the region of interest. w = u2 − u1, h = v2 − v1.

• H1,H2,H3 are the rotation matrices used to rotate the pixels in the vein pat- tern image.

The relationship between the two coordinate systems is the following.       u f 0 bw/2c x             d(u, v) v = 0 f bh/2c  y  (5.13)       1 0 0 1 d(u, v)

Once converted to the new coordinate system the pixels in J can be rotated using the following 3 rotation matrices.   1 0 0     H1 = 0 cos(φ) sin(φ) (5.14)   0 − sin(φ) cos(φ)   cos(θ) 0 − sin(θ)     H2 =  0 1 0  (5.15)   sin(θ) 0 cos(θ)   cos(ψ) sin(ψ) 0     H3 = − sin(ψ) cos(ψ) 0 (5.16)   0 0 1

The matrices are applied to (x, y, d(u, v)). The relationship in Equation 5.13 is used to transform it back into (u, v) coordinate system.   x     H1H2H3  y  (5.17)   d(u, v)

79 5. VeinDeep Fast Vein Pattern Recognition

Figure 5.5: VeinDeep algorithm workflow from vein pattern to sequence of points.

5.2.4 Clean Image

After J is corrected for distortion, we get the corrected result which is cleaned-up. The clean-up is depicted in the 4th column from the left of Figure 5.2. We define some variables for this final step.

• K, k(u, v) represent the final vein pattern image after perspective distortion correction and clean-up. k(u, v) is the binary pixel value at coordinates (u, v).

• g, any segment containing less than g non-zero pixels is removed. g is chosen to be large enough so that the majority of segments containing noise are removed.

A cleaning step is performed to finalise K. First a 4-connected components algorithm is run to determine the segments. All segments smaller than g pixels are removed. Small gaps are then filled. For every k(u, v) = 0 which has 2 neighbours that are non-zero then it is replaced with k(u, v) = 1. Finally we find the bounding box which encloses all non-zero pixels and remove rows and columns outside this bounding box. This is the last step in pre-processing.

80 5. VeinDeep Fast Vein Pattern Recognition

Figure 5.6: The effect on vein pattern from changing pre-processing parameters are shown in this image.

5.2.5 Effect of Changing Pre-processing Parameters

We have provided a visual guide in Figure 5.6, on the effect of changing several pre- processing parameter values. We determined the optimal values for each parameter empirically. However, once set these values stay the same for any user so long as the sensor remains the same.

They are grouped into 3 groups. Parameters a, b crops the vein pattern. o is the size of the adaptive filter. A smaller o preserves more detail, but introduces noise in the vein pattern image. A larger o leads to the removal of thinner veins. g is used to remove segments consisting of less than g pixels. A smaller g preserves detail but leaves in noise.

81 5. VeinDeep Fast Vein Pattern Recognition

5.2.6 Convert Vein Pattern to Sequence

We convert the final binary vein pattern image into a sequence of points. The algorithm workflow is depicted in Figure 5.5. The points present the locations where veins intersect with each row in the vein pattern image. We want to find these points because they describe the shape of the vein pattern. Finding these key points is similar to applying a line thinning algorithm. However, our approach is faster as it only requires one pass compared to the multiple passes needed for line thinning. We use one pass as we are not trying to preserve connectivity after thinning.

We use an example to clarify our approach. Veins span a few contagious pixels in a row, in row 1 of Figure 5.5, the vein spans pixels 180, 181,..., 190 (these are the column indices from left to right). We only keep one of the pixels in the span, that being the pixel at column 180, which is the first one encountered, as we process the row in ascending order of index values. In row 2, veins span pixels 115 − 130, 160 − 165 and 180 − 190. We keep the pixels at 115, 160, 180. We do this for every row.

We define the sequence of points using the following variables.

• s(j) = (j, u, v). u, v are the position of the point, j is the index of the point in the sequence.

• S = s(1), s(2), . . . , s(jmax).

5.2.7 Compare Sequences

Our similarity function is the Gaussian function. First we compute the Euclidean distance between each point in the reference and test sequences. The distance values

82 5. VeinDeep Fast Vein Pattern Recognition

Figure 5.7: We compare the similarity of sequences of key points from 2 vein pat- terns. First by computing the Euclidean distance between key points in both se- quences. Then the Euclidean distance is used to compute a similarity score. are used to compute the final similarity score. The algorithm workflow is depicted in Figure 5.7. We define some variables to describe this process in detail.

• S is the sequence computed from reference vein pattern. T is the sequence computed from the test vein pattern. s ∈ S, t ∈ T .

• ||s − t|| is the Euclidean distance between points s and t.

||s−t||2 • KF (s, t, σ) = exp(− σ2 ) is the Gaussian function. σ is the kernel width which penalises mis-alignment between key points in reference and test images. The optimal value must be empirically determined.

The similarity score is calculated using the following formula:

P P 0 P P 0 P P s∈S s0∈S KF (s, s , σ) + t∈T t0∈T KF (t, t , σ) − 2 × s∈S t∈T KF (s, t, σ)

This is the same as Formula 2.1 used in Kernel distance. We have previously de- scribed the properties of this formula in Section 2.3.1. The difference this time is our input is a subset of the values used in the formula in Section 2.3.1.

83 5. VeinDeep Fast Vein Pattern Recognition

Platform PC Compute Stick CPU 2.6 Ghz i5 3320M 1.33 Ghz Z3735F Cores 2 4 RAM 8 GiB 2 GiB OS Windows 10 64 bit Windows 10 32 bit

Table 5.1: VeinDeep test platform specifications.

5.3 Experiments

This section outlines the experiments and implementation details of a proof of con- cept VeinDeep system. We collected data from several subjects and use this data to benchmark VeinDeep against Hausdorff and Kernel distance. We present bench- marks showing the matching accuracy, mean matching time and memory consump- tion. Then we test the effect of changing the Kernel distance.

5.3.1 System Setup

We implemented VeinDeep on two different platforms: a Windows PC and Windows Computer Stick. The system specifications are shown in Table 5.1. As IR depth sensor equipped smartphones were not yet readily available at the time of writing, we simulate the device using a Compute Stick and Kinect V2 depth sensor. The Compute Stick represents the kind of performance found in low power embedded devices like smartphones. The Kinect V2 was setup on a tripod pointing down as shown in Figure 5.8. The system was developed in C++, using the OpenCV 3.1 library and Visual C++ 2015 compiler.

84 5. VeinDeep Fast Vein Pattern Recognition

Figure 5.8: VeinDeep used a Kinect V2.

5.3.2 Data Collection

Data was recorded from a total of 20 participants. The participants included 12 males and 8 females. 17 were graduate students aged between 20 to 35 years. One individual aged 18, 2 aged between 50 - 60. The recordings were conducted in an office environment. 6 instances were recorded per hand, per participant. The right and left hands have different vein patterns so each participant’s hands were treated as a separate subject. This gave a total of 240 recorded instances from 40 different hands. Each recorded IR image and depth map is 512 × 424 pixels in resolution. Our ROI is a centre portion 80 × 80 pixels in resolution.

We have the participant make a fist for each recording. This gesture forces the veins on the hand dorsum closer to the surface making them more visible. The IR illumination on a depth sensor is weak, this gesture helps improve data collection. We did not choose to use the palm as thicker tissue in the palm obstructs the vein patterns.

Each recording was made with the hand dorsum approximately in the centre of the ROI, between 50 to 65 cm from the sensor. We had the participant retract their hand in between each recording. This procedure introduces some jitter, in order to create a data set with height and tilt variations.

85 5. VeinDeep Fast Vein Pattern Recognition

VeinDeep Parameter Value u1, u2, v1, v2, z1, z2 216, 296, 172, 252, 500, 650 a, b 0.7, 0.8 o, g 11, 9 αref , βref , γref 90, 90, 180 φmax, θmax, ψmax 25, 25, 35 f 3.657 σ 5

Table 5.2: VeinDeep parameter values.

The system parameter values were set using the heuristics described in Section 5.2. The parameter values are shown in Table 5.2. The f parameter is the focal length of the Kinect V2 lens which is 3.657 mm [6]. All tests were performed using these values unless otherwise stated.

5.3.3 Matching Accuracy

Figure 5.9 shows the precision / recall values. Figure 5.10 shows the confusion ma- trices when precision is at approximately 0.98. VeinDeep edges out Kernel distance in precision. VeinDeep achieves a precision of 0.98 with a recall of 0.83. While Kernel distance has a precision of 0.90 at a similar recall value. Higher precision mean lower false positives at the expense of greater false negatives. This is good for authentication systems as a false positive allows in an intruder which is disastrous. While a false negative is only a temporary inconvenience.

For testing, we had 6 recordings for each hand. We randomly picked one to use as a reference vein pattern, the 5 others are used as positive samples, the remaining 234 recordings from other subjects are used as negative samples. We do this for all 40 hands. For all 3 algorithms we pre-process the vein patterns using the procedure depicted in Figure 5.2. Then we pass the cleaned vein pattern images through the 3 algorithms to get similarity scores. The Kernel distance function worked best with

86 5. VeinDeep Fast Vein Pattern Recognition a kernel width of 1. While VeinDeep worked best with a kernel width of 5.

To generate Figure 5.9, we ran the 3 algorithms using the procedure described previously. On each iteration we recorded the similarity score. If this score was below a threshold we considered it a positive. If it matched the ground truth then it was considered a true positive, otherwise a false positive. If the similarity score was above the threshold then the result is considered a negative. If it matched the ground truth it was considered a true negative, otherwise a false negative. We adjust the thresholds to get our precision/recall curves. The threshold values are 0 - 350 for VeinDeep, 0 - 15 for Hausdorff and 0 - 2000 for Kernel Distance.

We believe VeinDeep performed best for the following reason. Kernel functions apply a penalty when a pixel in the reference image has no matching pixel close to the corresponding location in the test image. The input vein pattern images are often not perfectly aligned because of jitter. Key point selection in VeinDeep mean there are fewer potential points to penalise compared to Kernel distance.

5.3.4 Matching Speed

Figure 5.11 shows the mean time to process and match vein patterns on PC and Compute Stick platforms. We wanted to test if there was any unacceptable lag in response time. We started to measure the time taken after reading the test image into memory and stopped once the similarity scores were computed. VeinDeep is the fastest by far taking 466 milliseconds on the Compute Stick. This is under 1/3 the time of Hausdorff distance and under 1/6 the time of Kernel distance.

The results coincide with a cursory analysis of the asymptotic upper bound. We exclude the pre-processing step as all 3 algorithms use the same procedure. We first define nz(K1) as the number of non-zero pixels in K1, P (K1) the number of

87 5. VeinDeep Fast Vein Pattern Recognition

Figure 5.9: The precision / recall results of vein pattern matching algorithms. Vein- Deep with a precision of 0.98, a recall of 0.83, edges out Kernel distance which has a precision of 0.90 at a similar recall level.

Figure 5.10: The confusion matrices for VeinDeep, Kernel and Hausdorff distance when precision is at approximately 0.98.

88 5. VeinDeep Fast Vein Pattern Recognition key points found in K1. If we are given a reference vein pattern image K1 and ||k1−k2||2 test pattern K2. Then assume mathematical operations such as exp(− σ2 ) are O(1). We get the following upper bounds.

Kernel distance consists of 3 parts as shown in Formula 2.1. All 3 parts require a quadratic number of steps. The following expression is the complexity.

O(nz(K1)2 + nz(K2)2 + (nz(K1) × nz(K2))) (5.18)

Hausdorff distance consists of 2 parts as shown in Formula 2.2. The inner expression require nz(K2) steps. The outer expression requires nz(K1) repetitions of the inner loop. The total complexity is the following.

O(nz(K1) × nz(K2)) (5.19)

VeinDeep has a quadratic asymptotic upper bound like Kernel distance due to im- plementation similarities. However, input length is based on P (K1) and P (K2). The following expression is the complexity.

O(P (K1)2 + P (K2)2 + (P (K1) × P (K2))) (5.20)

We know that P (K1) ≤ nz(K1) and P (K2) ≤ nz(K2). However, under most circumstances the number of non-zero pixels vastly exceeds the number of key points. So under most circumstances VeinDeep is faster.

5.3.5 Memory Use

Figure 5.12 shows the mean memory usage to process and match vein patterns on the Compute Stick platform. We wanted to know if the memory usage fit within our embedded system. We recorded the peak memory usage while processing each

89 5. VeinDeep Fast Vein Pattern Recognition

Figure 5.11: The mean run time to match a pair of vein pattern images.

Figure 5.12: The mean memory usage to match a pair of vein pattern images. test image and computed the mean value. VeinDeep used 6 MiB, Kernel distance used 9 MiB and Hausdorff used 14 MiB. All 3 algorithms used a small portion of total available memory.

5.3.6 Changing the Kernel Width

All parameters except one in VeinDeep deal with pre-processing of the vein pattern image. Figure 5.13 shows the precision / recall values after of changing σ, the kernel width. As previously explained, a smaller kernel width leads to greater penalties when a pixel in the reference image does not have a corresponding pixel in the same location in the test image. We can see choosing anything outside a narrow niche led to non-optimal results. Optimal values for this parameter must be determined empirically.

90 5. VeinDeep Fast Vein Pattern Recognition

Figure 5.13: The precision / recall values of VeinDeep after changing σ. σ is the kernel width, which controls the penalty when a point in the reference image does not closely correspond with a point in the test image. Changing this value outside a narrow niche led to non-optimal results. It is the only parameter in VeinDeep which is not related to pre-processing of the vein pattern image.

91 5. VeinDeep Fast Vein Pattern Recognition

5.4 Summary

The goal for this chapter was to present VeinDeep. We use vein pattern recognition to secure a depth sensors equipped smartphone. We designed a custom vein pattern recognition algorithm for this purpose. As far as we are aware this is the first time depth sensors have been used for this purpose. VeinDeep is designed to stop opportunistic access, such as a phone left unattended.

For testing we recorded 240 vein patterns, taken from 40 different hands, provided by 20 participants. The system achieved precision of 0.98 with recall of 0.83. We compare this to similar algorithms Kernel and Hausdorff distance, which achieve a lesser precision of 0.9 and 0.5 respectively at the same recall rate. VeinDeep matched vein patterns much more efficiently on our embedded Compute Stick, using only 1/6 the run time, 2/3 the memory of Hausdorff distance and is 1/3 the run time, 1/2 the memory of Kernel distance.

92 Chapter 6

Conclusion

In this thesis we had 2 motivations. We wanted to show the kinds of applications which can take advantage of depth sensor equipped embedded devices and build the neccesary supporting algorithms. In addressing these motivation we have built 3 prototypes. We have also advanced the state of the art in algorithms for processing depth sensor data. The following is a summary of the 3 main chapters of this thesis.

Chapter 3 QuickFind fast segmentation and object detection.

• A prototype augmented reality object assembly aid.

• A segmentation algorithm which is better than the common FF. It works exclusively with depth maps. While FF requires good lighting due to fusion with conventional cameras.

• An object detection algorithm which is better than the popular HOG and state of the art HONV. QuickFind had better detection rates, is faster, uses less memory and power.

Chapter 4 WashInDepth fast hand gesture recognition.

93 6. Conclusion

• A prototype hand wash monitoring system.

• A hand gesture recognition algorithm which is an extension of QuickFind.

• The algorithm exceeds the gesture recognition accuracy of HOG and HONV and performs faster.

Chapter 5 VeinDeep fast vein pattern recognition.

• A prototype designed to secure a depth sensor equipped smartphone.

• The first instance where a depth sensor has been re-purposed for vein pattern recognition.

• The algorithm exceeds the vein pattern recognition accuracy of Kernel and Hausdorff distances, performs faster and uses less memory.

We end with a few points on potential future research.

6.1 Future Work

We list some possible improvements to the 3 main chapters.

• One possible improvement for all 3 algorithms is to take advantage of parallel processing. Modern embedded systems also possess multicore processors. Each algorithm uses image filters which loop through the input. Each iteration does not require write access to shared resources. Therefore each algorithm can be sped up using parallelism.

94 6. Conclusion

• We would like to test with improved sensors in future work. Errors would be reduced by having higher resolution and lower noise. Both WashInDepth and VeinDeep are limited by sensor performance. In WashInDepth the background was very close to the target subject and we were working at the closest oper- ating distance to maximise resolution. This led to to noise after background removal. In VeinDeep we are forced to operate very close to the sensor because of the low resolution.

• We would like to test the power consumption of actual depth sensor equipped embedded devices. We have mainly simulated the devices in this thesis, as at the time of writing they are just appearing on the market.

• Under active IR illumination we should be able to retrieve object textures. This allows detection of objects with similar shape but different texture. This improvement could improve detection rate.

• Many gestures look alike from a single angle. A multi-sensor system should greatly improve gesture recognition accuracy of WashInDepth.

In Chapter 1 we explained that depth sensor equipped embedded sensors remain relatively underutilised. This is because no one fully understands the implications of this new mode of sensing for pervasive computing. We think the work presented in this thesis can be applied to other areas of research which we have listed in the following points.

• An IR and depth sensor equipped smartphone can be used as a mobile nutrition monitor. This is possible because there already exist smartphone based dietary monitors [34]. Spectroscopy of foodstuffs can be achieved using multi-spectral cameras [58]. We know a depth sensor is a type of multi-spectral camera. Combine these 2 ideas to make a nutrition monitor.

95 6. Conclusion

• There already exist systems which use depth sensors for face recognition [59]. We can scan the facial veins as an extra feature to boost recognition rates.

• Depth sensors have been used to monitor behaviour [60]. We can extend this idea using object detection and person identification to get context aware behaviour monitoring.

96 Bibliography

[1] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shot- ton, S. Hodges, D. Freeman, A. Davison, et al., “Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera,” in Proceedings of the 24th annual ACM symposium on User interface software and technology, pp. 559–568, ACM, 2011.

[2] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore, “Real-time human pose recognition in parts from single depth images,” Communications of the ACM, vol. 56, no. 1, pp. 116–124, 2013.

[3] C. Greenwood, S. Nirjon, J. Stankovic, H. J. Yoon, H.-K. Ra, S. Son, and T. Park, “Kinspace: Passive obstacle detection via kinect,” in European Con- ference on Wireless Sensor Networks, pp. 182–197, Springer, 2014.

[4] M. Firman, D. Thomas, S. Julier, and A. Sugimoto, “Learning to discover objects in rgb-d images using correlation clustering,” in 2013 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems, pp. 1107–1112, IEEE, 2013.

[5] N. Pinto, Y. Barhomi, D. D. Cox, and J. J. DiCarlo, “Comparing state-of-the- art visual features on invariant object recognition tasks,” in Applications of computer vision (WACV), 2011 IEEE workshop on, pp. 463–470, IEEE, 2011.

[6] D. Pagliari and L. Pinto, “Calibration of kinect for xbox one and comparison between the two generations of microsoft sensors,” Sensors, vol. 15, no. 11, pp. 27569–27589, 2015.

[7] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893, IEEE, 2005.

97 Conclusion

[8] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view rgb-d object dataset,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 1817–1824, IEEE, 2011.

[9] S. Tang, X. Wang, X. Lv, T. X. Han, J. Keller, Z. He, M. Skubic, and S. Lao, “Histogram of oriented normal vectors for object recognition with a depth sen- sor,” in Asian conference on computer vision, pp. 525–538, Springer, 2012.

[10] J. E. Hopcroft and R. E. Tarjan, “Efficient algorithms for graph manipulation,” 1971.

[11] L. Wang and G. Leedham, “A thermal hand vein pattern verification system,” in International Conference on Pattern Recognition and Image Analysis, pp. 58– 65, Springer, 2005.

[12] Q. Zhang, Y. Zhou, D. Wang, and X. Hu, “Personal authentication using hand vein and knuckle shape point cloud matching,” in Biometrics: Theory, Appli- cations and Systems (BTAS), 2013 IEEE Sixth International Conference on, pp. 1–6, IEEE, 2013.

[13] A. Johnson, Spin-Images: A Representation for 3-D Surface Matching. PhD thesis, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, August 1997.

[14] N. Silberman and R. Fergus, “Indoor scene segmentation using a structured light sensor,” in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 601–608, IEEE, 2011.

[15] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in European Conference on Computer Vision, pp. 746–760, Springer, 2012.

[16] L. Xia, C.-C. Chen, and J. K. Aggarwal, “Human detection using depth infor- mation by kinect,” in CVPR 2011 WORKSHOPS, pp. 15–22, IEEE, 2011.

[17] P. Scovanner, S. Ali, and M. Shah, “A 3-dimensional sift descriptor and its application to action recognition,” in Proceedings of the 15th ACM international conference on Multimedia, pp. 357–360, ACM, 2007.

[18] J. Knopp, M. Prasad, G. Willems, R. Timofte, and L. Van Gool, “Hough transform and 3d surf for robust three dimensional classification,” in European Conference on Computer Vision, pp. 589–602, Springer, 2010.

[19] Z. Ren, J. Yuan, J. Meng, and Z. Zhang, “Robust part-based hand gesture recognition using kinect sensor,” IEEE transactions on multimedia, vol. 15, no. 5, pp. 1110–1120, 2013.

98 Conclusion

[20] A. Kurakin, Z. Zhang, and Z. Liu, “A real time system for dynamic hand gesture recognition with a depth sensor,” in Signal Processing Conference (EUSIPCO), 2012 Proceedings of the 20th European, pp. 1975–1979, IEEE, 2012.

[21] Y. Li, “Hand gesture recognition using kinect,” in 2012 IEEE International Conference on Computer Science and Automation Engineering, pp. 196–199, IEEE, 2012.

[22] C. Keskin, F. Kıra¸c,Y. E. Kara, and L. Akarun, “Real time hand pose estima- tion using depth sensors,” in Consumer Depth Cameras for Computer Vision, pp. 119–137, Springer, 2013.

[23] I. Oikonomidis, N. Kyriazis, and A. A. Argyros, “Tracking the articulated mo- tion of two strongly interacting hands,” in Computer Vision and Pattern Recog- nition (CVPR), 2012 IEEE Conference on, pp. 1862–1869, IEEE, 2012.

[24] O. Oreifej and Z. Liu, “Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 716–723, 2013.

[25] X. Yang, C. Zhang, and Y. Tian, “Recognizing actions using depth motion maps-based histograms of oriented gradients,” in Proceedings of the 20th ACM international conference on Multimedia, pp. 1057–1060, ACM, 2012.

[26] M. Watanabe, T. Endoh, M. Shiohara, and S. Sasaki, “Palm vein authentication technology and its applications,” in Proceedings of the biometric consortium conference, pp. 19–21, 2005.

[27] P.-O. Ladoux, C. Rosenberger, and B. Dorizzi, “Palm vein verification system based on sift matching,” in International Conference on Biometrics, pp. 1290– 1298, Springer, 2009.

[28] A. Kumar and K. V. Prathyusha, “Personal authentication using hand vein triangulation and knuckle shape,” IEEE Transactions on Image processing, vol. 18, no. 9, pp. 2127–2136, 2009.

[29] Y. Ding, D. Zhuang, and K. Wang, “A study of hand vein recognition method,” in IEEE International Conference Mechatronics and Automation, 2005, vol. 4, pp. 2106–2110, IEEE, 2005.

[30] L. Wang and G. Leedham, “Near-and far-infrared imaging for vein pattern biometrics,” in 2006 IEEE International Conference on Video and Signal Based Surveillance, pp. 52–52, IEEE, 2006.

[31] L. Wang, G. Leedham, and S. Cho, “Infrared imaging of hand vein patterns for biometric purposes,” IET computer vision, vol. 1, no. 3/4, p. 113, 2007.

99 Conclusion

[32] J. M. Phillips and S. Venkatasubramanian, “A gentle introduction to the kernel distance,” arXiv preprint arXiv:1103.1625, 2011.

[33] K. Wolf, A. Schmidt, A. Bexheti, and M. Langheinrich, “Lifelogging: You’re wearing a camera?,” Pervasive Computing, IEEE, vol. 13, no. 3, pp. 8–12, 2014.

[34] J. Shang, K. Sundara-Rajan, L. Lindsey, A. Mamishev, E. Johnson, A. Tere- desai, and A. Kristal, “A pervasive dietary data recording system,” in Perva- sive Computing and Communications Workshops (PERCOM Workshops), 2011 IEEE International Conference on, pp. 307–309, IEEE, 2011.

[35] C.-W. You, N. D. Lane, F. Chen, R. Wang, Z. Chen, T. J. Bao, M. Montes-de Oca, Y. Cheng, M. Lin, L. Torresani, et al., “Carsafe app: Alerting drowsy and distracted drivers using dual cameras on smartphones,” in Proceeding of the 11th annual international conference on Mobile systems, applications, and services, pp. 13–26, ACM, 2013.

[36] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, 2015.

[37] E. W. Weisstein, “King graph,” 2003. http://mathworld.wolfram.com/ KingGraph.html.

[38] E. W. Weisstein, “Grid graph,” 2001. http://mathworld.wolfram.com/ GridGraph.html.

[39] M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan, “Time bounds for selection,” Journal of computer and system sciences, vol. 7, no. 4, pp. 448– 461, 1973.

[40] C. D. Manning, P. Raghavan, H. Sch¨utze, et al., Introduction to information retrieval, vol. 1, ch. 15.2.1, p. 327. Cambridge university press Cambridge, 2008.

[41] D. Pittet, B. Allegranzi, and J. Boyce, “The world health organization guide- lines on hand hygiene in health care and their consensus recommendations,” Infection Control, vol. 30, no. 07, pp. 611–622, 2009.

[42] H. Yamahara, H. Takada, and H. Shimakawa, “Behavior detection based on touched objects with dynamic threshold determination model,” in Smart Sens- ing and Context, pp. 142–158, Springer, 2007.

[43] V. Verdiramo, “Hand wash monitoring system and method,” Oct. 28 2008. US Patent 7,443,305.

100 Conclusion

[44] V. Galluzzi, T. Herman, and P. Polgreen, “Hand hygiene duration and tech- nique recognition using wrist-worn sensors,” in Proceedings of the 14th Interna- tional Conference on Information Processing in Sensor Networks, pp. 106–117, ACM, 2015. [45] M. A. S. Mondol and J. A. Stankovic, “Harmony: A hand wash monitoring and reminder system using smart watches,” in proceedings of the 12th EAI Interna- tional Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services on 12th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, pp. 11–20, ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2015. [46] G. Velvizhi, G. Anupriya, G. Sucilathangam, M. Ashihabegum, T. Jeyamuru- gan, and N. Palaniappan, “Wristwatches as the potential sources of hospital- acquired infections,” Journal of Clinical & Diagnostic Research, vol. 6, no. 5, 2012. [47] J. Han, L. Shao, D. Xu, and J. Shotton, “Enhanced computer vision with microsoft kinect sensor: A review,” Cybernetics, IEEE Transactions on, vol. 43, no. 5, pp. 1318–1334, 2013. [48] A. Ghosh, S. Ameling, J. Zhou, G. Lacey, E. Creamer, A. Dolan, O. Sherlock, and H. Humphreys, “Pilot evaluation of a ward-based automated hand hygiene training system,” American journal of infection control, vol. 41, no. 4, pp. 368– 370, 2013. [49] C. C. Aggarwal and T. Abdelzaher, “Integrating sensors and social networks,” in Social Network Data Analytics, pp. 379–412, Springer, 2011. [50] K. Wac, “Smartphone as a personal, pervasive health informatics services plat- form: literature review,” arXiv preprint arXiv:1310.7965, 2013. [51] M. Harbach, E. von Zezschwitz, A. Fichtner, A. De Luca, and M. Smith, “It’s a hard lock life: A field study of smartphone (un) locking behavior and risk perception,” in Symposium On Usable Privacy and Security (SOUPS 2014), pp. 213–230, 2014. [52] M. Jakobsson, E. Shi, P. Golle, and R. Chow, “Implicit authentication for mobile devices,” in Proceedings of the 4th USENIX conference on Hot topics in security, pp. 9–9, USENIX Association, 2009. [53] A. De Luca, A. Hang, F. Brudy, C. Lindner, and H. Hussmann, “Touch me once and i know it’s you!: implicit authentication based on touch screen patterns,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 987–996, ACM, 2012.

101 Conclusion

[54] A. Bianchi, I. Oakley, V. Kostakos, and D. S. Kwon, “The phone lock: audio and haptic shoulder-surfing resistant pin entry methods for mobile devices,” in Proceedings of the fifth international conference on Tangible, embedded, and embodied interaction, pp. 197–200, ACM, 2011.

[55] M. Dell’Amico, P. Michiardi, and Y. Roudier, “Password strength: An empirical analysis.,” in INFOCOM, vol. 10, pp. 983–991, 2010.

[56] A. M. Smith, M. C. Mancini, and S. Nie, “Second window for in vivo imaging,” Nature nanotechnology, vol. 4, no. 11, p. 710, 2009.

[57] C. D. Mutto, P. Zanuttigh, and G. M. Cortelazzo, Time-of-flight cameras and microsoft kinect (TM). Springer Publishing Company, Incorporated, 2012.

[58] M. Goel, E. Whitmire, A. Mariakakis, T. S. Saponas, N. Joshi, D. Morris, B. Guenter, M. Gavriliu, G. Borriello, and S. N. Patel, “Hypercam: hyperspec- tral imaging for ubiquitous computing applications,” in Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 145–156, ACM, 2015.

[59] B. Y. Li, A. S. Mian, W. Liu, and A. Krishna, “Using kinect for face recognition under varying poses, expressions, illumination and disguise,” in Applications of Computer Vision (WACV), 2013 IEEE Workshop on, pp. 186–192, IEEE, 2013.

[60] S. Nirjon, C. Greenwood, C. Torres, S. Zhou, J. A. Stankovic, H. J. Yoon, H.-K. Ra, C. Basaran, T. Park, and S. H. Son, “Kintense: A robust, accurate, real- time and evolving system for detecting aggressive actions from streaming 3d skeleton data,” in Pervasive Computing and Communications (PerCom), 2014 IEEE International Conference on, pp. 2–10, IEEE, 2014.

102